## NCDC Data Harmonization

The NCDC project takes leverage of multiple sources of data by using a federated infrastructure. Being able to use these sources of data requires a data harmonization process to make the data interoperable.

### OMOP CDM

A clinical model defines a structure and relationships that allow representing different types of clinical data. In combination with standard vocabularies, it becomes possible to achieve a higher degree of interoperability, metadata description, and sustainability.

The OMOP (Observational Medical Outcomes Partnership) CDM (Common Data Model) emerges in this context, presenting a clinical model that consistently grew to accommodate more types of clinical data. Its structure also includes standard vocabularies obtained from known sources, such as SNOMED.

The NCDC project takes leverage of the OMOP CDM model to represent the data from each source and mantain the source data. Although OMOP represents a more complex data structure, the NCDC project mainly uses the following tables to represent the data (https://ohdsi.github.io/CommonDataModel/cdm60.html):
- PERSON: “central identity management for all Persons in the database … uniquely identify each person or patient, and some demographic information.”
- OBSERVATION: “clinical facts about a Person obtained in the context of examination, questioning or a procedure.”
- MEASUREMENT: “structured values (numerical or categorical) obtained through systematic and standardized examination or testing of a Person or Person’s sample”
- CONDITION_OCCURRENCE: "suggesting the presence of a disease or medical condition stated as a diagnosis, a sign, or a symptom"

In [None]:
# Creating the database client using the "psycopg2" library
import psycopg2

# Build the URI to the DB following this specification:
# postgresql://[user[:password]@][host][:port][/dbname]
connection = psycopg2.connect("postgresql://")
db_client = connection.cursor()

#### Querying the OMOP CDM

Some examples on how to query the OMOP CDM are given below.
To write a new query it may be useful to use both the NCDC mapping information and the OMOP CDM v6.0 definition (https://ohdsi.github.io/CommonDataModel/cdm60.html).

In [None]:
# Select the max, min, and average year of birth for all persons in the database by gender
#
# Gender concept id:
#   - 8532: Female
#   - 8551: Unknown
#   - 8507: Male
sql_statement = """SELECT gender_concept_id, COUNT(person_id), 
MAX(year_of_birth), MIN(year_of_birth), AVG(year_of_birth) FROM 
PERSON GROUP BY gender_concept_id"""
db_client.execute(sql_statement)
result = db_client.fetchall()
print(result)

# Selecting the average age for all persons with a dementia diagnosis
#
# Condition with concept id 4182210 from SNOMED "Dementia"
sql_statement = """SELECT AVG(date_part('year', now()) - p.year_of_birth) 
FROM PERSON AS p INNER JOIN CONDITION_OCCURRENCE AS c ON p.person_id = c.person_id 
WHERE c.condition_concept_id = '4182210'
"""
db_client.execute(sql_statement)
result = db_client.fetchone()
print(result)

### Simplified Table

One of the drawbacks of using a clinical model can be the higher complexity in its model definition. This is the case with the OMOP CDM, it requires more knowledge about its schema and the querying can be more difficult, especially when taking the first steps. Although we recommend using the OMOP CDM, it's also possible to use a simplified table that mimics most of the representations used from the source data, a plane table with an entry by visit.

In [None]:
# Select the max, min, and average year of birth for all persons in the database by gender
#
# NCDC coding:
#  - 0: Male
#  - 1: Female
#  - NULL: Unknown
sql_statement = """SELECT sex, COUNT(id), 
MAX(birth_year), MIN(birth_year), AVG(birth_year) FROM ncdc 
GROUP BY sex
"""
db_client.execute(sql_statement)
result = db_client.fetchall()
print(result)

# Selecting the average age for all persons with a dementia diagnosis
#
# NCDC variable for dementia: "dementia_diagnosis"
# NCDC coding: TRUE ('1'), FALSE ('0')
sql_statement = """SELECT AVG(date_part('year', now()) - birth_year) 
FROM ncdc WHERE dementia_diagnosis IS TRUE"""
# Alternative: dementia_diagnosis = '1'
db_client.execute(sql_statement)
result = db_client.fetchone()
print(result)