# PATIENT to OMOP CDM

Regarding the Observation, Measurement, and Condition_Occurrence tables, should all observation values be entered into the same row for `observation_concept_id`, or should multiple rows be created?

If an `observation_concept_id` is repeated, should it be removed? Otherwise, a large amount of data might be generated.

To be completed.

#### LINKS

To get the corresponding vocabulary value for the `ethnicity_concept_id` column in the `PERSON` table in OMOP (among others):
https://inspiredata.network/etl/IDSR2OMOP/IDS2OMOP-Assay-Part-1-v1.0/measurement.html

Vocabulary value for `condition_type_concept_id`:
https://github.com/OHDSI/OMOP-Queries/blob/master/md/Condition_Occurence.md

In [3]:
import mysql.connector
import pandas as pd
import psycopg2
import random
import numpy as np
from datetime import datetime, timedelta

def get_random_value(val):
    if isinstance(val, tuple):
        return np.random.choice(val)
    return val

ModuleNotFoundError: No module named 'psycopg2'

In [56]:
# Leer datos del archivo CSV
df_patient_IDEA4RC = pd.read_csv("./IDEA4RC-data/cancerEpisodeIDEA4RC.csv")
df_patient_IDEA4RC.head(5)

Unnamed: 0,id,Gender,Race,Birth year,Country of Residence,Smoking,Cigarettes pack years smoked during life,Alcohol,Height/weight (BMI),Charslon Comorbidity Index,...,McCune-Albright syndrome,Multiple osteochondromas,Neurofibromatosis type 1,Rothmund-Thomson syndrome,Werner syndrome,Retinoblastoma,Paget disease,Other Genetic syndrome WHO 2020,Occurrence of other cancer,Previous cancer treatment
0,2,8532,8527,0,4329169,1585856,903650,1586197,4245997,42538860,...,37117262,37396802,377252,4286355,4197821,4158977,75910,1340204,1340204,1340204
1,27,8532,8515,0,4329169,1585856,903650,1586197,4245997,42538860,...,37117262,37396802,377252,4286355,4197821,4158977,75910,1340204,1340204,1340204
2,3,8532,8515,0,4329169,1585856,903650,1586197,4245997,42538860,...,37117262,37396802,377252,4286355,4197821,4158977,75910,1340204,1340204,1340204
3,44,8507,8516,0,4329169,1585856,903650,1586197,4245997,42538860,...,37117262,37396802,377252,4286355,4197821,4158977,75910,1340204,1340204,1340204
4,14,8507,8516,0,4329169,1585856,903650,1586197,4245997,42538860,...,37117262,37396802,377252,4286355,4197821,4158977,75910,1340204,1340204,1340204


In [57]:
# Conexión a la base de datos OMOP CDM
conn = psycopg2.connect(
    dbname="omopdb",
    user="postgres",
    password="mysecretpassword",
    host="localhost",
    port="5432"
)

# Comando para hacer el mapeo (insercción de datos del IDEA4RC.csv a OMOP)
cur = conn.cursor()
config = {
    'user': 'user', 
    'password': 'password',
    'host': '127.0.0.1',
    'database': 'idea4rc_dm',
    'raise_on_warnings': True
}

conn2 = mysql.connector.connect(**config)
curIDEA = conn2.cursor()

### Patient to Person table

For each column in `patientsIDEA4RC.csv` corresponding to the `PERSON` table, we will perform the necessary transformation for its subsequent mapping.

In [1]:
df_tables = pd.DataFrame()
df_tables.index = range(len(df_patient_IDEA4RC))
df_tables['Ethnicity'] = [4087925 if random.random() > 0.5 else 0 for _ in range(len(df_tables))] #Este valor no está en el excel de IDEA4RC. Necesitamos incluirlo y verificar también si es ese número o si hay más

# df_tables['Ethnicity'] = df_tables['Ethnicity'].astype(object)

# for column in person_mapping:
#     df_patient_IDEA4RC[column] = df_patient_IDEA4RC[column].astype(object)

sql = """
    INSERT INTO omopcdm.person (person_id, gender_concept_id, race_concept_id, year_of_birth, location_id, ethnicity_concept_id)
    VALUES (%s, %s, %s, %s, %s, %s)
"""

cur.executemany(sql, zip(df_patient_IDEA4RC['id'], df_patient_IDEA4RC['Gender'], df_patient_IDEA4RC['Race'],
                         df_patient_IDEA4RC["Birth year"], df_patient_IDEA4RC["Country of Residence"], df_tables["Ethnicity"]))

conn.commit()

NameError: name 'pd' is not defined

### Patient to Observation table

For each column in `patientsIDEA4RC.csv` corresponding to the `OBSERVATION` table, we will perform the necessary transformation for its subsequent mapping.

It is important to emphasize that this mapping is different from the others. For each `id` = `person_id`, we need to add as many rows as necessary for each value in the observation columns for that patient.

In [59]:
observation_vocab_values_concept_id= { 
    'Smoking': 1585856,
    'Cigarettes pack years smoked during life': 903650,
    'Alcohol': 1586197,
    'Comorbidity': 46235351,
    'Myocardial infarction':4329847,
    'Congestive heart failure':319835,
    'Peripheral vascular disease':321052,
    'Cerebrovascular accident (except hemiplegia)':381316,
    'Dementia': 4182210,
    'Chronic pulmonary disease':4186898,
    'Connective tissue disease':4344165,
    'Ulcer':4177703,
    'Mild liver disease':194984,
    'Moderate to severe liver disease': 194984,
    'Diabetes (without complications)': 201820,
    'Diabetes with end organ damage': 201820,
    'Hemiplegia': 374022,
    'Moderate to severe renal disease': 198124,
    'Solid tumor (non metastatic)': 443392,
    'Metastatic solid tumor': 443392,
    'Leukemia': 317510,
    'Lymphoma': 44499278,
    'Multiple myeloma': 437233,
    'AIDS': 4267414,
    'Eastern Cooperative Oncology Group performance status (ECOG PS) at diagnosis': 36305384,
    'ECOG PS label': 36303470,
    'Karnofsy index at diagnosis': 4169154,
    'Karnofsy index label': 36303744,
    'No Genetic syndrome WHO 2020': 37204336,
    'Occurrence of other cancer':1340204,
    'Previous cancer treatment' : 1340204 #Duplicate key error, what should we do?
}

areBooleans=[
    "No Genetic syndrome WHO 2020", 
    "Olliers disease", 
    "Maffuci syndrome", 
    "Li-Fraumeni syndrome", 
    "McCune-Albright syndrome", 
    "Multiple osteochondromas", 
    "Neurofibromatosis type 1", 
    "Rothmund-Thomson syndrome", 
    "Werner syndrome", 
    "Retinoblastoma", 
    "Paget disease", 
    "Other Genetic syndrome WHO 2020",
    "Comorbidity",
    "Myocardial infarction",
    "Congestive heart failure",
    "Peripheral vascular disease",
    "Cerebrovascular accident (except hemiplegia)",
    "Dementia",
    "Chronic pulmonary disease",
    "Connective tissue disease",
    "Ulcer",
    "Mild liver disease",
    "Moderate to severe liver disease",
    "Diabetes (without complications)",
    "Diabetes with end organ damage",
    "Hemiplegia",
    "Moderate to severe renal disease",
    "Solid tumor (non metastatic)",
    "Metastatic solid tumor",
    "Leukemia",
    "Lymphoma",
    "Multiple myeloma",
    "AIDS"
]
areNumbers = [
    "Height/weight (BMI)",
    "Charlson Comorbidity index",
    "Eastern Cooperative Oncology Group performance status (ECOG PS) at diagnosis",
    "Karnofsy index at diagnosis"
]

sql = """
    INSERT INTO omopcdm.observation (observation_id ,person_id, observation_concept_id, observation_date, observation_type_concept_id, value_as_concept_id)
    VALUES (%s, %s, %s, %s, %s,%s)
"""

sqlNumbers = """
    INSERT INTO omopcdm.observation (observation_id ,person_id, observation_concept_id, observation_date, observation_type_concept_id, value_as_number)
    VALUES (%s, %s, %s, %s, %s,%s)
"""
Observation_type = 38000280 #This is the code we should use with every observation type

date = datetime.now().date()

for idx, row in df_tables.iterrows():
    person_id = row['id']
    date_value = date.strftime('%Y-%m-%d') 
    observation_type_value = Observation_type
    for column in observation_vocab_values_concept_id.keys():
        observation_concept = observation_vocab_values_concept_id.get(column)
        if column in areBooleans:
            observation_value=4188539 if row[column]==1 else 4188540
        else:
            observation_value = row[column]
        if column in areNumbers:
            cur.execute(sqlNumbers, (person_id, observation_concept, date_value, observation_type_value, observation_value))
            conn.commit()
        else:
            cur.execute(sql,(person_id, observation_concept, date_value, observation_type_value, observation_value))
            conn.commit()

conn.commit()

### Patient to Measurement table 

In [60]:

sqlNumbers = """
    INSERT INTO omopcdm.measurement (person_id, value_as_number, measurement_date, measurement_type_concept_id,measurement_concept_id)
    VALUES (%s, %s, %s, %s, %s)
"""
patient_column_measurement = {
    'Height/weight (BMI)': 4245997,
    'Charlson Comorbidity index': 42538860
}
measurement_type_value=32809
for index, row in df_tables.iterrows():
    person_id = row['id']
    date_value = row['Date'].strftime('%Y-%m-%d')
    
    for column, measurement_concept in patient_column_measurement.items():
        measurement_value = row[column]
        
        # Insert into the database
        cur.execute(sqlNumbers, (person_id, measurement_value, date_value, measurement_type_value, measurement_concept))
        
# Commit all changes at once
conn.commit()

### Patient to Condition_occurrence table 

condition_type_value = 32879  -> Registry. However, we need to check if this is okay.


In [61]:
patient_column_condition_names = 
{'Olliers disease':4145177,
'Maffuci syndrome':4187683
,'Li-Fraumeni syndrome':4323645
,'McCune-Albright syndrome':37117262
,'Multiple osteochondromas':37396802,
'Neurofibromatosis type 1':377252
,'Rothmund-Thomson syndrome':4286355
,'Werner syndrome':4197821
,'Retinoblastoma':4158977
,'Paget disease':75910,'Other Genetic syndrome WHO 2020':1340204}


values_to_insert = []
condition_type_value = 32879
for _, row in df_tables.iterrows():
    person_id = row['id']
    date_value = row['Date'].strftime('%Y-%m-%d') 
    for column in patient_column_condition_names.keys():
        if row[column]==1 or row[column]==True or row[column] == 4188539:
            condition_value = patient_column_condition_names[column]
            values_to_insert.append((person_id, condition_value, date_value, condition_type_value))
        


sql = """
    INSERT INTO omopcdm.condition_occurrence (person_id, condition_concept_id, condition_start_date, condition_type_concept_id)
    VALUES (%s, %s, %s, %s)
"""

with conn.cursor() as cur:
    cur.executemany(sql, values_to_insert)

conn.commit()
cur.close()
conn.close()

In [62]:
df_tables.head(3)

Unnamed: 0,Ethnicity,Observation_type,Date,Measurement_type,condition_type
0,4087925,0,2024-06-06,0,42894222
1,4087925,32817,2024-06-06,32809,42894222
2,0,32817,2024-06-06,32809,42894222
