# Cancer Episode to OMOP

Cancer Episode in OMOP is representing an overarching episode, and it has asociated also the inforamtion about the diagnosis. 

In [None]:
import pandas as pd
import psycopg2
import random
import numpy as np
from uuid import uuid4
from datetime import datetime, timedelta

def get_random_value(val):
    if isinstance(val, tuple):
        return np.random.choice(val)
    return val

In [None]:
df_cancer_Episode_IDEA4RC = pd.read_csv("./IDEA4RC-data/cancerEpisodeIDEA4RC.csv")
df_cancer_Episode_IDEA4RC.head(5)

In [None]:
conn = psycopg2.connect(
    dbname="omopdb",
    user="postgres",
    password="mysecretpassword",
    host="localhost",
    port="5432"
)

cur = conn.cursor()

## General Explanation

Following the images, data contained in IDEA4RC cancer episode maps on episode with concep_id=episode of care, tath in OMOP is the code for a first occurrence of cancer (overarching episode). This episode in omop is linked to the diagnosis both with episode object concept id that is the condition concept id of the condition occurrence thatt represents the base diagnosis and to the episode event table that link the condition occurrence (diagnosis) with the overarching episode.

![First image](images/CancerEpisodeIDEA1.png)

![Second image](images/CancerEpisodeIDEA2.png)






All this data are considered measurement modifiers:


![Second image](images/CancerEpisodeIDEA3.png)

## Cancer Episode to Condition Occurrence

About the diagnosis, we have to combine the histology and topography fields from idea4rc to get the concept id. We may need an API that has as input “histology / anatomic site” and give back the concept_id eg: input: 8010/3-C67.4 output: 44500917 https://athena.ohdsi.org/api/v1/concepts?pageSize=15&domain=Condition&query=%208010/3-C67.4&boosts=&page=1

This table has four groups: Histology (WHO 2017) Sarcoma subgroup, Histology (WHO 2017) H&N subgroup, Subsite H&N, Subsite Sarc.

For example, when a Sarcoma is diagnosed, the histology/topology value is created with ‘Histology (WHO 2017) Sarcoma subgroup’/‘Subsite Sarc,’ and for H&N, the histology/topology value is created with ‘Histology (WHO 2017) H&N subgroup’/‘Subsite H&N.’ Additionally, if there is a sarcoma, there is no H&N, and vice versa. There can only be one value for the two groups Histology (WHO 2017) Sarcoma subgroup and Histology (WHO 2017) H&N, and one in the two groups Subsite H&N and Subsite Sarc.

1. Histology Group = Histology (WHO 2017) Sarcoma subgroup + Histology (WHO 2017) H&N
2. Site Group = Subsite H&N + Subsite Sarc
3. There is always 1 variable from the Histology group and 1 variable from the Site group for histology/topology

If we are not able to get an omop concept id for the combination of histology/topology what do we have to do? Are the combination of histology/topology all considered in Blueberry for the sarcoma registry?

condition type concept id=32835 EHR Pathology Report, is the same code in Blueberry?

Link to mapping:
https://docs.google.com/spreadsheets/d/1Vw1Dr2K4oG__cDQTutGaJhZvGUvQTLwc4qWreP6qMSs/edit?gid=481021496#gid=481021496


we need ICDO3 codes:
Human papillomavirus positive squamous cell carcinoma - 37204531
Human papillomavirus negative squamous cell carcinoma  -  37204532

we still need the sites_and_subsutes to be finished so we can finish the dictionary

In [None]:
sql = """
    INSERT INTO omopcdm.condition_occurrence (condition_occurrence_id, person_id, condition_concept_id, condition_start_date, condition_type_concept_id)
    VALUES (%s, %s, %s, %s, %s, %s);
"""
    
df_tables=df_cancer_Episode_IDEA4RC
df_tables['omopID']=None
histology = [
    "histology_squamous",
    "histology_adenocarcinoma",
    "histology_neuroendocrine",
    "histology_odontogenic_carcinoma",
    "histology_snuc",
    "histology_subgroup_adipocytic_tumours",
    "histology_subgroup_fmt_tumours",
    "histology_subgroupsocalled_fibrohistiocytic_tumours",
    "vascular_tumours",
    "pericytic_perivascular_tumours",
    "smooth_muscle_tumours",
    "skeletal_muscle_tumours",
    "chondro_osseous_tumours",
    "peripheral_nerve_sheath_tumours",
    "tumours_of_uncertain_differentiation",
    "undif_smallrcel_sarc_bon_and_sof_tis",
    "miscellanious_mesenchimal_tumors",
    "mixed_epithelial_and_mesenchymal_tumours",
    "est_and_related_tumours",
    "histology_subgroupsocalled_fibrohistiocytic_tumours",
    "histology_adenosquamous_carcinoma",
    "histology_teratocarcinosarcoma",
    "histology_NUT_carcinoma",
    "histology_HPV_related_carcinoma",
    "histology_olfactory_neuroblastoma",
    "carcinoma_undifferentiated",
    "endometrial_stromal_related_tumours"
]

topography = [
    "nasal_cavity_and_paranasal_sinuses_subsite",
    "nasopharynx_subsite",
    "hypopharynx_subsite",
    "oropharynx_subsite",
    "larynx_subsite",
    "oral_cavity_subsite",
    "lip_subsite",
    "upper_and_lower_limbs_subsite",
    "trunk_wall_subsite",
    "intra_abdominal_subsite",
    "intra_thoracic_subsite",
    "genito_urinary_subsite",
    "head_and_neck_subsite",
    "breast_subsite",
    "other_subsite",
    "partid_gland",
    "submandibular_gland",
    "sublingual_gland",
    "middle_ear",
    "subsite_sarc"
]
fromSNOMEDtoICDO3= {
    4078953:"8071/3",
    4166826:"8072/3",
    4206785:"8121/3",
    4009590:"8074/3",
    4247661:"8082/3",
    4029973:"8083/3",
    37156145:"8070/3",
    4312219:"8075/3",
    4052146:"8076/3",
    4175678:"8031/3",
    4233949:"8051/3",
    4191609:"8052/3",
    4097305:"8144/3",
    37152526: "8140/3",
    4050978: "8440/3",
    4247921: "8480/3",
    4238334: "8420/3",
    4172953: "8260/3",
    4022895: "8200/3",
    4253608: "8430/3",
    4030121: "8525/3",
    4164740: "8550/3",
    4148292: "8310/3",
    4146684: "8147/3",
    4221403: "8500/3",
    4224593: "8542/3",
    4323699: "8982/3",
    4029680: "8982/3",
    4321002: "8562/3",
    4066512: "8941/3",
    4182993: "8410/3",
    4271564: "8980/3",
    4212379: "8290/3",
    36714029: "8041/3",
    37018672: "8041/3",
    4029971: "8013/3",
    4133828: "8240/3",
    4133297: "8249/3",
    4098585: "9270/3",
    37116966: "9341/3",
    4005818: "9302/3",
    4243327: "8020/3",
    4241843: "8560/3",
    4284401: "9081/3",
    37116896: "8023/3",
    4295574: "9522/3",
    4021372: "8850/1",  # Atypical lipomatous tumor
    4102790: "8851/3",  # Liposarcoma, well differentiated
    4284833: "8858/3",  # Dedifferentiated liposarcoma
    4101123: "8852/3",  # Myxoid liposarcoma
    4010104: "8854/3",  # Pleomorphic liposarcoma
    45766452: "8813/1",  # Palmar/plantar-type fibromatosis
    4264204: "8821/1",  # Aggressive fibromatosis
    45766468: "8851/1",  # Lipofibromatosis
    4030132: "8834/1",  # Giant cell fibroblastoma
    4295283: "8832/1",  # Dermatofibrosarcoma protuberans
    37116995: "8815/0",  # Solitary fibrous tumor, grade 1
    37116996: "8815/1",  # Solitary fibrous tumor, grade 2
    37116997: "8815/3",  # Solitary fibrous tumor, grade 3
    4021373: "8825/1",  # Myofibroblastic tumour
    45766454: "8825/3",  # Low-grade myofibroblastic sarcoma
    4029679: "8810/1",  # Cellular fibroma
    45766450: "8811/1",  # Myxoinflammatory fibroblastic sarcoma
    4202374: "8810/3",  # Fibrosarcoma
    45771359: "8811/3",  # Myxofibrosarcoma
    45771361: "8840/3",  # Low-grade fibromyxoid sarcoma
    45766456: "8840/3",  # Sclerosing epithelioid fibrosarcoma
    4029525: "8835/1",  # Plexiform fibrohistiocytic tumour
    4218898: "9251/1",  # Giant cell tumour of soft parts
    4030140: "9252/0",  # Tenosynovial giant cell tumor
    45766525: "9252/1",  # Tenosynovial giant cell tumor, diffuse
    4029655: "9252/3",  # Malignant tenosynovial giant cell tumor
    4207381: "9133/3",  # Epithelioid haemangioendothelioma
    3661612: "9120/3",  # Angiosarcoma
    4281384: "9130/1",  # Hemangioendothelioma
    4030139: "9135/1",  # Spindle cell hemangioendothelioma
    45766493: "9136/1",  # Retiform haemangioendothelioma
    4030138: "9135/1",  # Papillary intralymphatic angioendothelioma
    45766494: "9136/1",  # Composite haemangioendothelioma
    45766492: "9138/1",  # Pseudomyogenic haemangioendothelioma
    4014760: "8711/0",  # Glomus tumor
    4028856: "8711/3",  # Glomus tumor, malignant
    4327493: "8897/1",  # Smooth muscle tumour
    607803: "8891/1",  # Epithelioid smooth muscle tumor of uncertain malignant potential
    600675: "8896/1",  # Myxoid smooth muscle tumour of uncertain malignant potential
    4029023: "8898/1",  # Metastasizing leiomyoma
    37151903: "8890/3",  # Leiomyosarcoma
    4173148: "8891/3",  # Epithelioid leiomyosarcoma
    4059632: "8896/3",  # Myxoid leiomyosarcoma
    4031038: "8910/3",  # Embryonal rhabdomyosarcoma
    4272931: "8920/3",  # Alveolar rhabdomyosarcoma
    4298312: "8901/3",  # Pleomorphic rhabdomyosarcoma
    4029528: "8912/3",  # Spindle cell rhabdomyosarcoma
    4029024: "8921/3",  # Ectomesenchymoma
    4298240: "9180/3",  # Extraskeletal osteosarcoma
    4062426: "9540/3",  # Malignant peripheral nerve sheath tumour
    4102399: "9540/3",  # Melanotic malignant peripheral nerve sheath tumor
    4005360: "9580/0",  # Granular cell tumour
    4027842: "9580/3",  # Granular cell tumour, malignant
    4096931: "8830/1",  # Atypical fibrous histiocytoma
    4029526: "8836/1",  # Angiomatoid fibrous histiocytoma
    45766448: "8802/1",  # Pleomorphic hyalinizing angiectatic tumour
    1243125: "8811/1",  # Hemosiderotic fibrolipomatous tumor
    4268491: "9040/3",  # Synovial sarcoma
    4244886: "8804/3",  # Epithelioid sarcoma
    4339208: "9581/3",  # Alveolar soft part sarcoma
    4155072: "9044/3",  # Clear cell sarcoma
    4297345: "9231/3",  # Extraskeletal myxoid chondrosarcoma
    4029522: "8806/3",  # Desmoplastic small round cell tumor
    4299132: "8963/3",  # Extrarenal rhabdoid tumor
    45771358: "8714/0",  # Perivascular epithelioid tumor, benign
    45766447: "8714/3",  # Perivascular epithelioid tumor, malignant
    4304780: "8714/0",  # Perivascular epithelioid cell tumor
    45766495: "9137/3",  # Intimal sarcoma
    4029021: "8805/3",  # Undifferentiated sarcoma
    4239519: "8982/0",  # Myoepithelioma
    4029680: "8982/3",  # Myoepithelial carcinoma
    4028565: "8842/0",  # Ossifying fibromyxoid tumour
    45766467: "8842/3",  # Ossifying fibromyxoid tumor, malignant
    4195947: "8990/0",  # Mesenchymoma
    45766484: "8990/0",  # Phosphaturic mesenchymal tumor, benign
    45766485: "8990/3",  # Phosphaturic mesenchymal tumour, malignant
    4239956: "8841/0",  # Angiomyxoma
    45766528: "9364/3",  # Ewing sarcoma
    37155974: "9366/3",  # Round cell sarcoma with EWSR1–non-ETS fusions
    37152438: "9367/3",  # CIC-rearranged sarcoma
    37152440: "9368/3",  # Sarcoma with BCOR genetic alterations
    4290926: "8930/3",  # Endometrial stromal sarcoma, high grade
    4028557: "8931/3",  # Endometrial stromal sarcoma, low grade
    45771357: "8590/1",  # Uterine tumour resembling ovarian sex cord tumour
    4040991: "9020/0",  # Benign phyllodes tumor
    4323562: "9020/1",  # Borderline phyllodes tumor
    4337106: "9020/3",  # Malignant phyllodes tumor
    4028710: "9758/3",  # Follicular dendritic cell sarcoma
    4029172:"9755/3",
    4028709:"9757/3",
    4029173:"9756/3",
    42872917:"9759/3",
    37311513:"9045/3",
    4135215:"8933/3"
}   

for idx, row in df_tables.iterrows():
    person_id = row['patient']
    date_value = df_tables.loc[idx]['date_of_diagnosis'].strftime('%Y-%m-%d')  # Get the 'Date' value from df_tables
    condition_id= datetime.now().strftime('%Y%m%d%H%M%S') + str(uuid4())
    df_tables.loc[idx,'omopID']=condition_id
    row_index = 0
    value=None
    while row_index < len(histology):
        histo=histology[row_index]
        value=row[histo]
        value=fromSNOMEDtoICDO3.get(value)
        if value:
            break
        row_index += 1
    histo_concept_id = value
    value=None
    row_index = 0
    while row_index < len(topography):
        topo=topography[row_index]
        value=row[topo]
        value=fromSNOMEDtoICDO3.get(value)
        if value:
            break
        row_index += 1
    topo_concept_id = value
    condition_concept_id = histo_concept_id + topo_concept_id #This has to be further checked so we can make sure it is a valid concept_id
    condition_type_concept_id= 32835 #This has to be checked
    cur.execute(sql, (condition_id, person_id, condition_concept_id, date_value, condition_type_concept_id))
    conn.commit()

In this case, for condition concept id, the vocabulary will be used.

condition_type_concept_id ????

## Overarching Episode Creation

This is something I am not sure of and need more time to check it
32879 as episode_type_concept_id? Registry



In [None]:
df_tables=df_cancer_Episode_IDEA4RC
df_tables['date_of_diagnosis'] = pd.to_datetime(df_tables['Date'])
sql = """
    INSERT INTO omopcdm.episode (episode_id,person_id, episode_start_date, episode_concept_id, episode_object_concept_id, episode_type_concept_id)
    VALUES (%s, %s, %s, %s, %s);
"""

def episode_event_definition(idEpisode, idCondition):
    sql = """
        INSERT INTO omopcdm.episode_event (episode_id, event_id, episode_event_field_concept_id )
        VALUES (%s, %s, %s);
    """
    cur.execute(sql, (idEpisode, idCondition,1147129))
    conn.commit()

queryGetObject="""
    SELECT condition_concept_id FROM omopcdm.condition_occurrence WHERE condition_occurrence_id=%s;"""

for idx, row in df_tables.iterrows():
    person_id=row['patient']
    episode_id=datetime.now().strftime('%Y%m%d%H%M%S') + str(uuid4())
    episode_start_date = df_tables.loc[idx]['date_of_diagnosis'].strftime('%Y-%m-%d')  # Get the 'Date' value from df_tables
    episode_concept_id = 32533
    cur.execute(queryGetObject, (df_tables.loc[idx]['omopID']))
    episode_object_concept_id = cur.fetchone()[0]
    episode_type_concept_id = 32879
    condition_id=df_tables.loc[idx]['omopID']
    episode_event_definition(episode_id, condition_id)
    cur.execute(sql, (episode_id, person_id, episode_start_date, episode_concept_id, episode_object_concept_id, episode_type_concept_id))

conn.commit()


### Cancer Episode to Procedure Ocurrance

Required I still dont have:

procedure_type_concept_id

In [None]:
df_tables = pd.DataFrame()
df_tables.index = range(len(df_cancer_Episode_IDEA4RC))

sql = """
    INSERT INTO omopcdm.procedure_occurrence (procedure_occurrence_id,person_id, procedure_concept_id, procedure_date, procedure_type_concept_id)
    VALUES (%s, %s, %s, %s, %s, %s);
"""
def episode_event_definition(idEpisode, idCondition):
    sql = """
        INSERT INTO omopcdm.episode_event (episode_id, event_id, episode_event_field_concept_id )
        VALUES (%s, %s, %s);
    """
    cur.execute(sql, (idEpisode, idCondition,1147082))
    conn.commit()

for idx, row in df_tables.iterrows():
    id_person = row['patient']
    date = row['date_of_diagnosis']
    procedure_concept_id = row['type_of_biopsy'] #We use the vocabulary since it is accepted as seen in https://athena.ohdsi.org/search-terms/terms?domain=Procedure&standardConcept=Standard&page=1&pageSize=15&query=
    provider_id= row['biopsy_done_by']
    procedure_type_concept_id = 32879#We need to see how to manage this
    procedure_occurrence_id=datetime.now().strftime('%Y%m%d%H%M%S') + str(uuid4())
    episode_event_definition(df_tables['omopID'],procedure_occurrence_id)
    cur.execute(sql, (id_person, procedure_concept_id, date, procedure_type_concept_id))
    conn.commit()


### Cancer Episode to Observation

Is this OK? I do not think it is, since I feel like this should be a boolean

In [None]:
sql = """
    INSERT INTO omopcdm.observation (person_id, observation_concept_id, observation_date, observation_type_concept_id, observation_event_id, obs_event_field_concept_id)
    VALUES (%s, %s, %s, %s, %s,%s);
"""
df_tables=df_cancer_Episode_IDEA4RC
observation_type_concept_id=32879
for idx, row in df_tables.iterrows():
    if row['radiotherapy_induced_sarcoma'] == 'RADIATION_THERAPY_IND_CHANGE' or row['radiotherapy_induced_sarcoma'] == 'POSITIVE' or row['radiotherapy_induced_sarcoma'] == 'NEGATIVE' or row['radiotherapy_induced_sarcoma'] == 'YES' or row['radiotherapy_induced_sarcoma']:
        cur.execute(sql, (row['patient'], 1147127, row['date_of_diagnosis'], observation_type_concept_id, row['omopID'], 1147127))
cur.executemany(sql, zip(df_tables['patient'], 37117814, df_tables['date_of_diagnosis'],
                         observation_type_concept_id, df_tables['omopID'], 1147127))
conn.commit()

### Cancer Episode to Measurement

For each Cancer Episode, we need to create 7 new rows in Measurement

I do not understand what to do with measurement_type_concept_id, since I do not have any value that appears to be valid. I will be using 0.

In [None]:
sql = """
    INSERT INTO omopcdm.measurement (person_id, measurement_concept_id, measurement_date, measurement_type_concept_id,value_as_number,measurement_event_id,meas_event_field_concept_id)
    VALUES (%s, %s, %s, %s, %s, %s, %s);
"""
sql_codes="""
    INSERT INTO omopcdm.concept (person_id, measurement_concept_id, measurement_date, measurement_type_concept_id, value_as_concept_id, measurement_event_id,meas_event_field_concept_id)
    VALUES (%s, %s, %s, %s, %s, %s, %s);
"""
cancerEpisode_vocab_values_concept_id= { 
    'grading': 4159955,
    'tumor_size' : 36768664,
    'superficial_depth' : 36768911,
    'deep_depth' : 36768749,
    'biopsy_mitotic_count' : 4227243,
    'plasmatic_ebv_dna_at_baseline': 3043849,
    'hpv_status': 46236082,
    'crpcreactive_protein_tested' : 3000965
}

df_tables=df_cancer_Episode_IDEA4RC
for idx, row in df_tables.iterrows():
    person_id = row['patient']
    date_value = df_tables.loc[idx]['date_of_diagnosis'].strftime('%Y-%m-%d')  # Get the 'Date' value from df_tables
    for column in cancerEpisode_vocab_values_concept_id.keys():
        measurement_value = row[column]
        measurement_conc = cancerEpisode_vocab_values_concept_id.get(column)
        measurement_type_concept_id=32879
        measurement_event_id=row['omopID']
        #visit_occurrence_id We do not need to use this, since in IDEA we do not save it or need it
        visit_occurrence_id=0
        if measurement_value in (1634371,1634752,1633749,1635587,1634085,9191,9189,45878602):
            cur.execute(sql_codes, (person_id, measurement_conc, date_value, measurement_type_concept_id,measurement_value,measurement_event_id,1147127))
        else:
            cur.execute(sql, (person_id, measurement_conc, date_value, measurement_type_concept_id,measurement_value,measurement_event_id,1147127))
        conn.commit()

### Histology and topography to Measurement

Is the value as concept id necessary? Is the one used correct?

In [None]:
histology = [
    "histology_squamous",
    "histology_adenocarcinoma",
    "histology_neuroendocrine",
    "histology_odontogenic_carcinoma",
    "histology_snuc",
    "histology_subgroup_adipocytic_tumours",
    "histology_subgroup_fmt_tumours",
    "histology_subgroupsocalled_fibrohistiocytic_tumours",
    "vascular_tumours",
    "pericytic_perivascular_tumours",
    "smooth_muscle_tumours",
    "skeletal_muscle_tumours",
    "chondro_osseous_tumours",
    "peripheral_nerve_sheath_tumours",
    "tumours_of_uncertain_differentiation",
    "undif_smallrcel_sarc_bon_and_sof_tis",
    "miscellanious_mesenchimal_tumors",
    "mixed_epithelial_and_mesenchymal_tumours",
    "est_and_related_tumours",
    "histology_subgroupsocalled_fibrohistiocytic_tumours",
    "histology_adenosquamous_carcinoma",
    "histology_teratocarcinosarcoma",
    "histology_NUT_carcinoma",
    "histology_HPV_related_carcinoma",
    "histology_olfactory_neuroblastoma",
    "carcinoma_undifferentiated",
    "endometrial_stromal_related_tumours"
]

topography = [
    "nasal_cavity_and_paranasal_sinuses_subsite",
    "nasopharynx_subsite",
    "hypopharynx_subsite",
    "oropharynx_subsite",
    "larynx_subsite",
    "oral_cavity_subsite",
    "lip_subsite",
    "upper_and_lower_limbs_subsite",
    "trunk_wall_subsite",
    "intra_abdominal_subsite",
    "intra_thoracic_subsite",
    "genito_urinary_subsite",
    "head_and_neck_subsite",
    "breast_subsite",
    "other_subsite",
    "partid_gland",
    "submandibular_gland",
    "sublingual_gland",
    "middle_ear",
    "subsite_sarc"
]


sql_measurement="""
    INSERT INTO omopcdm.concept (person_id, measurement_concept_id, measurement_date, measurement_type_concept_id, value_as_concept_id, measurement_event_id,meas_event_field_concept_id)
    VALUES (%s, %s, %s, %s, %s, %s, %s)
"""
for idx, row in df_tables.iterrows():
    value=None
    while row_index < len(histology):
        histo=histology[row_index]
        value=row[histo]
        if value:
            break
        row_index += 1
    histo_concept_id = value
    value=None
    row_index = 0
    while row_index < len(topography):
        topo=topography[row_index]
        value=row[topo]
        if value:
            break
        row_index += 1
    topo_concept_id = value
    measurement_event_id=row['omopID']
    measurement_type_concept_id=32879
    person_id=row['patient']
    date_value = df_tables.loc[idx]['date_of_diagnosis'].strftime('%Y-%m-%d')  # Get the 'Date' value from df_tables
    cur.execute(sql_measurement, (person_id, topo_concept_id, date_value, measurement_type_concept_id,topo_concept_id,measurement_event_id,1147127))
    conn.commit()
    cur.execute(sql_measurement, (person_id, histo_concept_id, date_value, measurement_type_concept_id,histo_concept_id,measurement_event_id,1147127))
    conn.commit()
