# Exploratory Data Analysis of Synthea MCODE Breast Cancer Dataset
This notebook gives a simple exploratory data analysis of the described Synthea dataset. The notebook assumes that you have a local instance of Katsu running on the default port with Synthea data ingested (as outlined in the federated-learning repository's README.md).

No algorithm is trained over this data here, but the preprocessing techniques are similar to those used for training a model.

We use pandas to perform data cleaning and heatmap creation. 

In [1]:
import requests
import json
query = """
query{
  katsuDataModels
  {
    mcodeDataModels
    {
      mcodePackets{
        subject {
          dateOfBirth
          sex
        }
        cancerCondition {
          dateOfDiagnosis
        }
        cancerRelatedProcedures {
          code {
            label
          }
        }
        cancerDiseaseStatus {
          label
        }
        medicationStatement {
          medicationCode {
            label
          }
        }
      }
    }
  }
}
"""
url = "http://localhost:7999/graphql"
req = requests.post(url, json={'query': query})

In [10]:
print(req.status_code)
all_results = json.loads(req.text)['data']['katsuDataModels']['mcodeDataModels']['mcodePackets']
all_results

200


[{'subject': {'dateOfBirth': '1978-01-29', 'sex': 'FEMALE'},
  'cancerCondition': [{'dateOfDiagnosis': '2018-01-19T22:03:59Z'}],
  'cancerRelatedProcedures': [{'code': {'label': 'Megavoltage radiation therapy using photons (procedure)'}},
   {'code': {'label': 'Megavoltage radiation therapy using photons (procedure)'}},
   {'code': {'label': 'Megavoltage radiation therapy using photons (procedure)'}},
   {'code': {'label': 'Megavoltage radiation therapy using photons (procedure)'}},
   {'code': {'label': 'Megavoltage radiation therapy using photons (procedure)'}},
   {'code': {'label': 'Megavoltage radiation therapy using photons (procedure)'}},
   {'code': {'label': 'Megavoltage radiation therapy using photons (procedure)'}},
   {'code': {'label': 'Megavoltage radiation therapy using photons (procedure)'}},
   {'code': {'label': 'Megavoltage radiation therapy using photons (procedure)'}},
   {'code': {'label': 'Megavoltage radiation therapy using photons (procedure)'}},
   {'code': {'

In [17]:
all_results[len(all_results)-25] # an arbitrary entry of our selected MCODE data.

{'subject': {'dateOfBirth': '1978-01-29', 'sex': 'FEMALE'},
 'cancerCondition': [{'dateOfDiagnosis': '2018-01-19T22:03:59Z'}],
 'cancerRelatedProcedures': [{'code': {'label': 'Megavoltage radiation therapy using photons (procedure)'}},
  {'code': {'label': 'Megavoltage radiation therapy using photons (procedure)'}},
  {'code': {'label': 'Megavoltage radiation therapy using photons (procedure)'}},
  {'code': {'label': 'Megavoltage radiation therapy using photons (procedure)'}},
  {'code': {'label': 'Megavoltage radiation therapy using photons (procedure)'}},
  {'code': {'label': 'Megavoltage radiation therapy using photons (procedure)'}},
  {'code': {'label': 'Megavoltage radiation therapy using photons (procedure)'}},
  {'code': {'label': 'Megavoltage radiation therapy using photons (procedure)'}},
  {'code': {'label': 'Megavoltage radiation therapy using photons (procedure)'}},
  {'code': {'label': 'Megavoltage radiation therapy using photons (procedure)'}},
  {'code': {'label': 'Mega

## Data Cleaning
Here we drop empty columns, adjust null values, or cut rows.

In [18]:
import pandas as pd
df = pd.json_normalize(all_results) # converts our JSON list into a normalized pandas dataframe

In [19]:
for col in df:
    if df[col].astype(str).nunique() == 1:
        print(col)
        print(df[col].astype(str).unique()) # all patients are female, so we drop subject sex and null values.
        df = df.drop(col, axis=1)

cancerDiseaseStatus
['nan']


In [21]:
df

Unnamed: 0,cancerCondition,cancerRelatedProcedures,medicationStatement,subject.dateOfBirth,subject.sex,cancerDiseaseStatus.label
0,[{'dateOfDiagnosis': '2018-01-19T22:03:59Z'}],[{'code': {'label': 'Megavoltage radiation the...,[{'medicationCode': {'label': 'exemestane 25 M...,1978-01-29,FEMALE,Patient's condition improved
1,[{'dateOfDiagnosis': '2019-07-24T20:39:12Z'}],[{'code': {'label': 'Megavoltage radiation the...,[{'medicationCode': {'label': '5 ML hyaluronid...,1945-08-11,FEMALE,Patient's condition improved
2,[{'dateOfDiagnosis': '2018-12-02T14:18:11Z'}],[{'code': {'label': 'Megavoltage radiation the...,[{'medicationCode': {'label': 'tamoxifen citra...,1966-12-15,FEMALE,Patient's condition improved
3,[{'dateOfDiagnosis': '2019-10-22T08:27:51Z'}],[],[{'medicationCode': {'label': '100 ML Epirubic...,1953-11-07,FEMALE,
4,[{'dateOfDiagnosis': '2017-02-24T00:00:00Z'}],"[{'code': {'label': 'Not available'}}, {'code'...",[{'medicationCode': {'label': 'Fluorouracil'}}...,1954-03-01,FEMALE,
5,[{'dateOfDiagnosis': '2015-02-01T00:00:00Z'}],"[{'code': {'label': 'Not available'}}, {'code'...",[{'medicationCode': {'label': 'Fluorouracil'}}...,1967-05-01,FEMALE,
6,[{'dateOfDiagnosis': '2015-07-13T00:00:00Z'}],"[{'code': {'label': 'Not available'}}, {'code'...","[{'medicationCode': {'label': 'Irinotecan'}}, ...",1946-04-01,FEMALE,
7,[{'dateOfDiagnosis': '2015-08-31T00:00:00Z'}],"[{'code': {'label': 'Not available'}}, {'code'...","[{'medicationCode': {'label': 'Bevacizumab'}},...",1965-08-01,FEMALE,
8,[{'dateOfDiagnosis': '2015-06-13T00:00:00Z'}],"[{'code': {'label': 'Not available'}}, {'code'...","[{'medicationCode': {'label': 'Oxaliplatin'}},...",1971-09-01,MALE,
9,[{'dateOfDiagnosis': '2015-09-14T00:00:00Z'}],"[{'code': {'label': 'Not available'}}, {'code'...",[{'medicationCode': {'label': 'Fluorouracil'}}...,1955-09-01,MALE,


In [24]:
df = df.dropna(subset=['cancerDiseaseStatus.label']) # drop any rows that have empty disease status labels

### Enumerate Cancer_Related_Procedures into Independent Rows

In [25]:
all_procs = set()
for _, row in df.iterrows():
    for i in row['cancerRelatedProcedures']:
        all_procs.add(i['code']['label'])
        
dict_list_procs = []
for _, row in df.iterrows():
    row_dict = dict.fromkeys(all_procs, 0)
    for i in row['cancerRelatedProcedures']:
        row_dict[i['code']['label']] += 1
    dict_list_procs.append(row_dict)
df_procs = pd.DataFrame(dict_list_procs)
df_procs

Unnamed: 0,Megavoltage radiation therapy using photons (procedure),Partial mastectomy (procedure)
0,34,1
1,34,1
2,34,1


### Enumerate Medication_Statement into Independent Rows

In [26]:
all_meds = set()
for _, row in df.iterrows():
    for i in row['medicationStatement']:
        all_meds.add(i['medicationCode']['label'])
        
dict_list_meds = []
for _, row in df.iterrows():
    row_dict = dict.fromkeys(all_meds, 0)
    for i in row['medicationStatement']:
        row_dict[i['medicationCode']['label']] += 1
    dict_list_meds.append(row_dict)
df_meds = pd.DataFrame(dict_list_meds)
df_meds

Unnamed: 0,palbociclib 100 MG Oral Capsule,5 ML fulvestrant 50 MG/ML Prefilled Syringe,5 ML hyaluronidase-oysk 2000 UNT/ML / trastuzumab 120 MG/ML Injection,exemestane 25 MG Oral Tablet,tamoxifen citrate 10 MG Oral Tablet
0,0,0,0,1,0
1,0,1,1,0,0
2,1,0,0,0,1


### Parse Diagnosis Age

In [27]:
import datetime
def parse_diagnosis_age(row) -> float:
    """
    A function that returns the difference (in hours) between the diagnosis date and born date of a dataframe entry.
    
    Input: A (Katsu returned) JSON object of the MCODE data.
    Output: The difference between the diagnosis date and born date.
    """
    diag_date = row['cancerCondition'][0]['dateOfDiagnosis']
    diag_age = datetime.datetime(int(diag_date[0:4]), int(diag_date[5:7]), int(diag_date[8:10]))
    born_date = row['subject.dateOfBirth']
    born_age = datetime.datetime(int(born_date[0:4]), int(born_date[5:7]), int(born_date[8:10]))
    difference = diag_age - born_age
    diff_in_hrs = divmod(difference.total_seconds(), 3600)[0] # rounded down
    return diff_in_hrs


In [28]:
diag_age = df.apply(lambda row: parse_diagnosis_age(row), axis=1)
diag_age_rename = diag_age.rename("diagnosisAge")
df = df.join(pd.DataFrame(diag_age_rename))

### Drop Cancer Condition
This probably wouldn't be done in a real workflow with the Synthea MCODE dataset, but I personally cannot parse what, if any of this, is relevant, so I just decided to drop the column since they all have breast cancer.

I also drop the medication_statement and cancer_related_procedures since we've parsed information from them already.

In [29]:
df = df.drop(axis=1, labels=['cancerCondition', 'medicationStatement', 'cancerRelatedProcedures'])

In [30]:
dfnew = pd.concat([df.reset_index(), pd.DataFrame(dict_list_procs), pd.DataFrame(dict_list_meds)], axis=1, ignore_index=False)

### One Hot Encode Cancer_Disease_Status.Label

In [31]:
one_hot = pd.get_dummies(dfnew['cancerDiseaseStatus.label'])
dfnew = dfnew.drop('cancerDiseaseStatus.label', axis=1)
dfnew = dfnew.join(one_hot["Patient's condition improved"])

### Drop Extraneous Columns
We drop any columns that deliver meta-information or information that is already provided by other columns.

In [32]:
dfnew

Unnamed: 0,index,subject.dateOfBirth,subject.sex,diagnosisAge,Megavoltage radiation therapy using photons (procedure),Partial mastectomy (procedure),palbociclib 100 MG Oral Capsule,5 ML fulvestrant 50 MG/ML Prefilled Syringe,5 ML hyaluronidase-oysk 2000 UNT/ML / trastuzumab 120 MG/ML Injection,exemestane 25 MG Oral Tablet,tamoxifen citrate 10 MG Oral Tablet,Patient's condition improved
0,0,1978-01-29,FEMALE,350400.0,34,1,0,0,0,1,0,1
1,1,1945-08-11,FEMALE,648240.0,34,1,0,1,1,0,0,1
2,2,1966-12-15,FEMALE,455520.0,34,1,1,0,0,0,1,1


## Analysis/Notable Results

First we note that there were ~450 results with empty 'cancer_status' rows, since our dropna during cleaning left us with 1596 rows when we fetched 2052 from the dataset.

Also, we note that every entry was female, and that everyone had some variant of breast cancer.

Also, there are only 66 entries where the cancer worsened. The cancer_status.label column was binary, so if it did not improve, it worsened.

This data-volume bias towards one of the binaries will severely impact classification models trained to try and predict this value.

In [33]:
len(dfnew[dfnew["Patient's condition improved"] == 0]) # how many people did not improve their cancer (worsened)

0

### Correlation Matrix and Heatmap

We finish with a correlation matrix and heatmap. There are large correlations between different medication and cancer procedures, but there are few real correlations with whether or not a patient's condition improved. This may be because of the amount of data entries with improved results, but that means that of the 66 who worsened, there were no clear medical or procedural commonalities.

In [34]:
corr = dfnew.corr()
corr

Unnamed: 0,index,diagnosisAge,Megavoltage radiation therapy using photons (procedure),Partial mastectomy (procedure),palbociclib 100 MG Oral Capsule,5 ML fulvestrant 50 MG/ML Prefilled Syringe,5 ML hyaluronidase-oysk 2000 UNT/ML / trastuzumab 120 MG/ML Injection,exemestane 25 MG Oral Tablet,tamoxifen citrate 10 MG Oral Tablet,Patient's condition improved
index,1.0,0.34796,,,0.866025,0.0,0.0,-0.866025,0.866025,
diagnosisAge,0.34796,1.0,,,-0.167412,0.937509,0.937509,-0.770097,-0.167412,
Megavoltage radiation therapy using photons (procedure),,,,,,,,,,
Partial mastectomy (procedure),,,,,,,,,,
palbociclib 100 MG Oral Capsule,0.866025,-0.167412,,,1.0,-0.5,-0.5,-0.5,1.0,
5 ML fulvestrant 50 MG/ML Prefilled Syringe,0.0,0.937509,,,-0.5,1.0,1.0,-0.5,-0.5,
5 ML hyaluronidase-oysk 2000 UNT/ML / trastuzumab 120 MG/ML Injection,0.0,0.937509,,,-0.5,1.0,1.0,-0.5,-0.5,
exemestane 25 MG Oral Tablet,-0.866025,-0.770097,,,-0.5,-0.5,-0.5,1.0,-0.5,
tamoxifen citrate 10 MG Oral Tablet,0.866025,-0.167412,,,1.0,-0.5,-0.5,-0.5,1.0,
Patient's condition improved,,,,,,,,,,


In [37]:
import seaborn as sns
import matplotlib.pyplot as plt

ModuleNotFoundError: No module named 'seaborn'

In [38]:
corr = dfnew.corr()
sns.heatmap(corr)

NameError: name 'sns' is not defined