# Cardea

Cardea is a machine learning library built on top of FHIR data standard to solve various common prediction problems from electronic health records.

This is a python notebook that demonstrates Cardea's workflow from a user's perspective. It is decomposed based on the elements present in the framework. Documentation: https://DAI-Lab.github.io/Cardea/

Currently in support of version 0.1.0.

In this tutorial, we show how to predict whether a patient will showup to an appointment using a dataset from Kaggle's Medical Appointment No Shows. Over 30% of patients miss their scheduled appointments, this results in poor optimization of time and resources. Through machine learning, we want to predict future appointment no-shows by using an end-to-end library that is easy to interpret.

In [1]:
# if you are running from Google Colab, uncomment the following commands to 
# install cardea.

# ! pip install cardea
# ! pip install 'urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1'

In [2]:
# imports 
import matplotlib.pyplot as plt
from mlblocks import MLPipeline

from cardea import Cardea

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# optional
import warnings
warnings.filterwarnings("ignore")

## Download MIMICIII Demo Dataset
We Initialize Cardea with the path to the dataset csv files. Cardea supports FHIR and MIMICIII data, and in this tutorial we use MIMICIII data. If you wish to use Cardea on the publically available MIMICIII dataset (as we do in this tutorial), uncomment the following code and run it on the MIMICIII data. You can also download the private MIMICIII dataset and initialize `Cardea` with the path to that dataset.

In [4]:
# from os import makedirs
# makedirs('mimic_demo_data', exist_ok=True)
# ! wget -r -N -c -np https://physionet.org/files/mimiciii-demo/1.4/ -P mimic_demo_data/

After importing the necessary packages, it is time to initialize a new object of cardea. This object will serve as the main pillar to call any method within cardea.

In [5]:
cd = Cardea(data_path='mimic_demo_data/physionet.org/files/mimiciii-demo/1.4/', fhir=False)

In [6]:
cd.entityset

Entityset: mimic
  Entities:
    admissions [Rows: 129, Columns: 19]
    callout [Rows: 77, Columns: 24]
    caregivers [Rows: 7567, Columns: 4]
    chartevents [Rows: 758355, Columns: 15]
    cptevents [Rows: 1579, Columns: 12]
    d_cpt [Rows: 134, Columns: 9]
    d_icd_diagnoses [Rows: 14567, Columns: 4]
    d_icd_procedures [Rows: 3882, Columns: 4]
    d_items [Rows: 12487, Columns: 10]
    d_labitems [Rows: 753, Columns: 6]
    datetimeevents [Rows: 15551, Columns: 14]
    diagnoses_icd [Rows: 1761, Columns: 5]
    drgcodes [Rows: 297, Columns: 8]
    icustays [Rows: 136, Columns: 12]
    inputevents_cv [Rows: 34799, Columns: 22]
    inputevents_mv [Rows: 13224, Columns: 31]
    labevents [Rows: 76074, Columns: 9]
    microbiologyevents [Rows: 2003, Columns: 16]
    noteevents [Rows: 0, Columns: 11]
    outputevents [Rows: 11320, Columns: 13]
    patients [Rows: 100, Columns: 8]
    prescriptions [Rows: 10398, Columns: 19]
    procedureevents_mv [Rows: 753, Columns: 25]
    proced

The first section (entities) represents the resources that were loaded into the framework. In other words, it describes the dataframes available presented with the number of rows and columns. The second section describes the relationship between the entities, which boils down to matching id columns.

## Problem Definition
You can display all the problems currently implemented in cardea under the `list_problems` method. Note that `appointment_no_show` is not supported on MIMIC data. 

In [7]:
cd.list_labelers()

{'appointment_no_show',
 'diagnosis_prediction',
 'length_of_stay',
 'mortality_prediction',
 'readmission'}

In this case, we will define the problem as _Mortality Prediction_ to predict whether a patient will miss their next appointment. Note that you can create your own `labeler` function to define a custom predictiont task.



In [8]:
# select problem
from cardea.data_labeling.mortality import mortality_prediction

label_times = cd.label(mortality_prediction)

## AutoML
Automated machine learning composes from two main phases:

* **automated feature engineering**: through autofe, we extract information called features. Finding the features is crucial for building data models and help in finding a satisfactory answer and interpreting the dataset as a whole.
* **automated modeling**: in automated modeling, the library supports running multiple machine learning algorithms and optimizes its hyperparamters in order to find the most optimal model.

Typically, this phase is complex and comprises of many elements, but Cardea provides an easier way of handling both phases.

In [9]:
# feature engineering
feature_matrix = cd.featurize(label_times[:1000]) # takes a while for the full dataset
feature_matrix.head(5)

Unnamed: 0_level_0,row_id,subject_id,admission_type,admission_location,insurance,language,religion,marital_status,ethnicity,diagnosis,...,MONTH(edregtime),WEEKDAY(admittime),WEEKDAY(edouttime),WEEKDAY(edregtime),YEAR(admittime),YEAR(edouttime),YEAR(edregtime),patients.row_id,patients.gender,label
hadm_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100375.0,12305.0,10056.0,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,Medicare,,CHRISTIAN SCIENTIST,UNKNOWN (DEFAULT),WHITE,SEPSIS,...,,0,,,2129,,,9514.0,F,False
100969.0,40554.0,42430.0,EMERGENCY,EMERGENCY ROOM ADMIT,Medicare,ENGL,CHRISTIAN SCIENTIST,,WHITE,CEREBROVASCULAR ACCIDENT,...,11.0,0,0.0,0.0,2142,2142.0,2142.0,31429.0,M,True
101361.0,40379.0,41914.0,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,Medicare,ENGL,CATHOLIC,MARRIED,WHITE,ELEVATED LIVER FUNCTIONS;S/P LIVER TRANSPLANT,...,,2,,,2145,,,31300.0,M,False
102203.0,40462.0,42135.0,EMERGENCY,CLINIC REFERRAL/PREMATURE,Medicaid,ENGL,MUSLIM,MARRIED,AMERICAN INDIAN/ALASKA NATIVE FEDERALLY RECOGN...,FAILURE TO THRIVE,...,,2,,,2127,,,31350.0,M,False
103379.0,41092.0,44228.0,EMERGENCY,EMERGENCY ROOM ADMIT,Private,ENGL,NOT SPECIFIED,SINGLE,WHITE,CHOLANGITIS,...,12.0,5,5.0,5.0,2170,2170.0,2170.0,31872.0,F,False


In [10]:
# shuffle the dataframe
feature_matrix = feature_matrix.sample(frac=1)

# pop the target labels
y = list(feature_matrix.pop('label'))
X = feature_matrix.values

In [11]:
# split the data into train and test
X_train, X_test, y_train, y_test = cd.train_test_split(
    X, y, test_size=0.2, shuffle=True)

The pipeline variable represents the order in which machine learning algorithms are executed. It can be used to a single end to end model for our problem task. For example:

```
pipeline = MLPipeline(['sklearn.ensemble.RandomForestClassifier'])
```

Here we define a Random Forest model.

In addition, you can use a sequence of primitives that allow you to (1) impute missing values (2) normalize the data (3) use Random Forest. This can be modeled as:
```
pipeline = MLPipeline([
    'sklearn.impute.SimpleImputer',
    'sklearn.preprocessing.OneHotEncoder',
    'sklearn.ensemble.RandomForestClassifier'])
```
If you run this on the MIMIC code however, you will get errors as the default SimpleImputer hyperparameters (aka function arguments) only work for continuous data. If you change the `strategy` hyperparemter to `most_frequent` instead, it will work on mixed categorica & continuous data. You can change hyperparameters in the pipeline as follows:
```
pipeline = MLPipeline([
    'sklearn.impute.SimpleImputer',
    'sklearn.preprocessing.OneHotEncoder',
    'sklearn.ensemble.RandomForestClassifier'],
                      init_params={
                          'sklearn.impute.SimpleImputer': {
                              'strategy': 'most_frequent'},
                          })
```

More on machine learning algorithms and MLPrimitives can be found here: https://HDI-Project.github.io/MLPrimitives

In [12]:
# create a ML pipeline
pipeline = MLPipeline([
    'sklearn.impute.SimpleImputer',
    'sklearn.preprocessing.OneHotEncoder',
    'sklearn.ensemble.RandomForestClassifier'],
                      init_params={
                          'sklearn.impute.SimpleImputer': {
                              'strategy': 'most_frequent'},
                          })
cd.set_pipeline(pipeline)

In [13]:
# modeling
cd.fit(X=X_train, y=y_train)
results = cd.evaluate(X=X, y=y)
results

Accuracy     0.922481
F1 Macro     0.901976
Precision    0.949495
Recall       0.875000
dtype: float64