# Cardea

Cardea is a machine learning library built on top of FHIR data standard to solve various common prediction problems from electronic health records.

This is a python notebook that demonstrates Cardea's workflow from a user's perspective. It is decomposed based on the elements present in the framework. Documentation: https://DAI-Lab.github.io/Cardea/

Currently in support of version 0.1.0.

In this tutorial, we show how to predict whether a patient will showup to an appointment using a dataset from Kaggle's Medical Appointment No Shows. Over 30% of patients miss their scheduled appointments, this results in poor optimization of time and resources. Through machine learning, we want to predict future appointment no-shows by using an end-to-end library that is easy to interpret.

In [1]:
%load_ext autoreload
%autoreload 2

from cardea import Cardea

In [2]:
# optional
import warnings
warnings.filterwarnings("ignore")

After importing the necessary packages, it is time to initialize a new object of cardea. This object will serve as the main pillar to call any method within cardea.

In [3]:
cd = Cardea()

## Load Kaggle Dataset
Using cardea's `load_data_entityset`, we can now either load local files that are in [FHIR](https://hl7.org/fhir) format. In order to try out cardea, we want to load kaggle's open dataset instead. Cardea automatically loads the Kaggle dataset into its memory when no folder path is given.

In [4]:
! curl -O https://dai-cardea.s3.amazonaws.com/kaggle.zip && unzip -d kaggle kaggle.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2988k  100 2988k    0     0  6345k      0 --:--:-- --:--:-- --:--:-- 6332k
Archive:  kaggle.zip
  inflating: kaggle/Patient.csv      
  inflating: kaggle/Coding.csv       
  inflating: kaggle/Appointment_Participant.csv  
  inflating: kaggle/Address.csv      
 extracting: kaggle/CodeableConcept.csv  
  inflating: kaggle/Reference.csv    
  inflating: kaggle/Observation.csv  
  inflating: kaggle/Identifier.csv   
  inflating: kaggle/Appointment.csv  


In [5]:
cd.load_entityset(data_path='kaggle', fhir=True)

# to view the loaded entityset
cd.es

Entityset: fhir
  Entities:
    Patient [Rows: 6100, Columns: 4]
    Coding [Rows: 3, Columns: 2]
    Appointment_Participant [Rows: 6100, Columns: 2]
    Address [Rows: 81, Columns: 2]
    CodeableConcept [Rows: 4, Columns: 2]
    Reference [Rows: 6100, Columns: 1]
    Observation [Rows: 110527, Columns: 3]
    Identifier [Rows: 227151, Columns: 1]
    Appointment [Rows: 110527, Columns: 5]
  Relationships:
    Patient.address -> Address.object_id
    Appointment_Participant.actor -> Reference.identifier
    CodeableConcept.coding -> Coding.object_id
    Observation.code -> CodeableConcept.object_id
    Observation.subject -> Reference.identifier
    Appointment.participant -> Appointment_Participant.object_id

The first section (entities) represents the resources that were loaded into the framework. In other words, it describes the dataframes available presented with the number of rows and columns. The second section describes the relationship between the resources. For example, the Patient resource has an address variable that is connected to the __Address__ resource.

## Problem Definition
You can display all the problems currently implemented in cardea under the `list_problems` method.

In [6]:
cd.list_labelers()

{'appointment_no_show',
 'diagnosis_prediction',
 'length_of_stay',
 'mortality',
 'readmission'}

In this case, we will define the problem as _Missed Appointment_ to predict whether a patient will miss their next appointment.



In [8]:
# select problem
label_times = cd.create_label_times()
label_times.head(5)

Elapsed: 01:50 | Remaining: 00:00 | Progress: 100%|██████████| identifier: 110527/110527 


Unnamed: 0,identifier,time,missed
0,5030230,2015-11-10 07:13:56,True
1,5122866,2015-12-03 08:17:28,False
2,5134197,2015-12-07 10:40:59,False
3,5134220,2015-12-07 10:42:42,True
4,5134223,2015-12-07 10:43:01,True


## AutoML
Automated machine learning composes from two main phases:

* **automated feature engineering**: through autofe, we extract information called features. Finding the features is crucial for building data models and help in finding a satisfactory answer and interpreting the dataset as a whole.
* **automated modeling**: in automated modeling, the library supports running multiple machine learning algorithms and optimizes its hyperparamters in order to find the most optimal model.

Typically, this phase is complex and comprises of many elements, but Cardea provides an easier way of handling both phases.

In [10]:
# feature engineering
feature_matrix = cd.generate_features(label_times[:1000], verbose=True) # takes a while for the full dataset
feature_matrix.head(5)

Built 14 features
Elapsed: 00:26 | Progress: 100%|██████████


Unnamed: 0_level_0,status,participant,DAY(created),DAY(start),IS_WEEKEND(created),IS_WEEKEND(start),MONTH(created),MONTH(start),WEEKDAY(created),WEEKDAY(start),YEAR(created),YEAR(start),Appointment_Participant.actor,Appointment_Participant.COUNT(Appointment),missed
identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
5030230,noshow,3353377007,10,4,False,False,11,5,1,2,2015,2016,832000000000000,56,True
5122866,fulfilled,486500845,3,2,False,False,12,5,3,0,2015,2016,91600000000000,55,False
5134197,fulfilled,64062658,7,3,False,False,12,6,0,4,2015,2016,1220000000000,33,False
5134220,noshow,207195819,7,3,False,False,12,6,0,4,2015,2016,31900000000000,48,True
5134223,noshow,1089855247,7,3,False,False,12,6,0,4,2015,2016,9580000000000,38,True


Once we have the features, we can now split the data into training and testing

In [13]:
# pop the target labels
y = feature_matrix.pop('missed').values
X = feature_matrix.values

X_train, X_test, y_train, y_test = cd.train_test_split(
    X, y, test_size=0.2, shuffle=True)

Now that we have our feature matrix properly divided, we can use to train our machine learning pipeline, Modeling, optimizing hyperparameters and finding the most optimal model

In [15]:
cd.set_pipeline('Random Forest')
cd.fit(X_train, y_train)
y_pred = cd.predict(X_test)

Finally, you can evaluate the performance of the model

In [19]:
cd.evaluate(X, y, fit=True, test_size=0.2, shuffle=True)

Accuracy     1.0
F1 Macro     1.0
Precision    1.0
Recall       1.0
dtype: float64