# Decision Tree Classification

## **1 Introduction**

This notebook is my learning material to keep track of the notions approached in the [Advanced Learning Algorithms](https://www.coursera.org/learn/advanced-learning-algorithms?specialization=machine-learning-introduction) course from the [Machine Learning Specialization](https://www.coursera.org/specializations/machine-learning-introduction) offered by DeepLearning.AI and Standford University.

Through this notebook, I use the [Prediction model of in-hospital mortality in intensive care unit patients with heart failure: machine learning-based, retrospective analysis of the MIMIC-III database dataset](https://datadryad.org/stash/dataset/doi:10.5061/dryad.0p2ngf1zd) created by Zhou Jingmin.

### **1.0.1 Imports**

In [None]:
import wget

# Data manipulation
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Decision tree
import xgboost as xgb

# Options for pandas
pd.options.display.max_columns = 50
pd.options.display.max_rows = 30

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Options for seaborn
sns.set_style('darkgrid')
%matplotlib inline

from IPython import get_ipython
ipython = get_ipython()

# Autoreload extesnions
if 'autoreload' not in ipython.extension_manager.loaded:
    %load_ext autoreload

### **1.1 Data**

#### **1.1.0.1 Download**

In [None]:
url = 'https://datadryad.org/stash/downloads/file_stream/773992'
filename = wget.download(url)

#### **1.1.0.2 Import**

In [None]:
mortality = pd.read_csv(filename)

#### **1.1.1 Exploratory Data Analysis**

In [None]:
mortality.info()
mortality.describe()

In [None]:
print(f'Number of missing values: {mortality.isna().sum().sum()}')
print(f'Number of missing values per column:\n{mortality.isna().sum()}')

## **2 Classification**

### **2.1 Preprocessing**

#### **2.1.1 Missing values**

In [None]:
# Apply research team 'Missing Data Handling' recommendation
for c in mortality.columns[mortality.isna().any()]:
    if mortality[c].dtypes == 'int64':
        # Replace missing value by the serie's median
        mortality[c].fillna(mortality[c].median(), inplace=True)
    else:
        # Replace missing value by the serie's mean
        mortality[c].fillna(mortality[c].mean(), inplace=True)
        
print(f'Number of missing values: {mortality.isna().sum().sum()}')

#### **2.1.2 Remove non useful feature**

In [None]:
mortality.drop(['group', 'ID'], axis=1, inplace=True)

#### **2.1.3 Split data**

In [None]:
X = mortality.drop('outcome', axis=1)
y = mortality['outcome']

X_tmp, X_test, y_tmp, y_test = train_test_split(X, y,
                                                 test_size=0.2,
                                                 random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_tmp, y_tmp,
                                                  train_size=0.75,
                                                  random_state=42)

print(X_train.shape, y_train.shape)
print(X_val.shape, y_val.shape)
print(X_test.shape, y_test.shape)

### **2.2 Model**

#### **2.2.1 Building**

In [None]:
depth = X_train.shape[1]

xgb_cl = xgb.XGBClassifier()

#### **2.2.2 Training**

In [None]:
_ = xgb_cl.fit(X_train, y_train,
               eval_set=[(X_val, y_val)])

#### **2.2.3 Test**

In [None]:
preds = xgb_cl.predict(X_test)

print(f'Accuracy: {accuracy_score(y_test, preds)}')

## **3 Results**

In [None]:
fig, ax = plt.subplots(figsize=(35, 35))

xgb.plot_tree(xgb_cl, ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))

xgb.plot_importance(xgb_cl,
                    show_values=True,
                    ax=ax)