# Q11
Tree Based Models - Q11 - 13/July

Medical record for 270 patients have been provided in the file 04_heart_disease.xlsx
https://drive.google.com/drive/folders/1Jl8iDu7nGmrqCECbrLqmVafgwE5PYfiU

    1) Find out the variable importance using a Decision Tree classifier to predict heart disease. 
    2)
        a) Train a decision tree model to predict heart disease using only the top 5 important variables. Use entire data for training. 
        b) What is the accuracy of the model with a 5 fold cross validation.

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_excel("04_heart_disease.xlsx", sheet_name="data")
df.head(2)

Unnamed: 0,age,sex,chest_pain_type,BP,cholestrol,bloodsugarlevel,ECG_result,Max_heart_rate,Angina,oldpeak,slopepeak,major_vessels,thal,disease
0,70,1,4,130,322,0,2,109,0,2.4,2,3,3,1
1,67,0,3,115,564,0,2,160,0,1.6,2,0,7,0


In [3]:
df.columns

Index(['age', 'sex', 'chest_pain_type', 'BP', 'cholestrol', 'bloodsugarlevel',
       'ECG_result', 'Max_heart_rate', 'Angina', 'oldpeak ', 'slopepeak',
       'major_vessels', 'thal', 'disease'],
      dtype='object')

# 1. 
Find out the variable importance using a Decision Tree classifier to predict heart disease.

In [4]:
# fit a decision tree classifier, using entire data, to get the top 5 most import features
x_cols = ['age', 'sex', 'chest_pain_type', 'BP', 'cholestrol', 'bloodsugarlevel',
       'ECG_result', 'Max_heart_rate', 'Angina', 'oldpeak ', 'slopepeak',
       'major_vessels', 'thal']
y_var = 'disease'
df[x_cols].head(2)
dt_model = DecisionTreeClassifier(criterion='gini')
dt_model.fit(df[x_cols], df[y_var])

DecisionTreeClassifier()

In [5]:
feature_importance = pd.Series(dt_model.feature_importances_)
feature_importance.index = x_cols
feature_importance.sort_values(ascending=False)

thal               0.269628
major_vessels      0.149115
cholestrol         0.084385
chest_pain_type    0.081865
age                0.073170
BP                 0.067827
oldpeak            0.065650
Max_heart_rate     0.051350
sex                0.046375
Angina             0.045635
slopepeak          0.039250
ECG_result         0.025750
bloodsugarlevel    0.000000
dtype: float64

In [6]:
top_5_features = feature_importance.sort_values(ascending=False).index[0:5].to_list()
top_5_features

['thal', 'major_vessels', 'cholestrol', 'chest_pain_type', 'age']

# 2a
Train a decision tree model to predict heart disease using only the top 5 important variables. Use entire data for training. 

In [7]:
dt_model = DecisionTreeClassifier(criterion='gini')
dt_model.fit(df[top_5_features], df[y_var])
dt_model.score(df[top_5_features], df[y_var]) * 100 

100.0

Without cross-validation, we see 100% accuracy on training data. Clearly the model is over-fitting.

# 2b
What is the accuracy of the model with a 5 fold cross validation?

In [8]:
tune_parm_space = {'min_samples_split':list(range(1,40)),
                   'max_depth':list(range(1,40))
                  }

clf = GridSearchCV(DecisionTreeClassifier(), tune_parm_space, cv=5)
clf.fit(df[top_5_features], df[y_var])
print(f"Best Train Score is {np.round(clf.best_score_ * 100, 2)}")

Best Train Score is 82.96


After cross-validation, the train accuracy drops to approximately 83%. 
So we can say that cross-validation, is very important part of model training with decision tree, because without which the model might over-fit data.