# Week 8: Comparing Alternative Models
# Additional Exercises 

For this week, we will continue with the heart failure dataset. We will look at decision trees and random forests to and compare the performance of the two models on this dataset. 

Q1. For your first task, please use the `train_test_split` function from scikit-learn in order to split the data into a training and testing set. Set the `shuffle` parameter to True and split the data so that 20% of the observations are in the test set. Finally, set the seed (`random_state`) to an arbitrary number, 7, for reproducability.

<span style="background-color: #FFD700">**Write your code below**</span> 

In [2]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [31]:
df = pd.read_csv("./Heart Failure Clinical Records.csv")
df

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.00,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.00,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.00,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.00,2.7,116,0,0,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
294,62.0,0,61,1,38,1,155000.00,1.1,143,1,1,270,0
295,55.0,0,1820,0,38,0,270000.00,1.2,139,0,0,271,0
296,45.0,0,2060,1,60,0,742000.00,0.8,138,0,0,278,0
297,45.0,0,2413,0,38,0,140000.00,1.4,140,1,1,280,0


In [36]:
from sklearn.metrics import accuracy_score, confusion_matrix

def results(pred, true, test_str, title_str):
    classes = ['0', '1']
    
    print(test_str, round(accuracy_score(true, pred), 3))

    # Calculating confusion matrix, sensitivity, and specificity
    cm = confusion_matrix(true, pred)

    tn, fp, fn, tp = confusion_matrix(true, pred).ravel()
    print("Sensitivity: ", (tp / (tp + fn)))
    print("Specificity: ", (tn / (tn + fp)))

    # Plotting confusion matrix
    ax = plt.subplot()
    sns.heatmap(cm, annot=True, ax = ax, fmt='g') 
    ax.set_xlabel('Predicted')
    ax.set_ylabel('True') 
    ax.set_title(title_str) 
    ax.xaxis.set_ticklabels(classes) 
    ax.yaxis.set_ticklabels(classes, rotation=360)

In [78]:
target = df["DEATH_EVENT"]

# Removing time due to target leakage
predictors = df.drop(['DEATH_EVENT', 'time'], axis = 1)

# TODO: Split the dataset into training and testing.
train_inputs, test_inputs, train_targets, test_targets = train_test_split(predictors, target, test_size = 0.20, shuffle = True, random_state=7)

# Scaling the features
sc = StandardScaler()
train_scaled = sc.fit_transform(train_inputs.astype(float, 64))
test_scaled = sc.transform(test_inputs.astype(float, 64))
test_scaled = pd.DataFrame(test_scaled, columns=train_inputs.columns)
train_scaled = pd.DataFrame(train_scaled, columns=test_inputs.columns)

Q2. Now that we have preprocessed our data, we would like to see how a decision tree classifier compares to logistic regression. Fill in the code snippets to train the model on our training set and generate predictions using our testing set. Use the `fit` and `predict` functions.

<span style="background-color: #FFD700">**Write your code below**</span> 

In [None]:
from sklearn import tree
from sklearn.model_selection import GridSearchCV

random_state = 0

dt = tree.DecisionTreeClassifier(random_state=random_state)

criterion = ['gini', 'entropy']
max_depth = [2,4,6,8,10,12,14]
class_weight = [None, 'balanced']

parameters = dict(criterion=criterion,
                  max_depth=max_depth,
                  class_weight=class_weight)

clf_GS = GridSearchCV(dt, parameters)

#TODO: train on the data


#TODO: Geberate predictions

Q3. Run the code below to generate a visual representation of your Decision Tree. You may notice that some of the values that the feature is split on is negative when that is clearly impossible in a biological context. Where in our workflow did we alter the features for this to occur and what can we do to "fix" this?

<span style="background-color: #FFD700">**Write Answer Below**</span> 

In [None]:
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(clf_GS.best_estimator_,
                    feature_names=train_scaled.columns.tolist(),
                    class_names=['0', '1'],
                    filled=True)

Q4: Below, we will do look at the data using a Random Forest model. Investigate what happens when you alter the amount of layers (depth) in you random forest model. First, change the max depth (the variable itself or in the `parameters` dictionary) to have a max depth of 2 (remember that the value must be in a list even if there is only one). Afterwards, change it so that the max depth may be 6. What effect does it have to performance and can you think of a reason as to why?

<span style="background-color: #FFD700">**Write Answer Below**</span> 

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

random_state = 0

rf = RandomForestClassifier(random_state=random_state)

criterion = ['gini', 'entropy', 'log_loss']
# TODO: Change max depth parameter
max_depth =
class_weight = [None, 'balanced']

parameters = dict(criterion=criterion,
                  max_depth=max_depth,
                  class_weight=class_weight)

clf_GS = GridSearchCV(rf, parameters)
clf_GS.fit(train_scaled, train_targets)

print(f"Best Criterion: {clf_GS.best_estimator_.get_params()['criterion']}")
print(f"Best max_depth: {clf_GS.best_estimator_.get_params()['max_depth']}")
print(f"Best class weighting: {clf_GS.best_estimator_.get_params()['class_weight']}")
print(f"{clf_GS.best_estimator_.get_params()}")
print("\n")

clf_GS.score(train_scaled, train_targets)
clf_GS.score(test_scaled, test_targets)

test_pred = clf_GS.predict(test_scaled)

results(test_pred, test_targets, "Random Forest Accuracy:", "Random Forest Confusion Matrix")

Q5: Returning back to the Decision Tree visualization, what is the most important feature that possess the most amount of information about the seperation of the classes? Compare this to the results from Week 3 investigation using statistics. Do you think the visualization supports the results we obtained from our bivariate analysis?

<span style="background-color: #FFD700">**Write Answer Below**</span> 