# <img style="float: left; padding-right: 10px; width: 45px" src="https://github.com/Harvard-IACS/2018-CS109A/blob/master/content/styles/iacs.png?raw=true"> CS-S109A Introduction to Data Science 

## Lecture 11: NNs and Visualizating Prediction Models

**Harvard University**<br>
**Summer 2020**<br>
**Instructors:** Kevin Rader<br>
**Authors:** Rahul Dave, David Sondak, Pavlos Protopapas, Chris Tanner, Eleni Kaxiras, Kevin Rader

---

In [None]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING 
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

# Table of Contents 
<ol start="0">
<li> Review of Tree-based Models </li>     
<li> Architecture of Artificial Neural Networks (ANNs) </li>     
<li> Variable Importances </li> 
<li> Interpreting Prediction Models </li> 
  

In [None]:
import pandas as pd
import sys
import numpy as np
import scipy as sp
import sklearn as sk
import matplotlib.pyplot as plt


from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn import tree
from sklearn import ensemble


# Here are the decision trees
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier

import tensorflow as tf

print(tf.__version__)  # You should see a 2.0.0 here!

# sns.set(style="ticks")
# %matplotlib inline

## Learning Goals

This Jupyter notebook accompanies Lecture 11. By the end of this lecture, you should be able to:

- have a better grasp of neural network archetecture
- interpret a few different types of variable importances
- interpret a prediction model by exploring the relationships of predictors with the response through prediction plots.


## Part 0: Data Wrangling

For this notebook we will be using the heart data set we've used all semester for performing classification:

In [None]:
heart_df = pd.read_csv('../data/Heart.csv')
print(heart_df.shape)
heart_df.head()

In [None]:
heart_df.describe()

In [None]:
# Split into X and y
X = heart_df[['Age','Sex','ChestPain','RestBP','Chol','Fbs','RestECG','MaxHR','ExAng','Oldpeak','Slope','Ca','Thal']]
y = 1*(heart_df['AHD']=='Yes')

In [None]:
# fix categorical data types for maching learning methods...don't worry about the warning message

X['ChestPain']=X['ChestPain'].astype('category')
X['ChestPain']=X['ChestPain'].cat.codes

X['Thal']=X['Thal'].astype('category')
X['Thal']=X['Thal'].cat.codes

X.dtypes

In [None]:
# imputing zeroes for the missing values in `CA`

X['Ca']=X['Ca'].fillna(0)

In [None]:
X.describe()

In [None]:
# split into train and test

from sklearn.model_selection import train_test_split
itrain, itest = train_test_split(range(X.shape[0]), train_size=0.80)

X_train = X.iloc[itrain, :]
X_test = X.iloc[itest, :]
y_train = y.iloc[itrain]
y_test = y.iloc[itest]


---

## Part 1: tree-based models

Below `max_depth=3` and `max_depth=10` decision trees are fit.

In [None]:
#fit the simple (depth = 3) decision tree classifier
dt3= tree.DecisionTreeClassifier(max_depth = 3)
dt3.fit(X_train,y_train)

#fit the an overfit (depth = 10) decision tree classifier
dt10 = tree.DecisionTreeClassifier(max_depth = 10)
dt10.fit(X_train,y_train)



**Q1.1** Calculate the AUC on both train and test, and interpret the results

In [1]:

######
#n Your code here
######



*your answer here*

We continue fitting tree based models: first with a random forest, and then a boosted tree model.  Note: these are untuned.

In [None]:
np.random.seed(109)
randomforest = RandomForestClassifier(n_estimators=100, max_features='sqrt', max_depth=10)
randomforest.fit(X_train,y_train);

adaboost = AdaBoostClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=4),
    n_estimators=500,
    learning_rate=.75)
adaboost.fit(X_train,y_train);

In [None]:
# evaluating
print("AUC on train for randomforest:",sk.metrics.roc_auc_score(y_train,randomforest.predict_proba(X_train)[:,1]))
print("AUC on test for randomforest:",sk.metrics.roc_auc_score(y_test,randomforest.predict_proba(X_test)[:,1]))

print("AUC on train for adaboost:",sk.metrics.roc_auc_score(y_train,adaboost.predict_proba(X_train)[:,1]))
print("AUC on test for adaboost:",sk.metrics.roc_auc_score(y_test,adaboost.predict_proba(X_test)[:,1]))

**Q1.2** What would happen to the above AUC on train and test (random forest and adaboost) if the number of estimators (base trees) were increased for each?

*your answer here*

---

## Part 2: NN model

Below we build our first NN model for these data

In [None]:
model_NN = tf.keras.models.Sequential([
    tf.keras.layers.Dense(100, input_shape=(pd.DataFrame(X_train).shape[1],), activation='relu'),
    tf.keras.layers.Dense(25, activation='tanh'),
    tf.keras.layers.Dense(1, activation='linear'),
])


**Q2.1** How many hidden layers does this model have?  What should be the loss function for this model?  What is incorrect in the model architecture above?  Be sure to fix it...

*your answer here*

In [None]:
# now fit the model, and evaluate:

model_NN.compile(optimizer='ADAM', loss='binary_crossentropy', metrics=['acc'])
history = model_NN.fit(X_train, y_train, epochs=100, batch_size=64, verbose=0)

print("AUC on train for NN_model:",sk.metrics.roc_auc_score(y_train,model_NN.predict_proba(X_train)))
print("AUC on test for NN_model:",sk.metrics.roc_auc_score(y_test,model_NN.predict_proba(X_test)))

**Q2.2** Create a new NN model called `model_NN2` that improves upon the fixed model above.  Why do you suppose it is doing a better job?

In [None]:
######
# your code here
######

---

## Part 3: Variable Importance

Below the variable importances are created for the 4 tree-based models:

In [None]:
#Default Variable Importance

plt.figure(figsize=(24,6))
#plt.set_xticks()
#plt.set_xticklabels(X.columns)
num=10 

plt.subplot(1, 4, 1)
dt3_importances = dt3.feature_importances_
order = np.flip(np.argsort(dt3_importances))[0:num]
plt.barh(range(num),dt3_importances[order],tick_label=X.columns[order]);
plt.title("Relative Variable Importance for dt3")

plt.subplot(1, 4, 2)
dt10_importances = dt10.feature_importances_
order = np.flip(np.argsort(dt10_importances))[0:num]
plt.barh(range(num),dt10_importances[order],tick_label=X.columns[order]);
plt.title("Relative Variable Importance for dt10")

plt.subplot(1, 4, 3)
rf_importances = randomforest.feature_importances_
order = np.flip(np.argsort(rf_importances))[0:num]
plt.barh(range(num),rf_importances[order],tick_label=X.columns[order]);
plt.title("Relative Variable Importance for rf")

plt.subplot(1, 4, 4)
adaboost_importances = adaboost.feature_importances_
order = np.flip(np.argsort(adaboost_importances))[0:num]
plt.barh(range(num),adaboost_importances[order],tick_label=X.columns[order]);
plt.title("Relative Variable Importance for adaboost");



**Q3.1** Interpret the plots above: why do they make sense?  How would the random forest variable imporance change if `max_features` was altered?

*your answer here*

Below we use the [`eli5`](https://eli5.readthedocs.io/en/latest/autodocs/sklearn.html#eli5.sklearn.permutation_importance.PermutationImportance) package to perform permutation importance for the random forest model.  

In [None]:
#pip install eli5
#permutation importance
import eli5
from eli5.sklearn import PermutationImportance
from eli5.permutation_importance import get_score_importances


perm = PermutationImportance(randomforest).fit(X_test, y_test)
#eli5.show_weights(perm,feature_names=X.columns)
print(X.columns)
eli5.show_weights(perm, feature_names = X_train.columns.tolist())

**Q3.2** How do the permutation importances compare to the default feature importance?  What is the difference in interpretation?

*your answer here*

In [None]:
#Note: eli5 does not behave well with Keras, by default.

perm = PermutationImportance(model_NN, random_state=1).fit(X_train,y_train)



---

## Part 4: Plotting Predictions


Below we start to interpret relationships from various models based on the predictions from those models

In [None]:
yhat_rf_train = randomforest.predict_proba(X_train)[:,1]
plt.scatter(X_train[['Age']],yhat_rf_train);
yhat_rf_test = randomforest.predict_proba(X_test)[:,1]
plt.scatter(X_test[['Age']],yhat_rf_test,marker='x');
plt.title("Predicted Probabilities vs. Age from the RF in train and test");

**Q4.1** What does the above plot showing?  How can it be interpreted?

*your answer here*

**Q4.1** Reproduce the above plot for your neural netowrk model.  How does it compare?  What does it say about Age's relationship with Cardiac Arrest?

In [None]:
######
# Your code here
######


**Q4.3** Fit a logistic regression to the predicted response from your NN model based on Age (in train).  Interpret the result

In [None]:
from sklearn.linear_model import LogisticRegression

######
# your code here
######


*your answer here*

Below, a few different plots are created:
1. The predicted probabilities vs. age for any reasonable value of age at the mean values for the other predictors
2. The predicted probabilties for each individual vs. Age (sometimes called profile plots) and the averaged individual probabilities vs. Age.
3. The median of these individual predcited probability curves, along with the middle 95% ranges at any particular value of Age.

In [None]:
means1 = X_train.mean(axis = 0)
#means1 =pd.Series(means)
means_df = (means1.to_frame()).transpose()
#df_repeated = pd.concat(means*3)
#print(df_repeated)
Ages = np.arange(np.min(X['Age']),np.max(X['Age']))
means_df  = pd.concat([means_df]*Ages.size,ignore_index=True)
means_df['Age'] = Ages


In [None]:
#plots at means
yhat_nn = NN_model.predict_proba(means_df)
plt.scatter(X_train['Age'],y_train)
plt.plot(means_df['Age'],yhat_nn,color="red")
plt.title("Predicted Probabilities vs. Age from NN in train");

In [None]:
#Plots for all observations.  And then averaged

means1 = X_train.mean(axis = 0)
#means1 =pd.Series(means)
means_df = (means1.to_frame()).transpose()
#df_repeated = pd.concat(means*3)
#print(df_repeated)
Ages = np.arange(np.min(X['Age']),np.max(X['Age']))
means_df  = pd.concat([means_df]*Ages.size,ignore_index=True)
means_df['Age'] = Ages
yhat_nns = []
for i in range(0,X_train.shape[0]):
    obs = X_train.iloc[i,:].to_frame().transpose()
    obs_df  = pd.concat([obs]*Ages.size,ignore_index=True)
    obs_df['Age'] = Ages
    yhat_nn = NN_model.predict_proba(obs_df)
    yhat_nns.append(yhat_nn.transpose())
    plt.plot(obs_df['Age'],yhat_nn,color='blue',alpha=0.05)

plt.plot(obs_df['Age'],np.mean(yhat_nns,axis=0)[0],color='red',linewidth=2);
    
plt.ylim(0,1)
plt.title("Predicted Probabilities vs. Age from NN in train for all observations");

In [None]:
plt.plot(obs_df['Age'],np.median(yhat_nns,axis=0)[0],color='red');
plt.plot(obs_df['Age'],np.quantile(yhat_nns,q=.05,axis=0)[0],color='blue');
plt.plot(obs_df['Age'],np.quantile(yhat_nns,q=.95,axis=0)[0],color='blue');


**Q4.4** Interpret these plots.  What does the NN model say about the relationship between age and chances of cardiac arrest?

*your answer here*

**Q4.5** Why is it important to consider plotting for separate individuals rather than just doing the predictions at the mean value for the other predictors?

*your answer here*