# <img style="float: left; padding-right: 10px; width: 45px" src="https://github.com/Harvard-IACS/2021-s109a/blob/master/lectures/crest.png?raw=true"> CS-S109A Introduction to Data Science 

## Lecture 10 (Interpreting Machine Learning Models and Randomization Tests)

**Harvard University**<br>
**Summer 2021**<br>
**Instructor:** Kevin Rader<br>


---

# Table of Contents 
<ol start="0">
<li> Learning Goals </li>
<li> Interpreting Models </li> 
<li> LIME </li> 
<li> Randomization Testing </li> 
    

## Learning Goals

This Jupyter notebook accompanies Lecture 10. By the end of this notebook, you should be able to:

- Interpret the results of machine learning models using several methods. 
- Use LIME and ELI5 packages.
- Investigate the use of randomization testing for AB testing data


In [None]:
import pandas as pd
import sys
import numpy as np
import scipy as sp
import sklearn as sk
import matplotlib.pyplot as plt

#from sklearn.linear_model import LogisticRegression
#from sklearn.decomposition import PCA
from sklearn import tree
from sklearn import ensemble

# Here are the decision trees
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier


# sns.set(style="ticks")
# %matplotlib inline

## Part 1: Data Wrangling

Today, we will be using the `Heart.csv` data set we've seen many times before.  We are trying to perform analyses to predict `AHD` frmo the other predictors.  We start by reading in the data, loo

In [None]:
heart_df = pd.read_csv('../data/Heart.csv')

In [None]:
print(heart_df.shape)
heart_df.head()

In [None]:
heart_df.describe()

In [None]:
X = heart_df[['Age','Sex','ChestPain','RestBP','Chol','Fbs','RestECG','MaxHR','ExAng','Oldpeak','Slope','Ca','Thal']]
y = 1*(heart_df['AHD']=='Yes')

In [None]:
#X['ChestPain']=X['ChestPain'].astype('category')
#X['ChestPain']=X['ChestPain'].cat.codes

#X['Thal']=X['Thal'].astype('category')
#X['Thal']=X['Thal'].cat.codes

In [None]:
X = X.assign(ChestPain=X['ChestPain'].astype('category').cat.codes)
X = X.assign(Thal=X['Thal'].astype('category').cat.codes)

In [None]:
X.describe()
X['Ca']=X['Ca'].fillna(0)

In [None]:
from sklearn.model_selection import train_test_split
itrain, itest = train_test_split(range(X.shape[0]), train_size=0.80)

X_train = X.iloc[itrain, :]
X_test = X.iloc[itest, :]
y_train = y.iloc[itrain]
y_test = y.iloc[itest]


**Q1.1**: How were the categorical variables handled?  How were missing values treated?  Were these wise choices?

*your answer here

---

## Part 2: Fitting Four Untuned ML Models

Start with 2 decision tree models and evaluate using AUC:

In [None]:
# fit a possibly underfit (depth = 3) decision tree classifier
dt3 = tree.DecisionTreeClassifier(max_depth = 3)
dt3.fit(X_train,y_train)

# fit an overfit (depth = 10) decision tree classifier
dt10 = tree.DecisionTreeClassifier(max_depth = 10)
dt10.fit(X_train,y_train)




In [None]:
# Evaluate using AUC

print("AUC on train for dt3:",sk.metrics.roc_auc_score(y_train,dt3.predict_proba(X_train)[:,1]))
print("AUC on test for dt3:",sk.metrics.roc_auc_score(y_test,dt3.predict_proba(X_test)[:,1]))

print("AUC on train for dt10:",sk.metrics.roc_auc_score(y_train,dt10.predict_proba(X_train)[:,1]))
print("AUC on test for dt10:",sk.metrics.roc_auc_score(y_test,dt10.predict_proba(X_test)[:,1]))



Now fit the two ensemble models: Rnadom Forest and Boosing:

In [None]:
# fit random forest and adaboost models

np.random.seed(109)
randomforest = RandomForestClassifier(n_estimators=100, max_features='sqrt', max_depth=10)
randomforest.fit(X_train,y_train);

adaboost = AdaBoostClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=3),
    n_estimators=1000,
    learning_rate=.8)
adaboost.fit(X_train,y_train);

In [None]:
# evaluate using AUC
print("AUC on train for randomforest:",sk.metrics.roc_auc_score(---,---)
print("AUC on test for randomforest:",sk.metrics.roc_auc_score(---,---)

print("AUC on train for adaboost:",sk.metrics.roc_auc_score(---,---)
print("AUC on test for adaboost:",sk.metrics.roc_auc_score(---,---)

**Q2.1**: Which model performs best?  Which models are overfit?  How do you know?

*your answer here

## Part 3: Variable Importance

Fill in the blanks below to calculate the variable importances from the 4 untuned models above.

In [None]:
#Default Variable Importance

plt.figure(figsize=(24,6))
#plt.set_xticks()
#plt.set_xticklabels(X.columns)
num=10 

plt.subplot(1, 4, 1)
dt3_importances = dt3.feature_importances_
order = np.flip(np.argsort(dt3_importances))[0:num]
plt.barh(range(num),dt3_importances[order],tick_label=X.columns[order]);
plt.title("Relative Variable Importance for dt3")

plt.subplot(1, 4, 2)
dt10_importances = dt10.feature_importances_
order = np.flip(np.argsort(dt10_importances))[0:num]
plt.barh(range(num),dt10_importances[order],tick_label=X.columns[order]);
plt.title("Relative Variable Importance for dt10")

plt.subplot(1, 4, 3)
rf_importances = ---
order = ---
plt.barh(---,---);
plt.title("Relative Variable Importance for rf")

plt.subplot(1, 4, 4)
adaboost_importances = adaboost.feature_importances_
adaboost_importances = pd.Series(adaboost_importances).fillna(0)
order = np.flip(np.argsort(adaboost_importances))[0:num]
plt.barh(range(num),adaboost_importances[order],tick_label=X.columns[order]);
plt.title("Relative Variable Importance for adaboost");



**Q3.1**: How do these variable importance measures compare for these 4 models?  Which predictor is most important in general?  How is it related to `AHD`? 

*your answer here

---

## Part 4: Using Eli-5 

We will Explain It Like a 5 year old using `ELI-5` to calculate permutation importance.

In [None]:
# install eli5
!pip install eli5

In [None]:
import eli5

In [None]:
#permutation importance for the random forest
from eli5.sklearn import PermutationImportance

seed = 42

perm = PermutationImportance(randomforest,random_state=seed,n_iter=10).fit(X_test, y_test)
eli5.show_weights(perm,feature_names=X.columns.tolist())
#eli5.explain_weights(perm, feature_names = X_train.columns.tolist())


**Q4.1**: Calculate and print out the permutation importances for the adaboost model

In [None]:
########
# your code below
########

**Q4.2**: How do the permutation importance measures compare to the default variable importance in the random forest?  How does the NN model compare to the random forest?

*your answer here*

---

## Part 5: Interpretation through Prediction Plots

We start by plotting hte predictions for all the observed data.


In [None]:
yhat_rf_train = randomforest.predict_proba(X_train)[:,1]
plt.scatter(X_train[['Age']],yhat_rf_train);
yhat_rf_test = randomforest.predict_proba(X_test)[:,1]
plt.scatter(X_test[['Age']],yhat_rf_test,marker='x');
plt.title("Predicted Probabilities vs. Age from the RF in train and test");

In [None]:
#Edit the code below for the adaboost model

yhat_adaboost_train = adaboost.predict_proba(---)
plt.scatter(---,---);
yhat_adaboost_test = adaboost.predict_proba(---)
plt.scatter(---,---);
plt.title("Predicted Probabilities vs. Age from The adaboost model in train and test");

**Q5.1** How do the random forest and boosted models compare in the interpretation of Age with AHD?  Which is more reliable?

*your answer here*

In [None]:
# Create the data frame of means to do the prediction
means1 = X_train.mean(axis = 0)
means_df = (means1.to_frame()).transpose()

# Do the prediction at all observed ages
Ages = np.arange(np.min(X['Age']),np.max(X['Age']))
means_df  = pd.concat([means_df]*Ages.size,ignore_index=True)
means_df['Age'] = Ages


In [None]:
#plots at means
yhat_rf = randomforest.predict_proba(means_df)[:,1]
plt.scatter(X_train['Age'],y_train)
plt.plot(means_df['Age'],yhat_rf,color="red")
plt.title("Predicted Probabilities vs. Age from NN in train");

In [None]:
#Plots for all observations.  And then averaged

yhat_rfs = []
for i in range(0,X_train.shape[0]):
    obs = X_train.iloc[i,:].to_frame().transpose()
    obs_df  = pd.concat([obs]*Ages.size,ignore_index=True)
    obs_df['Age'] = Ages
    yhat_rf = randomforest.predict_proba(obs_df)[:,1]
    yhat_rfs.append(yhat_rf)
    plt.plot(obs_df['Age'],yhat_rf,color='blue',alpha=0.05)

plt.plot(obs_df['Age'],np.mean(yhat_rfs, axis=0),color='red',linewidth=2);
    
plt.ylim(0,1)
plt.title("Predicted Probabilities vs. Age from RF in train for all observations");

In [None]:
# plot the 90% prediction interval
plt.plot(obs_df['Age'],np.median(yhat_rfs,axis=0),color='red');
plt.plot(obs_df['Age'],np.quantile(yhat_rfs,q=.05,axis=0),color='blue');
plt.plot(obs_df['Age'],np.quantile(yhat_rfs,q=.95,axis=0),color='blue');


**Q5.2** Interpret the two plots above.  What is the difference in the interpretations?  Is there any evidence of interaction effects between Age and the other predictors?  How do you know?

*your answer here*

---

## Part 6: Using LIME

In [None]:
!pip install lime
import lime

In [None]:
from lime.lime_tabular import LimeTabularExplainer
#explainer = LimeTabularExplainer(X_train)#class_names = [0,1])

explainer = LimeTabularExplainer(X_train.values,
                                 feature_names=X_train.columns,
                                 class_names = [0,1],
                                 mode='classification')


In [None]:
idx = 42

exp = explainer.explain_instance(X_train.values[idx], 
                                 randomforest.predict_proba, 
                                 num_features = 13)#X_train.values[idx].size)

print('Observation #: %d' % idx)
print('Probability(AHD) =', randomforest.predict_proba(X_train)[idx][1])
print('True class: %s' % y_train[idx])

In [None]:
### Plot the results
# exp.as_list()
exp.as_pyplot_figure();

In [None]:
# change the observation number and see what changes.
idx = ---
exp = explainer.explain_instance(X_train.values[idx], 
                                 randomforest.predict_proba, 
                                 num_features = 13)

print('Observation #: %d' % idx)
print('Probability(AHD) =', randomforest.predict_proba(X_train)[idx][1])
print('True class: %s' % y_train[idx])

In [None]:
### Plot the results
# exp.as_list()
exp.as_pyplot_figure();

**Q6.1** Interpret the LIME results above.  Do they agree with the other interpretations for the random forest model seen so far?

*your answer here*

## Part 7: Randomization Testing

This part will investigate the power of performing a randomization test for comparing a response `y` between two groups (defined by `x`)

In [None]:
# Here we create the mythical data

diff = 0
n = 100
sd = 10

x = np.random.binomial(1,0.5,n)
y = np.random.normal(diff*x,sd,n)

df = pd.DataFrame(np.array([x,y]).T, columns = ["x","y"])
df.head()

**Q7.1** Perform a permutation test (called a randomization test in this problem) on the data above.  What do you conclude?

In [None]:
replicates = 100

### your code here


*your answer here*

**Q7.2** Change the value of `diff` to reasonable values (start with 1). What do you conclude?

In [None]:
replicates = 100

### your code here


*your answer here*

**Q7.3** Replicate this data creation and permutation test 200 times (200 separate `experiments`) with `diff = 0`.  How often do you reject the null?  What happens as `diff` increases?

In [None]:
replicates = 100
experiments = 200

### your code here


*your answer here*