<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Visualizing CARTs with admissions data

_Instructor: Aymeric Flaisler_

---

Using the admissions data from earlier in the course, build CARTs, look at how they work visually, and compare their performance to more standard, parametric models.


---

### 1. Install and load the packages required to visually show decision tree branching

You will need to first:

1. Install `graphviz` with homebrew. The command will be `brew install graphviz`
- Install `pydotplus` with `conda install pydotplus`
- Load the packages as shown below (you may need to restart the kernel after the installations.)

In [None]:
# REQUIREMENTS:
# pip install pydotplus
# brew install graphviz

# Use graphviz to make a chart of the regression tree decision points:
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

---

### 2. Load in admissions data and other python packages

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats

plt.style.use('fivethirtyeight')

from ipywidgets import *
from IPython.display import display

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [None]:
admit = pd.read_csv('./datasets/admissions.csv')

---

### 3. Create regression and classification X, y data

The regression data will be:

    Xr = [admit, gre, prestige]
    yr = gpa
    
The classification data will be:

    Xc = [gre, gpa, prestige]
    yc = admit

In [None]:
# We focus on data we have for the time being.
# dont want to spend an unessary amount of time cleaning.
admit = admit.dropna()

In [None]:
admit.head()

In [None]:
Xr = admit[['admit','gre','prestige']]
yr = admit.gpa.values

Xc = admit[['gpa','gre','prestige']]
yc = admit.admit.values

---

### 4. Cross-validate regression and logistic regression on the data

Fit a linear regression for the regression problem and a logistic for the classification problem. Cross-validate the R2 and accuracy scores.

In [None]:
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import cross_val_score

In [None]:
# cross val Linear Reg with 4 folds
# reg_scores = cross_val_score(...

# cross val Logistic Reg with 4 folds
# cls_scores = cross_val_score(...

#get scores
print(reg_scores, np.mean(reg_scores))
print(cls_scores, np.mean(cls_scores))

In [None]:
# fit models
linreg = LinearRegression().fit(Xr, yr) #R2
logreg = LogisticRegression().fit(Xc, yc) #accuracy

---

### 5. Building regression trees

With `DecisionTreeRegressor`:

1. Build 4 models with different parameters for `max_depth`: `max_depth=1`, `max_depth=2`, `max_depth=3`, and `max_depth=None`
2. Cross-validate the R2 scores of each of the models and compare to the linear regression earlier.

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
# set 4 models


In [None]:
# fit the 4 models


In [None]:
# cross validate the 4 models


In [None]:
# score the 4 models


---

### 6. Visualizing the regression tree decisions

Use the template code below to create charts that show the logic/branching of your four decision tree regressions from above.

#### Interpreting a regression tree diagram

- First line is the condition used to split that node (go left if true, go right if false)
- `samples` is the number of observations in that node before splitting
- `mse` is the mean squared error calculated by comparing the actual response values in that node against the mean response value in that node
- `value` is the mean response value in that node

In [None]:
# # TEMPLATE CODE
# from sklearn.externals.six import StringIO  
# from IPython.display import Image  
# from sklearn.tree import export_graphviz
# import pydotplus

# # initialize the output file object
# dot_data = StringIO() 

# # my fit DecisionTreeRegressor object here is: dtr1
# # for feature_names i put the columns of my Xr matrix
# export_graphviz(dtr1, out_file=dot_data,  
#                 filled=True, rounded=True,
#                 special_characters=True,
#                 feature_names=Xr.columns)  

# graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
# Image(graph.create_png()) 
# To save the image of the tree
# graph.write_png('./dtr1.png')

In [None]:
# A:

---

### 7. Building classification trees

With `DecisionTreeClassifier`:

1. Again build 4 models with different parameters for `max_depth`: `max_depth=1`, `max_depth=2`, `max_depth=3`, and `max_depth=None`
2. Cross-validate the accuracy scores of each of the models and compare to the logistic regression earlier.

Note that now you'll be using the classification task where we are predicting `admit`.

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
# set 4 models


In [None]:
# fit the 4 models


In [None]:
# cross validate the 4 models


In [None]:
# score the 4 models


---

### 8. Visualize the classification trees

The plotting code will be the same as for regression, you just need to change the model you're using for each plot and the feature names.

The output changes somewhat from the regression tree chart. Earlier it would give the MSE of that node, but now there is a line called `value` that tells you the count of each class at that node.

In [None]:
# TEMPLATE CODE for max_depth = 1
# dot_data = StringIO()  

# export_graphviz(dtc1, out_file=dot_data,  
#                 filled=True, rounded=True,
#                 special_characters=True,
#                 feature_names=Xc.columns)  

# graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
# Image(graph.create_png())

In [None]:
# max_depth = 2

In [None]:
# max_depth = 3

In [None]:
# max_depth = 4

---

### 9. Using GridSearchCV to find the best decision tree classifier

As decision trees that are unrestricted will just end up over fitting the training data. Decision tree regression and classification models in sklearn offer a variety of ways to "pre-prune" (by restricting the how many times the tree can branch and what it can use).

Measure           | What it does
------------------|-------------
max_depth         | How many nodes deep can the decision tree go?
max_features      | Is there a cut off to the number of features to use?
max_leaf_nodes    | How many leaves can be generated per node?
min_samples_leaf  | How many samples need to be included at a leaf, at a minimum?  
min_samples_split | How many samples need to be included at a node, at a minimum?

It is not always best to search over _all_ of these in a grid search, unless you have a small dataset. Many of them while not redundant are going to have very similar effects on your model's fit.

Check out the documentation here:

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

---

#### Switch over to the college stats dataset

We are going to be predicting whether or not a college is public or private. Set up your X, y variables accordingly.

In [None]:
col = pd.read_csv('./datasets/College.csv')

In [None]:
col.head(2)

In [None]:
# Set up your X, y variables accordingly
y = 
X =

---

### 10. Building classification trees

With `DecisionTreeClassifier`:

1. Build 4 models with different parameters for `max_depth`: `max_depth=1`, `max_depth=2`, `max_depth=3`, and `max_depth=None`
2. Cross-validate the accuracy scores of each of the models and compare to the logistic regression earlier.


In [None]:
# set 4 trees
# dtc1 = DecisionTreeClassifier(max_depth=1)
# dtc2 = DecisionTreeClassifier(max_depth=2)
# dtc3 = DecisionTreeClassifier(max_depth=3)
# dtcN = DecisionTreeClassifier(max_depth=None)

In [None]:
# fit 4 trees
# dtc1.fit(X, y)
# dtc2.fit(X, y)
# dtc3.fit(X, y)
# dtcN.fit(X, y)

In [None]:
# use CV to evaluate the 4 trees
# dtc1_scores = cross_val_score(dtc1, X, y, cv=4)
# dtc2_scores = cross_val_score(dtc2, X, y, cv=4)
# dtc3_scores = cross_val_score(dtc3, X, y, cv=4)
# dtcN_scores = cross_val_score(dtcN, X, y, cv=4)

# print(dtc1_scores, np.mean(dtc1_scores))
# print(dtc2_scores, np.mean(dtc2_scores))
# print(dtc3_scores, np.mean(dtc3_scores))
# print(dtcN_scores, np.mean(dtcN_scores))

---

### 11. Set up and run the gridsearch on the data

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# gridsearch params
# dtc_params = {'max_depth':...

In [None]:
# Set the gridsearch


In [None]:
# use the gridsearchCV model to fit the data


---

### 12. print out the "feature importances"

The model has an attribute called `.feature_importances_` which can tell us which features were most important vs. others. It ranges from 0 to 1, with 1 being the most important.

An easy way to think about the feature importance is how much that particular variable was used to make decisions. Really though, it also takes into account how much that feature contributed to splitting up the class or reducing the variance.

A feature with higher feature importance reduced the criterion (impurity) more than the other features.

Below, show the feature importances for each variable predicting private vs. not, sorted by most important feature to least.

In [None]:
# Fill and print(the dataframe:
fi = pd.DataFrame({
    'feature': ...
    'importance': ...
})