# Decision Trees and Ensembles Lab

In this lab we will compare the performance of a simple Decision Tree classifier with a Bagging classifier. We will do that on few datasets, starting from the ones offered by Scikit Learn. This lab will not include much starter code; hence we ease into doing things independently. Do ask if you are stuck though.

## 1. Breast Cancer Dataset
We will start our comparison on the breast cancer dataset.
You can load it directly from scikit-learn using the `load_breast_cancer` function.

### 1.a Simple comparison
1. Load the data and create X and y
- Initialise a Decision Tree Classifier and use cross_val_score to evaluate its performance. Set crossvalidation to 5-folds
- Wrap a Bagging Classifier around the Decision Tree Classifier and use cross_val_score to evaluate its performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from __future__ import print_function

In [29]:
from sklearn.datasets import load_breast_cancer
X,y = load_breast_cancer(return_X_y=True)

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

tree = DecisionTreeClassifier(random_state=1)
print("Trees Score:\t", cross_val_score(tree, X, y, cv=5).mean())

Trees Score:	 0.917491342824


In [30]:
from sklearn.ensemble import BaggingClassifier
bagging=BaggingClassifier(tree, n_estimators=100, random_state=42)
print("Bagging Score:\t", cross_val_score(bagging, X, y, cv=5).mean())

Bagging Score:	 0.965063485956


### 1.b Scaled pipelines
As you may have noticed the features are not normalised. Do the score improve with normalisation?
For further experience with pipelines and scaling:

1. Create 2 pipelines, with a scaling preprocessing step and then either a decision tree or a bagging decision tree.
- Which score is better? Are the score significantly different? How can you judge that?
- Are the scores different from the non-scaled data?

In [31]:
from sklearn import preprocessing
norm = preprocessing.Normalizer()

from sklearn.pipeline import Pipeline
pipe1 = Pipeline([('norm',norm),('tree',tree)])
pipe2 = Pipeline([('norm',norm),('bagging',bagging)])

print(cross_val_score(pipe1, X, y, cv=5).mean())
print(cross_val_score(pipe2, X, y, cv=5).mean())

0.919199692189
0.956183147364


### 1.c Grid Search

Grid search is a great way to improve the performance of a classifier. Let's explore the parameter space of both models and see if we can improve their performance.

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Classifier
- search for few values of the parameters in order to improve the score of the classifier
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
- How does the score of the Grid-searched DT compare with the score of the Bagging DT?
- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Classifier
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search
- Does the score improve for the Grid-searched Bagging Classifier?
- Which score is better? Are the score significantly different? How can you judge that?

In [34]:
from sklearn.model_selection import GridSearchCV
params = {
    'max_depth':range(1,5),
    'min_samples_split':range(2,5),
    'min_samples_leaf':range(1,5)
}
grid = GridSearchCV(tree,params,cv=5).fit(X,y)

In [35]:
print(grid.best_params_)
print(grid.best_score_)

{'min_samples_split': 2, 'max_depth': 2, 'min_samples_leaf': 1}
0.927943760984


In [39]:
from sklearn.model_selection import GridSearchCV
params = {
    'base_estimator__max_depth':range(1,5),
    'base_estimator__min_samples_split':range(2,5),
    'base_estimator__min_samples_leaf':range(1,5)
}
grid = GridSearchCV(bagging,param_grid = params,cv=5,n_jobs=-1).fit(X,y)

In [40]:
print(grid.best_params_)
print(grid.best_score_)

{'base_estimator__min_samples_split': 4, 'base_estimator__max_depth': 4, 'base_estimator__min_samples_leaf': 1}
0.959578207381


## 2 Diabetes and Regression

Sklearn has a dataset of diabetic patients obtained from this study:

http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf

442 diabetes patients were measured on 10 baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements.

The target is a quantitative measure of disease progression one year after baseline.

Repeat the above comparison between a DecisionTreeRegressor and a Bagging version of the same.

### 2.a Simple comparison
1. Load the data and create X and y
- Initialize a Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. Which score will you use?
- Wrap a Bagging Regressor around the Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [47]:
from sklearn.datasets import load_diabetes
X,y = load_diabetes(return_X_y=True)

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score

tree = DecisionTreeRegressor(random_state=1)
print("Trees Score:\t", cross_val_score(tree, X, y, cv=5).mean())

Trees Score:	 -0.140462749211


In [48]:
from sklearn.ensemble import BaggingRegressor
bagging=BaggingRegressor(tree, n_estimators=100, random_state=42)
print("Bagging Score:\t", cross_val_score(bagging, X, y, cv=5).mean())

Bagging Score:	 0.418741681617


### 2.b Grid Search

Repeat Grid search as above:

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Regressor
- Search for few values of the parameters in order to improve the score of the regressor
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
- How does the score of the Grid-searched DT compare with the score of the Bagging DT?
- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Regressor
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search
- Does the score improve for the Grid-searched Bagging Regressor?
- Which score is better? Are the score significantly different? How can you judge that?


In [49]:
params = {
    'max_depth':range(1,5),
    'min_samples_split':range(2,5),
    'min_samples_leaf':range(1,5)
}
grid = GridSearchCV(tree,params,cv=5,n_jobs=-1).fit(X,y)
print(grid.best_params_)
print(grid.best_score_)

{'min_samples_split': 2, 'max_depth': 3, 'min_samples_leaf': 4}
0.337383150805


In [53]:
params = {
    'base_estimator__max_depth':range(1,5),
    'base_estimator__min_samples_split':range(2,5),
    'base_estimator__min_samples_leaf':range(1,5)
}

import cProfile
cProfile.run('grid = GridSearchCV(bagging,param_grid = params,cv=5,n_jobs=-1).fit(X,y)')
print(grid.best_params_)
print(grid.best_score_)

         373863 function calls (370404 primitive calls) in 11.698 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   11.715   11.715 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 <string>:1(FullArgSpec)
        1    0.000    0.000    0.000    0.000 <string>:8(__new__)
       16    0.000    0.000    0.001    0.000 Queue.py:107(put)
        1    0.000    0.000    0.000    0.000 Queue.py:197(_init)
       16    0.000    0.000    0.000    0.000 Queue.py:204(_put)
        1    0.000    0.000    0.000    0.000 Queue.py:26(__init__)
        2    0.000    0.000    0.002    0.001 __init__.py:102(Pipe)
        1    0.000    0.000    0.000    0.000 __init__.py:109(cpu_count)
        2    0.000    0.000    0.000    0.000 __init__.py:171(Lock)
        1    0.000    0.000    0.000    0.000 __init__.py:375(_get_n_jobs)
     2866    0.020    0.000    0.063    0.000 _abcoll.py:548(update)
    

In [54]:
cProfile.run('grid = GridSearchCV(bagging,param_grid = params,cv=5,n_jobs=1).fit(X,y)')
print(grid.best_params_)
print(grid.best_score_)

         62538101 function calls (62162137 primitive calls) in 61.381 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 <string>:1(FullArgSpec)
        1    0.000    0.000    0.000    0.000 <string>:8(__new__)
      721    0.000    0.000    0.000    0.000 __init__.py:375(_get_n_jobs)
      960    0.002    0.000    0.007    0.000 __init__.py:83(safe_indexing)
   592402    3.318    0.000   11.110    0.000 _abcoll.py:548(update)
    24100    0.016    0.000    0.105    0.000 _methods.py:25(_amax)
   100046    0.050    0.000    0.602    0.000 _methods.py:31(_sum)
      962    0.000    0.000    0.003    0.000 _methods.py:37(_any)
      486    0.001    0.000    0.001    0.000 _methods.py:43(_count_reduce_items)
      486    0.005    0.000    0.008    0.000 _methods.py:53(_mean)
      722    0.000    0.000    0.000    0.00

## Bonus: Project 6 data

Get a headstart on Project 6 by repeating this analysis on that dataset! You need to obtain it first though through the IMDB API.