# Decision Trees and Ensembles Lab

In this lab we will compare the performance of a simple Decision Tree classifier with a Bagging classifier. We will do that on few datasets, starting from the ones offered by Scikit Learn.

## 1. Breast Cancer Dataset
We will start our comparison on the breast cancer dataset.
You can load it directly from scikit-learn using the `load_breast_cancer` function.

### 1.a Simple comparison
1. Load the data and create X and y
- Initialize a Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds
- Wrap a Bagging Classifier around the Decision Tree Classifier and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [6]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
X.head()


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [36]:
#yy = data.target
y = pd.DataFrame(data.target, columns=['benign'])
y=data.target

In [64]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.grid_search import GridSearchCV
from sklearn import metrics
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier

# STEP 1: split X and y into training and testing sets (using random_state for reproducibility)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)

# STEP 2: train the model on the training set (using K=1)
logreg_cv = LogisticRegressionCV(solver='liblinear', cv = 5,penalty = 'l2')
logreg_cv.fit(X_train, y_train)

# STEP 3: test the model on the testing set, and check the accuracy
y_pred_class = logreg_cv.predict(X_test)
print metrics.accuracy_score(y_test, y_pred_class)

0.972027972028


In [65]:
dtree1= DecisionTreeClassifier(max_depth=1)
dtree1.fit(X_train, y_train)


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [66]:
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import BaggingClassifier


Tree = DecisionTreeClassifier()
bagging = BaggingClassifier(base_estimator = Tree, max_samples=0.5, max_features=0.5)

print "CV Score:\t", cross_val_score(Tree, X, y, cv=5, n_jobs=-1).mean()
print "Bagging Score:\t", cross_val_score(bagging, X, y, cv=5, n_jobs=-1).mean()

CV Score:	0.910411696806
Bagging Score:	0.964971142747


### 1.b Scaled pipelines
As you may have noticed the features are not normalized. Do the score improve with normalization?
By now you should be very familiar with pipelines and scaling, so:

1. Create 2 pipelines, with a scaling preprocessing step and then either a decision tree or a bagging decision tree.
- Which score is better? Are the score significantly different? How can you judge that?
- Are the scores different from the non-scaled data?

In [67]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
quote_clf = Pipeline([('mms', MinMaxScaler()),
                    ('Tree', DecisionTreeClassifier())])


quote_fit = quote_clf.fit(X_train, y_train)
quote_pred = quote_fit.predict(X_test)
print classification_report(y_test, quote_pred)


             precision    recall  f1-score   support

          0       0.93      0.84      0.88        44
          1       0.93      0.97      0.95        99

avg / total       0.93      0.93      0.93       143



### 1.c Grid Search

Grid search is a great way to improve the performance of a classifier. Let's explore the parameter space of both models and see if we can improve their performance.

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Classifier
- search for few values of the parameters in order to improve the score of the classifier
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
- How does the score of the Grid-searched DT compare with the score of the Bagging DT?
- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Classifier
    - This can take a really long time.
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search
- Does the score improve for the Grid-searched Bagging Classifier?
- Which score is better? Are the score significantly different? How can you judge that?

In [77]:
# run gridsearch using GridSearchCV and 5 folds
# score on accuracy; what does this metric tell us?
criteria = ['entropy','gini']
max_vals = [1,2,3,4,5,6]
Tree = DecisionTreeClassifier()

#W.

grid = GridSearchCV(estimator=Tree, param_grid=dict(max_depth=max_vals,criterion = criteria), cv=5)

grid.fit(X, y)

# find the best parameters of our gridsearch model.
grid.best_params_
print(grid)
## Summarize the Results of the Grid Search
print(grid.best_score_)
print(grid.best_estimator_.max_depth)
print classification_report(y_test, quote_pred)


GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'criterion': ['entropy', 'gini'], 'max_depth': [1, 2, 3, 4, 5, 6]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
0.93848857645
4
             precision    recall  f1-score   support

          0       0.93      0.84      0.88        44
          1       0.93      0.97      0.95        99

avg / total       0.93      0.93      0.93       143



In [78]:


# run gridsearch using GridSearchCV and 5 folds
# score on accuracy; what does this metric tell us?
max_f= [1,2,3]
max_vals = [1,2,3]
cr = ['entropy','gini']
bagging = BaggingClassifier(base_estimator = Tree)

#W.

grid = GridSearchCV(estimator=bagging, param_grid=dict(max_samples=max_vals,max_features = max_f), cv=5)

grid.fit(X, y)

# find the best parameters of our gridsearch model.
grid.best_params_
print(grid)
## Summarize the Results of the Grid Search
print(grid.best_score_)
print(grid.best_estimator_.max_samples)
print classification_report(y_test, quote_pred)




GridSearchCV(cv=5, error_score='raise',
       estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, ...n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'max_features': [1, 2, 3], 'max_samples': [1, 2, 3]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
0.806678383128
3
             precision    recall  f1-score   support

          0       0.93      0.84      0.88        44
          1       0.93      0.97      0.95        99

avg / total       0.93      0.93      0.93       143



## 2 Diabetes and Regression

Scikit Learn has a dataset of diabetic patients obtained from this study:

http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf

442 diabetes patients were measured on 10 baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements.

The target is a quantitative measure of disease progression one year after baseline.

Repeat the above comparison between a DecisionTreeRegressor and a Bagging version of the same.

### 2.a Simple comparison
1. Load the data and create X and y
- Initialize a Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. Which score will you use?
- Wrap a Bagging Regressor around the Decision Tree Regressor and use cross_val_score to evaluate it's performance. Set crossvalidation to 5-folds. 
- Which score is better? Are the score significantly different? How can you judge that?

In [79]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [89]:
import pandas as pd
from sklearn.datasets import load_diabetes 
data = load_diabetes()
X = pd.DataFrame(data.data, columns= ['age', 'sex', 'bmi', 'map' ,'tc' ,'ldl' ,'hdl', 'tch' ,'ltg' ,'glu'])
X.head()


Unnamed: 0,age,sex,bmi,map,tc,ldl,hdl,tch,ltg,glu
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641


In [91]:
#yy = data.target
y = pd.DataFrame(data.target, columns=['Diabetic'])
y=data.target

In [96]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegressionCV
from sklearn.grid_search import GridSearchCV
from sklearn import metrics
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier

# STEP 1: split X and y into training and testing sets (using random_state for reproducibility)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)



In [100]:
dtree1= DecisionTreeRegressor()
dtree1.fit(X_train, y_train)
print "Tree Score:\t", cross_val_score(dtree1, X_train, y_train, cv=5, n_jobs=-1).mean()

Tree Score:	0.0216524703807


In [106]:
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import BaggingRegressor


Tree = DecisionTreeRegressor()
bagging = BaggingRegressor(base_estimator = Tree, max_samples=0.5, max_features=0.5)

print "tree Score:\t", cross_val_score(Tree, X_train, y_train, cv=5, n_jobs=-1).mean()
print "Bagging Score:\t", cross_val_score(bagging, X_train, y_train, cv=5, n_jobs=-1).mean()

tree Score:	0.0216524703807
Bagging Score:	0.383585324604


In [111]:
# run gridsearch using GridSearchCV and 5 folds
# score on accuracy; what does this metric tell us?
criteria = ['mse']
max_vals = [1,2,3,4,5,6]
Tree = DecisionTreeRegressor()
max_features = [1,2,3,4,5]
#W.

grid = GridSearchCV(estimator=Tree, param_grid=dict(max_depth=max_vals,criterion = criteria,max_features=max_features), cv=5)

grid.fit(X, y)

# find the best parameters of our gridsearch model.
grid.best_params_
print(grid)
## Summarize the Results of the Grid Search
print(grid.best_score_)
print(grid.best_estimator_.max_depth)


GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'max_features': [1, 2, 3, 4, 5], 'criterion': ['mse'], 'max_depth': [1, 2, 3, 4, 5, 6]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
0.353553284801
3


In [115]:
# run gridsearch using GridSearchCV and 5 folds
# score on accuracy; what does this metric tell us?

criteria = ['mse']
max_vals = [1,2,3,4,5,6]
Tree = DecisionTreeRegressor()
max_features = [1,2,3,4,5]


bagging = BaggingRegressor(base_estimator = Tree)


grid = GridSearchCV(estimator=bagging, param_grid=dict(max_samples=max_vals,max_features = max_features), cv=5)

grid.fit(X, y)

# find the best parameters of our gridsearch model.
grid.best_params_
print(grid)
## Summarize the Results of the Grid Search
print(grid.best_score_)
print(grid.best_estimator_.max_samples)


GridSearchCV(cv=5, error_score='raise',
       estimator=BaggingRegressor(base_estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'max_features': [1, 2, 3, 4, 5], 'max_samples': [1, 2, 3, 4, 5, 6]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
0.272594942743
6


### 2.b Grid Search

Repeat Grid search as above:

1. Initialize a GridSearchCV with 5-fold cross validation for the Decision Tree Regressor
- Search for few values of the parameters in order to improve the score of the regressor
- Use the whole X, y dataset for your test
- Check the best\_score\_ once you've trained it. Is it better than before?
- How does the score of the Grid-searched DT compare with the score of the Bagging DT?
- Initialize a GridSearchCV with 5-fold cross validation for the Bagging Decision Tree Regressor
- Repeat the search
    - Note that you'll have to change parameter names for the base_estimator
    - Note that there are also additional parameters to change
    - Note that you may end up with a grid space to large to search in a short time
    - Make use of the n_jobs parameter to speed up your grid search
- Does the score improve for the Grid-searched Bagging Regressor?
- Which score is better? Are the score significantly different? How can you judge that?
