# Lab - Decision Trees 

This lab asks you to play with regression and classification trees,
and find the best combination of hyperparameters.  We use Wisconsin
Diagnostic Breast Cancer (WDBC) data for categorization and Boston
housing data for regression.  Both tasks are fairly similar.

The aim of this lab is to give you some experience with trees and
hyperparameter tuning.  Try to get as good accuracy/RMSE as possible!

## Classification 
In this task you work with WDBC data.  As a reminder, your task is to
predict __diagnosis__ (''M'' = cancer, ''B'' = no cancer).  


1. Load wdbc data and ensure it looks good.


2. Create your feature matrix $X$ and label vector $y$.  The former should contain all 30 features,  everything, except __diagnosis__ and __id__.  The latter should be __diagnosis__, converted to either logical or numeric variable (otherwise sklearn will fail).


3.  Split your data into training and validation chunks (or do cross validation below, but that is slower).


In [238]:
#code goes here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [239]:
#1
wdbc = pd.read_csv("wdbc.csv.bz2")
wdbc.head()

Unnamed: 0,id,diagnosis,radius.mean,texture.mean,perimeter.mean,area.mean,smoothness.mean,compactness.mean,concavity.mean,concpoints.mean,...,radius.worst,texture.worst,perimeter.worst,area.worst,smoothness.worst,compactness.worst,concavity.worst,concpoints.worst,symmetry.worst,fracdim.worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [240]:
#1
print(wdbc.shape)
print()
print(wdbc.isna().sum())

(569, 32)

id                   0
diagnosis            0
radius.mean          0
texture.mean         0
perimeter.mean       0
area.mean            0
smoothness.mean      0
compactness.mean     0
concavity.mean       0
concpoints.mean      0
symmetry.mean        0
fracdim.mean         0
radius.se            0
texture.se           0
perimeter.se         0
area.se              0
smoothness.se        0
compactness.se       0
concavity.se         0
concpoints.se        0
symmetry.se          0
fracdim.se           0
radius.worst         0
texture.worst        0
perimeter.worst      0
area.worst           0
smoothness.worst     0
compactness.worst    0
concavity.worst      0
concpoints.worst     0
symmetry.worst       0
fracdim.worst        0
dtype: int64


In [241]:
#2
features = wdbc.loc[:, ~wdbc.columns.isin(['diagnosis', 'id'])].copy()
labels = wdbc.loc[:, wdbc.columns.isin(['diagnosis'])].copy()
labels["diagnosis"] = np.where(labels["diagnosis"] == "M", 1, 0)

In [242]:
#3
X_train, X_val, y_train, y_val = train_test_split(features, labels, test_size = 0.20)

Now everything should be ready for a few classification trees.  Your
task is to analyze the effect of three hyperparameters of DecisionTreeClassifier:
max_depth, min_samples_split and min_samples_leaf.  All these hyperparameters help to avoid
overfitting. 


4. Explain what do these hyperparameters do.


5. Fit a decision tree (on training data), and compute accuracy (on validation data).  Use a combination of all three hyperparameters when defining the model.  As a refresher, you can create it along these lines:
```
m = DecisionTreeClassifier(max_depth=7, min_samples_leaf=..., ...)
```  
and you can compute accuracy on validation data as
```
m.score(Xv, yv)
```
where Xv and yv are your validation/test features $X$ and test labels $y$. 


In [243]:
#code goes here
#4

- max_depth: sets the maximum level that a tree can descend during the training process
- min_samples: specifies the minimum number of samples required to split nodes 
- min_samples_leaf: controls the number of examples that a terminal leaf node can have

In [244]:
#5
from sklearn.tree import DecisionTreeClassifier
m = DecisionTreeClassifier(max_depth = 7, min_samples_split = 2, min_samples_leaf = 1)
_ = m.fit(X_train, y_train)
m.score(X_val, y_val)

0.9473684210526315

Now it is time to do a more thorough search through hyperparameters by
performing 3-D grid search.  

6. Write a 3-fold nested loop where the outer loop runs over max depth, next loop runs over min samples split, and the innermost loop runs over min sample leafs.  Use a meaningful set of values for each of these.  For instance, I am using:
```
depths = range(1,6)
splits = [2,5,10,20,50,100]
leafs = [1,2,5,10,20,50,100]
```
You may want to start with a smaller number of combinations to speed
up the process though.


Inside of the loop, define a decision tree classifier using these
parameters, fit it on training data, and compute accuracy on
validation data.  Essentially you repeat question 5, just inside of the loop.


6. Find the best accuracy and the corresponding hyperparameter combination your loop can detect.  You can just check inside the innermost loop if the current accuracy is better than the previous best accuracy.


7. Finally, compare the best accuracy you achieved using trees with a similar accuracy using logistic regression (on validation data).(You may want to increase max_iter.) Which model gives you better accuracy?


In [245]:
#code goes here
#6

splits = [2, 5, 10, 20, 50, 100]
leafs = [1, 2, 5, 10, 20, 50, 100]

new_accu = 0
curr_accu = 0

depth = 0
split = 0
leaf = 0

for i in range (1, 6):
    for j in splits:
        for k in leafs:
            m = DecisionTreeClassifier(max_depth = i, min_samples_split = j, min_samples_leaf = k)
            _ = m.fit(X_train, y_train)
            new_accu = m.score(X_val, y_val)
            if curr_accu < new_accu:
                curr_accu = new_accu
                depth = i
                split = j
                leaf = k
                
print(curr_accu) #best accuracy
print(depth, split, leaf) #corresponding hyperparameter combination - depth, split, leaf

0.956140350877193
5 2 2


In [246]:
#7
from sklearn.linear_model import LogisticRegression
m = LogisticRegression(max_iter = 3000)
_ = m.fit(X_train, y_train.values.ravel())
m.score(X_val, y_val)

0.9473684210526315

The best accuracy using trees achieved a slightly better result than that using logistic regression on validation data.

## Regression Trees

This task is a very similar task to the previous one, just you should do a regression, not classification model.  So you can copy-paste most of your code, and then modify it a little bit.
We use Boston housing data and predict the median value (medv) using all other
attributes.  Instead of accuracy, we are now using RMSE, and instead of comparing the result with logistic regression, we compare it with linear regression.

1. Load boston data and ensure it looks good.


2. Create your feature matrix $X$ and outcome/label vector $y$.  The former should contain all features, exceot medv, and the latter is medv.


3. Split your data into training and validation chunks (or do cross validation below, but that is slower).

4. Fit a regression tree (on training data), and compute RMSE (on validation data).  Use a combination of the same hyperparameters when defining the model.  
  
As a refresher, RMSE is defined as
$RMSE = \sqrt{
      \frac{1}{N} \sum_{i=1}^{n} (\hat y_{i} - y_{i})^{2}
    }$


5. Write a similar 3-fold nested loop over these three hyperparameters. Inside of the loop, define a decision tree classifier using these parameters, fit it on training data, and compute RMSE on validation data.  Essentially you repeat question 4, just inside of the loop.


6. Find the best accuracy and the corresponding hyperparameter combination your loop can detect.  You can just check inside the innermost loop if the current accuracy is better than the previous best accuracy.

7. Finally, compare the best RMSE you achieved using regression trees with a RMSE of linear regression (on validation data). Which model gives you better accuracy?

In [247]:
#code goes here 
#1
btsn = pd.read_csv("boston.csv.bz2", sep="\t")
btsn.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [248]:
#1
print(btsn.shape)
print()
print(btsn.isna().sum())

(506, 14)

crim       0
zn         0
indus      0
chas       0
nox        0
rm         0
age        0
dis        0
rad        0
tax        0
ptratio    0
black      0
lstat      0
medv       0
dtype: int64


In [249]:
#2
features = btsn.loc[:, ~btsn.columns.isin(['medv'])].copy()
labels = btsn.loc[:, btsn.columns.isin(['medv'])].copy()

In [250]:
#3
X_train, X_val, y_train, y_val = train_test_split(features, labels, test_size = 0.20)

In [251]:
#4
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

m = DecisionTreeRegressor(max_depth = 7, min_samples_split = 2, min_samples_leaf = 1)
_ = m.fit(X_train, y_train)
y_pred = m.predict(X_val)

rmse = mean_squared_error(y_val, y_pred, squared = False)
rmse

3.7191659183051256

In [252]:
#5
new_rmse = 1000
curr_rmse = 1000

depth = 0
split = 0
leaf = 0

for i in range (1, 6):
    for j in splits:
        for k in leafs:
            m = DecisionTreeRegressor(max_depth = i, min_samples_split = j, min_samples_leaf = k)
            _ = m.fit(X_train, y_train)
            y_pred = m.predict(X_val)
            new_rmse = mean_squared_error(y_val, y_pred, squared = False)
            if curr_rmse > new_rmse:
                curr_rmse = new_rmse
                depth = i
                split = j
                leaf = k
                
print(curr_rmse) #best RMSE
print(depth, split, leaf) #corresponding hyperparameter combination - depth, split, leaf

3.4426198143584723
4 20 2


In [253]:
#6
new_accu = 0
curr_accu = 0

depth = 0
split = 0
leaf = 0

for i in range (1, 6):
    for j in splits:
        for k in leafs:
            m = DecisionTreeRegressor(max_depth = i, min_samples_split = j, min_samples_leaf = k)
            _ = m.fit(X_train, y_train)
            new_accu = m.score(X_val, y_val)
            if curr_accu < new_accu:
                curr_accu = new_accu
                depth = i
                split = j
                leaf = k
                
print(curr_accu) #best accuracy
print(depth, split, leaf) #corresponding hyperparameter combination - depth, split, leaf

0.8543646502629447
4 20 2


In [254]:
#7
from sklearn.linear_model import LinearRegression

m = LinearRegression()
_ = m.fit(X_train, y_train)
y_pred = m.predict(X_val)
rmse = mean_squared_error(y_val, y_pred, squared = False)

rmse

4.531841117074557

The best RMSE using regression trees achieved a better result than RMSE using linear regression on validation data.