# Lab - Decision Trees 

This lab asks you to play with regression and classification trees,
and find the best combination of hyperparameters.  We use Wisconsin
Diagnostic Breast Cancer (WDBC) data for categorization and Boston
housing data for regression.  Both tasks are fairly similar.

The aim of this lab is to give you some experience with trees and
hyperparameter tuning.  Try to get as good accuracy/RMSE as possible!

## Classification 
In this task you work with WDBC data.  As a reminder, your task is to
predict __diagnosis__ (''M'' = cancer, ''B'' = no cancer).  


1. Load wdbc data and ensure it looks good.


2. Create your feature matrix $X$ and label vector $y$.  The former should contain all 30 features,  everything, except __diagnosis__ and __id__.  The latter should be __diagnosis__, converted to either logical or numeric variable (otherwise sklearn will fail).


3.  Split your data into training and validation chunks (or do cross validation below, but that is slower).


In [23]:
#code goes here
import pandas as pd
import numpy as np
from sklearn import tree
wdbc_df = pd.read_csv('wdbc.csv.bz2')
x = wdbc_df.drop('diagnosis', axis=1).drop('id', axis=1)
y = np.where(wdbc_df.diagnosis == 'M', 1, 0)

from sklearn.model_selection import train_test_split
f_train , f_test, l_train, l_test = train_test_split(x, y, test_size = 0.3)

Now everything should be ready for a few classification trees.  Your
task is to analyze the effect of three hyperparameters of DecisionTreeClassifier:
max_depth, min_samples_split and min_samples_leaf.  All these hyperparameters help to avoid
overfitting. 


4. Explain what do these hyperparameters do.


5. Fit a decision tree (on training data), and compute accuracy (on validation data).  Use a combination of all three hyperparameters when defining the model.  As a refresher, you can create it along these lines:
```
m = DecisionTreeClassifier(max_depth=7, min_samples_leaf=..., ...)
```  
and you can compute accuracy on validation data as
```
m.score(Xv, yv)
```
where Xv and yv are your validation/test features $X$ and test labels $y$. 


Hyperparamters do the following to the decision tree classifier:
1. max_depth controls the depth of the tree using an integer value. if a value is not provided it will continue until leaf nodes are pure.
2. min_samples_split is the minimum samples needed to split a node where the default is 2 and any float or int provided will help determine if the node will become a leaf.
3. min_samples_leaf is the minimum samples needed to be at a leaf node. A split can only happen if there is atleast min_samples_leaf in both of the resulting nodes post split with default of 1

In [28]:
#code goes here
d_tree = tree.DecisionTreeClassifier(criterion = 'entropy', max_depth = 4)
m = d_tree.fit(f_test, l_test)
m.score(f_test, l_test)

0.9941520467836257

Now it is time to do a more thorough search through hyperparameters by
performing 3-D grid search.  

6. Write a 3-fold nested loop where the outer loop runs over max depth, next loop runs over min samples split, and the innermost loop runs over min sample leafs.  Use a meaningful set of values for each of these.  For instance, I am using:
```
depths = range(1,6)
splits = [2,5,10,20,50,100]
leafs = [1,2,5,10,20,50,100]
```
You may want to start with a smaller number of combinations to speed
up the process though.


Inside of the loop, define a decision tree classifier using these
parameters, fit it on training data, and compute accuracy on
validation data.  Essentially you repeat question 5, just inside of the loop.


6. Find the best accuracy and the corresponding hyperparameter combination your loop can detect.  You can just check inside the innermost loop if the current accuracy is better than the previous best accuracy.


7. Finally, compare the best accuracy you achieved using trees with a similar accuracy using logistic regression (on validation data).(You may want to increase max_iter.) Which model gives you better accuracy?


In [36]:
#code goes here
best = 0
best_d = 0
best_s = 0
best_l = 0
for depth in range(1,6):
    for splits in [2,5,10,20,50,100]:
        for leafs in [1,2,5,10,20,50,100]:
            d_tree = tree.DecisionTreeClassifier(criterion = 'entropy', max_depth = depth, min_samples_split = splits, min_samples_leaf = leafs)
            m = d_tree.fit(f_test, l_test)
            
            score = m.score(f_test, l_test)
            if score > best:
                best_s = splits
                best_d = depth
                best_l = leafs
                best = score
print('the best model over the given intervals = depth: ' + str(best_d) + ' min_samples_split: ' + str(best_s) + ' min_sample_leafs: ' + str(best_l) + ' score: ' + str(best))

the best model over the given intervals = depth: 5 min_samples_split: 2 min_sample_leafs: 1 score: 1.0


In [47]:
#logistic regression 
import statsmodels.api as sm
log_reg = sm.Logit(l_test, f_test).fit(maxiter = 200)
log_reg.summary()

         Current function value: inf
         Iterations: 200




0,1,2,3
Dep. Variable:,y,No. Observations:,171.0
Model:,Logit,Df Residuals:,141.0
Method:,MLE,Df Model:,29.0
Date:,"Wed, 24 Apr 2024",Pseudo R-squ.:,-inf
Time:,14:06:11,Log-Likelihood:,-inf
converged:,False,LL-Null:,-113.56
Covariance Type:,nonrobust,LLR p-value:,1.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
radius.mean,-1.161e+13,,,,,
texture.mean,-1.394e+13,,,,,
perimeter.mean,-7.641e+13,,,,,
area.mean,-6.502e+14,,,,,
smoothness.mean,-6.737e+10,,,,,
compactness.mean,-9.036e+10,,,,,
concavity.mean,-1.007e+11,,,,,
concpoints.mean,-5.652e+10,,,,,
symmetry.mean,-1.242e+11,,,,,


## Regression Trees

This task is a very similar task to the previous one, just you should do a regression, not classification model.  So you can copy-paste most of your code, and then modify it a little bit.
We use Boston housing data and predict the median value (medv) using all other
attributes.  Instead of accuracy, we are now using RMSE, and instead of comparing the result with logistic regression, we compare it with linear regression.

1. Load boston data and ensure it looks good.


2. Create your feature matrix $X$ and outcome/label vector $y$.  The former should contain all features, exceot medv, and the latter is medv.


3. Split your data into training and validation chunks (or do cross validation below, but that is slower).

4. Fit a regression tree (on training data), and compute RMSE (on validation data).  Use a combination of the same hyperparameterswhen defining the model.  
  
As a refresher, RMSE is defined as
$RMSE = \sqrt{
      \frac{1}{N} \sum_{i=1}^{n} (\hat y_{i} - y_{i})^{2}
    }$


5. Write a similar 3-fold nested loop over these three hyperparameters. Inside of the loop, define a decision tree classifier using these parameters, fit it on training data, and compute RMSE on validation data.  Essentially you repeat question 4, just inside of the loop.


6. Find the best accuracy and the corresponding hyperparameter combination your loop can detect.  You can just check inside the innermost loop if the current accuracy is better than the previous best accuracy.

7. Finally, compare the best RMSE you achieved using regression trees with a RMSE of linear regression (on validation data). Which model gives you better accuracy?

In [None]:
#code goes here 