In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets

<br>

Let's use the boston dataset again.

Let's use the same DV ('medv') and one predictor ('lstat').

First, let's run Simple Linear Regression and a Second Degree Poly Regression so that we can compare these two models with 
a Regression Tree.

<br>

In [None]:
boston_data=datasets.load_boston()

In [None]:
boston_df=pd.DataFrame(boston_data.data, columns= boston_data.feature_names)

In [None]:
boston_df['medv']= boston_data.target

In [None]:
boston_df.head()

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

One of the metrics that we are going to import is the mean_squared_error (MSE) because it is relevant when constructing regression trees.

### Simple Linear Regression with Scikit-learn

In [None]:
simple_reg_object= LinearRegression(fit_intercept=True)

In [None]:
X= boston_df['LSTAT'].values.reshape(-1,1)

In [None]:
y= boston_df['medv'].values

In [None]:
simple_reg_object.fit(X, y)

In [None]:
np.round ( r2_score(y, simple_reg_object.predict(X)), 2)

In [None]:
np.round( mean_squared_error(y, simple_reg_object.predict(X)), 2)

### Second degree polynomial regression with scikit-learn

In [None]:
poly2_object= PolynomialFeatures(degree=2)

In [None]:
X_poly2= poly2_object.fit_transform(X)

In [None]:
poly2_reg_object = LinearRegression(fit_intercept=True)

In [None]:
poly2_reg_object.fit(X_poly2, y)

In [None]:
round ( r2_score(y, poly2_reg_object.predict(X_poly2)), 2)

In [None]:
# Adj R Squared

top= sum((y - poly2_reg_object.predict(X_poly2))**2)/ (y.size-2-1)

bottom= sum((y - y.mean())**2)/(y.size-1)

round (1-(top/bottom), 2)

In [None]:
round (mean_squared_error(y, poly2_reg_object.predict(X_poly2)), 2)

### Regression Tree to predict 'medv' only using 'LSTA' as predictor

In [None]:
from sklearn import tree

from sklearn.tree import DecisionTreeRegressor

#### Some important _hyperparameters_ of the DecisionTreeRegressor () algorithm:


__Hyperparameters__: Parameters that you set up for the method that you are using BEFORE applying the method to a specific datatset. We have set some hyperparemters in the past, for example:

- For linear regression: fit_intercept= True
- For polynomial regression: degree= 2 (or 3, 4,..., the degree that you want to use)


<br>

__splitter{“best”, “random”}, default=”best”__

The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

<br>

__max_depth: int, default=None__

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure (RSS=0) or until all leaves contain less than __min_samples_split__ samples.

<br>


__min_samples_split: int or float, default=2__

The minimum number of samples required to split an internal node:

If int, then consider min_samples_split as the minimum number.
If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

<br>

__min_samples_leaf: int or float, default=1__

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least __min_samples_leaf__ training samples in each of the left and right branches. 

If int, then consider min_samples_leaf as the minimum number.
If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

<br>


__max_features: int, float or {“auto”, “sqrt”, “log2”}, default=None__

The number of features to consider when looking for the best split

The search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

__Comment__: For now, we will not change this feature, which means that we will have the algorithm look for all the features when attempting a split.



<br>

__random_state: int, RandomState instance or None, default=None__

__Controls the randomness of the estimator.__

When max_features < n_features, the algorithm will select max_features at random at each split before finding the best split among them__ (THAT WILL LIKELY CREATE DIFFERENT RESULTS ACROSS ATTEMPTS!!). 

But the best found split may vary across different runs, even if max_features=n_features__. That is the case, __if the improvement of the criterion is identical for several splits and one split has to be selected at random. 

__To obtain a deterministic behaviour during fitting, random_state has to be fixed to an integer.__

<br>

__min_impurity_decrease: float, default=0.0__

min_impurity_decrease is related to the minimum decrease in RSS that we are willing to accept.

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

In regression trees, impurity = RSS (or SSE)

The weighted impurity decrease equation is the following:

(N_t / N) * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity)
                    

where,
N is the total number of samples

N_t is the number of samples at the current node

N_t_L is the number of samples in the left child

N_t_R is the number of samples in the right child.


### Fitting a RT for medv based on lstat

We will fit our first regression tree without using any refined approach for selecting the hyperparameters (we will study a more refined approach later)

Let's just fit the tree defining one simple stopping criterion: the number of cases (observations) in each leave must be at least 10% of the sample size (min_samples_leaf=0.1)

10% ---> 10% of 506= 50.6 ~ 51

In [None]:
reg_tree_lstat= DecisionTreeRegressor(min_samples_leaf=0.1, random_state=1)

In [None]:
reg_tree_lstat.fit(X, y)

__To work INDIVIDUALLY FOR A COUPLE OF MINUTES__

Compute R2 for the RT by calling the method r2_score() on the RT object

Compute MSE for the RT by calling the method mean_squared_error() on the RT object

In [None]:
# R sq for RT

np.round ( r2_score(y, reg_tree_lstat.predict(X)), 2)

In [None]:
# MSE for RT
np.round ( mean_squared_error(y, reg_tree_lstat.predict(X)), 2)

How to compute __Adjusted R Squared for a RT__? 

Let's try to understand the formula of Adjusted R Squared a bit better.


Adjusted R Squared = 1 - [ (SSE/df error)/ (SSTotal/ n-1) ]

a) In Linear Regression

__DF error= n- (K+1), k= # predictors__

Ex: For simple linear regression, K=1; therefore:

DF error= n-(1+1)= n-2


b) In general

__DF error= n - # parameters to estimate__


Ex: For a second degree polynomial equation on X:

DF error= n - 3 (3 parameters to estimate: the intercept, the coeff for X, and the coeff for X squared)


Ex: In a Regression Tree

DF error= n- (# of regions)

For each region, the RT method needs to estimate the mean of Y

Computing adjusted R squared for the regression tree

We need to know how many regions resulted from the tree. How to find out the number of regions?

Option 1: Plot the tree and observe the number of regions (= number of leaves)

Option 2: Count the number of leaf nodes (see next code cell)

When we call the methods '.tree_.children_left' or '.tree_.children_right' on the tree object, we get an array with the indexes of the nodes to the left and right of each node. 

These methods return a -1 when the node to the left or right of a given node is a leaf node.

Therefore, we can know the number of leaves in the tree by counting the number of -1 in the ouput of '.tree_.children_left' or '.tree_.children_right'

In [None]:
reg_tree_lstat.tree_.children_left

In [None]:
reg_tree_lstat.tree_.children_right

In [None]:
# Get the number of leaves in the tree by counting the number of -1 in any of the above arrays
# Save it in a variable called 'number_leaves'
# DO IT HERE

number_leaves = np.sum(reg_tree_lstat.tree_.children_right==-1)
number_leaves

In [None]:
# There is a direct method to get the number of leaves

reg_tree_lstat.get_n_leaves()

In [None]:
# Adjusted R squared fo the tree

top= sum((y - reg_tree_lstat.predict(X))**2)/ (y.size-number_leaves)

bottom= (sum((y - y.mean())**2))/(y.size-1)

round (1-(top/bottom), 2)

In [None]:
# let's plot the tree!

tree.plot_tree(reg_tree_lstat)

plt.show()

In [None]:
# The figsize parameter sets the plot size (in inches)

plt.figure(figsize=(12,12))   
tree.plot_tree(reg_tree_lstat, fontsize=10, feature_names=['LSTAT'])
plt.show()

In [None]:
# The MSE of the top node is the variance of medv (= the variance of Y)

np.round( np.var(boston_df['medv']), 3)

In [None]:
# The prediction of Y for the top node is just the mean of Y

np.round( np.mean(boston_df['medv']), 3)

In [None]:
# The MSE of the left node on the second layer is the variance of medv for the observations where the LSTAT <= 9.725

np.round(np.var (boston_df['medv'][boston_df['LSTAT']<=9.725]), 3)

### Interpret the tree
##### Take 3 mins and try to answer the folowing questions independently. We will discuss the answers in 3 mins

What is the max depth of the tree?

How many regions were created by the tree algorithm?

Use the tree and graphically (i.e., going down the tree branches) predict the __medv__ for a neighborhood where lstat is 15.

Does the tree evidence any relationship (positive or negative) between lstat and medv? Where do you see that relationship in the tree?

In [None]:
# There is a method to get the depth of the tree

reg_tree_lstat.get_depth()

<br>

Let's plot the lines for the three models (simple reg, poly reg, and the tree) to see if we can graphically observe the reason why the RT does better than the linear regression and seemingly even better than the poly of second degree!

In [None]:
plt.style.use('seaborn')

plt.scatter(boston_df['LSTAT'], boston_df['medv'],c='grey',marker='o')

plt.xlabel("Lstat")

plt.ylabel("Medv")


plt.plot(boston_df['LSTAT'], simple_reg_object.predict(X), c='red', ls='-', label='Linear Model')


plt.plot(boston_df['LSTAT'].sort_values(), reg_tree_lstat.predict(boston_df['LSTAT'].sort_values().values.reshape(-1,1)), c='black', ls='-', label='Reg Tree')


plt.plot(boston_df['LSTAT'].sort_values(), poly2_reg_object.fit(X_poly2, y).predict(poly2_object.fit_transform(boston_df['LSTAT'].sort_values().values.reshape(-1,1))), c='green', ls='-', label='Poly 2nd degree')


plt.legend()


# plt.legend() will show a legend that correctly identifies each model because 
# the parameter 'label' was passed to the plot of each model

plt.show()

It is clear from the scatteplot that both the tree and the second-degree poly fit the data better than the linear equation.

However, I am not fully convinced that the tree does better than the second-degree poly as the adjusted R squared suggested.

Why don't we estimate the test prediction error for both the tree and the second-degree poly and compare them?

Let's do this using CV

In [None]:
from sklearn.model_selection import cross_val_score

Poly of second degree. Getting the CV error for each of the ten iterations

In [None]:
cv_values=-1*cross_val_score(poly2_reg_object.fit(X_poly2, y), X_poly2, y, scoring= 'neg_mean_squared_error', cv=10)
cv_values

Mean CV error = Mean test MSE ~ Estimation of test prediction error for Poly of second degree

In [None]:
np.mean(cv_values)

Regression Tree. Getting the CV error for each of the ten iterations

In [None]:
cv_values2=-1*cross_val_score(reg_tree_lstat.fit(X, y), X, y, scoring= 'neg_mean_squared_error', cv=10)
cv_values2

Mean CV error = Mean test MSE ~ Estimation of test prediction error for Reg Tree

In [None]:
np.mean(cv_values2)

### A better CV implementation that shuffles the data before splitting it into K folds

### REVIEW INDEPENDENTLY AT HOME!

In [None]:
from sklearn.model_selection import KFold
import statsmodels.formula.api as smf

In [None]:
k10fold=KFold(n_splits=10, shuffle=True, random_state= 1)

In [None]:
indexes= np.arange(len(boston_df['medv']))

CV application for the second degree poly

In [None]:
cv_scores=np.empty(shape=10)

In [None]:
i=0
for train_index, test_index in k10fold.split(indexes):
    regression_model=smf.ols('medv~LSTAT+I(LSTAT**2)', data=boston_df.iloc[train_index,]).fit()
    predictions=regression_model.predict(boston_df['LSTAT'][test_index])
    # The next line computes the test Mean Squared Error for each iteration
    cv_scores[i]=sum((boston_df['medv'][test_index] -predictions)**2)/(test_index.size)
    i=i+1

In [None]:
np.mean (cv_scores)

CV application for the Regression Tree

In [None]:
cv_scores2=np.empty(shape=10)

In [None]:
i=0
for train_index, test_index in k10fold.split(indexes):
    regression_model=reg_tree_lstat.fit(X[train_index], y[train_index])
    predictions=regression_model.predict(X[test_index])
    # The next line computes the test Mean Squared Error for each iteration
    cv_scores2[i]=sum((y[test_index] -predictions)**2)/(test_index.size)
    i=i+1

In [None]:
np.mean (cv_scores2)

### START HERE ON TUESDAY 10-04 !!!

Share the methods I found with students:

get_n_leaves()

get_depth()

### Regression Tree to predict 'medv' using all the predictors in the Boston dataset

Now we are going to learn how to apply RT in a more refined way:

a) Considering multiple predictors

b) Following a __conventional selection methodology__ used to estimate a good regression tree. This selection methodology applies a __pre-pruning strategy__ to growth the tree.

__Pre-pruning__: One more or more stopping criteria applied to the tree to prevent it from learning the training set without error. 

__Important__: In ML, when you hear that someone is prunning a tree that usually refers to the application of a __post-pruning strategy__, a topic we will cover on the next session.

What is the __conventional selection methodology__ used to obtain a good regression tree?

Check out the scikit-learn documentation for a graphical representation of this:

https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation


1) The data is split in training data and test data (for ex, 80% training and 20% test). 

The training data is used AS IF IT WAS ALL THE AVAILABLE DATA and CV is applied on it. CV is applied with the purpose of tuning the hyperparameters of the algorithm and find a good Tree. 

By tuning the hyperparameters of the algorithm I mean finding a good combination of hyperparameters (i.e., choosing max_depth, min_sample_splits, min_impurity_decrease, etc).


2) The good combination of hyperparameters found after applying CV are used to estimate a Tree using the training data.

3) Finally, the performance of this Tree is evaluated on the testing data.

1) The data is split in training data and test data (next cells):

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test= train_test_split (boston_df.iloc[:,:-1],boston_df['medv'], test_size=0.2, random_state=1)

Now, we tune the hyperparameters of the regression tree algorithm by using CV

This is an INTENSE application of CV because __we need to apply CV for each combination of hyperparametes__... and there are many combinations of hyperparameters.

There is a class from scikit-learn called 'GridSearchCV' that facilitates the application of CV in this case.

In [None]:
from sklearn.model_selection import GridSearchCV

The first step before applying the GridSearchCV() method is to create a dictionary with a range of values for each parameter (next cell):

In [None]:
hyperparam_grid = {
    'max_depth': np.arange(2,11), # testing depth from 2 to 10
    'min_samples_split':[0.1, 0.15, 0.2],
    'min_samples_leaf':[0.05, 0.1, 0.15], 
    'min_impurity_decrease': [0, 0.0005, 0.001, 0.005, 0.01, 0.05]
}

One recommendation (just a recommendation !!!) is to apply GridSearchCV() two times: a first time to narrow down the possible good values of the hyperparameters and the second time to choose the final values of the hyperparameters to use in building the tree.

In [None]:
gridSearch1 = GridSearchCV(DecisionTreeRegressor(), hyperparam_grid, cv=5, scoring='neg_mean_squared_error')

In [None]:
gridSearch1.fit(X_train, y_train)

In [None]:
print('Initial parameters: ', gridSearch1.best_params_)

In [None]:
hyperparam_grid2 = {
    'max_depth': [3,4,5,6,7],  
     'min_samples_split': [0.08, 0.1, 0.12], 
    'min_samples_leaf': [0.02, 0.05, 0.08],   
    'min_impurity_decrease': [0.001, 0.005, 0.01]
}

In [None]:
gridSearch2 = GridSearchCV(DecisionTreeRegressor(), hyperparam_grid2, cv=5,scoring='neg_mean_squared_error')

In [None]:
gridSearch2.fit(X_train, y_train)

In [None]:
print('Improved parameters: ', gridSearch2.best_params_)

Now, we need to build the tree with these hyperparameters values (or a combination of these values and those we found in the previous step)

In [None]:
reg_tree_multiple_boston= DecisionTreeRegressor(max_depth= 7, min_samples_split= 0.08, min_samples_leaf= 0.02, min_impurity_decrease= 0.001, random_state=1)

In [None]:
# Fit the tree on the training data using the previous hyperparameters

reg_tree_multiple_boston.fit(X_train, y_train)

In [None]:
# Now, we evaluate the prediction performance of 'reg_tree_multiple_boston' on the test data

mean_squared_error( y_test, reg_tree_multiple_boston.predict (X_test))

In [None]:
# Plot the tree

plt.figure(figsize=(25,20))   
tree.plot_tree(reg_tree_multiple_boston,filled=True, rounded= True, feature_names=X_train.columns, fontsize=12)
plt.show()

In [None]:
# What predictors were selected to be in the tree?

boston_df.iloc[:,:-1].columns [reg_tree_multiple_boston.feature_importances_!=0]

<br>

__Finally__, just check to see if the hyperparameters are actually being used when building the tree:

Max_depth was 7. Is this true in the tree?

min_samples_leaf was 2%. Is this true in the tree?

THINK ABOUT THIS FOR 2 MINUTES !!!