In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets

In [None]:
boston_data=datasets.load_boston()

In [None]:
boston_df=pd.DataFrame(boston_data.data, columns= boston_data.feature_names)

In [None]:
boston_df['medv']= boston_data.target

In [None]:
from sklearn.model_selection import train_test_split

from sklearn import tree

from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_squared_error

### Post-pruning approach 1: selecting alpha via the validation set approach

In [None]:
X_train, X_test, y_train, y_test= train_test_split (boston_df.iloc[:,:-1],boston_df['medv'], test_size=0.2, random_state=1)

In [None]:
reg_tree_boston_unprunned= DecisionTreeRegressor(random_state=1)

In [None]:
path= reg_tree_boston_unprunned.cost_complexity_pruning_path(X_train, y_train)

In [None]:
alphas= path['ccp_alphas']
alphas

Let's use each alpha to obtain a tree with the training data. Obtain the test MSE of the tree using the test data

Therefore, each alpha will have a test MSE associated to it

Our goal will be to choose the alpha leading to the lowest test MSE

In [None]:
mse_scores=[]
for i in alphas:
    treeloop= DecisionTreeRegressor(ccp_alpha=i,random_state=1)
    treeloop.fit(X_train, y_train)
    y_test_predicted=treeloop.predict(X_test)
    mse_scores.append(mean_squared_error( y_test,y_test_predicted)) 

__PROGRAMMING TIP:__ Initially, I had written the previous for loop like this:

mse_scores=[]

for i in alphas:

    tree= DecisionTreeRegressor(ccp_alpha=i,random_state=1)
    
    tree.fit(X_train, y_train)
    
    y_test_predicted=tree.predict(X_test)
    
    mse_scores.append(mean_squared_error( y_test,y_test_predicted)) 
    

Calling the object __tree__ was a huge mistake since it messes up with the tree class part of scikit learn. It prevented the tree from ploting. I was getting the following error when attempting to plot the tree:

'DecisionTreeRegressor' object has no attribute 'plot_tree'

NEVER CALL AN OBJECT WITH THE SAME NAME OF BUILT-IN CLASSES AND METHODS !!!




In [None]:
min(mse_scores)

In [None]:
# Let's use the nice .index() method available for lists!

indexmin=mse_scores.index(min(mse_scores))
indexmin

In [None]:
alphas[indexmin]

Now, let's just obtain the tree with this alpha (the alpha that resulted in the lowest test MSE)

In [None]:
reg_tree_boston_prunned= DecisionTreeRegressor(ccp_alpha= alphas[indexmin], random_state=1)

In [None]:
reg_tree_boston_prunned.fit(X_train, y_train)

As we saw in a previous code cell, the estimated test MSE for this tree was 14.0471... If you want to get it again, do the following:

In [None]:
mean_squared_error( y_test, reg_tree_boston_prunned.predict (X_test))

This is very good (low) prediction error. It is much lower than the test MSE obtained with the best tree obtained from pre-pruning (previous notebook).

In [None]:
# Plot the tree
plt.figure(figsize=(30,20))   
tree.plot_tree(reg_tree_boston_prunned, filled=True, rounded= True, feature_names=X_train.columns, fontsize=12)
plt.show()

In [None]:
# To know what variables are in the tree

boston_df.iloc[:,:-1].columns [reg_tree_boston_prunned.feature_importances_!=0]

### Interpreting the tree (Only if time permits. Otherwise, leave it for students to read at home independently)


- As we already knew from previous examples, lower values of LSTAT are linked to higher values of medv. This can be seen in the color pattern shown by the leaves. Darker leaves (higher values of medv) are more prevalent on the branch to the left of the root node. This left branch is related to lower values of LSTAT (LSTAT <= 9.73)


- When LSTAT is low, its influence on medv seems to depend on the values of other predictors. For large values of LSTAT, other predictors do not matter much! In other words, for neighborhoods where the % of houses with low SES (LSTAT) is above 9.73%, the median value of the houses does not depend MUCH on other factors because it is mostly determined by the value of LSTAT. The only exception is when LSTAT is really high; that is, above 16.09%, where a neighborhood with a value of NOX (Nitrogen oxides concentration in the air) less than or equal to 0.6, attenuates the impact of the high LSTAT on the house value.

Note: I state that it "attenuates" it because for NOX <= 0.6, medv stops the trend to keep going down (medv at the leaf is 17.79 compared to 14.412 at the NOX node)


#### Because the left branch of the tree starting at the root node has many nodes, I think that interpreting this branch is simplified if we write the rules that can be extracted from that branch. Let's write these rules:

<br>

Rule 1: __When LSTA <= 9.73% AND RM > 7.74 then: predicted medv= $ 44 318__

<br>

Rule 2: __When LSTA <= 9.73% AND RM <= 7.74 AND DIS <=1.485 then: predicted medv= $ 50 000__

<br>

Rule 3: When LSTA <= 9.73%  AND __RM <= 7.74__  AND DIS > 1.485 AND __RM > 6.64__ —> Predicted medv= $31 360

We can simplify this rule a little bit as follows:

__When LSTA <= 9.73%  AND  6.64 <RM<= 7.74  AND  DIS > 1.485 then: Predicted medv= $31 360__



<br>

Rule 4: When LSTA <= 9.73% AND __RM <= 7.74__ AND DIS > 1.485 AND __RM <= 6.64__ then: predicted medv= $ 23 746

We can simplify this rule a little bit as follows:
 
__When LSTA <= 9.73% AND RM <= 6.64 AND DIS > 1.485 then: predicted medv= $ 23 746__


<br>

Some insights from these rules (more insights are possible):

In neighborhoods with low LSTAT, the average number of rooms in the houses (RM) takes an important role in determining the median value of the houses. 


When LSTAT is low and RM is above 7.44, the median house value in the neighborhood is quite high ($ 44 318)


When LSTAT is low, even if RM is less than or equal to 7.44, there is still a situation where the median house value could be even higher than the previous case. That is when DIS (the average dist to five urban centers) is low (below 1.49). Under these three conditions, the the median house value in the neighborhood is very high ($ 50 000)

### Post-pruning approach 2: selecting alpha via CV 

WORK IN PROGRESS !!!