### Tree pruning with cross-validation

Now that we've bootstrapped something, we'll apply these ideas to optimizing alpha in the tree pruning process. Cross-validation can be used to optimize any hyperparameter or set of hyperparameters. You get more certainty about optimum values at the expense of computation time.

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from multiprocessing import Pool
from sklearn.model_selection import KFold, train_test_split
from itertools import product
import numpy as np
from os import cpu_count

In [2]:
df = pd.read_csv('/home/briggsc1-erau.edu/Downloads/housing.csv')
features = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income']
target = ['median_house_value']
df = df.dropna(subset = features+target)

In [3]:
x_tr,x_te,y_tr,y_te = train_test_split(df[features],df[target],
                                       test_size = 0.75)

To use multiprocessing, we need to create a function to perform the distributed task. Our distributed task will be fitting a tree to some training data. We'll pass in the tree and the training data as a single variable, as required by the multiprocessing syntax.

In [4]:
def fit_tree(inpt):
    x_tra,y_tra,alpha = inpt
    tr = DecisionTreeRegressor(ccp_alpha=alpha)
    tr.fit(x_tra, y_tra)
    return(tr)

In [5]:
kf = KFold(n_splits=3)
kf.get_n_splits(x_tr)

dfs_acc = [] # a list to store our df_acc dataframe for each split

optimum_alphas = [] # a list to store the best alpha value for each split.
for train_index, val_index in kf.split(x_tr): # this loop is over the cross-val splits
    x_tra = x_tr.iloc[train_index]
    x_val = x_tr.iloc[val_index]
    y_tra = y_tr.iloc[train_index]
    y_val = y_tr.iloc[val_index]
    
    tr = DecisionTreeRegressor()
    path = tr.cost_complexity_pruning_path(x_tra, y_tra) # we must compute the ccp_alphas for each split
    ccp_alphas, impurities = path.ccp_alphas, path.impurities
    
    inpt = product([x_tra],[y_tra],ccp_alphas) # assembling the input for the multiprocessing distribution
    
    cores = cpu_count() # get the number of CPUs
    
    with Pool(processes = cores-1) as p:
        trees = p.map(fit_tree,inpt)

    data = [] # data for a dataframe showing the scores and attributes of each tree
    for tr in trees:
        alpha = tr.ccp_alpha
        acc_tr = tr.score(x_tra,y_tra)
        acc_va = tr.score(x_val,y_val)
        n_leaves = tr.get_n_leaves()
        depth = tr.get_depth()
        data.append({'alpha':alpha,'depth':depth,'n_leaves':n_leaves,
                     'acc_tr':acc_tr,'acc_va':acc_va})
    df_acc = pd.DataFrame(data)
    dfs_acc.append(df_acc)
    best_idx = df_acc['acc_va'].idxmax() # find the row with the best accuracy on the validation set
    best_row = df_acc.loc[best_idx]
    optimum_alphas.append(best_row['alpha'])

In [6]:
# assemble the dfs_acc into a single dataframe with a fold column
for i,df_acc in enumerate(dfs_acc):
    df_acc['fold'] = i
df_acc = pd.concat(dfs_acc)

In [7]:
df_acc

Unnamed: 0,alpha,depth,n_leaves,acc_tr,acc_va,fold
0,0.000000e+00,27,3288,1.000000,5.461439e-01,0
1,1.468429e-04,27,3284,1.000000,5.446578e-01,0
2,1.468429e-04,27,3284,1.000000,5.400846e-01,0
3,1.468429e-04,27,3284,1.000000,5.549033e-01,0
4,2.349540e-04,27,3284,1.000000,5.600465e-01,0
...,...,...,...,...,...,...
3026,2.048820e+08,3,5,0.486657,4.405228e-01,2
3027,2.349431e+08,2,4,0.469352,4.227887e-01,2
3028,8.412136e+08,2,3,0.407391,3.596662e-01,2
3029,9.826828e+08,1,2,0.335010,2.725712e-01,2


In [8]:
best_alpha = np.mean(optimum_alphas)
print(f'The optimum value of alpha is {round(np.mean(best_alpha))}.')

The optimum value of alpha is 21678995.


In [10]:
tr = DecisionTreeRegressor(ccp_alpha = best_alpha,random_state = 0)
tr.fit(x_tr,y_tr)
best_val_acc = tr.score(x_te,y_te)
print(f'The optimum tree has accuracy {round(best_val_acc,3)} on the test set. It has a depth of {tr.get_depth()}.')

The optimum tree has accuracy 0.66 on the test set. It has a depth of 10.
