# DECISION TREE HYPER-PARAMETERS. TUNING DECISION TREES

- ** max_depth : int or None, optional (default=None)**
    The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Ignored if max_leaf_nodes is not None.
    
- **min_samples_split : int, optional (default=2)**
    The minimum number of samples required to split an internal node.

- There are more hyper-parameters: 
  - help("sklearn.tree.DecisionTreeRegressor")
  - help("sklearn.tree.DecisionTreeClassifier")



First, data is loaded, inputs go to X, outputs to y.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_boston
from sklearn import tree
from scipy.stats import sem

boston = load_boston()
X = boston.data
y = boston.target




## COMBINING HYPER-PARAMETER TUNING AND MODEL EVALUATION

The combination of model evaluation and hyper-parameter tuning can be understood as an external loop that trains a model and tests the model, and an internal loop, where the training process consists on looking for the best hyper-parameters, and then obtaining the model with those best hyper-parameters.

First, we are going to use **Holdout** (train/test) for model evaluation (external loop), and **3-fold crossvalidation** for hyper-parameter tuning (internal loop). Hyper-parameters will be adjusted with **Gridsearch**.

#### GRIDSEARCH

In [2]:
from sklearn.model_selection import train_test_split

# Random_state = 0 for reproducibility
# Holdout for model evaluation. 33% of available data for test
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.33, random_state=0)

In [63]:
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

# Search space
param_grid = {'max_depth': list(range(2,16,2)),
              'min_samples_split': list(range(2,16,2))}

# random.seed = 0 for reproducibility
np.random.seed(0)
clf = GridSearchCV(tree.DecisionTreeRegressor(), 
                   param_grid,
                   scoring='neg_mean_squared_error',
                   cv=3, 
                   n_jobs=1, verbose=1)

clf.fit(X=X_train, y=y_train)

# At this point, clf contains the model with the best hyper-parameters found by gridsearch
# and trained on the complete X_train

# Now, the performance of clf is computed on the test partition

y_test_pred = clf.predict(X_test)
print(metrics.mean_squared_error(y_test, y_test_pred))


Fitting 3 folds for each of 49 candidates, totalling 147 fits
23.621531765022485


[Parallel(n_jobs=1)]: Done 147 out of 147 | elapsed:    0.1s finished


Let's see the best hyper-parameters and their score (MSE)

In [64]:
clf.best_params_, -clf.best_score_

({'max_depth': 12, 'min_samples_split': 10}, 21.63609319641027)

If more control is needed when splitting data for grid search, we can use **KFold** to generate the crossvalidation iterator.

In [65]:
from sklearn.model_selection import KFold
# random_state=0 for reproducibility
cv_grid = KFold(n_splits=3, shuffle=True, random_state=0)
cv_grid

KFold(n_splits=3, random_state=0, shuffle=True)

And now we can do hyper-parameter tuning with 3-fold crossvalidation again. Results are different because shuffling of data within gridsearch might be different in the two cases.

In [66]:
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

# Search space
param_grid = {'max_depth': list(range(2,16,2)),
              'min_samples_split': list(range(2,16,2))}

# random.seed = 0 for reproducibility
np.random.seed(0)
clf = GridSearchCV(tree.DecisionTreeRegressor(), 
                   param_grid,
                   scoring='neg_mean_squared_error',
                   cv=cv_grid, 
                   n_jobs=1, verbose=1)

clf.fit(X=X_train, y=y_train)

# At this point, clf contains the model with the best hyper-parameters found by gridsearch
# and trained on the complete X_train

# Now, the performance of clf is computed on the test partition

y_test_pred = clf.predict(X_test)
print(metrics.mean_squared_error(y_test, y_test_pred))

Fitting 3 folds for each of 49 candidates, totalling 147 fits
20.845882154306768


[Parallel(n_jobs=1)]: Done 147 out of 147 | elapsed:    0.1s finished


#### RANDOMIZED SEARCH

Now, let's use **Randomized Search** instead of gridsearch. Only 20 hyper-parameter value combinations will be tried (budget=20)

In [67]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn import metrics

# Search space
param_grid = {'max_depth': list(range(2,16,2)),
              'min_samples_split': list(range(2,16,2))}

budget = 20
# random.seed = 0 for reproducibility
np.random.seed(0)
clf = RandomizedSearchCV(tree.DecisionTreeRegressor(), 
                         param_grid,
                         scoring='neg_mean_squared_error',
                         cv=cv_grid, 
                         n_jobs=1, verbose=1,
                         n_iter=budget
                        )

clf.fit(X=X_train, y=y_train)

# At this point, clf contains the model with the best hyper-parameters found by gridsearch
# and trained on the complete X_train

# Now, the performance of clf is computed on the test partition

y_test_pred = clf.predict(X_test)
print(metrics.mean_squared_error(y_test, y_test_pred))



Fitting 3 folds for each of 20 candidates, totalling 60 fits
22.35062888533451


[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:    0.0s finished


For **Randomized Search**, we can define the search space with statistical distributions, rather than using particular values as we did before. Below you can see how to use a uniform distribution on integers between 2 and 16 by means of *randint*. For continuous hyper-parameters we could use continuous distributions such as *uniform* or *expon* (exponential).

In [68]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn import metrics


from scipy.stats import uniform, expon
from scipy.stats import randint as sp_randint

# Search space with integer uniform distributions
param_grid = {'max_depth': sp_randint(2,16),
              'min_samples_split': sp_randint(2,16)}

budget = 20
# random.seed = 0 for reproducibility
np.random.seed(0)
clf = RandomizedSearchCV(tree.DecisionTreeRegressor(), 
                         param_grid,
                         scoring='neg_mean_squared_error',
                         cv=cv_grid, 
                         n_jobs=1, verbose=1,
                         n_iter=budget
                        )

clf.fit(X=X_train, y=y_train)

# At this point, clf contains the model with the best hyper-parameters found by gridsearch
# and trained on the complete X_train

# Now, the performance of clf is computed on the test partition

y_test_pred = clf.predict(X_test)
print(metrics.mean_squared_error(y_test, y_test_pred))



Fitting 3 folds for each of 20 candidates, totalling 60 fits
22.250927406129268


[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:    0.0s finished


What if we wanted to do **model evaluation with 5-fold crossvalidation** and **hyper-parameter tuning with 3-fold crossvalidation**? This is called nested crossvalidation (https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html). There is an external loop (for evaluating models) and an internal loop (for hyper-parameter tuning).

In [70]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn import metrics

# random_state=0 for reproducibility
# Evaluation of model (outer loop)
cv_evaluation = KFold(n_splits=5, shuffle=True, random_state=0)


from scipy.stats import uniform, expon

# Search space
param_grid = {'max_depth': list(range(2,16,2)),
              'min_samples_split': list(range(2,16,2))}

budget = 20
# random.seed = 0 for reproducibility
np.random.seed(0)
# This is the internal 3-fold crossvalidation for hyper-parameter tuning
clf = RandomizedSearchCV(tree.DecisionTreeRegressor(), 
                         param_grid,
                         scoring='neg_mean_squared_error',
                         # 3-fold for hyper-parameter tuning
                         cv=3, 
                         n_jobs=1, verbose=1,
                         n_iter=budget
                        )

# This is the external 5-fold crossvalidation for model evaluation
# Notice that clf is the model resulting of hyper-parameter tuning
scores = -cross_val_score(clf, 
                            X, y, 
                            scoring='neg_mean_squared_error', 
                            cv = cv_evaluation)





Fitting 3 folds for each of 20 candidates, totalling 60 fits
Fitting 3 folds for each of 20 candidates, totalling 60 fits
Fitting 3 folds for each of 20 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:    0.0s finished


Fitting 3 folds for each of 20 candidates, totalling 60 fits
Fitting 3 folds for each of 20 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:    0.0s finished


In [71]:
print(scores)
# The mean of the 5-fold crossvalidation is the final score of the model
print(scores.mean(), "+-", scores.std())



[30.95506453 14.04830397 15.73059325 47.94659277 11.27318105]
23.990747111818475 +- 13.792773921974007


#### OBTAINING THE FINAL MODEL (FOR DEPLOYMENT, OR FOR SENDING TO A COMPETITION, ...)

If at the end, we need a final model, we can get it by fitting clf to all the available data. Let us remember that clf does hyper.parameter tuning.

In [72]:
np.random.seed(0)
clfFinal = clf.fit(X,y)

Fitting 3 folds for each of 20 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:    0.0s finished


#### MODEL BASED OPTIMIZATION (BAYESIAN OPTIMIZATION)

scikit-optimize (skopt) will be used for this: https://scikit-optimize.github.io. **Holdout** for model evaluation and **3-fold crossvalidation** for hyper-parameter tuning (with **Model Based Optimization** )

In [73]:
# conda install -c conda-forge scikit-optimize 
from skopt import BayesSearchCV
from sklearn import metrics


from scipy.stats import uniform, expon
from scipy.stats import randint as sp_randint

# Search space with integer uniform distributions
param_grid = {'max_depth': (2,16),
              'min_samples_split': (2,16)}

budget = 20
# random.seed = 0 for reproducibility
np.random.seed(0)
clf = BayesSearchCV(tree.DecisionTreeRegressor(), 
                    param_grid,
                    scoring='neg_mean_squared_error',
                    cv=3,    
                    n_jobs=-1, verbose=1,
                    n_iter=budget
                    )

clf.fit(X=X_train, y=y_train)

# At this point, clf contains the model with the best hyper-parameters found by bayessearch
# and trained on the complete X_train

# Now, the performance of clf is computed on the test partition

y_test_pred = clf.predict(X_test)
print(metrics.mean_squared_error(y_test, y_test_pred))



Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits
Fitting 3 folds for each of 1 candidates, totalling 3 fits
22.386386715341622


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
