<h1>Pruning your Decision Tree in Scikit-learn</h1>

<h3>Pre-pruning Parameters</h3>

<p>We will look at three parameters for pre-pruning techniques using Scikit-learn, these are</p>
<ul>
    <li><strong>Prepruning Technique 1</strong>: Limiting the depth (max_depth)</li>
</ul>
<p>We use the max_depth parameter to limit the number of steps the tree can have between the root node and the leaf nodes.</p>

<ul>
    <li><strong>Prepruning Technique 2</strong>: Avoiding leaves with few datapoints (min_samples_leaf)</li>
</ul>
<p>We use the min_samples_leaf parameter to tell the model to stop building the tree early if the number of datapoints in a leaf will be below a threshold.</p>

<ul>
    <li><strong>Prepruning Technique 3</strong>: Limiting the number of leaf nodes (max_leaf_nodes)</li>
</ul>
<p>We use max_leaf_nodes to set a limit on the number of leaf nodes in the tree.</p>

<h3>Grid Search</h3>

<p>To determine the best values for the pre-pruning parameters, we’ll use cross validation to compare several potential options.</p>

<p>First, we import and prepare the data, and build our decision tree</p>

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

df = pd.read_csv('../data/titanic.csv')
df['male'] = df['Sex'] == 'male'
X = df[['Pclass', 'male', 'Age', 'Siblings/Spouses', 'Parents/Children', 'Fare']].values
y = df['Survived'].values

dt = DecisionTreeClassifier(random_state=99) # random_state ensure reproducibility of results

<p>The Grid Search class is called GridSearchCV. We start by importing it.</p>

In [2]:
from sklearn.model_selection import GridSearchCV

<p>GridSearchCV has four parameters that we’ll use:</p>
<ol>
    <li>The model (in this case a DecisionTreeClassifier)</li>
    <li>Param grid: a dictionary of the parameters names and all the possible values</li>
    <li>What metric to use (default is accuracy)</li>
    <li>How many folds for k-fold cross validation</li>
</ol>

<p>Below we create the param grid variable and give it a list of all the possible values for each parameter that we want to try.</p>

In [3]:
param_grid = {
    'max_depth': [5, 15, 25],
    'min_samples_leaf': [1, 3],
    'max_leaf_nodes': [10, 20, 35, 50]
}

<p>Now we create the grid search object. We’ll use the above parameter grid, set the scoring metric to the F1 score, and do a 5-fold cross validation.</p>

In [4]:
gs = GridSearchCV(dt, param_grid, scoring='f1', cv=5)

<p>Now we can fit the grid search object. This can take a little time to run as it’s trying every possible combination of the parameters.</p>

<p>Since we have 3 possible values for max_depth, 2 for min_samples_leaf and 4 for max_leaf_nodes, we have 3 * 2 * 4 = 24 different combinations to try</p>

In [5]:
gs.fit(X, y)

<p>At the end, we use the best_params_ attribute to see which model won.</p>

In [6]:
print("best params:", gs.best_params_)

best params: {'max_depth': 15, 'max_leaf_nodes': 35, 'min_samples_leaf': 1}


<strong>Thus we see that the best model has a maximum depth of 15, maximum number of leaf nodes as 35 and minimum samples per leaf of 1.</strong>