# Decision Trees in Practice

In this assignment we will explore various techniques for preventing overfitting in decision trees. We will extend the implementation of the binary decision trees that we implemented in the previous assignment. You will have to use your solutions from this previous assignment and extend them.

In this assignment you will:

* Implement binary decision trees with different early stopping methods.
* Compare models with different stopping parameters.
* Visualize the concept of overfitting in decision trees.

Let's get started!

# Import Neccessary Libraries

In [1]:
import pandas as pd
import numpy as np
import math
import string
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
%matplotlib inline

# Load LendingClub Dataset

This assignment will use the [LendingClub](https://www.lendingclub.com/) dataset used in the previous two assignments.

In [2]:
loans = pd.read_csv('lending-club-data.csv')
loans

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,sub_grade_num,delinq_2yrs_zero,pub_rec_zero,collections_12_mths_zero,short_emp,payment_inc_ratio,final_d,last_delinq_none,last_record_none,last_major_derog_none
0,1077501,1296599,5000,5000,4975,36 months,10.65,162.87,B,B2,...,0.4,1.0,1.0,1.0,0,8.143500,20141201T000000,1,1,1
1,1077430,1314167,2500,2500,2500,60 months,15.27,59.83,C,C4,...,0.8,1.0,1.0,1.0,1,2.393200,20161201T000000,1,1,1
2,1077175,1313524,2400,2400,2400,36 months,15.96,84.33,C,C5,...,1.0,1.0,1.0,1.0,0,8.259550,20141201T000000,1,1,1
3,1076863,1277178,10000,10000,10000,36 months,13.49,339.31,C,C1,...,0.2,1.0,1.0,1.0,0,8.275850,20141201T000000,0,1,1
4,1075269,1311441,5000,5000,5000,36 months,7.90,156.46,A,A4,...,0.8,1.0,1.0,1.0,0,5.215330,20141201T000000,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
122602,9856168,11708132,6000,6000,6000,60 months,23.40,170.53,E,E5,...,1.0,0.0,1.0,1.0,1,4.487630,20190101T000000,0,1,0
122603,9795013,11647121,15250,15250,15250,36 months,17.57,548.05,D,D2,...,0.4,0.0,0.0,1.0,0,10.117800,20170101T000000,0,0,0
122604,9695736,11547808,8525,8525,8525,60 months,18.25,217.65,D,D3,...,0.6,0.0,1.0,1.0,0,6.958120,20190101T000000,0,1,0
122605,9684700,11536848,22000,22000,22000,60 months,19.97,582.50,D,D5,...,1.0,1.0,0.0,1.0,0,8.961540,20190101T000000,1,0,1


As before, we reassign the labels to have +1 for a safe loan, and -1 for a risky (bad) loan.

In [3]:
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
loans = loans.drop('bad_loans', axis=1)

We will be using the same 4 categorical features as in the previous assignment: 
1. grade of the loan 
2. the length of the loan term
3. the home ownership status: own, mortgage, rent
4. number of years of employment.

In the dataset, each of these features is a categorical feature. Since we are building a binary decision tree, we will have to convert this to binary data in a subsequent section using 1-hot encoding. Extract these feature columns and target column from the dataset, and discard the rest of the feature columns.

In [4]:
features = ['grade',              # grade of the loan
            'term',               # the term of the loan
            'home_ownership',     # home_ownership status: own, mortgage or rent
            'emp_length',         # number of years of employment
           ]
target = 'safe_loans'
loans = loans[features + [target]]

In [5]:
loans

Unnamed: 0,grade,term,home_ownership,emp_length,safe_loans
0,B,36 months,RENT,10+ years,1
1,C,60 months,RENT,< 1 year,-1
2,C,36 months,RENT,10+ years,1
3,C,36 months,RENT,10+ years,1
4,A,36 months,RENT,3 years,1
...,...,...,...,...,...
122602,E,60 months,MORTGAGE,,-1
122603,D,36 months,MORTGAGE,10+ years,1
122604,D,60 months,MORTGAGE,5 years,-1
122605,D,60 months,MORTGAGE,10+ years,-1


## One Hot Encoding

In [6]:
catergorical_vars = [var for var in loans.columns if loans[var].dtypes == object]
loans = pd.get_dummies(loans)

## Split data into training and validation sets

In [7]:
train_indices = pd.read_json('module-6-assignment-train-idx.json')
train_indices = train_indices.rename({0: 'indices'}, axis=1)
train_indices = list(train_indices['indices'])

validation_indices = pd.read_json('module-6-assignment-validation-idx.json')
validation_indices = validation_indices.rename({0: 'indices'}, axis=1)
validation_indices = list(validation_indices['indices'])

train_data = loans.iloc[train_indices]
validation_data = loans.iloc[validation_indices]

# Early stopping methods for decision trees

In this section, we will extend the **binary tree implementation** from the previous assignment in order to handle some early stopping conditions. Recall the 3 early stopping methods that were discussed in lecture:

1. Reached a **maximum depth**. (set by parameter `max_depth`).
2. Reached a **minimum node size**. (set by parameter `min_node_size`).
3. Don't split if the **gain in error reduction** is too small. (set by parameter `min_error_reduction`).

For the rest of this assignment, we will refer to these three as **early stopping conditions 1, 2, and 3**.

## Early stopping condition 1: Maximum depth

Recall that we already implemented the maximum depth stopping condition in the previous assignment. In this assignment, we will experiment with this condition a bit more and also write code to implement the 2nd and 3rd early stopping conditions.

We will be reusing code from the previous assignment and then building upon this.  We will **alert you** when you reach a function that was part of the previous assignment so that you can simply copy and past your previous code.

## Early stopping condition 2: Minimum node size

The function **reached_minimum_node_size** takes 2 arguments:

1. The `data` (from a node)
2. The minimum number of data points that a node is allowed to split on, `min_node_size`.

This function simply calculates whether the number of data points at a given node is less than or equal to the specified minimum node size. This function will be used to detect this early stopping condition in the **decision_tree_create** function.

In [8]:
def reached_minimum_node_size(data, min_node_size):
    # Return True if the number of data points is less than or equal to the minimum node size.
    if len(data) <= min_node_size :
        return True
    else :
        return False
    

# The code below is to answer the following quiz question
data = [1,1,-1,1,-1,1,1,-1,1] # A random data consisting of 6 safe and 3 risky loans
ans = reached_minimum_node_size(data, 10)

if ans == True :
    ans = 'Since the number of data points <= min_node_size, hence the tree learning algorithm should STOP'
    
elif ans == False :
    ans = 'Since the number of data points > min_node_size, hence the tree learning algorithm should CONTINUE'

<font color='steelblue'><b> Quiz : Given an intermediate node with 6 safe loans and 3 risky loans, if the `min_node_size` parameter is 10, what should the tree learning algorithm do next? </b></font>

<font color='mediumvioletred'><b> Answer : {{ans}} </b></font>

## Early stopping condition 3: Minimum gain in error reduction

The function **error_reduction** takes 2 arguments:

1. The error **before** a split, `error_before_split`.
2. The error **after** a split, `error_after_split`.

This function computes the gain in error reduction, i.e., the difference between the error before the split and that after the split. This function will be used to detect this early stopping condition in the **decision_tree_create** function.

In [9]:
def error_reduction(error_before_split, error_after_split):
    # Return the error before the split minus the error after the split.
    return error_before_split - error_after_split
    
    
# The code below is to answer the following quiz question
reductions = [0.0, 0.05, 0.1, 0.14]
errors_list = [y - x for x,y in zip(reductions,reductions[1:])]
errors_list = [round(num, 2) for num in errors_list]

min_gain = 0.2

if min(errors_list) < 0.2 : 
    ans = 'STOP'

else :
    ans = 'CONTINUE'

<font color='steelblue'><b> Quiz : Assume an intermediate node has 6 safe loans and 3 risky loans. For each of 4 possible features to split on, the error reduction is 0.0, 0.05, 0.1, and 0.14, respectively. If the minimum gain in error reduction parameter is set to 0.2, what should the tree learning algorithm do next? </b></font>

<font color='mediumvioletred'><b> Answer : {{ans}} </b></font>

## Grabbing binary decision tree helper functions from past assignment

Recall from the previous assignment that we wrote a function `intermediate_node_num_mistakes` that calculates the number of **misclassified examples** when predicting the **majority class**. This is used to help determine which feature is best to split on at a given node of the tree.

**Please copy and paste your code for `intermediate_node_num_mistakes` here**.

In [10]:
def intermediate_node_num_mistakes(labels_in_node):
    # Corner case: If labels_in_node is empty, return 0
    if len(labels_in_node) == 0:
        return 0
    
    # Count the number of 1's (safe loans)
    count_safe = sum(labels_in_node == 1)
    
    # Count the number of -1's (risky loans)
    count_risky = sum(labels_in_node == -1)
                
    # Return the number of mistakes that the majority classifier makes.
    return min(count_safe, count_risky)

We then wrote a function `best_splitting_feature` that finds the best feature to split on given the data and a list of features to consider.

In [11]:
def best_splitting_feature(data, features, target):
    
    target_values = data[target]
    best_feature = None # Keep track of the best feature 
    best_error = 10     # Keep track of the best error so far 
    # Note: Since error is always <= 1, we should intialize it with something larger than 1.

    # Convert to float to make sure error gets computed correctly.
    num_data_points = float(len(data))  
    
    # Loop through each feature to consider splitting on that feature
    for feature in features:
        
        # The left split will have all data points where the feature value is 0
        left_split = data[data[feature] == 0]
        
        # The right split will have all data points where the feature value is 1
        right_split = data[data[feature] == 1]   
            
        # Calculate the number of misclassified examples in the left split.
        # Remember that we implemented a function for this! (It was called intermediate_node_num_mistakes)
        left_mistakes = intermediate_node_num_mistakes(left_split[target])            

        # Calculate the number of misclassified examples in the right split.
        right_mistakes = intermediate_node_num_mistakes(right_split[target])
        
        # Compute the classification error of this split.
        error = (left_mistakes + right_mistakes) / num_data_points

        # If this is the best error we have found so far, store the feature as best_feature and the error as best_error
        if error < best_error:
            best_error = error
            best_feature = feature
    
    return best_feature # Return the best feature we found

In [12]:
def create_leaf(target_values):    
    # Create a leaf node
    leaf = {'splitting_feature' : None,
            'left' : None,
            'right' : None,
            'is_leaf': True    }
   
    # Count the number of data points that are +1 and -1 in this node.
    num_ones = len(target_values[target_values == +1])
    num_minus_ones = len(target_values[target_values == -1])    

    # For the leaf node, set the prediction to be the majority class.
    # Store the predicted class (1 or -1) in leaf['prediction']
    if num_ones > num_minus_ones:
        leaf['prediction'] =  1         
    else:
        leaf['prediction'] =  -1                

    # Return the leaf node
    return leaf 

## Incorporating new early stopping conditions in binary decision tree implementation

Now, you will implement a function that builds a decision tree handling the three early stopping conditions described in this assignment.  In particular, you will write code to detect early stopping conditions 2 and 3.  You implemented above the functions needed to detect these conditions.  The 1st early stopping condition, **max_depth**, was implemented in the previous assigment and you will not need to reimplement this.  In addition to these early stopping conditions, the typical stopping conditions of having no mistakes or no more features to split on (which we denote by "stopping conditions" 1 and 2) are also included as in the previous assignment.

**Implementing early stopping condition 2: minimum node size:**

* **Step 1:** Use the function **reached_minimum_node_size** that you implemented earlier to write an if condition to detect whether we have hit the base case, i.e., the node does not have enough data points and should be turned into a leaf. Don't forget to use the `min_node_size` argument.
* **Step 2:** Return a leaf. This line of code should be the same as the other (pre-implemented) stopping conditions.


**Implementing early stopping condition 3: minimum error reduction:**

**Note:** This has to come after finding the best splitting feature so we can calculate the error after splitting in order to calculate the error reduction.

* **Step 1:** Calculate the **classification error before splitting**.  Recall that classification error is defined as:

$$
\text{classification error} = \frac{\text{# mistakes}}{\text{# total examples}}
$$
* **Step 2:** Calculate the **classification error after splitting**. This requires calculating the number of mistakes in the left and right splits, and then dividing by the total number of examples.
* **Step 3:** Use the function **error_reduction** to that you implemented earlier to write an if condition to detect whether  the reduction in error is less than the constant provided (`min_error_reduction`). Don't forget to use that argument.
* **Step 4:** Return a leaf. This line of code should be the same as the other (pre-implemented) stopping conditions.

In [13]:
def decision_tree_create(data, features, target, current_depth = 0, 
                         max_depth = 10, min_node_size=1, 
                         min_error_reduction=0.0):
    
    remaining_features = features[:] # Make a copy of the features.
    
    target_values = data[target]
    print('--------------------------------------------------------------------')
    print('Subtree, depth = %s (%s data points).' % (current_depth, len(target_values)))
    
    
    # Stopping condition 1: All nodes are of the same type.
    if intermediate_node_num_mistakes(target_values) == 0:
        print('Stopping condition 1 reached. All data points have the same target value.')               
        return create_leaf(target_values)
    
    # Stopping condition 2: No more features to split on.
    if remaining_features == []:
        print('Stopping condition 2 reached. No remaining features.')               
        return create_leaf(target_values)    
    
    # Early stopping condition 1: Reached max depth limit.
    if current_depth >= max_depth:
        print('Early stopping condition 1 reached. Reached maximum depth.')
        return create_leaf(target_values)
    
    # Early stopping condition 2: Reached the minimum node size.
    # If the number of data points is less than or equal to the minimum size, return a leaf.
    if reached_minimum_node_size(data, min_node_size):
        print('Early stopping condition 2 reached. Reached minimum node size.')
        return create_leaf(target_values)
    
    # Find the best splitting feature
    splitting_feature = best_splitting_feature(data, features, target)
    
    # Split on the best feature that we found. 
    left_split = data[data[splitting_feature] == 0]
    right_split = data[data[splitting_feature] == 1]
    
    # Early stopping condition 3: Minimum error reduction
    # Calculate the error before splitting (number of misclassified examples 
    # divided by the total number of examples)
    error_before_split = intermediate_node_num_mistakes(target_values) / float(len(data))
    
    # Calculate the error after splitting (number of misclassified examples 
    # in both groups divided by the total number of examples)
    left_mistakes = intermediate_node_num_mistakes(left_split[target])
    right_mistakes = intermediate_node_num_mistakes(right_split[target])
    error_after_split = (left_mistakes + right_mistakes) / float(len(data))
    
    # If the error reduction is LESS THAN OR EQUAL TO min_error_reduction, return a leaf.
    if error_reduction(error_before_split, error_after_split) <= min_error_reduction:
        print('Early stopping condition 3 reached. Minimum error reduction.')
        return create_leaf(target_values) 
    
    
    remaining_features.remove(splitting_feature)
    print('Split on feature %s. (%s, %s)' % (\
                      splitting_feature, len(left_split), len(right_split)))
    
    
    # Repeat (recurse) on left and right subtrees
    left_tree = decision_tree_create(left_split, remaining_features, target, 
                                     current_depth + 1, max_depth, min_node_size, min_error_reduction)        
    
    right_tree = decision_tree_create(right_split, remaining_features, target, 
                                     current_depth + 1, max_depth, min_node_size, min_error_reduction)
    
    
    return {'is_leaf'          : False, 
            'prediction'       : None,
            'splitting_feature': splitting_feature,
            'left'             : left_tree, 
            'right'            : right_tree}

In [14]:
def count_leaves(tree):
    if tree['is_leaf']:
        return 1
    return 1 + count_leaves(tree['left']) + count_leaves(tree['right'])

Run the following test code to check implementation, making sure to get **'Test passed'**

In [15]:
small_decision_tree = decision_tree_create(train_data, list(train_data.drop('safe_loans',axis=1)),
                                           'safe_loans', max_depth = 2, min_node_size = 10, min_error_reduction=0.0)
if count_leaves(small_decision_tree) == 7:
    print('Test passed!')
else:
    print('Test failed... try again!')
    print('Number of nodes found                :', count_nodes(small_decision_tree))
    print('Number of nodes that should be there : 7' )

--------------------------------------------------------------------
Subtree, depth = 0 (37224 data points).
Split on feature term_ 36 months. (9223, 28001)
--------------------------------------------------------------------
Subtree, depth = 1 (9223 data points).
Split on feature grade_A. (9122, 101)
--------------------------------------------------------------------
Subtree, depth = 2 (9122 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 2 (101 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 1 (28001 data points).
Split on feature grade_D. (23300, 4701)
--------------------------------------------------------------------
Subtree, depth = 2 (23300 data points).
Early stopping condition 1 reached. Reached maximum depth.
-----------------------------------------------

## Build a tree!

Now that your code is working, we will train a tree model on the **train_data** with
* `max_depth = 6`
* `min_node_size = 100`, 
* `min_error_reduction = 0.0`

In [16]:
my_decision_tree_new = decision_tree_create(train_data, list(train_data.drop('safe_loans',axis=1)), 'safe_loans', 
                                            max_depth = 6, min_node_size = 100, min_error_reduction=0.0)

--------------------------------------------------------------------
Subtree, depth = 0 (37224 data points).
Split on feature term_ 36 months. (9223, 28001)
--------------------------------------------------------------------
Subtree, depth = 1 (9223 data points).
Split on feature grade_A. (9122, 101)
--------------------------------------------------------------------
Subtree, depth = 2 (9122 data points).
Early stopping condition 3 reached. Minimum error reduction.
--------------------------------------------------------------------
Subtree, depth = 2 (101 data points).
Split on feature emp_length_< 1 year. (90, 11)
--------------------------------------------------------------------
Subtree, depth = 3 (90 data points).
Early stopping condition 2 reached. Reached minimum node size.
--------------------------------------------------------------------
Subtree, depth = 3 (11 data points).
Early stopping condition 2 reached. Reached minimum node size.
------------------------------------

Let's now train a tree model **ignoring early stopping conditions 2 and 3** so that we get the same tree as in the previous assignment.  To ignore these conditions, we set `min_node_size=0` and `min_error_reduction=-1` (a negative value).

In [17]:
my_decision_tree_old = decision_tree_create(train_data, list(train_data.drop('safe_loans',axis=1)),
                                            'safe_loans', max_depth = 6, min_node_size = 0, min_error_reduction=-1)

--------------------------------------------------------------------
Subtree, depth = 0 (37224 data points).
Split on feature term_ 36 months. (9223, 28001)
--------------------------------------------------------------------
Subtree, depth = 1 (9223 data points).
Split on feature grade_A. (9122, 101)
--------------------------------------------------------------------
Subtree, depth = 2 (9122 data points).
Split on feature grade_B. (8074, 1048)
--------------------------------------------------------------------
Subtree, depth = 3 (8074 data points).
Split on feature grade_C. (5884, 2190)
--------------------------------------------------------------------
Subtree, depth = 4 (5884 data points).
Split on feature grade_D. (3826, 2058)
--------------------------------------------------------------------
Subtree, depth = 5 (3826 data points).
Split on feature grade_E. (1693, 2133)
--------------------------------------------------------------------
Subtree, depth = 6 (1693 data points).
E

Split on feature grade_A. (347, 0)
--------------------------------------------------------------------
Subtree, depth = 6 (347 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 6 (0 data points).
Stopping condition 1 reached. All data points have the same target value.
--------------------------------------------------------------------
Subtree, depth = 5 (11 data points).
Split on feature home_ownership_OWN. (9, 2)
--------------------------------------------------------------------
Subtree, depth = 6 (9 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 6 (2 data points).
Stopping condition 1 reached. All data points have the same target value.
--------------------------------------------------------------------
Subtree, depth = 3 (1276 data points).
Split on feature gr

## Making predictions

Recall that in the previous assignment you implemented a function classify to classify a new point x using a given tree. We will need that function here. **Use your code from the previous assignment**.

In [18]:
def classify(tree, x, annotate = False):
       # if the node is a leaf node.
    if tree['is_leaf']:
        if annotate:
             print('At leaf, predicting %s' % tree['prediction'])
        return tree['prediction']
    
    else:
        # split on feature.
        split_feature_value = x[tree['splitting_feature']]
        if annotate:
            print('Split on %s = %s' % (tree['splitting_feature'], split_feature_value))
        if split_feature_value == 0:
            return classify(tree['left'], x, annotate)
        else:
            return classify(tree['right'], x, annotate)

Now, let's consider the first example of the validation set and see what the **my_decision_tree_new** model predicts for this data point.

In [19]:
print(validation_data.iloc[0])
print('Predicted class: %s ' % classify(my_decision_tree_new, validation_data.iloc[0]))

safe_loans                -1
grade_A                    0
grade_B                    0
grade_C                    0
grade_D                    1
grade_E                    0
grade_F                    0
grade_G                    0
term_ 36 months            0
term_ 60 months            1
home_ownership_MORTGAGE    0
home_ownership_OTHER       0
home_ownership_OWN         0
home_ownership_RENT        1
emp_length_1 year          0
emp_length_10+ years       0
emp_length_2 years         1
emp_length_3 years         0
emp_length_4 years         0
emp_length_5 years         0
emp_length_6 years         0
emp_length_7 years         0
emp_length_8 years         0
emp_length_9 years         0
emp_length_< 1 year        0
Name: 24, dtype: int64
Predicted class: -1 


Let's add some annotations to our prediction to see what the prediction path was that lead to this predicted class:

In [20]:
classify(my_decision_tree_new, validation_data.iloc[0], annotate = True)

Split on term_ 36 months = 0
Split on grade_A = 0
At leaf, predicting -1


-1

Let's now recall the prediction path for the decision tree learned in the previous assignment, which we recreated here as my_decision_tree_old.

In [21]:
classify(my_decision_tree_old, validation_data.iloc[0], annotate = True)

# The code below is to answer the following quiz questions

print('\nPrediction path for my_decision_tree_new on validation_data.iloc[1]: ')
classify(my_decision_tree_new, validation_data.iloc[1], annotate = True)

print('\nPrediction path for my_decision_tree_new on validation_data.iloc[2]: ')
classify(my_decision_tree_new, validation_data.iloc[2], annotate = True)

Split on term_ 36 months = 0
Split on grade_A = 0
Split on grade_B = 0
Split on grade_C = 0
Split on grade_D = 1
Split on grade_E = 0
At leaf, predicting -1

Prediction path for my_decision_tree_new on validation_data.iloc[1]: 
Split on term_ 36 months = 1
Split on grade_D = 0
Split on grade_E = 0
Split on grade_F = 0
Split on grade_C = 0
Split on grade_G = 0
At leaf, predicting 1

Prediction path for my_decision_tree_new on validation_data.iloc[2]: 
Split on term_ 36 months = 0
Split on grade_A = 0
At leaf, predicting -1


-1

<font color='steelblue'><b> Quiz 1 : For my_decision_tree_new trained with <i><u>max_depth = 6, min_node_size = 100, min_error_reduction=0.0</u></i>, is the prediction path for validation_data.iloc[0] shorter, longer, or the same as for my_decision_tree_old that ignored the early stopping conditions 2 and 3? </b></font>

<font color='mediumvioletred'><b> Answer 1 : The prediction path for <i><u>my_decision_tree trained with max_depth = 6, min_node_size = 100, min_error_reduction=0.0</u></i> is <i><u>shorter</u></i> for validation_data.iloc[0] than my_decision_tree_old </b></font>

<br/>

<font color='steelblue'><b> Quiz 2 : For my_decision_tree_new trained with <i><u>max_depth = 6, min_node_size = 100, min_error_reduction=0.0</u></i>, is the prediction path for any point always shorter, always longer, always the same, shorter or the same, or longer or the same as for my_decision_tree_old that ignored the early stopping conditions 2 and 3? </b></font>

<font color='mediumvioletred'><b> Answer 2 : The prediction path for <i><u>my_decision_tree trained with max_depth = 6, min_node_size = 100, min_error_reduction=0.0</u></i> is <i><u>shorter or of same length</u></i> than my_decision_tree_old </b></font>

<br/>

<font color='steelblue'><b> For a tree trained on any dataset using *max_depth = 6, min_node_size = 100, min_error_reduction=0.0*, what is the maximum number of splits encountered while making a single prediction? </b></font>

<font color='mediumvioletred'><b> Answer 3 : 6 since that is the max_depth of the tree </b></font>

## Evaluating the model

Now let us evaluate the model that we have trained. You implemented this evaluation in the function **evaluate_classification_error** from the previous assignment. **Use your code from the previous assignment**.

In [22]:
def evaluate_classification_error(tree, data):
    # Apply the classify(tree, x) to each row in your data
    prediction = data.apply(lambda x: classify(tree, x), axis=1)
    
    # Once you've made the predictions, calculate the classification error and return it
    classification_error = round(sum(prediction != data['safe_loans']) / len(data),15)
    
    return classification_error

Now, let's use this function to evaluate the classification error of **my_decision_tree_new** on the **validation_data**.

In [23]:
error_validation_new = evaluate_classification_error(my_decision_tree_new, validation_data)
error_validation_new

0.377746660922016

Now, evaluate the validation error using **my_decision_tree_old**

In [24]:
error_validation_old = evaluate_classification_error(my_decision_tree_old, validation_data)
print(error_validation_old)

# The code below is to answer the following quiz question
if error_validation_new > error_validation_old :
    ans = 'The classificaiton error is higher on new decision tree as comparted to old decision tree on the validation data'

elif error_validation_new == error_validation_old :
    ans = 'The classificaiton error is same on new decision tree as comparted to old decision tree on the validation data'
    
elif error_validation_new < error_validation_old :
    ans = 'The classificaiton error is lower on new decision tree as comparted to old decision tree on the validation data'

0.377746660922016


<font color='steelblue'><b> Is the validation error of the new decision tree (using early stopping conditions 2 and 3) lower than, higher than, or the same as that of the old decision tree from the previous assignment? </b></font>

<font color='mediumvioletred'><b> Answer 3 : {{ans}} </b></font>

# Exploring the effect of max_depth

We will compare three models trained with different values of the stopping criterion. We intentionally picked models at the extreme ends (**too small**, **just right**, and **too large**).

Train three models with these parameters:

1. **model_1**: max_depth = 2 (too small)
2. **model_2**: max_depth = 6 (just right)
3. **model_3**: max_depth = 14 (may be too large)

For each of these three, we set `min_node_size = 0` and `min_error_reduction = -1`.

**Note:** Each tree can take up to a few minutes to train. In particular, `model_3` will probably take the longest to train.

In [25]:
model_1 = decision_tree_create(train_data, list(train_data.drop('safe_loans',axis=1)), 'safe_loans', max_depth = 2, 
                                min_node_size = 0, min_error_reduction=-1)

model_2 = decision_tree_create(train_data, list(train_data.drop('safe_loans',axis=1)), 'safe_loans', max_depth = 6, 
                                min_node_size = 0, min_error_reduction=-1)

model_3 = decision_tree_create(train_data, list(train_data.drop('safe_loans',axis=1)), 'safe_loans', max_depth = 14, 
                                min_node_size = 0, min_error_reduction=-1)

--------------------------------------------------------------------
Subtree, depth = 0 (37224 data points).
Split on feature term_ 36 months. (9223, 28001)
--------------------------------------------------------------------
Subtree, depth = 1 (9223 data points).
Split on feature grade_A. (9122, 101)
--------------------------------------------------------------------
Subtree, depth = 2 (9122 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 2 (101 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 1 (28001 data points).
Split on feature grade_D. (23300, 4701)
--------------------------------------------------------------------
Subtree, depth = 2 (23300 data points).
Early stopping condition 1 reached. Reached maximum depth.
-----------------------------------------------

Split on feature grade_D. (23300, 4701)
--------------------------------------------------------------------
Subtree, depth = 2 (23300 data points).
Split on feature grade_E. (22024, 1276)
--------------------------------------------------------------------
Subtree, depth = 3 (22024 data points).
Split on feature grade_F. (21666, 358)
--------------------------------------------------------------------
Subtree, depth = 4 (21666 data points).
Split on feature grade_C. (14444, 7222)
--------------------------------------------------------------------
Subtree, depth = 5 (14444 data points).
Split on feature grade_G. (14347, 97)
--------------------------------------------------------------------
Subtree, depth = 6 (14347 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 6 (97 data points).
Early stopping condition 1 reached. Reached maximum depth.
----------------------------------

Split on feature home_ownership_OWN. (11, 0)
--------------------------------------------------------------------
Subtree, depth = 14 (11 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 14 (0 data points).
Stopping condition 1 reached. All data points have the same target value.
--------------------------------------------------------------------
Subtree, depth = 12 (5 data points).
Split on feature home_ownership_OWN. (5, 0)
--------------------------------------------------------------------
Subtree, depth = 13 (5 data points).
Split on feature home_ownership_RENT. (5, 0)
--------------------------------------------------------------------
Subtree, depth = 14 (5 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 14 (0 data points).
Stopping condition 1 reached. All dat

Split on feature home_ownership_OWN. (1088, 0)
--------------------------------------------------------------------
Subtree, depth = 12 (1088 data points).
Split on feature home_ownership_RENT. (1088, 0)
--------------------------------------------------------------------
Subtree, depth = 13 (1088 data points).
Split on feature emp_length_1 year. (1035, 53)
--------------------------------------------------------------------
Subtree, depth = 14 (1035 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 14 (53 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 13 (0 data points).
Stopping condition 1 reached. All data points have the same target value.
--------------------------------------------------------------------
Subtree, depth = 12 (0 data points).
Stopping condition 1

Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 13 (148 data points).
Split on feature emp_length_< 1 year. (137, 11)
--------------------------------------------------------------------
Subtree, depth = 14 (137 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 14 (11 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 12 (0 data points).
Stopping condition 1 reached. All data points have the same target value.
--------------------------------------------------------------------
Subtree, depth = 11 (57 data points).
Split on feature home_ownership_OTHER. (57, 0)
--------------------------------------------------------------------
Subtree, depth = 12 (57 data points).
Split on fe

--------------------------------------------------------------------
Subtree, depth = 8 (0 data points).
Stopping condition 1 reached. All data points have the same target value.
--------------------------------------------------------------------
Subtree, depth = 7 (0 data points).
Stopping condition 1 reached. All data points have the same target value.
--------------------------------------------------------------------
Subtree, depth = 6 (0 data points).
Stopping condition 1 reached. All data points have the same target value.
--------------------------------------------------------------------
Subtree, depth = 5 (0 data points).
Stopping condition 1 reached. All data points have the same target value.
--------------------------------------------------------------------
Subtree, depth = 4 (79 data points).
Split on feature home_ownership_MORTGAGE. (34, 45)
--------------------------------------------------------------------
Subtree, depth = 5 (34 data points).
Split on feature grad

Split on feature home_ownership_OTHER. (26, 0)
--------------------------------------------------------------------
Subtree, depth = 13 (26 data points).
Split on feature home_ownership_OWN. (18, 8)
--------------------------------------------------------------------
Subtree, depth = 14 (18 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 14 (8 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 13 (0 data points).
Stopping condition 1 reached. All data points have the same target value.
--------------------------------------------------------------------
Subtree, depth = 12 (2 data points).
Stopping condition 1 reached. All data points have the same target value.
--------------------------------------------------------------------
Subtree, depth = 11 (62 data points).
Spl

Split on feature grade_D. (23300, 4701)
--------------------------------------------------------------------
Subtree, depth = 2 (23300 data points).
Split on feature grade_E. (22024, 1276)
--------------------------------------------------------------------
Subtree, depth = 3 (22024 data points).
Split on feature grade_F. (21666, 358)
--------------------------------------------------------------------
Subtree, depth = 4 (21666 data points).
Split on feature grade_C. (14444, 7222)
--------------------------------------------------------------------
Subtree, depth = 5 (14444 data points).
Split on feature grade_G. (14347, 97)
--------------------------------------------------------------------
Subtree, depth = 6 (14347 data points).
Split on feature grade_A. (9318, 5029)
--------------------------------------------------------------------
Subtree, depth = 7 (9318 data points).
Split on feature home_ownership_OTHER. (9301, 17)
-------------------------------------------------------------

Split on feature home_ownership_RENT. (449, 0)
--------------------------------------------------------------------
Subtree, depth = 13 (449 data points).
Split on feature emp_length_1 year. (431, 18)
--------------------------------------------------------------------
Subtree, depth = 14 (431 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 14 (18 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 13 (0 data points).
Stopping condition 1 reached. All data points have the same target value.
--------------------------------------------------------------------
Subtree, depth = 11 (9 data points).
Split on feature emp_length_3 years. (8, 1)
--------------------------------------------------------------------
Subtree, depth = 12 (8 data points).
Stopping condition 1 reached. 

Subtree, depth = 6 (4303 data points).
Split on feature emp_length_4 years. (3969, 334)
--------------------------------------------------------------------
Subtree, depth = 7 (3969 data points).
Split on feature home_ownership_OTHER. (3957, 12)
--------------------------------------------------------------------
Subtree, depth = 8 (3957 data points).
Split on feature emp_length_9 years. (3828, 129)
--------------------------------------------------------------------
Subtree, depth = 9 (3828 data points).
Split on feature emp_length_2 years. (3312, 516)
--------------------------------------------------------------------
Subtree, depth = 10 (3312 data points).
Split on feature grade_A. (3312, 0)
--------------------------------------------------------------------
Subtree, depth = 11 (3312 data points).
Split on feature grade_B. (3312, 0)
--------------------------------------------------------------------
Subtree, depth = 12 (3312 data points).
Split on feature grade_G. (3312, 0)
-----

Split on feature grade_G. (334, 0)
--------------------------------------------------------------------
Subtree, depth = 10 (334 data points).
Split on feature term_ 60 months. (334, 0)
--------------------------------------------------------------------
Subtree, depth = 11 (334 data points).
Split on feature home_ownership_OTHER. (334, 0)
--------------------------------------------------------------------
Subtree, depth = 12 (334 data points).
Split on feature home_ownership_OWN. (286, 48)
--------------------------------------------------------------------
Subtree, depth = 13 (286 data points).
Split on feature home_ownership_RENT. (0, 286)
--------------------------------------------------------------------
Subtree, depth = 14 (0 data points).
Stopping condition 1 reached. All data points have the same target value.
--------------------------------------------------------------------
Subtree, depth = 14 (286 data points).
Early stopping condition 1 reached. Reached maximum depth.
-

Subtree, depth = 7 (347 data points).
Split on feature grade_C. (347, 0)
--------------------------------------------------------------------
Subtree, depth = 8 (347 data points).
Split on feature grade_G. (347, 0)
--------------------------------------------------------------------
Subtree, depth = 9 (347 data points).
Split on feature term_ 60 months. (347, 0)
--------------------------------------------------------------------
Subtree, depth = 10 (347 data points).
Split on feature home_ownership_MORTGAGE. (237, 110)
--------------------------------------------------------------------
Subtree, depth = 11 (237 data points).
Split on feature home_ownership_OTHER. (235, 2)
--------------------------------------------------------------------
Subtree, depth = 12 (235 data points).
Split on feature home_ownership_OWN. (203, 32)
--------------------------------------------------------------------
Subtree, depth = 13 (203 data points).
Split on feature home_ownership_RENT. (0, 203)
--------

Split on feature home_ownership_RENT. (112, 0)
--------------------------------------------------------------------
Subtree, depth = 13 (112 data points).
Split on feature emp_length_1 year. (102, 10)
--------------------------------------------------------------------
Subtree, depth = 14 (102 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 14 (10 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 13 (0 data points).
Stopping condition 1 reached. All data points have the same target value.
--------------------------------------------------------------------
Subtree, depth = 11 (6 data points).
Split on feature home_ownership_OWN. (6, 0)
--------------------------------------------------------------------
Subtree, depth = 12 (6 data points).
Split on feature home_ownershi

Split on feature home_ownership_RENT. (10, 0)
--------------------------------------------------------------------
Subtree, depth = 13 (10 data points).
Split on feature emp_length_1 year. (9, 1)
--------------------------------------------------------------------
Subtree, depth = 14 (9 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 14 (1 data points).
Stopping condition 1 reached. All data points have the same target value.
--------------------------------------------------------------------
Subtree, depth = 13 (0 data points).
Stopping condition 1 reached. All data points have the same target value.
--------------------------------------------------------------------
Subtree, depth = 12 (0 data points).
Stopping condition 1 reached. All data points have the same target value.
--------------------------------------------------------------------
Subtree, depth = 10 (1654 data

In [26]:
error_model_1_train = evaluate_classification_error(model_1, train_data)
error_model_2_train = evaluate_classification_error(model_2, train_data)
error_model_3_train = evaluate_classification_error(model_3, train_data)

print('Training data, classification error (model 1):', error_model_1_train)
print('Training data, classification error (model 2):', error_model_2_train)
print('Training data, classification error (model 3):', error_model_3_train)

# The code below is to answer the followin quiz questions
errors_train_list = [error_model_1_train, error_model_2_train, error_model_3_train]
model_names = ['Model 1', 'Model 2', 'Model 3']

errors_train = dict(zip(model_names, errors_train_list))
min_error_train = min(errors_train, key=errors_train.get)

print('\n' + str(min_error_train) + ' has the smallest classification error on training data')


'''Errors on validation data'''

error_model_1_validation = evaluate_classification_error(model_1, validation_data)
error_model_2_validation = evaluate_classification_error(model_2, validation_data)
error_model_3_validation = evaluate_classification_error(model_3, validation_data)

errors_validation_list = [error_model_1_validation, error_model_2_validation, error_model_3_validation]

errors_validation = dict(zip(model_names, errors_validation_list))
min_error_validation = min(errors_validation, key=errors_validation.get)

print('\n' + str(min_error_validation) + ' has the smallest classification error on validation data')

if min_error_train == min_error_validation :
    ans_2 = 'Yes, the tree with smallest error in training data also has the smallest error on validation data'
    print('\n' + str(ans_2))
elif min_error_train != min_error_validation :
    ans_2 = 'No, the tree with smallest error in training data does not have the smallest error on validation data'
    print('\n' + str(ans_2))

Training data, classification error (model 1): 0.400037610143993
Training data, classification error (model 2): 0.380426606490436
Training data, classification error (model 3): 0.377256608639587

Model 3 has the smallest classification error on training data

Model 2 has the smallest classification error on validation data

No, the tree with smallest error in training data does not have the smallest error on validation data


<font color='steelblue'><b> Quiz 1 : Which tree has the smallest error on the validation data? </b></font>

<font color='mediumvioletred'><b> Answer 1 : {{min_error_validation}} has the smallest classification error on validation data </b></font>

<br/>

<font color='steelblue'><b> Quiz 2 : Does the tree with the smallest error in the training data also have the smallest error in the validation data? </b></font>

<font color='mediumvioletred'><b> Answer 2 : {{ans_2}} </b></font>

<br/>

<font color='steelblue'><b> Quiz 3 : Is it always true that the tree with the lowest classification error on the training set will result in the lowest classification error in the validation set? </b></font>

<font color='mediumvioletred'><b> Answer 3 : No, as there could be overfitting which could give lowest classification error on training set but may not not give the lowest classication error on validation set </b></font>

### Measuring the complexity of the tree

Recall in the lecture that we talked about deeper trees being more complex. We will measure the complexity of the tree as

```
  complexity(T) = number of leaves in the tree T
```

Using the function `count_leaves` as implemented earlier (above), compute the number of nodes in `model_1`, `model_2`, and `model_3`. 

In [27]:
count_model_1 = count_leaves(model_1)
count_model_2 = count_leaves(model_2)
count_model_3 = count_leaves(model_3)

print('Number of leaves in Model 1 : ', count_model_1)
print('Number of leaves in Model 2 : ', count_model_2)
print('Number of leaves in Model 3 : ', count_model_3)

# The code below is to answer the followin quiz questions
count_list = [count_model_1, count_model_2, count_model_3]
counts = dict(zip(model_names, count_list))

most_complex = max(counts, key=counts.get)
print('\n' + most_complex, 'has the largest complexity')

for val in errors_validation.values() :
    if errors_validation[most_complex] < val :
        ans = 'Yes, the most complex tree will always result in the lowest classification error in validation data'
    else :
        ans = 'No, the most complex tree will not always result in the lowest classification error in validation data'

print('\n' + ans)

Number of leaves in Model 1 :  7
Number of leaves in Model 2 :  77
Number of leaves in Model 3 :  681

Model 3 has the largest complexity

No, the most complex tree will not always result in the lowest classification error in validation data


<font color='steelblue'><b> Quiz 1 : Which tree has the largest complexity? </b></font>

<font color='mediumvioletred'><b> Answer 1 : {{most_complex}} has the largest complexity </b></font>

<br/>

<font color='steelblue'><b> Quiz 2 : Is it always true that the most complex tree will result in the lowest classification error in the validation_set? </b></font>

<font color='mediumvioletred'><b> Answer 2 : {{ans}} </b></font>

# Exploring the effect of min_error

We will compare three models trained with different values of the stopping criterion. We intentionally picked models at the extreme ends (**negative**, **just right**, and **too positive**).

Train three models with these parameters:
1. **model_4**: `min_error_reduction = -1` (ignoring this early stopping condition)
2. **model_5**: `min_error_reduction = 0` (just right)
3. **model_6**: `min_error_reduction = 5` (too positive)

For each of these three, we set `max_depth = 6`, and `min_node_size = 0`.

**Note:** Each tree can take up to 30 seconds to train.

In [28]:
model_4 = decision_tree_create(train_data, list(train_data.drop('safe_loans',axis=1)), 'safe_loans', max_depth = 6, 
                                min_node_size = 0, min_error_reduction=-1)

model_5 = decision_tree_create(train_data, list(train_data.drop('safe_loans',axis=1)), 'safe_loans', max_depth = 6, 
                                min_node_size = 0, min_error_reduction=0)

model_6 = decision_tree_create(train_data, list(train_data.drop('safe_loans',axis=1)), 'safe_loans', max_depth = 6, 
                                min_node_size = 0, min_error_reduction=5)

--------------------------------------------------------------------
Subtree, depth = 0 (37224 data points).
Split on feature term_ 36 months. (9223, 28001)
--------------------------------------------------------------------
Subtree, depth = 1 (9223 data points).
Split on feature grade_A. (9122, 101)
--------------------------------------------------------------------
Subtree, depth = 2 (9122 data points).
Split on feature grade_B. (8074, 1048)
--------------------------------------------------------------------
Subtree, depth = 3 (8074 data points).
Split on feature grade_C. (5884, 2190)
--------------------------------------------------------------------
Subtree, depth = 4 (5884 data points).
Split on feature grade_D. (3826, 2058)
--------------------------------------------------------------------
Subtree, depth = 5 (3826 data points).
Split on feature grade_E. (1693, 2133)
--------------------------------------------------------------------
Subtree, depth = 6 (1693 data points).
E

Split on feature home_ownership_MORTGAGE. (4303, 2919)
--------------------------------------------------------------------
Subtree, depth = 6 (4303 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 6 (2919 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 4 (358 data points).
Split on feature emp_length_8 years. (347, 11)
--------------------------------------------------------------------
Subtree, depth = 5 (347 data points).
Split on feature grade_A. (347, 0)
--------------------------------------------------------------------
Subtree, depth = 6 (347 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 6 (0 data points).
Stopping condition 1 reached. All data point

In [29]:
error_model_4_validation = evaluate_classification_error(model_4, validation_data)
error_model_5_validation = evaluate_classification_error(model_5, validation_data)
error_model_6_validation = evaluate_classification_error(model_6, validation_data)

print('Validation data, classification error (model 4):', error_model_4_validation)
print('Validation data, classification error (model 5):', error_model_5_validation)
print('Validation data, classification error (model 6):', error_model_6_validation)

# The code below is to answer the followin quiz questions
count_model_4 = count_leaves(model_4)
count_model_5 = count_leaves(model_5)
count_model_6 = count_leaves(model_6)

print('\nNumber of leaves in Model 4 : ', count_model_4)
print('Number of leaves in Model 5 : ', count_model_5)
print('Number of leaves in Model 6 : ', count_model_6)

count_list = [count_model_4, count_model_5, count_model_6]
model_names = ['Model 4', 'Model 5', 'Model 6']
counts = dict(zip(model_names, count_list))

most_complex = max(counts, key=counts.get)
print('\n' + most_complex + ' has the largest complexity')

Validation data, classification error (model 4): 0.377746660922016
Validation data, classification error (model 5): 0.377746660922016
Validation data, classification error (model 6): 0.503446790176648

Number of leaves in Model 4 :  77
Number of leaves in Model 5 :  23
Number of leaves in Model 6 :  1

Model 4 has the largest complexity


<font color='steelblue'><b> Quiz 1 : Using the complexity definition above, which model (model_4, model_5, or model_6) has the largest complexity? Did this match your expectation? </b></font>

<font color='mediumvioletred'><b> Answer 1 : {{most_complex}} has the largest complexity </b></font>

<br/>

<font color='steelblue'><b> Quiz 2 : *model_4* and *model_5* have similar classification error on the validation set but *model_5* has lower complexity? Should you pick *model_5* over *model_4*? </b></font>

<font color='mediumvioletred'><b> Answer 2 : Yes, *model_5* should be picked over *model_4* since increase in depth would not increase the error for *model_5* but would increase for *model_4* due to overfitting. Choosing a simpler tree helps dealing with problem of overfitting </b></font>

# Exploring the effect of min_node_size

We will compare three models trained with different values of the stopping criterion. Again, intentionally picked models at the extreme ends (**too small**, **just right**, and **just right**).

Train three models with these parameters:
1. **model_7**: min_node_size = 0 (too small)
2. **model_8**: min_node_size = 2000 (just right)
3. **model_9**: min_node_size = 50000 (too large)

For each of these three, we set `max_depth = 6`, and `min_error_reduction = -1`.

**Note:** Each tree can take up to 30 seconds to train.

In [30]:
model_7 = decision_tree_create(train_data, list(train_data.drop('safe_loans',axis=1)), 'safe_loans', max_depth = 6, 
                                min_node_size = 0, min_error_reduction=-1)

model_8 = decision_tree_create(train_data, list(train_data.drop('safe_loans',axis=1)), 'safe_loans', max_depth = 6, 
                                min_node_size = 2000, min_error_reduction=-1)

model_9 = decision_tree_create(train_data, list(train_data.drop('safe_loans',axis=1)), 'safe_loans', max_depth = 6, 
                                min_node_size = 50000, min_error_reduction=-1)

--------------------------------------------------------------------
Subtree, depth = 0 (37224 data points).
Split on feature term_ 36 months. (9223, 28001)
--------------------------------------------------------------------
Subtree, depth = 1 (9223 data points).
Split on feature grade_A. (9122, 101)
--------------------------------------------------------------------
Subtree, depth = 2 (9122 data points).
Split on feature grade_B. (8074, 1048)
--------------------------------------------------------------------
Subtree, depth = 3 (8074 data points).
Split on feature grade_C. (5884, 2190)
--------------------------------------------------------------------
Subtree, depth = 4 (5884 data points).
Split on feature grade_D. (3826, 2058)
--------------------------------------------------------------------
Subtree, depth = 5 (3826 data points).
Split on feature grade_E. (1693, 2133)
--------------------------------------------------------------------
Subtree, depth = 6 (1693 data points).
E

Split on feature home_ownership_MORTGAGE. (4303, 2919)
--------------------------------------------------------------------
Subtree, depth = 6 (4303 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 6 (2919 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 4 (358 data points).
Split on feature emp_length_8 years. (347, 11)
--------------------------------------------------------------------
Subtree, depth = 5 (347 data points).
Split on feature grade_A. (347, 0)
--------------------------------------------------------------------
Subtree, depth = 6 (347 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 6 (0 data points).
Stopping condition 1 reached. All data point

Split on feature grade_B. (4701, 0)
--------------------------------------------------------------------
Subtree, depth = 4 (4701 data points).
Split on feature grade_C. (4701, 0)
--------------------------------------------------------------------
Subtree, depth = 5 (4701 data points).
Split on feature grade_E. (4701, 0)
--------------------------------------------------------------------
Subtree, depth = 6 (4701 data points).
Early stopping condition 1 reached. Reached maximum depth.
--------------------------------------------------------------------
Subtree, depth = 6 (0 data points).
Stopping condition 1 reached. All data points have the same target value.
--------------------------------------------------------------------
Subtree, depth = 5 (0 data points).
Stopping condition 1 reached. All data points have the same target value.
--------------------------------------------------------------------
Subtree, depth = 4 (0 data points).
Stopping condition 1 reached. All data points 

 Calculate the accuracy of each model (**model_7**, **model_8**, and **model_9**) on the validation set.

In [31]:
error_model_7_validation = evaluate_classification_error(model_7, validation_data)
error_model_8_validation = evaluate_classification_error(model_8, validation_data)
error_model_9_validation = evaluate_classification_error(model_9, validation_data)

print('Validation data, classification error (model 7):', error_model_7_validation)
print('Validation data, classification error (model 8):', error_model_8_validation)
print('Validation data, classification error (model 9):', error_model_9_validation)

Validation data, classification error (model 7): 0.377746660922016
Validation data, classification error (model 8): 0.377423524342956
Validation data, classification error (model 9): 0.503446790176648


 Using the count_leaves function, compute the number of leaves in each of each models (**model_7**, **model_8**, and **model_9**).

In [32]:
count_model_7 = count_leaves(model_7)
count_model_8 = count_leaves(model_8)
count_model_9 = count_leaves(model_9)

print('Number of leaves in Model 7 : ', count_model_7)
print('Number of leaves in Model 8 : ', count_model_8)
print('Number of leaves in Model 9 : ', count_model_9)

# The code below is used to answer the following quiz question

error_list = [error_model_7_validation, error_model_8_validation, error_model_9_validation]
model_names = ['Model 7', 'Model 8', 'Model 9']
errors = dict(zip(model_names, error_list))

least_error = min(errors, key=errors.get)
print('\n' + least_error + ' should be choosen since it has the smallest classification error')

Number of leaves in Model 7 :  77
Number of leaves in Model 8 :  39
Number of leaves in Model 9 :  1

Model 8 should be choosen since it has the smallest classification error


<font color='steelblue'><b> Quiz : Using the results obtained in this section, which model (model_7, model_8, or model_9) would you choose to use? </b></font>

<font color='mediumvioletred'><b> Answer : {{least_error}}  should be choosen since it has the smallest classification error </b></font>