## Exploring decision tree

In [54]:
import numpy as np
import pandas as pd

In [55]:
loans = pd.read_csv("lending-club-data.csv")

  interactivity=interactivity, compiler=compiler, result=result)


### Exploring data

In [56]:
loans.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,sub_grade_num,delinq_2yrs_zero,pub_rec_zero,collections_12_mths_zero,short_emp,payment_inc_ratio,final_d,last_delinq_none,last_record_none,last_major_derog_none
0,1077501,1296599,5000,5000,4975,36 months,10.65,162.87,B,B2,...,0.4,1.0,1.0,1.0,0,8.1435,20141201T000000,1,1,1
1,1077430,1314167,2500,2500,2500,60 months,15.27,59.83,C,C4,...,0.8,1.0,1.0,1.0,1,2.3932,20161201T000000,1,1,1
2,1077175,1313524,2400,2400,2400,36 months,15.96,84.33,C,C5,...,1.0,1.0,1.0,1.0,0,8.25955,20141201T000000,1,1,1
3,1076863,1277178,10000,10000,10000,36 months,13.49,339.31,C,C1,...,0.2,1.0,1.0,1.0,0,8.27585,20141201T000000,0,1,1
4,1075269,1311441,5000,5000,5000,36 months,7.9,156.46,A,A4,...,0.8,1.0,1.0,1.0,0,5.21533,20141201T000000,1,1,1



The target column (label column) of the dataset that we are interested in is called `bad_loans`. In this column **1** means a risky (bad) loan **0** means a safe  loan.

In order to make this more intuitive and consistent with the lectures, we reassign the target to be:
* **+1** as a safe  loan, 
* **-1** as a risky (bad) loan. 

We put this in a new column called `safe_loans`.

In [57]:
loans["safe_loans"] = loans["bad_loans"].apply(lambda x: +1 if x ==0 else -1)

In [58]:
loans.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,delinq_2yrs_zero,pub_rec_zero,collections_12_mths_zero,short_emp,payment_inc_ratio,final_d,last_delinq_none,last_record_none,last_major_derog_none,safe_loans
0,1077501,1296599,5000,5000,4975,36 months,10.65,162.87,B,B2,...,1.0,1.0,1.0,0,8.1435,20141201T000000,1,1,1,1
1,1077430,1314167,2500,2500,2500,60 months,15.27,59.83,C,C4,...,1.0,1.0,1.0,1,2.3932,20161201T000000,1,1,1,-1
2,1077175,1313524,2400,2400,2400,36 months,15.96,84.33,C,C5,...,1.0,1.0,1.0,0,8.25955,20141201T000000,1,1,1,1
3,1076863,1277178,10000,10000,10000,36 months,13.49,339.31,C,C1,...,1.0,1.0,1.0,0,8.27585,20141201T000000,0,1,1,1
4,1075269,1311441,5000,5000,5000,36 months,7.9,156.46,A,A4,...,1.0,1.0,1.0,0,5.21533,20141201T000000,1,1,1,1


Now, let us explore the distribution of the column `safe_loans`. This gives us a sense of how many safe and risky loans are present in the dataset.

In [59]:
print ("Percetage of Good Loans in input datasets is:")
print (round(sum(loans["safe_loans"] == +1)/len(loans),2))

Percetage of Good Loans in input datasets is:
0.81


In [60]:
print ("Percetage of Good Loans in input datasets is:")
print (round(sum(loans["safe_loans"] == -1)/len(loans),2))

Percetage of Good Loans in input datasets is:
0.19


### Features for the classification algorithm

In this assignment, we will be using a subset of features (categorical and numeric). The features we will be using are **described in the code comments** below. If you are a finance geek, the [LendingClub](https://www.lendingclub.com/) website has a lot more details about these features.

In [62]:
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]

target = 'safe_loans'                   # prediction target (y) (+1 means safe, -1 is risky)

In [66]:
loans = loans[features + [target]]

In [67]:
categorical_var = [m for m in loans.columns if loans[m].dtypes == object]
categorical_var

['grade', 'sub_grade', 'home_ownership', 'purpose', 'term']

In [68]:
loans =pd.get_dummies(loans)

### Sample data to balance classes

As we explored above, our data is disproportionally full of safe loans.  Let's create two datasets: one with just the safe loans (`safe_loans_raw`) and one with just the risky loans (`risky_loans_raw`).

In [69]:
import json
with open ("module-5-assignment-1-train-idx.json", "r") as f:
    train_idx = json.load(f)
with open ("module-5-assignment-1-validation-idx.json", "r") as f:
    valid_idx = json.load(f)

In [70]:
train_data = loans.iloc[train_idx]

In [71]:
valid_data = loans.iloc[valid_idx]

In [72]:
train_data.shape

(37224, 68)

In [73]:
valid_data.shape

(9284, 68)

In [74]:
safe_loans_prob = round(float(sum(valid_data['safe_loans'] == 1))/len(valid_data),2)
safe_loans_prob

0.5

In [75]:
sum(train_data["safe_loans"] == 1 ) / len(train_data)

0.5036535568450462

### Build a decision tree classifier

In [76]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

In [77]:
decision_tree_model = tree.DecisionTreeClassifier(max_depth=6)

In [78]:
small_model = tree.DecisionTreeClassifier(max_depth=2)

In [81]:
train_input = train_data.drop("safe_loans",1)

In [83]:
decision_tree_model.fit(train_input, train_data[target])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [84]:
small_model.fit(train_input, train_data[target])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

### Making predictions

In [85]:
validation_safe_loans = valid_data[valid_data[target] == 1]
validation_risky_loans = valid_data[valid_data[target] == -1]

sample_validation_data_risky = validation_risky_loans[0:2]
sample_validation_data_safe = validation_safe_loans[0:2]

sample_validation_data = sample_validation_data_safe.append(sample_validation_data_risky)
sample_validation_data

Unnamed: 0,short_emp,emp_length_num,dti,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans,grade_A,grade_B,...,purpose_house,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_small_business,purpose_vacation,purpose_wedding,term_ 36 months,term_ 60 months
19,0,11,11.18,1,1,82.4,0.0,1,0,1,...,0,0,0,0,0,0,0,0,1,0
79,0,10,16.85,1,1,96.4,0.0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
24,0,3,13.97,0,1,59.5,0.0,-1,0,0,...,0,0,0,0,1,0,0,0,0,1
41,0,11,16.33,1,1,62.1,0.0,-1,1,0,...,0,0,0,0,0,0,0,0,1,0


In [87]:
sample_validation_data["safe_loans"]

19    1
79    1
24   -1
41   -1
Name: safe_loans, dtype: int64

In [86]:
decision_tree_model.predict(sample_validation_data.drop('safe_loans',1))

array([ 1, -1, -1,  1], dtype=int64)

### Explore probability predictions

In [88]:
decision_tree_model.predict_proba(sample_validation_data.drop('safe_loans',1))

array([[ 0.34156543,  0.65843457],
       [ 0.53630646,  0.46369354],
       [ 0.64750958,  0.35249042],
       [ 0.20789474,  0.79210526]])

### Tricky predictions!

 Now, we will explore something pretty interesting. For each row in the sample_validation_data, what is the probability (according to small_model) of a loan being classified as safe?

In [89]:
small_model.predict_proba(sample_validation_data.drop("safe_loans", 1))

array([[ 0.41896585,  0.58103415],
       [ 0.59255339,  0.40744661],
       [ 0.59255339,  0.40744661],
       [ 0.23120112,  0.76879888]])

Probability for  2nd and 3rd record are same, the reason I think is because depth is 2, and these two record fall into exactlly the same "rule"/classification path.

### Evaluating accuracy of the decision tree model

In [91]:
decision_tree_model.score(train_data.drop('safe_loans',1), train_data[target], sample_weight=None)

0.64052761659144641

In [92]:
small_model.score(train_data.drop('safe_loans',1), train_data[target], sample_weight=None)

0.61350204169353106

In [93]:
decision_tree_model.score(valid_data.drop('safe_loans',1), valid_data[target], sample_weight=None)

0.63614821197759586

In [94]:
small_model.score(valid_data.drop('safe_loans',1), valid_data[target], sample_weight=None)

0.61934510986643687

### Evaluating accuracy of a complex decision tree model

Here, we will train a large decision tree with max_depth=10. This will allow the learned tree to become very deep, and result in a very complex model. Recall that in lecture, we prefer simpler models with similar predictive power. This will be an example of a more complicated model which has similar predictive power, i.e. something we don't want.

In [95]:
big_model = tree.DecisionTreeClassifier(max_depth=10)

In [96]:
big_model.fit(train_input, train_data[target])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [98]:
big_model.score(train_input, train_data[target])

0.6637384483129164

In [99]:
big_model.score(valid_data.drop("safe_loans",1), valid_data[target])

0.62634640241275308

### Quantifying the cost of mistakes

In [100]:
valid_prediction = big_model.predict(valid_data.drop("safe_loans",1))

In [101]:
valid_prediction

array([-1,  1, -1, ..., -1, -1,  1], dtype=int64)

In [103]:
false_pos = sum(valid_prediction>valid_data[target])
print("Number of False positives is:", false_pos)

Number of False positives is: 1650


In [104]:
false_neg = sum(valid_prediction<valid_data[target])
print ("Number of False Negative is:", false_neg)

Number of False Negative is: 1819


Let's assume that each mistake costs us money: a false negative costs $10,000, while a false positive positive costs $20,000. What is the total cost of mistakes made by decision_tree_model on validation_data?

In [105]:
total_cost = false_neg*10000 + false_pos * 20000
print ("Total cose it:", total_cost)

Total cose it: 51190000


## Implementing binary decision trees

In [166]:
loans = pd.read_csv("lending-club-data.csv")

  interactivity=interactivity, compiler=compiler, result=result)


In [167]:
loans["safe_loans"] = loans["bad_loans"].apply(lambda x: +1 if x ==0 else -1)
loans = loans.drop("bad_loans",1)
#where 1 is the axis number (0 for rows and 1 for columns.)

In [168]:
features = ['grade',              # grade of the loan
            'term',               # the term of the loan
            'home_ownership',     # home_ownership status: own, mortgage or rent
            'emp_length',         # number of years of employment
           ]
target = 'safe_loans'

In [169]:
loans = loans[features+[target]]

In [170]:
loans=pd.get_dummies(loans)

In [171]:
import json
with open ("module-5-assignment-2-train-idx.json", "r") as f:
    train_idx = json.load(f)
with open ("module-5-assignment-2-test-idx.json", "r") as f:
    test_idx = json.load(f)

In [172]:
train_data = loans.iloc[train_idx]

In [173]:
test_data = loans.iloc[test_idx]

### Decision tree implementation

In this section, we will implement binary decision trees from scratch. There are several steps involved in building a decision tree. For that reason, we have split the entire assignment into several sections.

Steps to follow:

*Step 1: Calculate the number of safe loans and risky loans.

*Step 2: Since we are assuming majority class prediction, all the data points that are not in the majority class are considered mistakes.

*Step 3: Return the number of mistakes.

In [127]:
safe_loans = sum(loans["safe_loans"] == 1)

In [130]:
risky_loans = sum(loans["safe_loans"] == -1)

In [142]:
print ("Mistake is:",min(safe_loans,risky_loans))

Mistake is: 23150


write the function intermediate_node_num_mistakes which computes the number of misclassified examples of an intermediate node given the set of labels (y values) of the data points contained in the node. Your code should be analogous to

In [143]:
def intermediate_node_num_mistakes(labels_in_node):
    # Corner case: If labels_in_node is empty, return 0
    if len(labels_in_node) == 0:
        return 0    
    # Count the number of 1's (safe loans)
    safe_loans = sum(labels_in_node==1)
    # Count the number of -1's (risky loans)
    risky_loans = sum(labels_in_node ==-1)          
    # Return the number of mistakes that the majority classifier makes.
    return min(risky_loans, safe_loans)  


In [147]:
# Test case 1
example_labels = np.array([-1, -1, 1, 1, 1])
if intermediate_node_num_mistakes(example_labels) == 2:
    print ('Test passed!')
else:
    print ('Test 1 failed... try again!')

# Test case 2
example_labels = np.array([-1, -1, 1, 1, 1, 1, 1])
if intermediate_node_num_mistakes(example_labels) == 2:
    print ('Test passed!')
else:
    print ('Test 3 failed... try again!')
    
# Test case 3
example_labels = np.array([-1, -1, -1, -1, -1, 1, 1])
if intermediate_node_num_mistakes(example_labels) == 2:
    print ('Test passed!')
else:
    print ('Test 3 failed... try again!')

Test passed!
Test passed!
Test passed!


### Function to pick best feature to split on

The function best_splitting_feature takes 3 arguments:

The data

The features to consider for splits (a list of strings of column names to consider for splits)

The name of the target/label column (string)

The function will loop through the list of possible features, and consider splitting on each of them. It will calculate the classification error of each split and return the feature that had the smallest classification error when split on.

Recall that the classification error is defined as follows:

classification error=# mistakes / # total examples


Follow these steps to implement best_splitting_feature:
<br />Step 1: Loop over each feature in the feature list 
<br />Step 2: Within the loop, split the data into two groups: one group where all of the data has feature value 0 or False (we will call this the left split), and one group where all of the data has feature value 1 or True (we will call this the right split). Make sure the left split corresponds with 0 and the right split corresponds with 1 to ensure your implementation fits with our implementation of the tree building process. 
<br />Step 3: Calculate the number of misclassified examples in both groups of data and use the above formula to compute theclassification error. 
<br />Step 4: If the computed error is smaller than the best error found so far, store this feature and its error.

Note: Remember that since we are only dealing with binary features, we do not have to consider thresholds for real-valued features. This makes the implementation of this function much easier.

In [149]:
def best_splitting_feature(data, features, target):
    
    target_values = data[target]
    best_feature = None # Keep track of the best feature 
    best_error = 10     # Keep track of the best error so far 
    # Note: Since error is always <= 1, we should intialize it with something larger than 1.

    # Convert to float to make sure error gets computed correctly.
    num_data_points = float(len(data))  
    
    # Loop through each feature to consider splitting on that feature
    for feature in features:
        
        # The left split will have all data points where the feature value is 0
        left_split = data[data[feature] == 0]
        
        # The right split will have all data points where the feature value is 1
        ## YOUR CODE HERE
        right_split =  data[data[feature] == 1]
                    
        # Calculate the number of misclassified examples in the left split.
        # Remember that we implemented a function for this! (It was called intermediate_node_num_mistakes)
        # YOUR CODE HERE
        left_mistakes =  intermediate_node_num_mistakes(left_split[target])

        # Calculate the number of misclassified examples in the right split.
        ## YOUR CODE HERE
        right_mistakes = intermediate_node_num_mistakes(right_split[target])
            
        # Compute the classification error of this split.
        # Error = (# of mistakes (left) + # of mistakes (right)) / (# of data points)
        ## YOUR CODE HERE
        error = (left_mistakes + right_mistakes)/num_data_points
        
        # If this is the best error we have found so far, store the feature as best_feature and the error as best_error
        ## YOUR CODE HERE
        if error < best_error:
            best_feature = feature
            best_error = error
        
    
    return best_feature # Return the best feature we found

### Building the tree

With the above functions implemented correctly, we are now ready to build our decision tree. Each node in the decision tree is represented as a dictionary which contains the following keys and possible values:

{ 
<br />   'is_leaf'            : True/False.
<br />   'prediction'         : Prediction at the leaf node.
<br />   'left'               : (dictionary corresponding to the left tree).
<br />   'right'              : (dictionary corresponding to the right tree).
<br />   'splitting_feature'  : The feature that this node splits on
<br />}


write a function that creates a leaf node given a set of target values. Your code should be analogous to

In [152]:
def create_leaf(target_values):    
    # Create a leaf node
    leaf = {'splitting_feature' : None,
            'left' : None,
            'right' : None,
            'is_leaf':  True   }   ## YOUR CODE HERE 
   
    # Count the number of data points that are +1 and -1 in this node.
    num_ones = len(target_values[target_values == +1])
    num_minus_ones = len(target_values[target_values == -1])    

    # For the leaf node, set the prediction to be the majority class.
    # Store the predicted class (1 or -1) in leaf['prediction']
    if num_ones > num_minus_ones:
        leaf['prediction'] =  1        ## YOUR CODE HERE
    else:
        leaf['prediction'] =  -1        ## YOUR CODE HERE        

    # Return the leaf 
    return leaf

We have provided a function that learns the decision tree recursively and implements 3 stopping conditions:
<br /> Stopping condition 1: All data points in a node are from the same class.
<br /> Stopping condition 2: No more features to split on.
<br /> Additional stopping condition: In addition to the above two stopping conditions covered in lecture, in this assignment we will also consider a stopping condition based on the max_depth of the tree. By not letting the tree grow too deep, we will save computational effort in the learning process.


In [158]:
def decision_tree_create(data, features, target, current_depth = 0, max_depth = 10):
    remaining_features = features[:] # Make a copy of the features.
    
    target_values = data[target]
    print ("--------------------------------------------------------------------")
    print ("Subtree, depth = %s (%s data points)." % (current_depth, len(target_values)))
    

    # Stopping condition 1
    # (Check if there are mistakes at current node.
    # Recall you wrote a function intermediate_node_num_mistakes to compute this.)
    if  intermediate_node_num_mistakes(target_values)== 0:  ## YOUR CODE HERE
        print ("Stopping condition 1 reached.")     
        # If not mistakes at current node, make current node a leaf node
        return create_leaf(target_values)
    
    # Stopping condition 2 (check if there are remaining features to consider splitting on)
    if remaining_features == []:   ## YOUR CODE HERE
        print ("Stopping condition 2 reached.")    
        # If there are no remaining features to consider, make current node a leaf node
        return create_leaf(target_values)    
    
    # Additional stopping condition (limit tree depth)
    if current_depth >= max_depth:  ## YOUR CODE HERE
        print ("Reached maximum depth. Stopping for now.")
        # If the max tree depth has been reached, make current node a leaf node
        return create_leaf(target_values)

    # Find the best splitting feature (recall the function best_splitting_feature implemented above)
    ## YOUR CODE HERE
    splitting_feature = best_splitting_feature(data, features, target)
    # Split on the best feature that we found. 
    left_split = data[data[splitting_feature] == 0]
    right_split =  data[data[splitting_feature] == 1]     ## YOUR CODE HERE
    remaining_features.remove(splitting_feature)
    print ("Split on feature %s. (%s, %s)" % (\
                      splitting_feature, len(left_split), len(right_split)))
    
    # Create a leaf node if the split is "perfect"
    if len(left_split) == len(data):
        print ("Creating leaf node.")
        return create_leaf(left_split[target])
    if len(right_split) == len(data):
        print ("Creating leaf node.")
        ## YOUR CODE HERE
        return create_leaf(right_split[target])
        
    # Repeat (recurse) on left and right subtrees
    left_tree = decision_tree_create(left_split, remaining_features, target, current_depth + 1, max_depth)        
    ## YOUR CODE HERE
    right_tree = decision_tree_create(right_split, remaining_features, target, current_depth +1, max_depth)

    return {'is_leaf'          : False, 
            'prediction'       : None,
            'splitting_feature': splitting_feature,
            'left'             : left_tree, 
            'right'            : right_tree}

### Build the tree!

In [159]:
x = train_data.drop('safe_loans',1)

In [160]:
features_new = [col for col in x.columns]

In [161]:
my_decision = decision_tree_create(train_data, features_new, target, current_depth = 0, max_depth = 6)

--------------------------------------------------------------------
Subtree, depth = 0 (37224 data points).
Split on feature term_ 36 months. (9223, 28001)
--------------------------------------------------------------------
Subtree, depth = 1 (9223 data points).
Split on feature grade_A. (9122, 101)
--------------------------------------------------------------------
Subtree, depth = 2 (9122 data points).
Split on feature grade_B. (8074, 1048)
--------------------------------------------------------------------
Subtree, depth = 3 (8074 data points).
Split on feature grade_C. (5884, 2190)
--------------------------------------------------------------------
Subtree, depth = 4 (5884 data points).
Split on feature grade_D. (3826, 2058)
--------------------------------------------------------------------
Subtree, depth = 5 (3826 data points).
Split on feature grade_E. (1693, 2133)
--------------------------------------------------------------------
Subtree, depth = 6 (1693 data points).
R

### Making predictions with a decision tree

In [163]:
def classify(tree, x, annotate = False):
       # if the node is a leaf node.
    if tree['is_leaf']:
        if annotate:
             print ("At leaf, predicting %s" % tree['prediction'])
        return tree['prediction']
    else:
        # split on feature.
        split_feature_value = x[tree['splitting_feature']]
        if annotate:
             print ("Split on %s = %s" % (tree['splitting_feature'], split_feature_value))
        if split_feature_value == 0:
            return classify(tree['left'], x, annotate)
        else:
            return classify(tree['right'], x, annotate)

In [182]:
for a in test_data[0:1].to_dict(orient = 'records'):
    print(a)
    print ('Predicted class: %s ' % classify(my_decision, a))

{'safe_loans': -1, 'grade_A': 0, 'grade_B': 0, 'grade_C': 0, 'grade_D': 1, 'grade_E': 0, 'grade_F': 0, 'grade_G': 0, 'term_ 36 months': 0, 'term_ 60 months': 1, 'home_ownership_MORTGAGE': 0, 'home_ownership_OTHER': 0, 'home_ownership_OWN': 0, 'home_ownership_RENT': 1, 'emp_length_1 year': 0, 'emp_length_10+ years': 0, 'emp_length_2 years': 1, 'emp_length_3 years': 0, 'emp_length_4 years': 0, 'emp_length_5 years': 0, 'emp_length_6 years': 0, 'emp_length_7 years': 0, 'emp_length_8 years': 0, 'emp_length_9 years': 0, 'emp_length_< 1 year': 0, 'emp_length_n/a': 0}
Predicted class: -1 


In [184]:
classify(my_decision, a, annotate=True)

Split on term_ 36 months = 0
Split on grade_A = 0
Split on grade_B = 0
Split on grade_C = 0
Split on grade_D = 1
At leaf, predicting -1


-1

### Evaluating your decision tree

In [185]:
def evaluate_classification_error(tree, data):
    # Apply the classify(tree, x) to each row in your data
    data['prediction'] = [classify(tree,a) for a in data.to_dict(orient = 'records')]
        
    
    # Once you've made the predictions, calculate the classification error and return it
    classification_error = round(float(sum(data['prediction'] != data['safe_loans']))/len(data),2)
    return classification_error

In [186]:
evaluate_classification_error(my_decision,test_data)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


0.38

### Printing out a decision stump

In [196]:
def print_stump(tree, name = 'root'):
    split_name = tree['splitting_feature'] # split_name is something like 'term. 36 months'
    if split_name is None:
        print ("(leaf, label: %s)" % tree['prediction'])
        return None
    split_feature, split_value = split_name.split('_')
    print ('                       ',name)
    print ('         |---------------|----------------|')
    print ('         |                                |')
    print ('         |                                |')
    print ('         |                                |')
    print ('  [{0} == 0]               [{0} == 1]    '.format(split_name))
    print ('         |                                |')
    print ('         |                                |')
    print ('         |                                |')
    print ('    (%s)                         (%s)' \
        % (('leaf, label: ' + str(tree['left']['prediction']) if tree['left']['is_leaf'] else 'subtree'),
           ('leaf, label: ' + str(tree['right']['prediction']) if tree['right']['is_leaf'] else 'subtree')))

In [197]:
print_stump(my_decision,name = 'root')

                        root
         |---------------|----------------|
         |                                |
         |                                |
         |                                |
  [term_ 36 months == 0]               [term_ 36 months == 1]    
         |                                |
         |                                |
         |                                |
    (subtree)                         (subtree)
