# Lab 4.1:  Decision Trees

Decision trees can be used for either regression or classification tasks. Decision trees are a powerful tool; however, are very prone to overfitting the training dataset and therefore often fail to generalize well to test data sets. However, they are the building block for several other powerful machine learning algorithms and are therefore important to learn about.

<img src="../images/DecisionTreeExample.png" alt="Drawing" style="width: 500px;height=500"/>

Here's a look ahead at what we'll be doing in this notebook:


1. [Load data and packages](#loaddata)
    
Then we will work incrementally from Decision Trees to Random Forests. At each stage we will experiment with tuning the model parameters and evaluate the models performance.
2. [Build Decision Tree](#decisiontree)

3. [Feature Importance](#featureimportance)

As a reminder our previous model, Linear regression, was a parametric model with assumes linearity among others.

Whereas decision trees and associated algorithms are non-parametric models which means we are no longer restricted to independant variables which have a linear relationship and we don't have to ensure several assumptions are true. 

Therefore we can start to bring in other features that could be useful.

After we run our decision trees, we will compare our new output to our output from the linear regressions we ran in the previous notebook. 

In this notebook, we will be looking at how we can predict the loan amount using decision trees. However, it is worth mentioning that these methods can also be used to classify data, for example some interesting classification questions we could investigate are:
- Can we classify which loans expired and which one got funded?
- Is a loan posted by a male or female?

Feel free to investigate these on your own time! Here is a very simple example of how the RandomForestClassifier works: https://bicorner.com/2015/10/26/random-forest-using-ipython/

## 1.1 Import packages
<a id='loaddata'></a>

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from ggplot import *
plt.style.use('ggplot')
%matplotlib inline

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import graphviz 
from sklearn import tree

You can access Timestamp as pandas.Timestamp
  pd.tslib.Timestamp,
  from pandas.lib import Timestamp
  from pandas.core import datetools


If you don't have graphviz installed or are having problems displaying the tree structure later on, try:
- brew install graphviz (Mac/Windows)
- sudo apt-get install graphviz (Linux)

## 1.2. Load and format data

In [3]:
# Load data to pandas DataFrame
data_path = '../data/'
df = pd.read_csv(data_path+'df_end_of_linear.csv', low_memory=False)

In [4]:
# TODO: Remove this part and incorporate it into feature engineering section.
# potentially the feature engineering section should show how to do one hot encoding and this module
# will reference one hot encoding and remove any cols that are strings to highligh that sklearn's implementation
# cannot handle strings.
df = df.dropna()
cols = df[['loan_amount', 
           'partner_delinquency_rate',
            'posted_year',
           'posted_month',
           'female',
           'num_tags',
           'parent',
           'tag_#Woman Owned Biz',
           'age_int',
           'tag_#Repeat Borrower',
           'children_int',
          'more_one_partner_country',
          'terms.repayment_term',
           'tag_#Schooling',
           'married',
           'pct_female',
           'exploratory_partner',
           'partner_dollar_amount',
           'top_partner_id',
           'num_partner_countries',
           'days_to_fund',
            'hours_to_fund',
            'bc_partner_others',
           'bc_partner_HIHEA',
           'bc_partner_OAF_high',
           'bc_partner_OAF_low',
         'sector_Personal Use',
           'sector_Health',
           'sector_Wholesale',
           'sector_Agriculture',
           'kids']]


Just like last module, we are going to build regressors to predict the loan amount.

However, instead of using just a few features, we will build a tree that considers many the features in the dataset - including those we have engineered ourselves.

In [5]:
y = df['loan_amount']
# drop returns a copy of the DataFrame with the specified columns removed.  
X = cols.drop('loan_amount', axis=1)

In [6]:
# Split data into training and testing sets;
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 2.1 Decision Tree
<a id='decisiontree'></a>
Before we build our first decision tree, let's first learn about the input parameters for sklearn's implementation of a Decision Tree Regressor.  

Feel free to look at the [docs](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor.get_params), or simply put a question mark before a call to the class.  Prepending a ? to any method, variable, or class will display that method's defined docstring (way to go ipython!)

In [7]:
?DecisionTreeRegressor

Many of the sklearn algorithms are implemented using the same standard steps: 
- **Step 1: Initiate the algorithm** Tdefine the parameters (& hyperparameters of the algorithm) of the algorithm. For example, the maximum depth, the minimum samples in a leaf etc. To learn more about the parameters for each algorithm, either check the module documentation on the internet or run a cell with the algorithm name followed by ? as we did at the beginning of this notebook for the RandomForestRegressor. These resources will also tell you the default values used for each parameter. 

- **Step 2: Train the algorithm** train the algorithm by fitting it to the X_train and y_train datasets.

- **Step 3: Evaluating the algorithm** evaluate the predictive power of the algorithm by comparing the predictive loan amount values to the true values. We can do this for the training and testing dataset.

First let build a function which encapsulates the 3 model implementation steps; Initialize, Train, Evaluate our decision tree regressor. 

In [8]:
def train_score_regressor(sklearn_regressor, X_train, y_train, X_test, y_test, model_parameters):
    '''
    Purpose: 
        - train a regressor on training data
        - score data on training and test data
        - return trained model
    '''
    # Step 1: Initializing the sklearn regressor 
    regressor = sklearn_regressor(**model_parameters)
    # Step 2: Training the algorithm using the X_train dataset of features and y_train, the associated target features
    regressor.fit(X_train, y_train)
    # Step 3: Calculating the score of the predictive power on the training and testing dataset.
    training_score = regressor.score(X_train, y_train)
    testing_score = regressor.score(X_test, y_test)
    # Print the results!
    print("Train score: " + str(training_score))
    print("Test score: " + str(testing_score))
        
    return regressor
?train_score_regressor

With all tree algorithms the major challenge is using the parameters to balance the bias vs variance tradeoff.  

Before we get into exploring the parameters, let's see how the model preforms when using the default values.

In [10]:
trained_regressor = train_score_regressor(sklearn_regressor = DecisionTreeRegressor,
                                          X_train = X_train, y_train = y_train, 
                                          X_test = X_test, y_test = y_test, 
                                          model_parameters = {'random_state':42})

Train score: 1.0
Test score: 0.363635374266


While the Decision Tree managed to get a perfect r2 scored on the training data, it managed an abysmal .34 on the test data.  This is a clear indication that the model has overfit the data.

Given that by default sklearn's implementation of a DecisionTreeRegressor does not put any restrictions on the depth of the tree, the number of samples per leaf, the number of samples per leaf, etc.  As a result, the Decision Tree will find signal in any and all noise of the training data set, which causes the model to perform poorly on the test data.  

When a model overfits to a training data set, we say it has **high variance**.  Since an unconstrained decision tree will almost perfectly model any training data, it will vary tremendously depending on the training data that is provided.

## 2.2 Parameter tuning
To reduce the variance, we constrain the model using some of the provided parameters for example:
- Criterion (Cost function used to measure the purity of a split)
- Maximum depth of the tree
- Minimum samples for each node split
- Minimum samples for each terminal node
- Maximum number of terminal nodes

If you need a refresher to remember what these parameters are, look back over the [slides](https://docs.google.com/presentation/d/1leWPbwis9GJHJcQehlhPhtKEAErUPvlTpKjnkv1aWWU/edit?usp=sharing) or use this [useful blog](https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/#four).

Initially, we are going to experiment with the max_depth parameter. 

In [13]:
# Define the model parameters 
# We are fixing the random state so that the results are reproducible.
parameters = {"max_depth":4,'random_state':42}
trained_regressor = train_score_regressor(sklearn_regressor = DecisionTreeRegressor,
                                          X_train = X_train, y_train = y_train, 
                                          X_test = X_test, y_test = y_test, 
                                          model_parameters = {"max_depth":4,'random_state':42})

Train score: 0.558032781957
Test score: 0.438339906714
