## Library
We are using the Turi Create library for implementing the Boosted Trees Model.

In [91]:
import turicreate

## Data
We will be using the same given [LendingClub](https://www.lendingclub.com/) dataset.

In [92]:
loans = turicreate.SFrame('../data/lending-club-data.sframe/')

## Target Column Definition

The target column (label column) of the dataset that we are interested in is called `bad_loans`. In this column **1** means a risky (bad) loan **0** means a safe  loan.

We reassign the target to be:
* **+1** as a safe  loan, 
* **-1** as a risky (bad) loan. 

We put this in a new column called `safe_loans` and define it as `target` column.

In [93]:
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
loans = loans.remove_column('bad_loans')
target = 'safe_loans' # prediction target (y) (+1 means safe, -1 is risky)

## Features Selection
Like previous assignment, we will be using a subset of features (categorical and numeric). The features we will be using are **described in the code comments** below:

In [94]:
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]

# Extract the feature columns and target column
loans = loans[features + [target]]

## Class Balancing
One way to combat class imbalance is to undersample the larger class until the class distribution is approximately half and half. Here, we will undersample the larger class (safe loans) in order to balance out our dataset. This means we are throwing away many data points. We used `seed=1` so everyone gets the same results.

We do this in order to help the algorithm studies both classes equally so it can perform more precise predictions.

In [95]:
safe_loans_raw = loans[loans[target] == +1]
risky_loans_raw = loans[loans[target] == -1]

# Since there are fewer risky loans than safe loans, find the ratio of the sizes
# and use that percentage to undersample the safe loans.
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))

risky_loans = risky_loans_raw
safe_loans = safe_loans_raw.sample(percentage, seed = 1)

# Append the risky_loans with the downsampled version of safe_loans
loans_data = risky_loans.append(safe_loans)

## Data Splitting
80% of the original data will be randomly split into training set `(train_data)` and 20% will be randomly split into test set `(test_data)`. We used `seed=1` so everyone gets the same results.

In [96]:
train_data, test_data = loans_data.random_split(0.8, seed = 1)

## Boosted Trees Model Building
By using Turi Create we use its `boosted_trees_classifier` class to create the model. The parameters are:

* `train_data`: the input data for the algorithm to train on.

* `validation_set`: set to None because we don't have a validation set.

* `target`: is the target column which is `safe_loans`.

* `features`: are the features the algorithm will use to learn.

* `max_iterations`: the number of trees grown for the model **(this will be covered in the algorithm explaination below)**

* `min_child_weight`: the min weight of a child node. In the training process, if weight of a child nonde smaller than min_child_weight then that node won't be divided

* `column_subsample`: percentage of sample getting from training set to improve decision tree

* `random_seed`: this is the seed for randomization when selecting data points as training data for different trees and subset features for each tree

* `max_depth`: the maximum depth allowed for all trees

In [97]:
boosted_trees_model = turicreate.boosted_trees_classifier.create(train_data,
                                                            validation_set = None,
                                                            target = target,
                                                            features = features,
                                                            max_iterations = 100,
                                                            min_child_weight = 1,
                                                            column_subsample =  0.85, 
                                                            random_seed = 1,
                                                            max_depth = 10)

Here is the summary of the model after building:

In [98]:
boosted_trees_model.summary()

Class                          : BoostedTreesClassifier

Schema
------
Number of examples             : 37224
Number of feature columns      : 12
Number of unpacked features    : 12
Number of classes              : 2

Settings
--------
Number of trees                : 100
Max tree depth                 : 10
Training time (sec)            : 2.3097
Training accuracy              : 0.8405
Training log_loss              : 0.4208
Training auc                   : 0.9275



## Accuracy Comparison With The Decision Tree Model
By using the same training data set `(train_data)`. We also build a Decision Tree Model with them same `target` and `features`

In [99]:
decision_tree_model = turicreate.decision_tree_classifier.create(train_data,
                                                            validation_set = None,
                                                            target = target,
                                                            features = features,
                                                            max_depth = 10,
                                                            )

Now, have a quick comparison between two models with their corresponding accuracy

In [100]:
print("Decision Tree model's accuracy:", decision_tree_model.evaluate(test_data)['accuracy'])
print("Boosted Trees model's accuracy:", boosted_trees_model.evaluate(test_data)['accuracy'])

Decision Tree model's accuracy: 0.6274235243429557
Boosted Trees model's accuracy: 0.6140672124084446


We can see that **most of the time**, the accuracy of the **Boosted Trees model** is ***higher*** than the **Decision Tree model**.

If your result is opposite, you can try **rebuild** the **Boosted Trees model** by executing its code cell again. Then try the accuracy test again (or just hit **Run All**)

## **Boosted Tree Model Algorithm Explaination**

### **Main Idea**

Like other boosting methods, gradient boosting combines weak "learners" into a single strong learner in an iterative fashion. It is easiest to explain in the least-squares regression setting, where the goal is to "teach" a model $F$ to predict values of the form $\hat{y} = F(x)$ by minimizing the mean squared error $\frac{1}{n} \sum \limits _{i} (\hat{y_{i}} - y_{i})^{2}$, where $i$ indexes over some training set of size $n$ of actual values of the output variable $y$:
- $\hat{y_{i}}$:  the predicted value $F(x_{i})$
- $y_{i}$: the observed value
- $n$: the number of samples in $y$

Now, let us consider a gradient boosting algorithm with $M$ stages. At each stage $m(1 <= m <= M)$ of gradient boosting, suppose some imperfect mode $F_{m}$ for low $m$, this model may simply return $\hat{y_{i}} = \bar{y}$, , where the RHS is the mean of $y$).

In order to improve $F_{m}$, our algorithm should add some new estimator, $h_{m}(x)$. Thus, $F_{m+1}(x) = F_{m}(x) + h_{m}(x) = y$ or, equivalently, $h_{m}(x) = y - F_{m}(x)$

Therefore, gradient boosting will fit $h$ to the residual $y - F_{m}(x)$. As in other boosting variants, each $F_{m+1}$ attempts to correct the errors of its predecessor $F_{m}$. 

A generalization of this idea to loss functions other than squared error, and to classification and ranking problems, follows from the observation that residuals $h_{m}(x)$ for a given model are proportional equivalent to the negative gradients of the mean squared error (MSE) loss function (with respect to $F(x)$:

$L_{MSE} = \frac{1}{n}(y - F(x))^{2}$&emsp;&emsp;&emsp;&emsp;$-\frac{\partial L_{MSE}}{\partial F} = \frac{2}{n}(y - F(x)) = \frac{2}{n}h_{m}(x)$

So, gradient boosting could be specialized to a gradient descent algorithm, and generalizing it entails "plugging in" a different loss and its gradient.


### **Gradient Tree Boosting Algorithm**

1.&emsp;Initialize model with a constant value:&emsp;&emsp;$f_{0}(x) = \textrm{arg min}_{\gamma} \sum \limits _{i=1} ^{N} L(y_{i}, \gamma)$<br>
2.&emsp;For m = 1 to M:<br><br>
&emsp;&emsp;(a)&emsp;For $i = 1,2,...,N$ compute: &emsp;&emsp;$r_{im} = - \displaystyle \Bigg[\frac{\partial L(y_{i}, f(x_{i}))}{\partial f(x_{i})}\Bigg]_{f=f_{m−1}}$<br><br>
&emsp;&emsp;(b)&emsp;Fit a regression tree to the targets $r_{im}$ giving terminal regions:&emsp;&emsp;$R_{jm}, j = 1, 2, . . . , J_{m}.$<br><br><br>
&emsp;&emsp;(c)&emsp;For $j = 1, 2, . . . , J_{m}$ compute: &emsp;&emsp;&emsp;&emsp;$\gamma_{jm} = \underset{\gamma}{\textrm{arg min}} \sum \limits _{x_{i} \in R_{jm}} L(y_{i}, f_{m−1}(x_{i}) + \gamma)$<br><br>
&emsp;&emsp;(d)&emsp;Update:&emsp;&emsp;$f_{m}(x) = f_{m−1}(x) + \sum _{j=1} ^{J_{m}} \gamma_{jm} I(x \in R_{jm})$<br><br>
3. Output $\hat{f}(x) = f_{M}(x)$

### **Pros and Cons**
**Pros**
- Often provides predictive accuracy that cannot be trumped
- Lots of flexibility - can optimize on different loss functions and provides several hyper parameter tuning options that make the function fit very flexible.
- No data pre-processing required - often works great with categorical and numerical values as is.
- Handles missing data - imputation not required.

**Cons**
- Gradient Boosting Models will continue improving to minimize all errors. This can overemphasize outliers and cause overfitting.
- Computationally expensive - often require many trees (>1000) which can be time and memory exhaustive.
- The high flexibility results in many parameters that interact and influence heavily the behavior of the approach (number of iterations, tree depth, regularization parameters, etc.). This requires a large grid search during tuning.
- Less interpretative in nature, although this is easily addressed with various tools.

**Visualized Reference From**: https://www.youtube.com/watch?v=TyvYZ26alZs