## Library
We are using the Turi Create library for implementing the Random Forest Classification Model.

In [11]:
import turicreate as tc

## Data
We will be using the same given [LendingClub](https://www.lendingclub.com/) dataset.

In [12]:
loans = tc.SFrame('../data/lending-club-data.sframe/')

## Target Column Definition

The target column (label column) of the dataset that we are interested in is called `bad_loans`. In this column **1** means a risky (bad) loan **0** means a safe  loan.

We reassign the target to be:
* **+1** as a safe  loan, 
* **-1** as a risky (bad) loan. 

We put this in a new column called `safe_loans` and define it as `target` column.

In [13]:
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
loans = loans.remove_column('bad_loans')

target = 'safe_loans' # prediction target (y) (+1 means safe, -1 is risky)

## Features Selection
Like previous assignment, we will be using a subset of features (categorical and numeric). The features we will be using are **described in the code comments** below:

In [14]:
features = ['grade',                     # grade of the loan
            'sub_grade',
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
           ]

                  
# Extract the feature columns and target column
loans = loans[features + [target]]

## Class Balancing
One way to combat class imbalance is to undersample the larger class until the class distribution is approximately half and half. Here, we will undersample the larger class (safe loans) in order to balance out our dataset. This means we are throwing away many data points. We used `seed=1` so everyone gets the same results.

We do this in order to help the algorithm studies both classes equally so it can perform more precise predictions.

In [15]:
safe_loans_raw = loans[loans[target] == +1]
risky_loans_raw = loans[loans[target] == -1]

# Since there are fewer risky loans than safe loans, find the ratio of the sizes
# and use that percentage to undersample the safe loans.
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))

risky_loans = risky_loans_raw
safe_loans = safe_loans_raw.sample(percentage, seed = 1)

# Append the risky_loans with the downsampled version of safe_loans
loans_data = risky_loans.append(safe_loans)

## Data Splitting
80% of the original data will be randomly split into training set `(train_data)` and 20% will be randomly split into test set `(test_data)`. We used `seed=1` so everyone gets the same results.

In [16]:
train_data, test_data = loans_data.random_split(.8, seed = 1)

## Random Forest Model Building
By using Turi Create we use its `random_forest_classifier` class to create the model. The parameters are:

* `train_data`: the input data for the algorithm to train on.

* `validation_set`: set to None because we don't have a validation set.

* `target`: is the target column which is `safe_loans`.

* `features`: are the features the algorithm will use to learn.

* `max_iterations`: the number of trees grown for the model **(this will be covered in the algorithm explaination below)**

* `max_depth`: the maximum depth allowed for all trees

* `random_seed`: this is the seed for randomization when selecting data points as training data for different trees and subset features for each tree **(this will be covered in the algorithm explaination below)**. For now if you set it to None (`random_seed = None`), the accuracy will be different each time you build the model. If you set it to a fixed number (e.g. `random_seed = 1`), the result for each build will be the same.

In [17]:
random_forest_model = tc.random_forest_classifier.create(train_data,
                                                            validation_set = None,
                                                            target = target,
                                                            features = features,
                                                            max_iterations = 100,
                                                            max_depth = 6,
                                                            random_seed = None)

Here is the summary of the model after building:

In [18]:
random_forest_model.summary()

Class                          : RandomForestClassifier

Schema
------
Number of examples             : 37224
Number of feature columns      : 8
Number of unpacked features    : 8
Number of classes              : 2

Settings
--------
Number of trees                : 100
Max tree depth                 : 6
Training time (sec)            : 1.5858
Training accuracy              : 0.6256
Training log_loss              : 0.6454
Training auc                   : 0.6754



## Accuracy Comparison With The Decision Tree Model
By using the same training data set `(train_data)`. We also build a Decision Tree Model with them same `target` and `features`

In [19]:
decision_tree_model = tc.decision_tree_classifier.create(train_data,
                                                            validation_set = None,
                                                            target = target,
                                                            features = features)

Now, have a quick comparison between two models with their corresponding accuracy

In [20]:
print("Decision Tree model's accuracy:", decision_tree_model.evaluate(test_data)['accuracy'])
print("Random Forest model's accuracy:", random_forest_model.evaluate(test_data)['accuracy'])

Decision Tree model's accuracy: 0.6209607927617407
Random Forest model's accuracy: 0.6210685049547608


We can see that **most of the time** (with `random_seed = None` when building the model), the accuracy of the **Random Forest model** is ***higher*** than the **Decision Tree model**.

If your result is opposite, you can try **rebuild** the **Random Forest model** by executing its code cell again. Then try the accuracy test again (or just hit **Run All**)

The result will be 100% different each try (with `random_seed = None` when building the model). So how does this work?

## **Random Forest Model Algorithm Explaination**

### **Main Idea**
The final prediction of this model is based on a combination of multiple different decision trees. For binary classification case (the target column is 1 or 2), it will choose the value with highest amount of predictions among all trees for each prediction.
**For example**: let's say we have 5 trees and a data point. Consider the following result by putting the data point through all trees:

- Tree 1: predicts 1
- Tree 2: predicts 0
- Tree 3: predicts 1
- Tree 4: predicts 1
- Tree 5: predicts 0

The final prediction for this data point will be 1 (because 1 has more predictions than 0)

But why does it called ***Random***?

### **The Learning Process**
**Input**: the training data set `train_data`

**Learning**: build K trees (the K number here is specified by the parameter `max_iterations` when building the model). For each tree:
1. Generate K ***random*** new training sets from the original training set with a method called **Bootstrap Aggregating** or **Bagging** for short. This means ***randomly*** select any item in the original set and append it into the new set until the size of the new set is equal to the old one (duplicates are allowed). For example: The original set is: 1,2,3,4,5. The K new sets would be:
    - 1,1,2,3,4
    - 2,1,5,5,3
    - 4,2,3,1,5
    - ...
2. We use i<sup>th</sup> new training set for the i<sup>th</sup> tree to learn.
3. For each tree, instead of using all features for it, we select ***randomly*** a subset of features with a feature called **Feature Randomness**. The number of features selected will be $\sqrt{total features}$. According to scientists, this formula usually leads to a more accurate prediction.
4. Grow K trees to n depth (is determined by parameter `max_depth`)

**Predicting**: the final prediction will be the highest amount of predictions among all trees for each data point.

-> Through the learning process, the process in this model evolves ***randomness***. This is why it is called **Random Forest** - Forest here means multiple trees.

### **Pros and Cons**
**Pros**: Most of the time, it will gives a more accurate and precise prediction compares to original Decision Tree Model. Since there are multiple trees and they are randomly different from each other, it avoids and prevents the overfitting problem with the original model.

**Cons**: Consumes more computation power and time. There are more trees to be built even with less features for each of them.



**Visualized Reference From**
https://www.youtube.com/watch?v=v6VJ2RO66Ag&list=LL&index=1&ab_channel=NormalizedNerd

