## Library
We are using the Turi Create library for implementing the Random Forest Classification Model

In [1]:
import turicreate as tc

## Data
We will be using the same given [LendingClub](https://www.lendingclub.com/) dataset

In [2]:
loans = tc.SFrame('../data/lending-club-data.sframe/')

## Target Column Definition

The target column (label column) of the dataset that we are interested in is called `bad_loans`. In this column **1** means a risky (bad) loan **0** means a safe  loan.

We reassign the target to be:
* **+1** as a safe  loan, 
* **-1** as a risky (bad) loan. 

We put this in a new column called `safe_loans` and define it as `target` column

In [3]:
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
loans = loans.remove_column('bad_loans')

target = 'safe_loans' # prediction target (y) (+1 means safe, -1 is risky)

## Features Selection
Like previous assignment, we will be using a subset of features (categorical and numeric). The features we will be using are **described in the code comments** below

In [4]:
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]

                  
# Extract the feature columns and target column
loans = loans[features + [target]]

What remains now is a **subset of features** and the **target** that we will use for the rest of this notebook. 

## Class Balancing
One way to combat class imbalance is to undersample the larger class until the class distribution is approximately half and half. Here, we will undersample the larger class (safe loans) in order to balance out our dataset. This means we are throwing away many data points. We used `seed=1` so everyone gets the same results.
We do this in order to help the algorithm studies both classes equally so it can perform more precise predictions

In [5]:
safe_loans_raw = loans[loans[target] == +1]
risky_loans_raw = loans[loans[target] == -1]

# Since there are fewer risky loans than safe loans, find the ratio of the sizes
# and use that percentage to undersample the safe loans.
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))

risky_loans = risky_loans_raw
safe_loans = safe_loans_raw.sample(percentage, seed=1)

# Append the risky_loans with the downsampled version of safe_loans
loans_data = risky_loans.append(safe_loans)

## Splitting the data
80% of the original data will be randomly split into training set `(train_data)` and 20% will be randomly split into test set `(test_data)`. We used `seed=1` so everyone gets the same results.

In [6]:
train_data, test_data = loans_data.random_split(.8, seed=1)

## Build the Random Forest Model
By using Turi Create we use its `random_forest_classifier` class to create the model. The paramenters are:

* `train_data`: the input data for the algorithm to train on

* `validation_set`: set to None because we don't have a validation set

* `target`: is the target column which is `safe_loans`

* `features`: are the features the algorithm will use to learn

In [7]:
random_forest_model = tc.random_forest_classifier.create(train_data,
                                                                 validation_set=None,
                                                                 target = target,
                                                                 features = features)

## Accuracy comparison with the original Decision Tree Model
By using the same training data set `(train_data)`. We also build a Decision Tree Model with them same `target` and `features`

In [8]:
decision_tree_model = tc.decision_tree_classifier.create(train_data,
                                                            validation_set=None,
                                                            target = target,
                                                            features = features)

In [9]:
print('decision tree model:', decision_tree_model.evaluate(test_data)['accuracy'])
print('random forest model:', random_forest_model.evaluate(test_data)['accuracy'])

decision tree model: 0.6367944851357173
random forest model: 0.6402412753123654


## Random Forest Model algorithm explaination