# Model Training

Our goal is to be able to quickly train a variety of regression models and evaluate which models perform best at predicting the return on investment (ROI) of a LendingClub loan. 

However, there a few more steps necessary to prepare our data for the modeling process. The ROI of each loan what we're trying to predict (our target), so we need to attach our `loan_roi` dictionary to the dataframe as a new column. Then, we need to turn various categorical columns into dummy columns. For example, the categorical column `home_ownership` can contain three possible values: `('RENT', 'MORTGAGE', 'OWN')`. Most models cannot automatically handle categorical columns so we need to create new boolean columns such as `home_rent`, where the value is 1 if the original value was `RENT`.

The code below will demonstrate the necessary steps, using functions I've written. Then we'll be able to train models and evaluate their performance. For now I am planning on training the following machine learning models:

* Decision Tree
* Random Forest
* XGBoost

Note: As of now I've only trained XGBoost as I expected it would be the winner and needed 1 trained model to test my portfolio simulator. I will be coming back to train more models.

Additionally, I will be trying some models that are not machine learning in order to test that we really are getting a performance boost over simpler methods. The strategies I will be testing are:

* Selecting High Interest Rate Loans
* Selecting Low Interest Rate Loans
* Selecting Random Loans

### Cross Validation

It is essential that we perform cross validation to analyze how our trained models perform on unseen data. The traditional method is to use `k-fold` cross validation to select the best model parameters during the training process. For now, I am going to begin with a different method.

I will fit my models on all training loans and then use a custom "portfolio simulator" class I created to evaluate model performance.

In [1]:
import pandas as pd
import numpy as np
import pickle
from src.modeling import *
from src.feature_engineering import *
from xgboost import XGBRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

Read in our cleaned dataframe from previous steps, along with our loan ROI dictionary, and create a new column containing the ROI of all completed loans.

In [2]:
loans = pd.read_pickle('data/df_EDA.pkl.bz2', compression='bz2')

with open('data/loan_rois.pickle', 'rb') as handle:
    loan_rois = pickle.load(handle)
    
loans['roi'] = pd.Series(loan_rois)

Next, take the categorical columns, use their values to create boolean dummy columns, and then drop the original columns. We are using tree models so let's fill in `NaN` values with -99, allowing our trees to split on missing values.

In [3]:
loans = create_dummy_cols(loans)
loans = fill_nas(loans)

We are going to end up dropping the `issue_d` column for model training, but will end up needing it for the portfolio simulator. So at this point let's save the issue date of each loan in a dictionary.

In [4]:
issue_dates = dict(zip(loans.index, loans['issue_d']))

Remember, we only calculated ROI for loans that had been completed and will not be receiving any more payments. Therefore, our training dataset is going to be all loans that have an ROI calculating. Our testing dataset is going to be all loans where the ROI value is not in the dictionary containing our ROI values.

In [6]:
training_rows = loans.index.isin(loan_rois)
testing_rows = np.logical_not(training_rows)

training_loans = loans.loc[training_rows, :]
testing_loans = loans.loc[testing_rows, :]

The last step before training a model is to split the training set into labels and a target. Our labels are the columns we will be using to train the model, and the target is what we are trying to learn to predict (ROI). We also exclude any columns relating to the issue date of the loan, as that is not a feature we want to train on.

In [7]:
X_train, y_train = split_data_into_labels_and_target(training_loans)
X_test = testing_loans.drop(columns=['issue_d', 'roi'])

### XGBoost

Let's use XGBoost as the first model we train on this dataset. 

In [53]:
# Let's give GPU training a shot!
model = XGBRegressor(objective='reg:squarederror', tree_method='gpu_hist', gpu_id=0)
# model = XGBRegressor(objective='reg:squarederror', n_jobs=-1)
fit_model = train_model(model, X_train, y_train)

Excellent, we've got a trained model. As stated before, I intend to evaluate my model performance by running it through a portfolio simulator. We need to take this trained model, make predictions for the ROI of loans in the testing set, and then save the dataframe in the format necessary for the portfolio simulator. I've written some helper functions to assist, let's walk through an example below.

In [54]:
predicted_rois = get_predictions(fit_model, X_test)
predicted_rois

array([-6.0914345 , -0.36336404,  2.7361727 , ...,  7.628992  ,
        1.6834992 ,  3.0342667 ], dtype=float32)

We need to add loan issue date back into a dataframe. We won't be able to run a portfolio simulation if we don't know the date the loan was availble.

In [55]:
df_testing = X_test.copy(deep=True)
df_testing['issue_d'] = pd.Series(issue_dates)

In [56]:
simulation_df = create_dataframe_for_simulation(df_testing, predicted_rois)
simulation_df.head()

Unnamed: 0_level_0,id,loan_amnt,predicted_roi
issue_d,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-08-01,114844590,15400,-6.091434
2017-08-01,113880484,5500,-0.363364
2017-08-01,115705737,5000,2.736173
2017-08-01,115412547,6000,4.281939
2017-08-01,115402601,3000,6.662006


The above dataframe is the format needed for my portfolio simulator. The simulator runs month by month choosing loans to invest in, so I chose to make the month the index of the data frame. From that point we need the loan id, loan amount, and the predicted ROI coming from our trained model. I'm going to save this dataframe and then repeat the process for the other model types that I'll be working with.

In [57]:
simulation_df.to_pickle('data/model_xgb_predictions.pkl.bz2', compression='bz2')

### Decision Tree

Let's repeat the above process for a decision tree regressor. XGBoost could've handled `NaN` values, but a decision tree cannot. I'm choosing to replace missing values with -99. Our tree-based models will then consider this while checking all possible feature splits.

In [18]:
model = DecisionTreeRegressor()
fit_model = train_model(model, X_train, y_train)
predicted_rois = get_predictions(fit_model, X_test)
simulation_df = create_dataframe_for_simulation(df_testing, predicted_rois)
simulation_df.head()

Unnamed: 0_level_0,id,loan_amnt,predicted_roi
issue_d,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-08-01,114844590,15400,27.764506
2017-08-01,113880484,5500,15.289537
2017-08-01,115705737,5000,12.11631
2017-08-01,115412547,6000,6.938766
2017-08-01,115402601,3000,8.493736


In [19]:
simulation_df.to_pickle('data/model_dt_predictions.pkl.bz2', compression='bz2')

### Random Forest


In [16]:
model = RandomForestRegressor(n_jobs=-1)
fit_model = train_model(model, X_train, y_train)
predicted_rois = get_predictions(fit_model, X_test)
simulation_df = create_dataframe_for_simulation(df_testing, predicted_rois)
simulation_df.head()

Unnamed: 0_level_0,id,loan_amnt,predicted_roi
issue_d,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-08-01,114844590,15400,4.494387
2017-08-01,113880484,5500,-1.467877
2017-08-01,115705737,5000,-2.942033
2017-08-01,115412547,6000,1.127957
2017-08-01,115402601,3000,5.032741


In [17]:
simulation_df.to_pickle('data/model_rf_predictions.pkl.bz2', compression='bz2')

### Gradient Boosted Regressor

There is a ton of tuning I could use with the `GradientBoostingRegressor` model but for now let's run it with the default parameters.

In [34]:
model = GradientBoostingRegressor(loss='huber')
fit_model = train_model(model, X_train, y_train)
predicted_rois = get_predictions(fit_model, X_test)
simulation_df = create_dataframe_for_simulation(df_testing, predicted_rois)
simulation_df.head()

Unnamed: 0_level_0,id,loan_amnt,predicted_roi
issue_d,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-08-01,114844590,15400,7.547071
2017-08-01,113880484,5500,8.797667
2017-08-01,115705737,5000,7.799025
2017-08-01,115412547,6000,6.282394
2017-08-01,115402601,3000,6.65625


In [35]:
simulation_df.to_pickle('data/model_gbrt_predictions.pkl.bz2', compression='bz2')

### High Interest Rate Strategy

In [28]:
predictions = X_test.int_rate
simulation_df = create_dataframe_for_simulation(df_testing, predictions)
simulation_df.head()

Unnamed: 0_level_0,id,loan_amnt,predicted_roi
issue_d,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-08-01,114844590,15400,22.91
2017-08-01,113880484,5500,14.08
2017-08-01,115705737,5000,10.91
2017-08-01,115412547,6000,10.91
2017-08-01,115402601,3000,7.97


In [29]:
simulation_df.to_pickle('data/model_high_interest_rate.pkl.bz2', compression='bz2')

### Low Interest Rate Strategy

In [30]:
predictions = -1*X_test.int_rate
simulation_df = create_dataframe_for_simulation(df_testing, predictions)
simulation_df.head()

Unnamed: 0_level_0,id,loan_amnt,predicted_roi
issue_d,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-08-01,114844590,15400,-22.91
2017-08-01,113880484,5500,-14.08
2017-08-01,115705737,5000,-10.91
2017-08-01,115412547,6000,-10.91
2017-08-01,115402601,3000,-7.97


In [31]:
simulation_df.to_pickle('data/model_low_interest_rate.pkl.bz2', compression='bz2')

### Random Loan Strategy

Our portfolio simulator is set up to select the highest loans available, based on an investor's available cash and minimium return requirements. We can simulate choosing loans randomly by assigning random values from a continuous uniform distribution to our predicted ROI column.

In [32]:
# Set seed for reproducibility. 
np.random.seed(91)
predictions = np.random.uniform(50, 60, size=len(df_testing))
simulation_df = create_dataframe_for_simulation(df_testing, predictions)
simulation_df.head()

Unnamed: 0_level_0,id,loan_amnt,predicted_roi
issue_d,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-08-01,114844590,15400,52.010036
2017-08-01,113880484,5500,53.290205
2017-08-01,115705737,5000,52.964997
2017-08-01,115412547,6000,50.933337
2017-08-01,115402601,3000,53.330788


In [33]:
simulation_df.to_pickle('data/model_random_pick.pkl.bz2', compression='bz2')

### Modeling Part 2

Coming in the modeling part 2 notebook we will try more models, such a gradient boosted trees, as well as performing more traditional k-fold cross validation and parameter tuning.

### Portfolio Simulator

For now, let's move on to the portfolio simulator notebook and see how well our models perform on loans they haven't seen before.