# Model Training

Our goal is to be able to quickly train a variety of regression models and evaluate which models perform best at predicting the return on investment (ROI) of a LendingClub loan. 

However, there a few more steps necessary to prepare our data for the modeling process. The ROI of each loan what we're trying to predict (our target), so we need to attach our `loan_roi` dictionary to the dataframe as a new column. Then, we need to turn various categorical columns into dummy columns. For example, the categorical column `home_ownership` can contain three possible values: `('RENT', 'MORTGAGE', 'OWN')`. Most models cannot automatically handle categorical columns so we need to create new boolean columns such as `home_rent`, where the value is 1 if the original value was `RENT`.

The code below will demonstrate the necessary steps, using functions I've written. Then we'll be able to train models and evaluate their performance. For now I am planning on training the following machine learning models:

* Decision Tree
* Random Forest
* XGBoost

Additionally, I will be trying some models that are not machine learning in order to test that we really are getting a performance boost over simpler methods. The strategies I will be testing are:

* Selecting High Interest Rate Loans
* Selecting Low Interest Rate Loans
* Selecting Random Loans

### Cross Validation

It is essential that we perform cross validation to analyze how our trained models perform on unseen data. The traditional method is to use `k-fold` cross validation to select the best model parameters during the training process. For now, I am going to begin with a different method.

I will fit my models on all training loans and then use a custom "portfoolio simulator" class I created to evaluate model performance.

# ADD MORE HERE

In [None]:
import pandas as pd
import pickle
from src.modeling import *
from src.feature_engineering import *
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

Read in our cleaned dataframe from previous steps, along with our loan ROI dictionary, and create a new column containing the ROI of all completed loans.

In [None]:
loans = pd.read_pickle('data/df_EDA.pkl.bz2', compression='bz2')

with open('data/loan_rois.pickle', 'rb') as handle:
    loan_rois = pickle.load(handle)
    
loans['roi'] = pd.Series(loan_rois)

Next, take the categorical columns, use their values to create boolean dummy columns, and then drop the original columns. We are using tree models so let's fill in `NaN` values with -99, allowing our trees to split on missing values.

In [None]:
loans = create_dummy_cols(loans)
# loans = fill_nas(loans)

We are going to end up dropping the `issue_d` column for model training, but will end up needing it for the portfolio simulator. So at least point let's save the issue date of each loan in a dictionary.

In [None]:
issue_dates = dict(zip(loans.index, loans['issue_d']))

Remember, we only calculated ROI for loans that had been completed and will not be receiving any more payments. Therefore, our training dataset is going to be all loans that have an ROI calculating. Our testing dataset is going to be all loans where the ROI value is missing.

In [None]:
training_loans = loans.loc[loans['roi'].notnull(), :]
testing_loans = loans.loc[loans['roi'].isnull(), :]

The last step before training a model is to split the training set into labels and a target. Our labels are the columns we will be using to train the model, and the target is what we are trying to learn to predict (ROI). We also exclude any columns relating to the issue date of the loan, as that is not a feature we want to train on.

In [None]:
X_train, y_train = split_data_into_labels_and_target(training_loans)
X_test = testing_loans.drop(columns=['issue_d', 'roi'])

### XGBoost

Let's use XGBoost as the first model we train on this dataset. 

In [None]:
model = XGBRegressor(objective='reg:squarederror' n_jobs=-1)
fit_model = train_model(model, X_train, y_train)

Excellent, we've got a trained model. As stated before, I intend to evaluate my model performance by running it through a portfolio simulator. We need to take this trained model, make predictions for the ROI of loans in the testing set, and then save the dataframe in the format necessary for the portfolio simulator. I've written some helper functions to assist, let's walk through an example below.

In [None]:
predicted_rois = get_predictions(fit_model, X_test)
predicted_rois

We need to add loan issue date back into a dataframe. We won't be able to run a portfolio simulation if we don't know the date the loan was availble.

In [None]:
df_testing = X_test.copy(deep=True)
df_testing['issue_d'] = pd.Series(issue_dates)

In [None]:
simulation_df = create_dataframe_for_simulation(df_testing, predicted_rois)
simulation_df.head()

The above dataframe is the format needed for my portfolio simulator. The simulator runs month by month choosing loans to invest in, so I chose to make the month the index of the data frame. From that point we need the loan id, loan amount, and the predicted ROI coming from our trained model. I'm going to save this dataframe and then repeat the process for the other model types that I'll be working with.

In [None]:
simulation_df.to_pickle('data/model_xgb_predictions.pkl.bz2', compression='bz2')

### Decision Tree

Let's repeat the above process for a decision tree regressor. For a decision tree we'll need to deal with `NaN` values. I'm choosing to replace missing values with -99. Our tree-based models will then consider this while checking possible feature splits.

In [31]:
model = DecisionTreeRegressor()
fit_model = train_model(model, X_train, y_train)
predicted_rois = get_predictions(fit_model, X_test)
simulation_df = create_dataframe_for_simulation(df_testing, predicted_rois)
simulation_df.head()

ValueError: Found array with 0 sample(s) (shape=(0, 151)) while a minimum of 1 is required.

### Random Forest


In [None]:
model = RandomForestRegressor(n_jobs=-1)
fit_model = train_model(model, X_train, y_train)
predicted_rois = get_predictions(fit_model, X_test)
simulation_df = create_dataframe_for_simulation(df_testing, predicted_rois)
simulation_df.head()

### High Interest Rate Strategy

In [None]:
predictions = X_test.int_rate
simulation_df = create_dataframe_for_simulation(df_test, predictions)
simulation_df.head()