# Assignment 3

In this assignment, you will train and tune a model for predicting the IRR of indiviual loans from 2017Q1--Q2 issued by the Prosper platform. 

# Imports

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import numpy_financial as npf
import os

In [None]:
from pagayapro.paths.data_paths import ASSIGNMENT5_DATA

# Load the datasets

In [None]:
data = pd.read_parquet(os.path.join(ASSIGNMENT5_DATA, "prosper_data.parquet"))

Display the first five rows of your data

How many rows are there in your data? How many columns?

Load the cashflows from the follwing path

In [None]:
cashflows = pd.read_parquet(os.path.join(ASSIGNMENT5_DATA, "prosper_cashflows.parquet"))

Display the first five cashflows

What is the shape of the cashflows dataframe?

## Verify matching ID sets and make sure they are in the same order
We will use the cashflows table in order to generate labels for our prediction model. To do so, we must verify that the two data sets describe the same loans and in the same order. Please do so now.

## Compute IRR

Compute the IRR each of the rows in the cashflows table individually, and convert them to annual IRR in percents.

Check for NaNs in the target: How many NaNs are there?

Determine why the NaNs in the cashflows appear.

Fix the NaN according to your findings. Make sure there are no more NaNs in the target.

# Data preprocessing for model train

Look at the data and see if there are features which we can't/shouldn't use. 

What are the different dtypes? how many from each dtype?

Important: We must remove the co_mob, co_amount features from the data since they are not features we can use when deciding whether to approve a loan.
These features tell us how the payments of the loan behaved, which is information from the future which we don't have at the time of the approval decision.

However, we will not remove them yet, as we will use them later on to make sure we have the same ratio of charged-off loans in our train and test datasets

Lets look specifically at the non-numerical features

It is safe to assume the date is not correlated to the IRR (and even if, for some reason, it is, there is no way for our model to correctly evaluate loans with future dates which are not used in training). Therefore, we can drop the date feature.

Look at the remaining categorical features. How many distinct values appear in each of these features?

Decide how to manipulate them so they can be used in the model.

## Remove features with high NaN rate


Check which features have high NaN values and decide how to treat them. Which is the feature with the most NaN values?

Remember that the XGBoost model can handle NaN values in the features, but the more NaNs we have, the harder it will be for the model to learn.

## Checking validity of cashflows


Show the payments of each loan for every month the loan existed

Find something in this data that doesn't make sense. Figure out why/when this happens. How many times does this behavior happens? Determine how to treat this behavior. If you decide to drop these rows- make sure you do not create a CO-bias.

In order to make sure that we do not creat a CO balance by dropping a subset of loans, we should verify that the percentage of CO loans is roughly the same before and after dropping these loans.

Compute the CO rate of full data:

Compute the CO rate of data without invalid payments:

_Note_: Since the difference is very small, it is safe to say we are not creating a CO bias and we can safely remove the entries. 
We note that though we have determined that we are not creating a CO-bias by removing these entries, it is still very much possible that we are creating some other bias. Since the number of samples is really quite small, it is ok to ignore that for the moment (but better keep that in mind in case the model fails or vastly underperforms for some reason).

## Train test/validation split


As stated before, to make sure our model is well trained and tested we want to make sure we have roughly the same CO rate in the train and test sets. To do that, we use the "stratify" argument in the train_test_split function.

Read about using the "stratify" argument in [train-test-splitting](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). Use "stratify" and make sure you have the same CO-rate in train and test sets.

This argument will recieve a boolean column stating whether the loan was a CO. Lets create that column:

Now split the data to train and test using the sklearn function - `train_test_split`. Make sure the train and test have roughly the same CO rate.

In [None]:
from sklearn.model_selection import train_test_split

As we can see, the CO ratio is indeed very similar between the datasets.
Now lets drop the CO feature from the data to prevent overfitting.

# Running XGB

First we import the XGBoost package and create an empty dictionary for saving the results:

In [None]:
import xgboost as xgb

evals_result = dict()

Next, we set the train and test matrices for the XGBoost (see [here](https://xgboost.readthedocs.io/en/latest/get_started.html)):

In [None]:
dtrain = xgb.DMatrix(X_train, y_train)
dtest = xgb.DMatrix(X_test, y_test)

Here is an example of how to run XGBoost (the following hyperparameters are completely arbitrary).

Notice that the `num_boost_round` hyperparameter must by specified in the xgb.train function, and not as part of the params dictionary.

In [None]:
params = {
    "booster": "gbtree",
    "max_depth": 6,
    "colsample_bytree": 1,
    "subsample": 1,
    "eta": 0.3,
    "gamma": 1,
    "min_child_weight": 1,
    "max_delta_step": 0,
    "nthread": 8,
    "seed": 42,
    "objective": "reg:squarederror",
    "eval_metric": "rmse",
}
num_boost_round = 20

model = xgb.train(params, dtrain,  evals=[(dtest, 'eval'), (dtrain, 'train')], evals_result=evals_result, num_boost_round=num_boost_round)

Choose an initial setting of hyperparameters on which you will later perform a hyperparameter grid search using cross validation. A good place to start the search from is important in order to establish good results (we need to search in the right area). Do not change the loss function. Fill in your initial values in the cell below.

Explore the content of the `evals_result` dictionary, and find the values of the loss function for the model.

Plot the loss curves of the training and test.

Notice that training for longer (more rounds) might improve the training set RMSE but will achieve a higher RMSE on the test set. This happens due to overfitting on the train set. You can verify this phenomenon by setting the `num_boost_round` parameter to 100.

## Hyperparameters tuning using cross validation

Next, we will use cross-validation in order to fine-tune the hyperparameters.
Cross-validation is a good way to evaluate our model with as little bias as possible from the training/test set we created. This is beacuse cross validation uses k-folds, in which, each fold is a different choice of test (validation) set, eventually using the entire data inserted as both train and test set.  

In order to avoid overfitting the hyperparameters, we will insert only the training set to the cross-validation, thereby creating 3 datasets for each fold; the original test set, the per-fold training set, and the per-fold test set (more accurately called - validation set, which is a test set used during training but not for the final evaluation of the model).

### Setting parameters for grid-search

Set the gridsearch hyperparameters by indicating the values checked for each parameter. Make sure not to use too many values since the number of permutations increases multiplicatively!

In [None]:
gridsearch_params = [
    (max_depth, min_child_weight, colsample_bytree, subsample, eta, gamma)
    for max_depth in #set values here
    for min_child_weight in #set values here
    for colsample_bytree in #set values here
    for subsample in #set values here
    for eta in #set values here
    for gamma in #set values here
]

Run k-fold cross validation on all permutations of parameters in the gridsearch. Save the hyperparameters of the best permutation.
Make sure this run doesn't take more than 1 hour. The CV step can be done using [XGBoost's cv method](https://www.rdocumentation.org/packages/xgboost/versions/1.4.1.1/topics/xgb.cv).

## Training using  the best hyperparameters

Lets now load the best hyperparameters and train the model. Then we will finally test the model on our test set!

In [None]:
evals_result = dict()

params = {
    "booster": "gbtree",
    "max_depth": #set values here,
    "colsample_bytree": #set values here,
    "subsample": #set values here,
    "eta": #set values here,
    "gamma": #set values here,
    "min_child_weight": #set values here,
    "max_delta_step": 0,
    "nthread": 8,
    "seed": 42,
    "objective": "reg:squarederror",
    "eval_metric": "rmse",
}

model = xgb.train(params, dtrain,  evals=[(dtest, 'eval'), (dtrain, 'train')], evals_result=evals_result, num_boost_round=num_boost_round)

Plot the learning curve

What is the final test and train RMSE you got?

Intuitively, an RMSE of ~35, when the possible values of the target are (-100, 35), is not good at all.
As a sanity check, lets look at the feature importance in our model:

## Model evaluation

Run feature importance for you model. What are the 3 most important features in the model?

In [None]:
ax = xgb.plot_importance(model)
fig = ax.figure
fig.set_size_inches(24, 16)
ax.tick_params(axis='both', which='major', labelsize=20)
plt.xlabel('F score', fontsize=24)
plt.ylabel('Features', fontsize=24)
plt.show()

The above graph describes the features which had the greatest impact on our model. We can use this data as a sanity check to verify our model.
For example, the fact that int_rate is the most important feature is very encouraging since if the loan did not CO, the IRR is supposed to be very close to the interest rate.  

We can also see that the 'debt_to_income' and 'loan_amount' features are at the top as well which is another good sign for the performance of our model.

Check what is the RMSE of the test set if instead of our model we use the mean IRR of the train set. What is the difference between this result and our model's result? What does that mean about our model?

Amazingly, the RMSE when simply taking the mean IRR of the train set is only slightly higher than our model. Usually this means that our model is terrible, however, in this case it is simply very hard to create a model which predicts the IRR of a specific loan (you are more than welcome to try and do better!)

Furtunately, this is not the requirement of the assignment. We only need to identify the best loans in a given volume to get the highest IRR possible. Lets see how our model fairs in this assignment.

# Checking the performance of the model on a portfolio of loans

Create a scatter plot for the predicted IRR vs the true IRR of the loans in the test set. Can you detect any visible trend? Try to use different stratifications of the data to further investigate the model's performance.

## Plotting volume-return

Plot the performance of your model (IRR-wise) if it had to choose the best X-volume loans (similar to what we did in the previous assignment), for X being various possible volumes of subsets of the overall data.


What is the IRR if we set a volume of 1/5 out of the total volume? How would you evaluate your model in this task?

Since the IRR of the total test set is 4, a result of 7.4% is very good! Hurray!

As we can see in the graph above, the first buckets which sum up to 10MM$ have a very high IRR (>10% as apposed to ~4% of the total portfolio). Furthermore, we see that the more loans we take, the smaller our IRR gets (most of the time).
This means that bins with lower predicted IRR actually do have a lower IRR, as a portfolio.

From this we can learn that though it seemed that our model did not learn well to predict the IRR of a single loan, it did do a good job when choosing the best loans.
This is in contrast to constantly predicting the mean IRR, which fairs similarly to our model when predicting a single loan, but will fair poorly (randomly) when choosing the best X loans.

We can also compare this model to the model created in the previous assignment and see this model better predicts the best loans in a given volume. Import your model and plot your results.

Also, as a comparison to the industry standard- plot the volume-return attained by selecting the top N loans in terms of credit score to get a given volume. How does your mode