# Assignment 6

In this assignment, we aim to study and predict the notion of activation of individual loan offers. In particular, you will investingate the probability of activation of a loan offer, given its features and, most crucially, its offer interest rate, create a model for predicting this probability and use it for a basic pricing mission.

# Basic imports

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import os

In [None]:
from pagayapro.paths.data_paths import ASSIGNMENT6_DATA

# Loading the data 

Start by loading your offers data from the following path:

In [4]:
offers = pd.read_parquet(os.path.join(ASSIGNMENT6_DATA, "upg_offers.parquet"))

Review your data, how many rows and columns? Select a small sample of rows and review it.

When was the eariliest loan in your dataset issued? When was the latest?

# Initial obserations 

Observe your data, what statuses do you find in the dataset? How many loans in the dataset are approved by the platform?

Compare the distribution of credit score (`fico9`) and interest rate in the entire dataset and among the set of accepted offers.

## Observing activation

In this assignment, we are interested in the activation rate among offers we have given to potential clients; therefore, we should restrict our dataset to only those loans which we did not decline. Create the dataset of approved loans.

In [None]:
approved = # write your code here

Draw a scatterplot of the probability of activation per interest rate (i.e. the mean of the RV attaining 1 if the loan is activated).

Does the scatter plot seem sensical? What is the general trend of the graph? Add regression lines to get a better feel for it. You can also various methods (e.g. binning or rolling windows) to get a smoother view of the probabilities.


Do the results seem plausible to you? How would you explain them?

How does the above trend look when stratifying by various features? Dissect your data to bins according to features you find important and create the same plots:

### Soooo.... how do you explain this observation?

# Modelling activation

We will now try to create models for predicting the activation of individual loan offers. As we already noted, the probability of activating a loan can vary widely depending not only on the interest rate but also on various features of the client. Therefore, we will try to model the activation in a way that takes this into account.

**Feel free to make changes to any part of the models' design below**

## Logistic regression in bins 

A very interesting theoretical notion for predicting activation is that of the individual activation probability function. Broadly speaking, this function is defined for each client separately, and describes, for each interest $i$, the probability that the client will accept the given loan with interest rate $i$. 

Of course, trying to estimate the activation probability function of a client is, generally, impossible, since we only offer one interest rate per loan to each client. Furthermore, from a statistical point of view, it is mode sensible to estimate this function from a sample of ''similar'' clients, in some sense which we have to define ourselves. Thus, our first model will be constructed as follows:
* Splitting of the data to bins according to some binning logic
* Fitting a logistic regression model, one for each bins above
* Setting the threshold for declaring a loan as activated

Once we have this, our prediction will also happen per bin. 

Below is a scheme for creating such a model, many constants and parameters are left for you to decide on and attune according to your own personal judgment.

### Split to train-test 

Split your data to a train and test set according to date (we'll be doing an OOT training here). Use 2019-1-1 as your splitting point

### Binning your training data according to credit_score 

We have already seen in the past (and above) that credit score is a stong confounding variable for many phenomena in the financial world, and for activation research in particular. This has a lot to do with the industry's use of this feature. 

To begin with, we will also consider credit score as a central feature for classifying clients. Cut your trainng data into 5 (or any other number) bins according to credit_score, and add the bins as a column to you training data

Recommended: save you bins (you can use `pd.IntervalIndex` for this purpose)

### Create new feature 

Only binning according to credit score will probably give acceptable results; however, we have seen in many cases that credit score alone is not a perfect predictor of the client's behaviour and should be considered in conjunction with other features. One solution would be to select a number of additional features and bin the data further according to them. What do you suppose would be problematic about doing this?

Instead, we will do something else. We will engineer a new feature, which should have some predictive ability for the activation of a loan, and use it as the only binning feature except for credit score.

Start by reading the loan features and locating the features corresponding to the offers in your training set.

In [80]:
features =  pd.read_parquet(os.path.join(ASSIGNMENT6_DATA, "upg_features.parquet"))

In [None]:
train_features = # write your code here

Find the 50 numerical features which are most correlated with activation in your training data (you can decide for yourself if and how to handle nan values in the features)

We want to create a feature which 'capture' most variance in all the features used above; to do so, we'll use PCA. Import `sklearn.decomposition.PCA` and use it in order to create a column of the top feature created by PCA according to the columns you found above.

In [83]:
from sklearn.decomposition import PCA
# write your code here

Add the PCA column to you training data

### binning

Cut each of the credit score bins you have already created in 5 (or any other number) bins according to the new feature you have just created

Recommended: Save the newly created pca_bins

## Training your model[s] 

Now that your bins are ready it is time to fit a logistic model to each of these bins. Write a fitting function that, given a training set and with a binning as above, return a series of logistic models, indexed by the corresponding bins. Optional- you can also accept the binning information as a parameter and perform the binning as part of the training process.

### Finding thersholds for your models

In the setting of this assignment, the prospect of wrongly predicting an offer as activated is far worse than that of wrongly predicting it as not-activated. Indeed, as the loans we consider here have already been approved, our main goal here is to obtain maximum volume. Thus, wrongfully predicting many offers as activated puts us at a risk of overestimating the eventual volume of our portfolio, while the completmentary case would result in underestimation, which is a much more manageable problem.

Therefore, while having a high TPR is nice, it is more important for us to keep our FPR low, as the former indicates a low number of false negatives (i.e. reduced underestimation) compared to true positives, while the latter indicates a low number of false positives (i.e. reduiced overestimation) compared to the number of true negatives.

For each bin above, find the minimal threshold for which the train-fpr is above 20%. You can use `sklearn.metrics.roc_curve` for this purpose. It is recommended that you write a function that accepts the series of models created in the previous section and returns a series, indexed in the same way, containing the threshold found for each model.

### Prediction

Write your prediction function. Note that in order to predict you must pass your test set through all the stepr taken above in the training process and perform the prediction accordind to the same bins.

In addition, write a `predic_proba` function, which returns, for each offer, the probability of activation by the client.

## Checking model accuracy 

Assess your model; check accuracy score, plot confusion matrices and describe the roc plot of the model on the training and test set. What is the model's AUC? Feel free to use any other metric you find relevant.

How would you assess your model? Do you find it to be a good predictor of activation?

## Modeling using classification trees

An alternative approach to classiying data which is split in a non-obvious way is to use classification trees, which automatically performs the splitting of the data into bins and predicts the probability of activation according to the specific bin. And since classification trees alone can be boosted using ensemble method, we might as well use XGBoost for this model. 

Import the `xgboost` package and create a classifier instance using the `{'objective':'binary:logistic'}` parameter. Use the same train-test split as above. You may use whichever features and perform any preprocessing you wish to them in order to improve your model's predictive ability, 

### Checking model accuracy

Run the same checks as above to asses your model: check accuracy score, plot confusion matrices and describe the roc plot of the model on the training and test set. What is the model's AUC? Feel free to use any other metric you find relevant.

## Comparing the models

Select a sample of 5 loans and plot their client's activation function according to the two models. Does the prediction agree with your intuition regrading the shape of the activation function?

Discuss the benefits and downsides of either of the models. Which do you find better for use in a business setting? Which would you recommend for deployment to production?

# Pricing

Once we have a functional predictor for the activation probability of a loan, it's time to use it in order to improve our pricing methodology. In the following path you will find features for ~78,000 accepted offers from 2019Q2-2020Q1 in Upgrade. 

Upload the data and review it.

In [None]:
features_2020=pd.read_parquet("s3://pagaya-pro-source/data/assignment6/pricing/upg_features_2020.parquet"()

What is the expected volume of the entire dataset?

In addition, in the following path, you will find IRR predictions for each loan in the new dataset and for any integral int rate in the range $[5,30]$. Load the IRR predictions and use them in order to assign the predicted IRR of each loan to the accepted offers table (_Note_: you may round the int_rate value to the nearest integer in order to use the IRR predictions table).

In [None]:
irr_preds = pd.read_parquet("s3://pagaya-pro-source/data/assignment6/pricing/irr_preds.parquet")

Find the portflio obtain by the top 50% by volume (in \$) in terms of irr. What is the expected volume of this portfolio? What is its mean expected IRR?

Using the same portfolio, change the offers' interest rate in order to obtain maximum volume. What is the mean expected IRR of this portfolio?

Using the same portfolio, change the offers' interest rate in order to obtain maximum mean expected IRR. What is the expected volume of this portfolio?

# A challenge

Using the same portfolio, find a vector of interest rates which makes the portfolio optimal in the sense that any change to the portfolio's interest rates would case either the expected volume or the mean expcted IRR to decrease (such a portfolio is some times calles _Pareto-optimal_).

Given a volume $0\le v\le \text{total volume}$ find a vector of interest rates which would make the expected volume as close as possible to $v$. Draw the curve of optimal expected volume to mean expected IRR.

_Hint_. Finding the correct vector of interest rates can be done using a greedy algorithm which selects, at each step, the loan for which changing the interest rate by $1\%$ would give the largest (resp. smallest) change in activation for the smallest (resp. smallest) change in IRR.