<a href="https://colab.research.google.com/github/CBravoR/AdvancedAnalyticsLabs/blob/master/notebooks/python/Lab_LGD_Modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LGD Modelling

In this lab, we will model the LGD using two techniques: A linear regression, a a fitted distribution regression, and a random forest. LGD models are particularly tricky as they tend to have oddly-shaped distributions, thus traditional methods do not tend to create good fit for the models.

First, we will load and study the data.


In [None]:
!gdown https://drive.google.com/uc?id=1nldxUFNGDziLZgE-fJv5KmNjnbdM29na

In [None]:
import numpy as np
import pandas as pd

In [None]:
LGD_data = pd.read_csv('LGD.csv')
LGD_data.describe()

Let's create a test / train split.

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_LGD_test = train_test_split( LGD_data.iloc[:, 0:13], # Predictors
                                                    LGD_data['LGD'],         # Target variable
                                                    test_size=0.33,          # Test size percentage
                                                    random_state=20201209    # Seed
                                                    )

And finally let's plot the LGD distribution.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_theme(style="darkgrid")

In [None]:
# Create the figure with the density
fig = sns.displot(LGD_data['LGD'], kind = 'kde')

# Create a density histogram
sns.histplot(LGD_data['LGD'], stat = 'density')

# Plot the whole thing
plt.show()

As we can see, the LGD has an unbalanced bimodal distribution between 0 and 1.

## Linear regression

We will now try to fit a basic linear regression and see its performance. For this we will use the linear regression implementation of scikit-learn, [```linear_model```](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model). We will also regularize using ElasticNet.

In [None]:
# Importing the package.
from sklearn.linear_model import ElasticNetCV

In [None]:
LGD_linear_model = ElasticNetCV(l1_ratio=np.arange(0.01, 1.01, 0.05),  # l1_ratios to try.
                                n_alphas=10,                        # How many alphas to try per l1_ratio
                                fit_intercept=True,                 # Use constant?
                                max_iter=1000,                      # Iterations
                                tol=0.0001,                         # Parameter tolerance
                                cv=3,                               # Number of cross_validation folds
                                verbose=True,                       # Explicit or silent training
                                n_jobs=2,                           # Cores to use
                                random_state=20201209               # Random seed
                                )

In [None]:
LGD_linear_model.fit(x_train, y_train)

Let's check the output.

In [None]:
coef_df = pd.concat([pd.DataFrame({'column': x_train.columns}), 
                    pd.DataFrame(np.transpose(LGD_linear_model.coef_))],
                    axis = 1
                   )

coef_df

We can see some variables are not relevant. Let's check the goodness of fit over the test set.

In [None]:
# Predict over test set
linear_pred_test = LGD_linear_model.predict(x_test)

# Calculate the error
linear_error = np.abs(linear_pred_test - y_LGD_test)

# Print a scatter plot with distributions.
fig, ax = plt.subplots(figsize=(11, 8.5))
sns.scatterplot(x = y_LGD_test,            # The x is the real value
                y = linear_pred_test,  # The y value is the predictor
                hue = linear_error,    # The colour represents the error
                legend = False
                )

# Overlay a diagonal line
X_plot = np.linspace(0, 1, 100)
Y_plot = X_plot

plt.plot(X_plot, Y_plot, color='r')

plt.show()

We can see several values are predicted to be below 0, while many are shown to be below its correct value, particularly for large graphs. How dark a point shows the error magnitud. Let's see the average error magnitud.

In [None]:
from sklearn.metrics import mean_squared_error

linear_mse = mean_squared_error(y_LGD_test, linear_pred_test)
print('The MSE for the linear model is %.4f' % linear_mse)

## General Linear Regression Fit

General regressions are not implemented in Python yet. This means we should use the GLM trick that we saw in the lectures to estimate a regression model that has an appropriate output distribution. Let's see how this would work out.

The first step is to look for the best distribution for our data. For this we can use the [```fitter```](https://github.com/cokelaer/fitter) package, that tries to find the best distribution among all available in scipy. Let's install it and load it.

In [None]:
!pip install fitter
import fitter

Now we can look for the best distribution. The process is:
1.  Create the fitter object.
2. Fit it over our LGD data.
3. Pick the best distribution between all available.

In [None]:
# Generate the fitter object.
dists_LGD = fitter.Fitter(LGD_data['LGD'],      # The data
                          timeout = 30,         # How long to wait before timeout. Some distributions are very hard to fit!
                          distributions = None, # Optionally you can give distributions. None means all of them, ironically.
                          )


Not all distributions are good for our problem. This can greatly increase fitting time too. Let's restrict distributions to those we believe might be adequate for our case.

In [None]:
# Get the full list of distributions.
dists_LGD.distributions

In [None]:
# Pick a few.
dists_LGD.distributions = ['beta', 'gamma', 'mielke', 'lognorm']

In [None]:
# Fit it
dists_LGD.fit(n_jobs = -1,      # How many cores to use.
              progress = True  # Show progress bar
              )

In [None]:
dists_LGD.summary()

We can see the best distributions are the Mielke distribution (a mix between a beta and an F function common in physical phenomena) and the gamma distribution, a generalization of the beta distribution.

Let's use [Mielke's distribution](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mielke.html) for this example. Every dataset can have its own distribution!

In [None]:
dists_LGD.get_best()

The function for Mielke's density distribution are k, s, loc and scale, which are conveniently given in the above dataframe. We see there is no translation made (loc is close to 0), but we do need to scale the distribution a bit (the fourth parameter). The k and s parameters give the shape of the distribution. 

What functions do we need? Well, the general process for a regression of this type is:

1. Get where on the original cumulative distribution a point (the LGD) falls. For this we need the cumulative Mielke distribution, called the ```cdf``` function in scipy.
2. Get where on the normal distribution that particular point falls. For this we need the inverse of the cumulative function, also called the **percent point function** or ```ppf``` in scipy.
3. Apply this to all points in the dataset. Now everything is mapped to a normal variable.
4. Run a linear regression between our regressors and the z-transformed variable. You can use LASSO, ElasticNet, etc to get the best model.
5. Go back. For this you need the inverse of the cumulative normal distribution function (cdf) and the inverse cumulative distribution function for our target distribution (Mielke's ppf function).

Let's import all required functions.

In [None]:
# Import the functions
from scipy.stats import mielke, norm

In [None]:
# Set the parameters to particular values.
LGD_mielke = mielke(*dists_LGD.fitted_param['mielke'])
LGD_normal = norm()

Let's begin the calculations. The first step is to get the CDF of all elements in the Mielke distribution and finding its corresponding z-value in the normal distribution.

In [None]:
# Get the Mielke cdf point.
LGD_data['MielkeCDF'] = [LGD_mielke.cdf(x) for x in LGD_data['LGD']]

# Get the corresponding z-value in the normal function
LGD_data['Z-Mielke'] = [norm.ppf(x) for x in LGD_data['MielkeCDF']]
LGD_data['Z-Mielke'].describe()

Our data is perfectly mapped to a normal regression. Now we are ready to run the regression! We can use the same code as before, but our target now will be the newly calculate Z-Mielke variable.

In [None]:
LGD_mielke_model = ElasticNetCV(l1_ratio=np.arange(0.01, 1.01, 0.05),  # l1_ratios to try.
                                n_alphas=10,                        # How many alphas to try per l1_ratio
                                fit_intercept=True,                 # Use constant?
                                max_iter=1000,                      # Iterations
                                tol=0.0001,                         # Parameter tolerance
                                cv=3,                               # Number of cross_validation folds
                                verbose=True,                       # Explicit or silent training
                                n_jobs=2,                           # Cores to use
                                random_state=20201209               # Random seed
                                )

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_mielke_test = train_test_split( LGD_data.iloc[:, 0:13], # Predictors
                                                    LGD_data['Z-Mielke'],         # Target variable
                                                    test_size=0.33,          # Test size percentage
                                                    random_state=20201209    # Seed
                                                    )

In [None]:
LGD_mielke_model.fit(x_train, y_train)

In [None]:
coef_df = pd.concat([pd.DataFrame({'column': x_train.columns}), 
                    pd.DataFrame(np.transpose(LGD_mielke_model.coef_))],
                    axis = 1
                   )

coef_df

Similar variables are relevant now, but the weights have clearly changed! We can now apply this model to the test data and then calculate the corresponding LGD by reversing our procedure.

In [None]:
# Predict over test set
mielke_pred_test = LGD_mielke_model.predict(x_test)
mielke_pred_test = norm.cdf(mielke_pred_test)
mielke_pred_test = LGD_mielke.ppf(mielke_pred_test)

# Calculate the error
mielke_error = np.abs(mielke_pred_test - y_LGD_test)


Now that we have the estimates and the error, we can plot our results and calculate the MSE.

In [None]:

# Print a scatter plot with distributions.
fig, ax = plt.subplots(figsize=(11, 8.5))
sns.scatterplot(x = y_LGD_test,            # The x is the real value
                y = mielke_pred_test,  # The y value is the predictor
                hue = mielke_error,    # The colour represents the error
                legend = False
                )

# Overlay a diagonal line
X_plot = np.linspace(0, 1, 100)
Y_plot = X_plot

plt.plot(X_plot, Y_plot, color='r')

plt.show()

In [None]:
from sklearn.metrics import mean_squared_error

linear_mse = mean_squared_error(y_LGD_test, mielke_pred_test)
print('The MSE for the Mielke-distributed model is %.4f' % linear_mse)

So we got a lower error! The improvement is not extreme in this dataset, but besides getting a better error we also get a better distribution: Our model starts at 0 and covers most of the original range. We can use this trick to create a regression for any distribution we want. As an exercise, train an XGBoosting model for this data and compare it with our Mielke distributed model. Can you improve the MSE with a non-linear model?