<a href="https://colab.research.google.com/github/CBravoR/AdvancedAnalyticsLabs/blob/master/notebooks/python/Lab_2_Capital_Requirements_and_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Capital requirements and Pandas

In this lab, we'll implement the formulas for capital requirements, and we will apply it to a complete dataset. For this goal we will use the excellent [```pandas```](https://pandas.pydata.org/) package, which allows for data handling in general.

***Important self-study: Go through the [10 minutes to Pandas tutorial](https://pandas.pydata.org/docs/user_guide/10min.html).***

## Reading the data

For this exercise we will read a dataset from credit scoring. I previously uploaded the data to Google, and it is available at https://docs.google.com/spreadsheets/d/1Am74y2ZVQ6dRFYVZUv_VoyP-OTS8BM4x0svifHQvtNc/export?gid=819627738&format=csv

The dataset (called **Bankloan**, from IBM) has a set of 1,000 loans with default information. It includes the following variables:

- Customer: ID, or unique label, of the borrower (NOT predictive).
- Age: Age of the borrower in years.
- Education: Maximum education level the borrower reached.
1: Complete primary. 2: Completed Secondary. 3: Incomplete Higher Ed. 4: Complete Higher Ed. 5: With postgraduate studies (complete MSc or PhD).
- Employ: Years at current job.
- Address: Years at current address.
- Income: Income in ‘000s USD.
- Leverage: Debt/Income Ratio.
- CredDebt: Credit card standing debt.
- OthDebt: Other debt in ‘000s USD.
- MonthlyLoad: Monthly percentage from salary used to repay debts.
- Default: 1 If default has occurred, 0 if not (Target variable).
- PD: The calibrated probability of default of the loan.
- LGD: The estimated LGD for the loan.
- Outstanding: EAD.

The goal is to construct a model to predict whether the loan is going to default or not. We will use this dataset for the next few labs.

To actually get the data, we could:

- Download the file following the link.
- Upload the file to our Google Drive
- Connect the google drive to our own Colab session
- Import the file

This is... tedious. The second alternative is to simply download the file from the web directly to our session. This can be done with Linux's command ```gdown```. This is NOT a Python command, but an operative system command, thus we need to invoke it with the prefix ```!``` which means "run this in the operative system".

The command is

```
gdown Google_Path
```

In [None]:
!gdown https://drive.google.com/uc?id=1lyEd01JaoVbL1mbgn-wr3YvLmURAgQ8B

Note that it downloads it to ```/content/FILENAME```. To check what we  downloaded we can use the ```head``` OS command.

In [None]:
!head /content/bankloan_scored_nodefault.csv

## Pandas

Now we will use Pandas to read the CSV file. The  function to do so is [```read_csv```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). We will store the results in a variable named ```loan_data```.

In [None]:
import pandas as pd

bankloan_data = pd.read_csv('/content/bankloan_scored_nodefault.csv')

Now we can start exploring the data. First, a list of the variables and its types:

In [None]:
bankloan_data.dtypes

Int64 are integers, float64 are decimals, and object means a general type. In this case text.

Using the ```describe``` function we can get summary statistics of the numerical variables.

In [None]:
bankloan_data.describe()

To get an idea of the different distributions of the data, we can plot the histograms of the variables. First, we import the graphic environment.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
histograms = bankloan_data.hist()

In [None]:
bankloan_data.loc[:, 'Age'].iloc[1:10]

However, there is a far more powerful package for visualizing data, that uses Pandas as its backend: [seaborn](https://seaborn.pydata.org/introduction.html). Let's visualize the dataset using this tool.

In [None]:
import seaborn as sns
import numpy as np

In [None]:
sns.set(color_codes=True)
sns.pairplot(bankloan_data, hue = 'Default')

In a future lab we will focus on data cleaning and what we should look for in this dataset, but for now we can properly calculate the Basel Capital Requirements.

## Basel III Capital Requirements

Recalling the last lecture, the equation for the capital requirement of any operation is:

$$
K = LGD \cdot \left\{ N\left( \sqrt{\frac{1}{1-R}} \cdot N^{-1}(PD) + \sqrt{\frac{R}{1-R}} \cdot N^{-1}(0.999) \right) - PD \right\} \left( \frac{1 + (M - 2.5)b}{1 - 1.5b}\right)
$$

The values of $b$ and $M$ will be variable for bonds, but for retail and mortgages the maturity is fixed at 1, and the b term dissapears. The correlations are given by the regulation:

- Mortgages: $R = 0.15$
- Revolving: $R = 0.04$
- Other retail: $R = 0.03 \left( \frac{1 - e^{-35PD}}{1 - e^{-35}} \right) + 0.16 \left( 1 - \frac{1 - e^{-35PD}}{1 - e^{-35}} \right)$
- Corporate and sovereign exposures $ R = 0.12 \left( \frac{1 - e^{-50PD}}{1 - e^{-50}} \right) + 0.24 \left( 1 - \frac{1 - e^{-50PD}}{1 - e^{-50}} \right)$

With this we can calculate the capital requirements and the Risk Weighted Assets (RWA) for this portfolio. Let's start implementing the capital requirement function. Note that we require the cumulative normal distribution  and its inverse functions. For this we will use numpy's sister package [```scipy```](https://scipy.org/) which includes all (traditional) statistical models and quantities for classic stats (not analytics!) in its subpackage [```stats```](https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html).

Within the package stats, we find the statistical distribution we need: [```norm```](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html#scipy.stats.norm), the standard normal. Within it, we can call the cumulative function (```norm.cdf```) and the inverse function, ```norm.ppf``` which stands for *[percent point function](https://stackoverflow.com/questions/20626994/how-to-calculate-the-inverse-of-the-normal-cumulative-distribution-function-in-p)*.



In [None]:
def capital_requirement_retail(PD, LGD):
  import numpy as np
  from scipy.stats import norm
  # First part of the equation, lower correlation
  R =  0.03 * ( (1 - np.exp(-35 * PD)) / (1 - np.exp(-35)) )
  # Second part of the equation, higher correlation 
  R += 0.16 * (1 - ( (1 - np.exp(-35 * PD)) / (1 - np.exp(-35)) ) )
  # Now we can calculate the capital
  K = norm.cdf(np.sqrt( (1 - R) ** (-1) ) * norm.ppf(PD) + 
               np.sqrt( R / (1 - R) ) * norm.ppf(0.999) ) - PD
  K *= LGD
  return(K) 

In [None]:
# 50% PD rate and LGD = 0.5
print('PD = 0.5 & LGD = 0.5. K = %.3f' % capital_requirement_retail(0.5, 0.5))

# PD = 1 and LGD = 1
print('PD = 1 & LGD = 1. K = %.3f' % capital_requirement_retail(1, 1))

# PD = 1 and LGD = 1
print('PD = 0.99 & LGD = 0.99. K = %.3f' % capital_requirement_retail(0.99, 0.99))


We can see the capital requirement is a non-linear function of the PD and LGD (**why?**). 

Let's study the plot for a fixed LGD.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Nseries = np.arange(0, 1, 0.001)
plt.plot(Nseries, capital_requirement_retail(Nseries, 1))
plt.show()

With this we can now calculate the capital requirement of the portfolio, applying the function to every loan in the dataset. For this we need to use two different functions: the [```apply```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html) function from Pandas, which applies a function to rows or columns from a dataset, and a short [lambda function](https://www.w3schools.com/python/python_lambda.asp). 

A lambda function is, quite simply, a synonym for another function where you specify some inputs and outputs. In this case, our Basel capital requirement function has the problem it requires only the PD and LGD from the row, but Pandas will cast the function to the whole row. With a lambda function we can just specify which columns it will use, as follows:

In [None]:
bankloan_data['CapitalReq'] = bankloan_data.apply(lambda x : capital_requirement_retail(x['PD'], x['LGD']), axis = 1)

Now we have calculated the capital requirements into a new column of the dataframe, ```CapitalReq```! To see how it looks like:

In [None]:
sns.distplot(bankloan_data['CapitalReq'])

And we can finally calculate the maximum Risk Weighted Asset (RWA) value. Assuming a factor $F = 8\%$, remember that:

$$
RWA = \frac{1}{F} * K * EAD
$$

in retail lending the Exposure at Default is equal to the outstanding amount, leading to:

In [None]:
RWA = (1 / 0.08) * np.dot(bankloan_data['CapitalReq'], bankloan_data['Outstanding'])
RWA

Every bank will have a different factor of the RWA which it must conserve. This will depend on its own characteristics. If, for example, the bank had a 12% requirement, then its regulatory capital would be equal to:

In [None]:
RWA = (1 / 0.12) * np.dot(bankloan_data['CapitalReq'], bankloan_data['Outstanding'])


# To format money correctly
import locale
locale.setlocale( locale.LC_ALL, '' )

# Display
out = locale.currency( RegCap, grouping=True )
print('The maximum value for the RWA at a 12% capital requirement is equal to ' + out)

And that's it! Note that this example is ommitting a few important steps, such as we did not check the PD or LGD lower floors that  are in effect in Basel III and the outstanding amount was given to you. You are now ready to answer the quantitative question for coursework 1!

## Self-Study

During the lab we used the ```apply``` function to cast a formula throughout a dataset. This uses only a limited amount of resources. To speed this up, and take advantage of modern machines with more than one core, we can use a multicore approach.

This is called **paralelizing**. The idea is to run your apply function using all available cores, as they do not interfere with each other.

To do so, read through the tutorial of the package [swifter](https://github.com/jmcarpenter2/swifter). Modify the apply call to include this multicore call, and test the running time. Your code should be much faster. 