<a href="https://colab.research.google.com/github/CBravoR/AdvancedAnalyticsLabs/blob/master/notebooks/python/Lab_2_Capital_Requirements_and_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Lab 2 - Capital Requirements and Using Pandas

## Reading the data

For this exercise we will read a dataset from credit scoring. I previously uploaded the data to Google, and it is available at https://docs.google.com/spreadsheets/d/1Am74y2ZVQ6dRFYVZUv_VoyP-OTS8BM4x0svifHQvtNc/export?gid=819627738&format=csv

The dataset (called **Bankloan**, from IBM) has a set of 1,000 loans with default information. It includes the following variables:

- Customer: ID, or unique label, of the borrower (NOT predictive).
- Age: Age of the borrower in years.
- Education: Maximum education level the borrower reached.
1: Complete primary. 2: Completed Secondary. 3: Incomplete Higher Ed. 4: Complete Higher Ed. 5: With postgraduate studies (complete MSc or PhD).
- Employ: Years at current job.
- Address: Years at current address.
- Income: Income in ‘000s USD.
- Leverage: Debt/Income Ratio.
- CredDebt: Credit card standing debt.
- OthDebt: Other debt in ‘000s USD.
- MonthlyLoad: Monthly percentage from salary used to repay debts.
- Default: 1 If default has occurred, 0 if not (Target variable).
- PD: The calibrated probability of default of the loan.
- LGD: The estimated LGD for the loan.
- Outstanding: EAD.

Target: Whether the loan is going to default or not (Default variable)

First, we will import the data. As the data is in a Google Drive folder, we will use the [gdown](https://github.com/wkentaro/gdown) utility to download it. Note that the command has an exclamation sign before it. It means *run this in the terminal and not in Python*. Gdown can be used from a python session itself or from the terminal as we are doing now.

In [None]:
# Download the dataset from Google Drive
!gdown https://drive.google.com/uc?id=1lyEd01JaoVbL1mbgn-wr3YvLmURAgQ8B

In [None]:
# Check the first few lines of the dataset
!head /content/bankloan_scored_nodefault.csv

Now we will import Pandas. [Pandas](https://pandas.pydata.org/) is the best-known data management software in Python. It implements most of the functionality of R's dataframes in Python. However, it is very inefficient. It is a good alternative for small datasets, but it has signficant speed and memory issues when running large datasets. We will use better alternatives later in the course. For now, however, it is a great option.

Let's import pandas and the numerical analysis package, numpy.

In [None]:
import pandas as pd
import numpy as np

We now read the data. Pandas can read and write from (and into) many different types. The following function reads from a CSV and then shows the top 10 results.

In [None]:
bankloan_data = pd.read_csv('./bankloan_scored_nodefault.csv', sep=',')

In [None]:
bankloan_data.head(10)

The property ```dtypes``` shows the data types of all columns.

In [None]:
bankloan_data.dtypes

And the function ```describe()``` shows the summary statistics of the numerical variables. Note categorical variables are not displayed.

In [None]:
bankloan_data.describe()

## Basic data exploration

Let's explore the data a bit further. Python has the powerful [```matplotlib```](https://matplotlib.org/) package, and its [```pyplot```](https://matplotlib.org/stable/users/explain/quick_start.html) interface to create graphs using Matlab's programmatic language for graphs that has been around for many years. ```matplotlib``` is a very large and flexible software. Let's make some simple plots.

In [None]:
# Import the matplotlib library
import matplotlib.pyplot as plt

# Indicate to the notebooks plots should be rendered inline
%matplotlib inline

As a first plot, we will use one that comes preprogrammed in Pandas. ```hist()``` is a function that provides a histogram of all variables in a dataset.

In [None]:
histograms = bankloan_data.hist()

Now let's slice and dice the data a bit. Pandas has several ways to do this. In general, we can split slicing and dicing into three groups:

1. Slicing and dicing using variable names. This one requires no extra property.
1. Slicing and dicing based on attributes or characteristics of the data, or doing conditional filters. These ones are indexed using the ```.loc``` property.
2. Slicing and dicing using numerical indexes. This one requires using the ```.iloc``` property.

Note these are properties, not functions, so they use square brackets, not round parenthesis.

In [None]:
# Select all rows and the 'Age' column
bankloan_data.loc[:, 'Age']

In [None]:
# Select the first 10 rows and the 'Age' column, for people aged less than 37. Note the use of .loc[] to filter the rows
bankloan_data.loc[bankloan_data.loc[:, 'Age'] < 37, :]

In [None]:
# Obtaining the numbers of samples whose age is less than 37
np.sum(bankloan_data.loc[:, 'Age'] < 37)

In [None]:
# Integer indexing
bankloan_data.iloc[0:5, 1:2]

The original matplotlib plot is not very nice. There is a library oriented to data analysis called [```seaborn```](https://seaborn.pydata.org/) that provides some prettier graphs and some extra functions to matplotlib plots. Normally, you can use seaborn to generate a pretty graph, and then finetune the output using matplotlib's functions, as we do below. Let's import the package and create a plot.

In [None]:
import seaborn as sns

In [None]:
# Configure the basic structure of the plots
sns.set_theme(color_codes=True)

# Create a pairplot in seaborn. Takes a while as several plots are run.
sns.pairplot(bankloan_data)

# Use matplotlib to tweak the outcome and save the plot as a PDF and JPG file
plt.savefig('Hist.pdf')
plt.savefig('Hist.jpg')

# Show the plot inlined in the notebook
plt.show()

What do you see here? Any variables that jump to mind? Let's calculate the capital requirement of our loans.

## Basel III Capital Requirements

Recalling the last lecture, the equation for the capital requirement of any operation is:

$$
K = LGD \cdot \left\{ N\left( \sqrt{\frac{1}{1-R}} \cdot N^{-1}(PD) + \sqrt{\frac{R}{1-R}} \cdot N^{-1}(0.999) \right) - PD \right\} \left( \frac{1 + (M - 2.5)b}{1 - 1.5b}\right)
$$

The values of $b$ and $M$ will be variable for bonds, but for retail and mortgages the maturity is fixed at 1, and the b term dissapears. The correlations are given by the regulation:

- Mortgages: $R = 0.15$
- Revolving: $R = 0.04$
- Other retail: $R = 0.03 \left( \frac{1 - e^{-35PD}}{1 - e^{-35}} \right) + 0.16 \left( 1 - \frac{1 - e^{-35PD}}{1 - e^{-35}} \right)$
- Corporate and sovereign exposures $ R = 0.12 \left( \frac{1 - e^{-50PD}}{1 - e^{-50}} \right) + 0.24 \left( 1 - \frac{1 - e^{-50PD}}{1 - e^{-50}} \right)$

The following code defines the CR formula for retail loans.


In [None]:
#Other retail
def capital_requirement_retail(PD, LGD):
  import numpy as np
  from scipy.stats import norm

  # Check if PD satisfies floor
  if PD < 0.0003:
    PD = 0.0003

  # First part of the equation, lower correlation
  R =  0.03 * ( (1 - np.exp(-35 * PD)) / (1 - np.exp(-35)) )

  # Second part of the equation, higher correlation
  R += 0.16 * (1 - ( (1 - np.exp(-35 * PD)) / (1 - np.exp(-35)) ) )

  # Now we can calculate the capital
  K = norm.cdf(np.sqrt( (1 - R) ** (-1) ) * norm.ppf(PD) +
               np.sqrt( R / (1 - R) ) * norm.ppf(0.999) ) - PD
  K *= LGD
  return(K)

Now we can calculate the function itself for a specific PD and LGD combination.

In [None]:
capital_requirement_retail(LGD = 0.5, PD = 0.4)

Or we can print it in a nicer format using a [f-string](https://statics.teams.cdn.office.net/evergreen-assets/safelinks/1/atp-safelinks.html).

In [None]:
print(f'PD = 0.5 & LGD = 0.5. K = {capital_requirement_retail(0.5, 0.5):.3f}')

Let's create a few plots showing the behaviour of the function. Note I am using a list comprehension. This is a very useful data structure that applies the function across a column.

In [None]:
# Generate the series and set the LGD to 1
Xseries = np.arange(0, 1.001, 0.001)
LGD = 1

# Calculate the capital requirement for each PD using a list comprehension
Yseries = [capital_requirement_retail(x, LGD) for x in Xseries]

# Plot the series
plt.plot(Xseries, Yseries)
plt.title('PD curve at LGD = 1')
plt.xlabel('PD')
plt.ylabel('Capital Req. %')
plt.show()

In [None]:
# PD Curve
Xseries = np.arange(0, 1.001, 0.001)
LGD = 1
Yseries = [capital_requirement_retail(x, LGD) + x * LGD for x in Xseries]
plt.plot(Xseries, Yseries)
plt.title('PD curve at LGD = 1')
plt.xlabel('PD')
plt.ylabel('Full Load %')
plt.show()

Now, let's apply the result to the full dataset. For this, we need a [lambda function](https://www.w3schools.com/python/python_lambda.asp) that will map the vector inputs to the function inputs.

In [None]:
bankloan_data['CapitalReq'] = bankloan_data.apply(lambda x : capital_requirement_retail(x['PD'], x['LGD']), axis=1)

In [None]:
bankloan_data['CapitalReq']

And now we can plot the distribution using Seaborn. The distplot function does this and adds the KDE.

In [None]:
sns.displot(bankloan_data['CapitalReq'], kde=True)
plt.show()

And we can finally calculate the maximum Risk Weighted Asset (RWA) value that would be required to cover these instruments. Assuming a factor $F = 8\%$, remember that:

$$
RWA = \frac{1}{F} * K * EAD
$$

in retail lending the Exposure at Default is equal to the outstanding amount, leading to:

In [None]:
RWA = (1 / 0.08) * np.dot(bankloan_data['CapitalReq'], bankloan_data['Outstanding'])
RWA

In [None]:
np.sum(bankloan_data['Outstanding'])

In [None]:
# Set the locale to the default system locale
import locale
locale.setlocale(locale.LC_ALL, '')

# Display
out = locale.currency(RWA, grouping=True)
print('The maximum value for the RWA at a 8% capital requirement is equal to ' + out)

However, Basel says that the RWA per business line is 12.5 times the capital requirement (i.e. it does not adjust by the bank's own load), so the 12.5 factor is the correct value to use when calculating the RWA of the line. For their final capital requirement allocation, the bank may want to adjust this to their own load. Most banks in Canada have an 11.5% load that they would use to estimate their capital requirements, but that does not impact the RWA of the asset under a Standard Approach, just the final (and obviously more important) capital allocation.