# October
## Loan Default prediction
##### Find default loan clients

### Problem Description
Imagine a car loan company specialized in proviing car loans to retailed clients. Since retailed clients required authorization or regulation to operate in the financial markets, the company need to decide wisely if the client is eligible for a loan. That means that the company will needs to secure its portfolio and facilitate descision processes. In the same scope, the goal of the current case study is to build a system that detects if a client is loan default, based on client data. More precisely, a machine learning model will be trained on client data aiming to learn how clients can be classified as default or not. 

This is a binary classification problem which can be tackled using machine learning or deap learning algorithms. We can conclude the following points:
- Descision on loan default 
- Detect loan default clients
- Binary classification problem

### Data Exploration

In this section the data will be discussed and explored. Some interesting findings regarding data distributions and structures will be visualized. The code block below import the required libraries, loads the data and performs the preprocess state. We performed all the basic steps of data manipulation and cleaning:

- Check for dublicated clients
- Removing NaN, +inf, -inf
- Descriptive statistics

In [None]:
!pip install -r requirements.txt

In [None]:
from data_manager import DataManager
from model import Model

#### Initialize DataManager class
Data manager is responsible for the data manipulation, the preprocess and the cleaning. The next blocks of code provide info about the raw data in the car_loan_trainset dataset.

In [None]:
data_manager = DataManager(path_to_data='../data_folder/car_loan_trainset.csv')

In [None]:
print(f'Initial data shape: {data_manager.data.shape}')
print('Data columns:', data_manager.data.columns)

In [None]:
data_manager.data.head(5)

The preprocess step of data manager is performing the essential preprocessing steps in the data and returns the columns will be fit in the model. 
#### Excluded columns and records
- All NA and inf records
- All columns with ids

In [None]:
data = data_manager.get_preprocessed_data()

#### Feature extraction
We tried to extract some information from some specific variables from the dataset by implementing ratios.
##### Fetarue extraion ratios
- Overdue to active ratio = $\frac{ratio\_overdue}{main\_active\_loan + sub\_active\_loan}$ 


- Overdue_to_total_ratio = $\frac{total\_overdue}{total\_account\_loan}$



- Monthly_payment_to_outstanding_ratio = $\frac{total\_monthly\_payment}{total\_outstanding\_loan}$



- Outstanding_to_disburse_raio = $\frac{total\_outstanding_loan}{total\_disbursed\_loan}$

The preprocess step also performs some logarithmic transformation on skewd variables:
- Total_outstanding_loan
- Total_disbursed_loan
- Total_monthly_payment

In the preprocessing step we binarize the variable age into: $age = \begin{cases} 1, & \text{if  age } >= 30\\
0, & \text{if  age } < 30 \end{cases}$

The variable employment type is a categorical variable with 3 levels so it is turned into one hot encoding, which means a $n\times3$ matrix with zeros and $1$ in the position which coresponds to the employment type level. In this way the model can handle the categorical variable.

#### Skewed variables
The following tow plots visualize the raw total_outstanding_loan before and after the log transformation. It can be easily observed that initialy is very skewed.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(font_scale = 2, rc={'axes.facecolor': 'white', 'figure.facecolor': 'white'})
sns.set_style("whitegrid")

In [None]:
data_subset = data_manager.data.sample(350)
sns.displot(data=data_subset, x="total_outstanding_loan", hue="loan_default", height=10, aspect=2).set(title='Raw total_outstanding_loan')

In [None]:
data_subset = data.sample(350)
sns.displot(data=data_subset, x="total_outstanding_loan", hue="loan_default", height=10, aspect=2).set(title='Log total_outstanding_loan')

#### Age
The age histogram illustrates the average age of a loan. We can see the we cannot observe any pattern or a tendency that classifies the two types of clients into default or not.

In [None]:
data_subset = data[data['average_age'] < 50]
sns.displot(data=data_subset, x="average_age", hue="loan_default", height=10, aspect=2).set(title='Average Age')

In [None]:
sns.displot(data=data, x="loan_to_asset_ratio", hue="loan_default", height=10, aspect=2).set(title='Loan to asset ratio')

The following plot is a box plot for the loan to asset ratio. Data cannot be easily separted between the two classes because the distirbutions are overlapping.

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot(data=data, x="loan_default", y='loan_to_asset_ratio')

In [None]:
sns.displot(data=data, x="credit_history", hue="loan_default", height=10, aspect=2).set(title='Credit history')

In [None]:
sns.displot(data=data, x="Credit_level", hue="loan_default", height=10, aspect=2).set(title='Credit level')

In [None]:
data_subset = data[data['overdue_to_active_ratio'] < 2]
sns.displot(data=data_subset, x="overdue_to_active_ratio", hue="loan_default", height=10, aspect=2)

In [None]:
sns.displot(data=data_subset, x="overdue_to_total_ratio", hue="loan_default", height=10, aspect=2).set(title='Overdue to total ratio')

### Findings from exploration
The data are imbalanced, the loan default clients are less than the non ones. We can also conclude from the plots and the exploration that the features are not separated well in the two classes. That means that the classifier will not perform highly.

In [None]:
print('Proportion of class 1:', len(data.loan_default[data.loan_default==1])/data.loan_default.size)
print('Proportion of class 0:', len(data.loan_default[data.loan_default==0])/data.loan_default.size)

In [None]:
data.shape

### Modeling
Because we have high variant values in the variables of the dataset, we perform a pre-trainning step of scaling the data using a MinMax scaler which transforms the data in a range of $X_i \in [-2,2]$. We split the data into train and validation set (test set) using a proportion of $20\%$ for the testset. The model is an Multilayer perceptron with one hidden layer and 100 nodes. The model was trained for 300 iterations.

The evaluation metrics we used are:
- Precission
- Recall
- F1 score
- Accuracy

We can use the endpoint for training the model using FastAPI. We have two methods deployed for handling the imbalance classes in the dataset:
- Undersampling
- Oversampling
Those two can be used for training the model. The request /train needs the model name and the method for handling imbalance data.

In [8]:
import requests
oversampling = "False"
model = "MLP"
request_url = f"http://127.0.0.1:8000/train/{model}/oversampling/{oversampling}"
response = requests.post(request_url)

In [9]:
response.text

'{"Model":"MLP","Accuracy score":0.5438149025323374,"Precision score":0.2290448343079922,"Recall score":0.6628205128205128,"F1 score":0.340445146845779}'

### Conclusions
It is obvious that the performance of the model is the best according to accuracy and F1 score. The interesting findins though are that the recall score is high which means that the amount of false negative predictions are low. This is interesting because in the specific problem the task is to build a system which detects the loan default clients. So, the system needs to have low number of falsely negative classified clients which means that if a client is class=1 (loan_deafult) it is more possible to be detected and classified correctly. Unfortunately the ratio between precision recall is not so balance since the precision is very low, which means that the false positives are relatively low. The accuracy in this case is not so informative regarding the performance of the binary classifier since the dataset is very imbalanced.

In [50]:
import base64
token = base64.b64encode(b'October:123').decode('utf-8')
body = {
  "Driving_flag": 1,
  "last_six_months_new_loan_no": 0,
  "last_six_month_defaulted_no": 12,
  "average_age": 40,
  "credit_history": 0,
  "loan_to_asset_ratio": 0.65,
  "total_outstanding_loan": 0,
  "total_disbursed_loan": 0,
  "total_monthly_payment": 123,
  "active_to_inactive_act_ratio": 0.43,
  "Credit_level": 2,
  "age": 28,
  "loan_default": 1,
  "employment_type": 2,
  "total_overdue_no": 1000,
  "main_account_active_loan_no": 0,
  "total_account_loan_no": 0,
  "sub_account_active_loan_no": 0
}
response = requests.post('http://127.0.0.1:8000/predict/MLP', headers={"Authorization": f"Basic {token}"}, json=body)

In [44]:
response.text

'{"prediction_outcome_class":0,"username":"October"}'

### Future work
- Extract more features which classify better the data
- Use feature selection techinques e.g. Elastinet
- Experimentation with various scaling-feature methods