## Building a Credit Scoring Model for Buy-Now-Pay-Later Services: A Comprehensive Analysis of Credit Risk Classification

## Overview
Credit scoring models are vital tools used by financial institutions to assess the creditworthiness of potential borrowers. As part of risk management, these models predict the likelihood of a borrower defaulting on a loan, which allows institutions to mitigate potential losses. Traditionally, credit scoring models rely on statistical methods to analyze historical data, identifying patterns and relationships between borrower behavior and loan outcomes.

In this challenge, we aim to build a comprehensive credit scoring model for Bati Bank, a leading financial institution collaborating with an eCommerce platform. The goal is to enhance their buy-now-pay-later service, allowing customers to purchase items on credit, based on their predicted creditworthiness. By developing a reliable and robust credit scoring model, Bati Bank can make informed lending decisions while minimizing the risk of defaults. This project involves data exploration, feature engineering, model development, and real-time deployment of the model via an API to assess credit risk and optimize loan terms.

## Objectives
The primary objectives of this project are as follows:

1. Define Credit Risk Proxy: Establish a proxy variable to categorize users as high-risk (bad) or low-risk (good) based on their likelihood of default.

2. Feature Engineering: Select relevant features from the data and engineer new ones that are strong predictors of default risk. This includes creating aggregate and extracted features, encoding categorical variables, handling missing data, and normalizing numerical features.

3. Develop a Credit Risk Model: Build and train machine learning models that assign risk probabilities to new customers based on historical transaction data.

4. Credit Scoring: Use the model's probability estimates to create a credit score for each customer, facilitating quick and accurate creditworthiness assessments.

5. Loan Optimization: Predict the optimal loan amount and duration for new customers, considering their risk profile to ensure sustainable lending practices.

6. Model Deployment: Deploy the trained credit scoring model through an API, enabling real-time credit scoring and decision-making. The API will accept customer transaction data and return predictions on credit risk and loan recommendations.

### Import Library

In [2]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Load dataset

In [3]:
df = pd.read_csv('../data/woe_feature_engineering.csv')
df

Unnamed: 0.1,Unnamed: 0,TransactionId,BatchId,AccountId,SubscriptionId,CustomerId,CountryCode,ProviderId,ProductId,Amount,...,ChannelId_ChannelId_2,ChannelId_ChannelId_3,ChannelId_ChannelId_5,Recency,Frequency,Monetary,Stability,RFMS_Score,Risk_Label,RFMS_Binned
0,0,TransactionId_76871,BatchId_36123,AccountId_3957,SubscriptionId_887,CustomerId_4406,0.0,5,1,0.092004,...,False,True,False,1.0,0.028851,0.557522,0.000919,0.396823,0,2.0
1,1,TransactionId_73770,BatchId_15642,AccountId_4841,SubscriptionId_3829,CustomerId_4406,0.0,3,19,0.091910,...,True,False,False,1.0,0.028851,0.557522,0.000919,0.396823,0,2.0
2,2,TransactionId_26203,BatchId_53941,AccountId_4229,SubscriptionId_222,CustomerId_4683,0.0,5,0,0.091958,...,False,True,False,0.0,0.000244,0.556944,0.000000,0.139297,0,0.0
3,3,TransactionId_380,BatchId_102363,AccountId_648,SubscriptionId_2185,CustomerId_988,0.0,0,11,0.093750,...,False,True,False,1.0,0.009046,0.558153,0.005187,0.393097,0,2.0
4,4,TransactionId_28195,BatchId_38780,AccountId_4841,SubscriptionId_3829,CustomerId_988,0.0,3,19,0.091853,...,True,False,False,1.0,0.009046,0.558153,0.005187,0.393097,0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95657,95657,TransactionId_89881,BatchId_96668,AccountId_4841,SubscriptionId_3829,CustomerId_3078,0.0,3,19,0.091820,...,True,False,False,1.0,0.139853,0.569883,0.006814,0.429138,0,2.0
95658,95658,TransactionId_91597,BatchId_3503,AccountId_3439,SubscriptionId_2643,CustomerId_3874,0.0,5,1,0.092004,...,False,True,False,1.0,0.010269,0.557249,0.000687,0.392051,0,2.0
95659,95659,TransactionId_82501,BatchId_118602,AccountId_4841,SubscriptionId_3829,CustomerId_3874,0.0,3,19,0.091910,...,True,False,False,1.0,0.010269,0.557249,0.000687,0.392051,0,2.0
95660,95660,TransactionId_136354,BatchId_70924,AccountId_1346,SubscriptionId_652,CustomerId_1709,0.0,5,8,0.092188,...,False,True,False,1.0,0.127873,0.561462,0.000969,0.422576,0,2.0


## 5. Train Test Split

### Separet dependant and independant variable

In [4]:
from sklearn.model_selection import train_test_split

# Define features and target variable
X = df.drop(['TransactionId', 'Risk_Label', 'RFMS_Score', 'TransactionId', 'BatchId', 'AccountId', 'SubscriptionId', 'CustomerId', 'CountryCode', 'ProviderId','ProductId','FraudResult'], axis=1)  # Drop non-feature columns
y = df['Risk_Label']

### Split the data to training and Testing Set

In [5]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print("Features (X):")
X.head()

Features (X):


Unnamed: 0.1,Unnamed: 0,Amount,Value,PricingStrategy,Transaction_Hour,Transaction_Day,Transaction_Month,Transaction_Year,Total_Transaction_Amount,Average_Transaction_Amount,...,ProductCategory_tv,ProductCategory_utility_bill,ChannelId_ChannelId_2,ChannelId_ChannelId_3,ChannelId_ChannelId_5,Recency,Frequency,Monetary,Stability,RFMS_Binned
0,0,0.092004,0.000101,0.666667,2,15,11,2018,0.557522,0.047184,...,False,False,False,True,False,1.0,0.028851,0.557522,0.000919,2.0
1,1,0.09191,2e-06,0.666667,2,15,11,2018,0.557522,0.047184,...,False,False,True,False,False,1.0,0.028851,0.557522,0.000919,2.0
2,2,0.091958,5e-05,0.666667,2,15,11,2018,0.556944,0.047137,...,False,False,False,True,False,0.0,0.000244,0.556944,0.0,0.0
3,3,0.09375,0.002206,0.666667,3,15,11,2018,0.558153,0.047749,...,False,True,False,True,False,1.0,0.009046,0.558153,0.005187,2.0
4,4,0.091853,6.5e-05,0.666667,3,15,11,2018,0.558153,0.047749,...,False,False,True,False,False,1.0,0.009046,0.558153,0.005187,2.0


In [6]:
print("\nTarget (y):")
y.head()


Target (y):


0    0
1    0
2    0
3    0
4    0
Name: Risk_Label, dtype: int64

In [7]:
# # Display the size of the train and test sets
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")

Training set size: 76529 samples
Testing set size: 19133 samples
