*1. Business Understanding

Objective:

The primary goal is to predict whether a borrower will default on their loan. This will help financial institutions in Kenya make better lending decisions and reduce the risk of bad loans.

Key Questions:

What factors contribute to loan defaults?

Can we predict which borrowers are likely to default based on their profile and loan details?

How can the model be used to improve loan approval processes?

2. Data Understanding

Dataset Overview:

The dataset contains information about borrowers, including their demographics, employment status, credit score, loan details, and whether they defaulted on their loan.

Key Variables:
ID: Unique identifier for each borrower.

GENDER: Gender of the borrower.

AGE: Age of the borrower.

NO_DEFAULT_LOAN: Number of previous loans without default.

EMPLOYMENT_STATUS: Employment status of the borrower.

SECTOR: Sector in which the borrower works.

MARITAL_STATUS: Marital status of the borrower.

CREDIT_SCORE: Credit score of the borrower.

SCOREGRADE: Credit score grade.

CRR: Credit risk rating.

CURRENCY: Currency of the loan (KES in this case).

NET INCOME: Net income of the borrower.

PRINCIPAL_AMOUNT: Loan amount.

EMI: Equated Monthly Installment (monthly payment).

OD_DAYS: Number of days the loan is overdue.

PRUDENTIAL_CLASSIFICATION: Classification of the loan (e.g., Normal, Doubtful).

RISK_GRADE: Risk grade of the borrower.

AREARS: Amount in arrears.

LOAN_STATUS: Current status of the loan (e.g., Current, Expired).


Data Exploration:

Missing Values: Some columns like GENDER, AGE, CREDIT_SCORE, and SCOREGRADE have missing values.

Data Types: Most columns are numerical, but some like GENDER, EMPLOYMENT_STATUS, and LOAN_STATUS are categorical.

Target Variable: LOAN_STATUS will be our target variable. We will classify loans as "Default" (Expired) or "Non-Default" (Current).


* 3. Data Preparation

*** Data Cleaning:

Handling Missing Values:

For numerical columns like AGE we can fill missing values with the median or mean.

Removing Irrelevant Columns: Columns like CURRENCY (since all loans are in KES) can be dropped.

Encoding Categorical Variables: Convert categorical variables like GENDER, EMPLOYMENT_STATUS, and LOAN_STATUS into numerical values using one-hot encoding or label encoding.

*** Feature Engineering:

Loan-to-Income Ratio: Create a new feature by dividing the PRINCIPAL_AMOUNT by NET INCOME. This can help assess the borrower's ability to repay the loan.

Debt-to-Income Ratio: Calculate the ratio of EMI to NET INCOME.

Overdue Ratio: Calculate the ratio of OD_DAYS to the loan tenure.

*** Data Splitting:
Split the data into training (80%) and testing (20%) sets to evaluate the model's performance.

* 4. Modeling

*** Model Selection:
We will use classification algorithms to predict whether a borrower will default. Some common algorithms include:

Logistic Regression

Decision Trees

Random Forest

Gradient Boosting (e.g., XGBoost, LightGBM)

Support Vector Machines (SVM)

*** Model Training:
Train each model on the training dataset.

Use cross-validation to ensure the model generalizes well to unseen data.

*** Hyperparameter Tuning:

Use techniques like Grid Search or Random Search to find the best hyperparameters for each model.

* 5. Evaluation

Model Evaluation Metrics:

Accuracy: Percentage of correctly predicted defaults and non-defaults.

Precision: Percentage of predicted defaults that are actual defaults.

Recall: Percentage of actual defaults that are correctly predicted.

F1-Score: Balance between precision and recall.

ROC-AUC: Area under the ROC curve, which measures the model's ability to distinguish between default and non-default classes.

*** Model Comparison:

Compare the performance of different models based on the above metrics.

Select the best-performing model for deployment.

Confusion Matrix:

Visualize the number of true positives, true negatives, false positives, and false negatives.

* 6. Deployment

Model Deployment:

Deploy the best model as a web service or API that financial institutions can use to predict loan defaults in real-time.

Use tools like Flask or FastAPI to create the API.

Monitoring:

Continuously monitor the model's performance in production.

Retrain the model periodically with new data to ensure it remains accurate.

User Interface:
Create a simple dashboard where loan officers can input borrower details and get a prediction on whether the borrower is likely to default.

* Conclusion
By following the CRISP-DM framework, we have built a Loan Default Detection System that can help financial institutions in Kenya make better lending decisions. The model predicts whether a borrower is likely to default based on their profile and loan details, reducing the risk of bad loans and improving the overall health of the financial sector.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

In [9]:
# Load the dataset
df = pd.read_csv(r'C:\Users\ADMIN\Downloads\personal\project 5\DSC-CapstoneProject\Data\Final_Loans_dataset.csv')
df

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,ID,GENDER,AGE,NO_DEFAULT_LOAN,EMPLOYMENT_STATUS,SECTOR,MARITAL_STATUS,CREDIT_SCORE,SCOREGRADE,CRR,CURRENCY,NET INCOME,PRINCIPAL_AMOUNT,EMI,OD_DAYS,PRUDENTIAL_CLASSIFICATION,RISK_GRADE,AREARS,LOAN_STATUS
0,209801.0,FEMALE,37,7.0,EMPLOYED,FINANCE & INSURANCE,MARRIED,615.0,II,B10,KES,5000.00,642000.00,59826.37,204.0,DOUBTFUL,B9,453208.12,CURRENT
1,315048.0,MALE,41,3.0,SELF-EMPLOYED,,MARRIED,529.0,JJ,B20,KES,0.00,78000.00,6149.03,295.0,DOUBTFUL,B9,68917.29,EXPIRED
2,145878.0,MALE,36,6.0,EMPLOYED,TRANSPORT & COMMUNICATION,MARRIED,665.0,FF,A5,KES,1294783.78,80000.00,7439.02,0.0,NORMAL,A1-A6,4.84,CURRENT
3,295535.0,MALE,41,5.0,EMPLOYED,,0,618.0,HH,B20,KES,347554.00,172000.00,16062.90,323.0,DOUBTFUL,B9,195045.53,EXPIRED
4,493960.0,MALE,41,1.0,EMPLOYED,FINANCE & INSURANCE,SINGLE,696.0,DD,A5,KES,4210957.00,300502.44,28098.61,0.0,NORMAL,A1-A6,247.18,CURRENT
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1048570,,,0,,,,,,,,,,,,,,,,
1048571,,,0,,,,,,,,,,,,,,,,
1048572,,,0,,,,,,,,,,,,,,,,
1048573,,,0,,,,,,,,,,,,,,,,


In [10]:
# Display basic information about the dataset
display(df.head())
display(df.tail())
display(df.info())


Unnamed: 0,ID,GENDER,AGE,NO_DEFAULT_LOAN,EMPLOYMENT_STATUS,SECTOR,MARITAL_STATUS,CREDIT_SCORE,SCOREGRADE,CRR,CURRENCY,NET INCOME,PRINCIPAL_AMOUNT,EMI,OD_DAYS,PRUDENTIAL_CLASSIFICATION,RISK_GRADE,AREARS,LOAN_STATUS
0,209801.0,FEMALE,37,7.0,EMPLOYED,FINANCE & INSURANCE,MARRIED,615.0,II,B10,KES,5000.0,642000.0,59826.37,204.0,DOUBTFUL,B9,453208.12,CURRENT
1,315048.0,MALE,41,3.0,SELF-EMPLOYED,,MARRIED,529.0,JJ,B20,KES,0.0,78000.0,6149.03,295.0,DOUBTFUL,B9,68917.29,EXPIRED
2,145878.0,MALE,36,6.0,EMPLOYED,TRANSPORT & COMMUNICATION,MARRIED,665.0,FF,A5,KES,1294783.78,80000.0,7439.02,0.0,NORMAL,A1-A6,4.84,CURRENT
3,295535.0,MALE,41,5.0,EMPLOYED,,0,618.0,HH,B20,KES,347554.0,172000.0,16062.9,323.0,DOUBTFUL,B9,195045.53,EXPIRED
4,493960.0,MALE,41,1.0,EMPLOYED,FINANCE & INSURANCE,SINGLE,696.0,DD,A5,KES,4210957.0,300502.44,28098.61,0.0,NORMAL,A1-A6,247.18,CURRENT


Unnamed: 0,ID,GENDER,AGE,NO_DEFAULT_LOAN,EMPLOYMENT_STATUS,SECTOR,MARITAL_STATUS,CREDIT_SCORE,SCOREGRADE,CRR,CURRENCY,NET INCOME,PRINCIPAL_AMOUNT,EMI,OD_DAYS,PRUDENTIAL_CLASSIFICATION,RISK_GRADE,AREARS,LOAN_STATUS
1048570,,,0,,,,,,,,,,,,,,,,
1048571,,,0,,,,,,,,,,,,,,,,
1048572,,,0,,,,,,,,,,,,,,,,
1048573,,,0,,,,,,,,,,,,,,,,
1048574,,,0,,,,,,,,,,,,,,,,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 19 columns):
 #   Column                     Non-Null Count    Dtype  
---  ------                     --------------    -----  
 0   ID                         18197 non-null    float64
 1   GENDER                     17749 non-null    object 
 2   AGE                        1045120 non-null  object 
 3   NO_DEFAULT_LOAN            17922 non-null    float64
 4   EMPLOYMENT_STATUS          18197 non-null    object 
 5   SECTOR                     18070 non-null    object 
 6   MARITAL_STATUS             18197 non-null    object 
 7   CREDIT_SCORE               17922 non-null    float64
 8   SCOREGRADE                 17922 non-null    object 
 9   CRR                        18197 non-null    object 
 10  CURRENCY                   18197 non-null    object 
 11  NET INCOME                 17763 non-null    float64
 12  PRINCIPAL_AMOUNT           18197 non-null    float64
 13  EMI         

None