# **Business Understanding & Exploratory Data Analysis**
## **Business Understanding**
### **Background**
As the country faces economic woes, the banking sector is especially being hit the hardest as Kenyans default on their loans due to the economic pressure weighing heavily on households and businesses. According to the latest *FinAccess Household Survey*, 16.6% of borrowers defaulted in 2024, up significantly from 10.7% in 2021.

A default occurs when a borrower fails to make their loan payments on time or fails to meet some of the provisional conditions of the loan agreement.

This increase coincides with a sharp rise in mobile and digital lending, which has made credit more accessible than ever. According to a survey by the credit rating agency, *TransUnion*, the surge in credit access, especially among under-35s, is unmatched by financial literacy, leading many borrowers into avoidable debt cycles, which in turn lead to high default rates.

When it comes to commercial banks, the *Central Bank of Kenya* (CBK) reports that the ratio of non-performing loans (NPLs), those unpaid for over 90 days, rose to 17.6 per cent by June 2025, up from 17.4 per cent in April. Also, from a total loan book of Sh4.045 trillion issued by banks as of December 2024, the spike in NPLs suggests that more than Sh712 billion is now at risk of not being recovered.

The CBK has linked this surge to job losses, stagnant incomes, and the overall rise in the cost of living, which have made it difficult for borrowers to service their loans. This marks **the highest default level recorded in 20 years** and highlights growing financial stress among borrowers.

### **Problem Statement**
Both commercial and **especially** digital lenders currently face difficulty in identifying borrowers who are likely to default. Without early warning systems, lenders risk financial losses, and borrowers risk deeper debt cycles. Without robust borrower profiling, digital lenders extend credit without fully assessing risk, a gap that this model seeks to close.

### **Project Objectives**
This project aims to:
- Develop a classification model that predicts loan default risk using data collected during loan applications. 

- Help lenders, through the model, to make informed, risk-sensitive decisions.

- Help borrowers manage their borrowing and avoid getting into debt cycles.

### **StakeHolders**
This model will benefit mobile loan companies, digital lenders, and financial analysts in Kenya's credit sector.

Other stakeholders include:
- **Investors**: Individuals or entities that own shares of the company and are interested in its profitability and stock performance. 

- **Creditors**: Those who have lent money to the company and are concerned with its ability to repay debts and interest. 

- **Management**: Management is responsible for the company's financial performance and is accountable to stakeholders. 

- **Borrowers**: The loan applicants.

### **Hypotheses**
- Applicants with no employment are more likely to default.

- Lower monthly income is associated with higher default rates.

- Shorter loan terms correlate with increased default risk.

## **Data Understanding**
In this section of the project we are trying to understand our data, its properties, the data types, the features, their distributions among others, as well as inspect it for data quality issues. The activities in this section include:
- Understanding the source of the data.

- Importing relevant dependencies and loading the dataset.

- Inspecting the dataset's properties.

- Exploring the data features.

### **Data Source**
This dataset was sourced from Kaggle's [Loan Default Prediction Dataset](https://www.kaggle.com/datasets/nikhil1e9/loan-default?resource=download). It contains information on borrowers and if they default on loans or not. 

#### *Why is this Dataset Suitable for the Project?*


### **Importing Dependencies and Loading the Dataset**
In this notebook, we are mainly analyzing the dataset, for that we are only using `Pandas` for the analysis, `Seaborn` and `Matplotlib` for visualization.

In [5]:
# Import relevant dependencies.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

To load the dataset, we are using pandas' `read.csv()` method

In [6]:
# Load dataset.
loans_info = pd.read_csv('../Data/RawData/Loan_default.csv')

# Preview the dataset.
loans_info.head()

Unnamed: 0,LoanID,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education,EmploymentType,MaritalStatus,HasMortgage,HasDependents,LoanPurpose,HasCoSigner,Default
0,I38PQUQS96,56,85994,50587,520,80,4,15.23,36,0.44,Bachelor's,Full-time,Divorced,Yes,Yes,Other,Yes,0
1,HPSK72WA7R,69,50432,124440,458,15,1,4.81,60,0.68,Master's,Full-time,Married,No,No,Other,Yes,0
2,C1OZ6DPJ8Y,46,84208,129188,451,26,3,21.17,24,0.31,Master's,Unemployed,Divorced,Yes,Yes,Auto,No,1
3,V2KKSFM3UN,32,31713,44799,743,0,3,7.07,24,0.23,High School,Full-time,Married,No,No,Business,No,0
4,EY08JDHTZP,60,20437,9139,633,8,4,6.51,48,0.73,Bachelor's,Unemployed,Divorced,No,Yes,Auto,No,0


### **Data Properties Inspection**
We have succefully loaded the dataset to the notebook, so, our inspection begins. 

Here, we are inspecting the shape of the data, it's data types, what the columns contain, if we have any quality issues to worry about and so on. We are also carrying out a statistical analysis, to understand better features that contain numerical values.

#### *Shape*
This helps us know if the dataset is enough to help us make meaningful predictions. We are using Pandas' `.shape` method.

In [7]:
# Check the dataset's shape
loans_info.shape

(255347, 18)

With more than 250,000 records, we are positioned well to carry out our predictions. Having enough records ensures the reliability, accuracy, and generalizability of models and predictions, otherwise, insufficient data can lead to inaccurate outcomes and a lack of robust insights.

#### *Data Types*
We are usng `.info()` in this section, this is an important method for getting a concise summary of the dataset.

In [8]:
# Check the data types.
loans_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 255347 entries, 0 to 255346
Data columns (total 18 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   LoanID          255347 non-null  object 
 1   Age             255347 non-null  int64  
 2   Income          255347 non-null  int64  
 3   LoanAmount      255347 non-null  int64  
 4   CreditScore     255347 non-null  int64  
 5   MonthsEmployed  255347 non-null  int64  
 6   NumCreditLines  255347 non-null  int64  
 7   InterestRate    255347 non-null  float64
 8   LoanTerm        255347 non-null  int64  
 9   DTIRatio        255347 non-null  float64
 10  Education       255347 non-null  object 
 11  EmploymentType  255347 non-null  object 
 12  MaritalStatus   255347 non-null  object 
 13  HasMortgage     255347 non-null  object 
 14  HasDependents   255347 non-null  object 
 15  LoanPurpose     255347 non-null  object 
 16  HasCoSigner     255347 non-null  object 
 17  Default   