TITLE : Loan Approval Analysis and Prediction using Financial Risk Indicators

PROBLEM STATEMENT

Financial institutions receive a large number of loan applications every day and must decide whether to approve or reject them based on the applicantâ€™s financial and personal information. Making incorrect decisions may lead to financial losses or rejection of eligible customers.

The objective of this project is to analyze loan applicant data using Exploratory Data Analysis (EDA) and identify key factors influencing loan approval such as income, employment status, dependents, assets, loan amount, loan term, and EMI burden.

Further, a machine learning model is built to predict whether a loan application will be approved or rejected based on these features. The project also evaluates risk indicators like EMI-to-Income ratio to understand repayment capacity and improve decision-making.

In [2]:
import numpy as np

In [3]:
import pandas as pd

In [4]:
import matplotlib.pyplot as plt

importing the dataset into the dataframe

In [5]:
data = pd.read_csv(r"C:\Users\G.Sreenivasulu\ABHI\loan_approval_project\dataset\loan_approval_dataset.csv")

let's have an overview of the data


In [6]:
#first 5 rows
data.head()

Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
0,1,2,Graduate,No,9600000,29900000,12,778,2400000,17600000,22700000,8000000,Approved
1,2,0,Not Graduate,Yes,4100000,12200000,8,417,2700000,2200000,8800000,3300000,Rejected
2,3,3,Graduate,No,9100000,29700000,20,506,7100000,4500000,33300000,12800000,Rejected
3,4,3,Graduate,No,8200000,30700000,8,467,18200000,3300000,23300000,7900000,Rejected
4,5,5,Not Graduate,Yes,9800000,24200000,20,382,12400000,8200000,29400000,5000000,Rejected


In [7]:
#column information and datatypes
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4269 entries, 0 to 4268
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   loan_id                    4269 non-null   int64 
 1    no_of_dependents          4269 non-null   int64 
 2    education                 4269 non-null   object
 3    self_employed             4269 non-null   object
 4    income_annum              4269 non-null   int64 
 5    loan_amount               4269 non-null   int64 
 6    loan_term                 4269 non-null   int64 
 7    cibil_score               4269 non-null   int64 
 8    residential_assets_value  4269 non-null   int64 
 9    commercial_assets_value   4269 non-null   int64 
 10   luxury_assets_value       4269 non-null   int64 
 11   bank_asset_value          4269 non-null   int64 
 12   loan_status               4269 non-null   object
dtypes: int64(10), object(3)
memory usage: 433.7+ KB


In [8]:
#basic statistics of non-categorical values
data.describe()

Unnamed: 0,loan_id,no_of_dependents,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value
count,4269.0,4269.0,4269.0,4269.0,4269.0,4269.0,4269.0,4269.0,4269.0,4269.0
mean,2135.0,2.498712,5059124.0,15133450.0,10.900445,599.936051,7472617.0,4973155.0,15126310.0,4976692.0
std,1232.498479,1.69591,2806840.0,9043363.0,5.709187,172.430401,6503637.0,4388966.0,9103754.0,3250185.0
min,1.0,0.0,200000.0,300000.0,2.0,300.0,-100000.0,0.0,300000.0,0.0
25%,1068.0,1.0,2700000.0,7700000.0,6.0,453.0,2200000.0,1300000.0,7500000.0,2300000.0
50%,2135.0,3.0,5100000.0,14500000.0,10.0,600.0,5600000.0,3700000.0,14600000.0,4600000.0
75%,3202.0,4.0,7500000.0,21500000.0,16.0,748.0,11300000.0,7600000.0,21700000.0,7100000.0
max,4269.0,5.0,9900000.0,39500000.0,20.0,900.0,29100000.0,19400000.0,39200000.0,14700000.0


In [9]:
#shape of the dataframe
data.shape

(4269, 13)

In [10]:
#columns of the dataset
data.columns

Index(['loan_id', ' no_of_dependents', ' education', ' self_employed',
       ' income_annum', ' loan_amount', ' loan_term', ' cibil_score',
       ' residential_assets_value', ' commercial_assets_value',
       ' luxury_assets_value', ' bank_asset_value', ' loan_status'],
      dtype='object')

In [11]:
#finally, let's take 2 random rows
data.sample(2)

Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
405,406,2,Not Graduate,Yes,1800000,5700000,4,809,1200000,1900000,6300000,2400000,Approved
940,941,5,Graduate,Yes,7600000,20000000,10,522,21500000,1400000,20400000,4300000,Rejected


The above code shows that the data is correctly imported.

Let's look more into the data and make it effecient for EDA


In [12]:
#looking for the null values
data.isnull().sum()

loan_id                      0
 no_of_dependents            0
 education                   0
 self_employed               0
 income_annum                0
 loan_amount                 0
 loan_term                   0
 cibil_score                 0
 residential_assets_value    0
 commercial_assets_value     0
 luxury_assets_value         0
 bank_asset_value            0
 loan_status                 0
dtype: int64

In [13]:
# the columns have a leading space
data.columns = data.columns.str.strip()
data.columns

Index(['loan_id', 'no_of_dependents', 'education', 'self_employed',
       'income_annum', 'loan_amount', 'loan_term', 'cibil_score',
       'residential_assets_value', 'commercial_assets_value',
       'luxury_assets_value', 'bank_asset_value', 'loan_status'],
      dtype='object')

In [14]:
#let us also check the string/object valued columns

print(data['loan_status'].values)
print(data['education'].values)

[' Approved' ' Rejected' ' Rejected' ... ' Rejected' ' Approved'
 ' Approved']
[' Graduate' ' Not Graduate' ' Graduate' ... ' Not Graduate'
 ' Not Graduate' ' Graduate']


In [15]:
#both the columns with string values have leading space infront of them
data['loan_status'] = data['loan_status'].str.strip()
data['education'] = data['education'].str.strip()

print(data['loan_status'].values)
print(data['education'].values)

['Approved' 'Rejected' 'Rejected' ... 'Rejected' 'Approved' 'Approved']
['Graduate' 'Not Graduate' 'Graduate' ... 'Not Graduate' 'Not Graduate'
 'Graduate']


In [16]:
#there aren't any null/missing values throughout the dataset

# but in case there are any missing values in the future datasets

data['cibil_score'] = data['cibil_score'].fillna(data['cibil_score'].median())
data['loan_term'] = data['loan_term'].fillna(data['loan_term'].median())
data['loan_amount'] = data['loan_amount'].fillna(data['loan_amount'].median())
data['income_annum'] = data['income_annum'].fillna(data['income_annum'].median())
data['self_employed'] = data['self_employed'].fillna("No")
data['education'] = data['education'].fillna("Graduate")

In [17]:
# checking if there are any duplicates
data.duplicated().sum()

np.int64(0)

The data is cleaned and ready for further assessment

Analysing the cleaned data

In [18]:
# adding new columns that are used for credit risk analysis

# TOTAL ASSETS
data['total_assets'] = data['residential_assets_value'] + data['commercial_assets_value'] + data['luxury_assets_value'] + data['bank_asset_value']

#INCOME-TO-LOAN RATIO
data['inc_loan'] = data['income_annum']/data['loan_amount']


In [19]:
#let's check the count of the target variable

data['loan_status'].value_counts()

loan_status
Approved    2656
Rejected    1613
Name: count, dtype: int64

In [21]:
data.to_csv(r"C:\Users\G.Sreenivasulu\ABHI\loan_approval_project\dataset\01_cleaned.csv")