# **Bank Loan Approval Prediction**

## Life cycle of Machine learning Project

- Understanding the Problem Statement     
- Data Collection     
- Data Checks to perform     
- Exploratory data analysis     
- Data Pre-Processing     
- Model Training     
- Choose best model     

## (1) Problem Statement

   Banks receive many loan applications every day.    
   But not **every person qualifies for a loan**.

   So the goal is:

- To build a Machine Learning model that predicts whether a customer will get loan approval or not, based on their personal and financial details.

## (2) Data Collection 

- Dataset Source - https://www.kaggle.com/datasets/architsharma01/loan-approval-prediction-dataset
- The data consists of 13 columns and 4269 rows.

## 2.1 Import Data and Required Packages
Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

In [68]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

Import the CSV Data as Pandas DataFrame


In [69]:
df = pd.read_csv("D:\Loan_Approval_Prediction\LoanApproval_Data\loan_approval_prediction.csv")

Show Top 5 Records


In [70]:
df.head()

Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
0,1,2,Graduate,No,9600000,29900000,12,778,2400000,17600000,22700000,8000000,Approved
1,2,0,Not Graduate,Yes,4100000,12200000,8,417,2700000,2200000,8800000,3300000,Rejected
2,3,3,Graduate,No,9100000,29700000,20,506,7100000,4500000,33300000,12800000,Rejected
3,4,3,Graduate,No,8200000,30700000,8,467,18200000,3300000,23300000,7900000,Rejected
4,5,5,Not Graduate,Yes,9800000,24200000,20,382,12400000,8200000,29400000,5000000,Rejected


Shape of the dataset


In [71]:
df.shape

(4269, 13)

## 2.2 Dataset Information

| Column name                                            | Meaning / Description                                                        |
| ------------------------------------------------------ | ---------------------------------------------------------------------------- |
| `loan_id`                                              | Unique ID for each loan application                                          |
| `no_of_dependents`                                     | Number of dependents (people dependent on applicant)                         |
| `education`                                            | Education level of applicant (e.g. graduate / not graduate)                  |
| `self_employed`                                        | Whether applicant is self-employed or not                                    |
| `income_annum`                                         | Annual income of the applicant                                               |
| `loan_amount`                                          | Amount of loan requested/required by applicant                               |
| `loan_term`                                            | Requested loan term / tenure (time to repay)                                 |
| `cibil_score`                                          | Credit score of the applicant — measure of creditworthiness                  |
| `residential_assets_value`                             | Value of residential assets owned by applicant                               |
| `commercial_assets_value`                              | Value of any commercial assets owned (if any) by applicant                   |
| `luxury_assets_value`                                  | Value of luxury assets owned by applicant (if any)                           |
| `bank_asset_value`                                     | Value/assets in bank (bank balance / holdings) of applicant                  |
| `loan_status`                                          | Target variable: whether loan was approved or rejected (Approved / Rejected) |

## 3. Data Checks to perform 

3.1 Check dataset columns

In [72]:
df.columns

Index(['loan_id', ' no_of_dependents', ' education', ' self_employed',
       ' income_annum', ' loan_amount', ' loan_term', ' cibil_score',
       ' residential_assets_value', ' commercial_assets_value',
       ' luxury_assets_value', ' bank_asset_value', ' loan_status'],
      dtype='object')

- In our column names, there are some extra spaces at the beginning.
### We are going to remove those spaces so that while fetching the columns from the dataset, we do not face any kind of difficulty.

3.2 Removing extra spaces from the column names.

In [73]:
for i in list(df.columns)[1:]:
    df.rename(columns={i:i[1:]},inplace=True)
df.columns

Index(['loan_id', 'no_of_dependents', 'education', 'self_employed',
       'income_annum', 'loan_amount', 'loan_term', 'cibil_score',
       'residential_assets_value', 'commercial_assets_value',
       'luxury_assets_value', 'bank_asset_value', 'loan_status'],
      dtype='object')

- **Problem Solved**

3.3 Check dataset information

In [74]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4269 entries, 0 to 4268
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   loan_id                   4269 non-null   int64 
 1   no_of_dependents          4269 non-null   int64 
 2   education                 4269 non-null   object
 3   self_employed             4269 non-null   object
 4   income_annum              4269 non-null   int64 
 5   loan_amount               4269 non-null   int64 
 6   loan_term                 4269 non-null   int64 
 7   cibil_score               4269 non-null   int64 
 8   residential_assets_value  4269 non-null   int64 
 9   commercial_assets_value   4269 non-null   int64 
 10  luxury_assets_value       4269 non-null   int64 
 11  bank_asset_value          4269 non-null   int64 
 12  loan_status               4269 non-null   object
dtypes: int64(10), object(3)
memory usage: 433.7+ KB


- There are 10 integer columns and 3 object columns. We can easily see that there are no null values in the dataset.

3.4 Check missing value, null value

In [75]:
df.isna().sum()

loan_id                     0
no_of_dependents            0
education                   0
self_employed               0
income_annum                0
loan_amount                 0
loan_term                   0
cibil_score                 0
residential_assets_value    0
commercial_assets_value     0
luxury_assets_value         0
bank_asset_value            0
loan_status                 0
dtype: int64

3.5 Check Duplicates

In [76]:
df[df.duplicated()]

Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status


- There are no duplicate records present in the dataset.

3.6 Check the number of unique values of each column

- 3.6.1 for categorical columns

In [77]:
df.select_dtypes(include='object').nunique()

education        2
self_employed    2
loan_status      2
dtype: int64

- There are three categorical columns and each has 2 unique values.

- 3.6.2 for Integer columns

In [78]:
df.select_dtypes(include='int').nunique()

loan_id                     4269
no_of_dependents               6
income_annum                  98
loan_amount                  378
loan_term                     10
cibil_score                  601
residential_assets_value     278
commercial_assets_value      188
luxury_assets_value          379
bank_asset_value             146
dtype: int64

- There are 6 types of dependents and 10 types of loan term values in the dataset.
- **The loan_id column is completely unique, so we should remove it from the dataset.**

3.7 Remove loan_id column

In [79]:
df.drop('loan_id',axis=1,inplace=True)

3.8 Check statistics of data set

In [80]:
df.describe(include='int')

Unnamed: 0,no_of_dependents,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value
count,4269.0,4269.0,4269.0,4269.0,4269.0,4269.0,4269.0,4269.0,4269.0
mean,2.498712,5059124.0,15133450.0,10.900445,599.936051,7472617.0,4973155.0,15126310.0,4976692.0
std,1.69591,2806840.0,9043363.0,5.709187,172.430401,6503637.0,4388966.0,9103754.0,3250185.0
min,0.0,200000.0,300000.0,2.0,300.0,-100000.0,0.0,300000.0,0.0
25%,1.0,2700000.0,7700000.0,6.0,453.0,2200000.0,1300000.0,7500000.0,2300000.0
50%,3.0,5100000.0,14500000.0,10.0,600.0,5600000.0,3700000.0,14600000.0,4600000.0
75%,4.0,7500000.0,21500000.0,16.0,748.0,11300000.0,7600000.0,21700000.0,7100000.0
max,5.0,9900000.0,39500000.0,20.0,900.0,29100000.0,19400000.0,39200000.0,14700000.0


- A maximum of 5 family members are dependent on the customer, and the bank provides a loan for up to 20 months based on the dataset.

3.9 Check the unique value of the categorical columns

In [81]:
for i in list(df.select_dtypes(include='object')):
    print(f"• Unique value of {i} column are : {df[i].unique()}")

• Unique value of education column are : [' Graduate' ' Not Graduate']
• Unique value of self_employed column are : [' No' ' Yes']
• Unique value of loan_status column are : [' Approved' ' Rejected']


- In our column values, there are some extra spaces at the beginning.

### We are going to remove these spaces so that while analyzing or processing the data, we do not face any issues.

3.10 Removing extra spaces from the values of the columns

In [82]:
for i in list(df.select_dtypes(include='object')):
    for j in list(df[i].unique()):
        df[i] = df[i].replace(j,j[1:])

for i in list(df.select_dtypes(include='object')):
    print(f"• Unique value of {i} column are : {df[i].unique()}")

• Unique value of education column are : ['Graduate' 'Not Graduate']
• Unique value of self_employed column are : ['No' 'Yes']
• Unique value of loan_status column are : ['Approved' 'Rejected']


- **Problem Solved**

3.11 Check the head, tail and sample

In [83]:
df.head()

Unnamed: 0,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
0,2,Graduate,No,9600000,29900000,12,778,2400000,17600000,22700000,8000000,Approved
1,0,Not Graduate,Yes,4100000,12200000,8,417,2700000,2200000,8800000,3300000,Rejected
2,3,Graduate,No,9100000,29700000,20,506,7100000,4500000,33300000,12800000,Rejected
3,3,Graduate,No,8200000,30700000,8,467,18200000,3300000,23300000,7900000,Rejected
4,5,Not Graduate,Yes,9800000,24200000,20,382,12400000,8200000,29400000,5000000,Rejected


In [88]:
df.tail()

Unnamed: 0,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
4264,5,Graduate,Yes,1000000,2300000,12,317,2800000,500000,3300000,800000,Rejected
4265,0,Not Graduate,Yes,3300000,11300000,20,559,4200000,2900000,11000000,1900000,Approved
4266,2,Not Graduate,No,6500000,23900000,18,457,1200000,12400000,18100000,7300000,Rejected
4267,1,Not Graduate,No,4100000,12800000,8,780,8200000,700000,14100000,5800000,Approved
4268,1,Graduate,No,9200000,29700000,10,607,17800000,11800000,35700000,12000000,Approved


In [86]:
df.sample(5)

Unnamed: 0,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
2825,3,Graduate,No,600000,2100000,16,892,700000,400000,2400000,500000,Approved
1748,2,Graduate,No,4200000,10200000,14,407,5900000,4500000,12000000,2500000,Rejected
2296,4,Graduate,No,7000000,25700000,20,597,19900000,13500000,27500000,10100000,Approved
2746,4,Not Graduate,No,2600000,8800000,10,682,4500000,4700000,10300000,1600000,Approved
850,2,Graduate,Yes,1800000,4200000,8,648,1200000,3100000,6200000,1600000,Approved


## 4. Univariate Analysis