# Loan Approval Prediction using Random Forest Classifier Algorithm

Citation:

The Loan Approval dataset is provided by: https://www.kaggle.com/

## Problem Definition

The goal of this project is to build a machine learning model that predicts whether a loan application will be approved based on various applicant information such as income, credit history, loan amount, and other relevant features. This will help financial institutions make informed decisions on loan approvals, minimizing the risk of default while improving customer satisfaction.

**Objective**: 
* Accurately predict loan approval using applicant data.
* Identify key features that impact loan approval decisions.
* Improve decision-making for loan officers by automating the approval process with a reliable model.

**Target Variable**: The target variable is the `loan_status`.

#### Importing Required Libraries

In [567]:
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import StandardScaler  
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report  

## Dataset Creation

In [568]:
df = pd.read_csv('loan_approval_dataset.csv')
df.head(3)

Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
0,1,2,Graduate,No,9600000,29900000,12,778,2400000,17600000,22700000,8000000,Approved
1,2,0,Not Graduate,Yes,4100000,12200000,8,417,2700000,2200000,8800000,3300000,Rejected
2,3,3,Graduate,No,9100000,29700000,20,506,7100000,4500000,33300000,12800000,Rejected


Make a copy of the dataset

In [569]:
df_copy = df.copy()
df_copy.head(2)

Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
0,1,2,Graduate,No,9600000,29900000,12,778,2400000,17600000,22700000,8000000,Approved
1,2,0,Not Graduate,Yes,4100000,12200000,8,417,2700000,2200000,8800000,3300000,Rejected


#### Initial Data Preprocessing

* Looking at the Data Structure: info(), describe(), value_counts()
* Handling missing values
* Removing duplicates
* Converting categorical data into numerical form 
* Basic feature selection (removing irrelevant columns)

In [570]:
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4269 entries, 0 to 4268
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   loan_id                    4269 non-null   int64 
 1    no_of_dependents          4269 non-null   int64 
 2    education                 4269 non-null   object
 3    self_employed             4269 non-null   object
 4    income_annum              4269 non-null   int64 
 5    loan_amount               4269 non-null   int64 
 6    loan_term                 4269 non-null   int64 
 7    cibil_score               4269 non-null   int64 
 8    residential_assets_value  4269 non-null   int64 
 9    commercial_assets_value   4269 non-null   int64 
 10   luxury_assets_value       4269 non-null   int64 
 11   bank_asset_value          4269 non-null   int64 
 12   loan_status               4269 non-null   object
dtypes: int64(10), object(3)
memory usage: 433.7+ KB


In [571]:
df_copy.describe()

Unnamed: 0,loan_id,no_of_dependents,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value
count,4269.0,4269.0,4269.0,4269.0,4269.0,4269.0,4269.0,4269.0,4269.0,4269.0
mean,2135.0,2.498712,5059124.0,15133450.0,10.900445,599.936051,7472617.0,4973155.0,15126310.0,4976692.0
std,1232.498479,1.69591,2806840.0,9043363.0,5.709187,172.430401,6503637.0,4388966.0,9103754.0,3250185.0
min,1.0,0.0,200000.0,300000.0,2.0,300.0,-100000.0,0.0,300000.0,0.0
25%,1068.0,1.0,2700000.0,7700000.0,6.0,453.0,2200000.0,1300000.0,7500000.0,2300000.0
50%,2135.0,3.0,5100000.0,14500000.0,10.0,600.0,5600000.0,3700000.0,14600000.0,4600000.0
75%,3202.0,4.0,7500000.0,21500000.0,16.0,748.0,11300000.0,7600000.0,21700000.0,7100000.0
max,4269.0,5.0,9900000.0,39500000.0,20.0,900.0,29100000.0,19400000.0,39200000.0,14700000.0


In [572]:
df_copy[' education'].value_counts()

 education
Graduate        2144
Not Graduate    2125
Name: count, dtype: int64

In [573]:
df_copy[' self_employed'].value_counts()

 self_employed
Yes    2150
No     2119
Name: count, dtype: int64

In [574]:
df_copy[' loan_status'].value_counts()

 loan_status
Approved    2656
Rejected    1613
Name: count, dtype: int64

Let's check class imbalance by calculating the ratio of the two classes.

In [575]:
class_counts = df_copy[' loan_status'].value_counts()

class_0_count = class_counts.iloc[0]
class_1_count = class_counts.iloc[1]

ratio = class_0_count / class_1_count
print(f"Ratio of {class_counts.index[0]} to {class_counts.index[1]}: {ratio:.2f}")

class_percentages = df[' loan_status'].value_counts(normalize=True) * 100
print("\nClass Percentages:")
print(class_percentages)

Ratio of  Approved to  Rejected: 1.65

Class Percentages:
 loan_status
Approved    62.215976
Rejected    37.784024
Name: proportion, dtype: float64


There classes tend to be moderately imbalanced.

We can observe from the column names that there are spaces inbetween the quotation mark and the names. Lets strip out the spaces.

In [576]:
for column_name in df_copy.columns:
    df_copy.rename(columns={column_name: column_name.strip()}, inplace=True)

df_copy.columns

Index(['loan_id', 'no_of_dependents', 'education', 'self_employed',
       'income_annum', 'loan_amount', 'loan_term', 'cibil_score',
       'residential_assets_value', 'commercial_assets_value',
       'luxury_assets_value', 'bank_asset_value', 'loan_status'],
      dtype='object')

Remove leading and trailing whitespaces from each string in the categorical columns

In [577]:
cat_columns = df_copy.select_dtypes(include=['object']).columns

stripped_coll = []
for col in cat_columns:
    df_copy[col] = df_copy[col].str.strip()

for col in cat_columns:
    print(df_copy[col].unique())

['Graduate' 'Not Graduate']
['No' 'Yes']
['Approved' 'Rejected']


In [578]:
df_copy[df_copy.duplicated()].any()

loan_id                     False
no_of_dependents            False
education                   False
self_employed               False
income_annum                False
loan_amount                 False
loan_term                   False
cibil_score                 False
residential_assets_value    False
commercial_assets_value     False
luxury_assets_value         False
bank_asset_value            False
loan_status                 False
dtype: bool

Convert the `Approved` and `Rejected` values in the `loan_status` target to `1 and 0`.

In [579]:
df_copy['loan_status'] = [1 if status == 'Approved' else 0 for status in df_copy['loan_status']]
df_copy['loan_status'].unique()

array([1, 0], dtype=int64)

#### Exploratory Data Analysis (EDA)

* Visualize the data using histograms, scatter plots, box-plots etc.
* Identify `patterns, relationships, or outliers` in the data.
* Understand the `distribution of features, correlations, redundancy and multicollinearity` etc.
* Check for class imbalace.
* Feature engineering might be done based on insights from EDA (e.g., creating new features or transforming existing ones).

#### Further Preprocessing

* Dealing with `outliers` found during EDA.
* Feature engineering
* Scaling/normalizing and creating pipeline.

#### Train-Test Split

* Splitting the dataset into training and test sets.
* Training set: 70-80% of the dataset
* Testing set: 20-30% of the dataset.

## Model Selection 

## Model Training

* Train a basic model without tuning any hyperparameters to establish a `baseline performance`.
* Fit the model to the entire training set

## Model Assesment

* Evaluate on both training and test set.
* Cross-validation to evaluate model performance.
* Compare `training accuracy` and `test accuracy` to detect `overfitting or underfitting`.
* Compare `test accuracy` and `cross-validation scores` to provide `better measures of generalization`.
* Evaluate Initial Model: Generate and examine the `classification report`.
* Plot and visualize `learning curves`.

## Model Optimization

* Use `GridSearchCV` or `RandomSearchCV` to find the optimal combination of hyperparameters.
* Evaluate final Model: Generate and examine the `classification report`.
* After hyperparameter tuning, use `confusion matrix` to assess the model’s performance to ensure that it generalizes well to new, unseen data.


## Model Deployment (Optional)

* If the SVM classifier performs well, consider saving the model using joblib or pickle for deployment purposes.

## Documentation

* Document the workflow, including the rationale for data preprocessing choices, model performance metrics, and any tuning steps you performed. Ensure reproducibility.