# Loan Eligibility Prediction Using Machine Learning

# Problem Statement:

The process of determining loan eligibility is critical for financial institutions to minimize risk while extending loans to individuals. This dataset consists of key demographic and financial information about loan applicants, including factors such as gender, marital status, income, credit history, loan amount, and property location. **The current challenge is to predict whether a loan applicant will be eligible for a loan in the future based on their profile.**

# Objective:

The goal of this analysis is to build a predictive model that can assess the eligibility of loan seekers by analyzing historical data. By doing so, the model will help financial institutions make informed decisions, improving approval accuracy and reducing the risk of loan defaults.



# Importing Useful Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# For handling warnings
import warnings
warnings.filterwarnings('ignore') 


# Loading the Dataset

In [None]:
train_file_path = '/kaggle/input/finance-loan-approval-prediction-data/train.csv'
test_file_path = '/kaggle/input/finance-loan-approval-prediction-data/test.csv'

df_train = pd.read_csv(train_file_path)

# Understanding the Dataset

In [None]:
info = df_train.info() #brief information about the dataset
description = df_train.describe() #Brief decription of columns
quick_peek = df_train.head() #Taking a peek at the data
data_shape = df_train.shape #knowing the datashape

no_of_rows, no_of_columns = (data_shape) 
no_of_features = no_of_columns - 1
tot_num_data = no_of_rows * no_of_columns

print ('                                                     ')
print (f'Brief info of the data:\n{info}')
print ('-----------------------------------------------------------------------')
print ('                                                     ')
print (f'Descriptions of Columns:\n{description}')
print ('------------------------------------------------------------------------')
print ('                                                     ')
print(f'A quick view of the dataset:\n{quick_peek}')
print ('------------------------------------------------------------------------')
print ('                                                     ')
print (f'shape of data:\n{data_shape}')

print (f'Number of rows: {no_of_rows}')
print (f'Number of columns: {no_of_columns}')
print (f'Number of features: {no_of_features}')
print (f'Total number of data: {tot_num_data}')



# Data Preprocessing

## Handling missing values

In [None]:
#Checking for missing values

missing_values = df_train.isnull().sum()

print ('Missing values in each column: \n', missing_values)

In [None]:
# Percentage of missing values

missing_percentage = (missing_values/len(df_train))*100

print ('percentage of missing values: \n', missing_percentage.astype(float))

## Filling the missing values

For categorical data columns (Gender, Married, Dependents, Self-employed, Credit history) the missing values will be filled with the **mode** of the column.

However, numerical data columns (Loan amount. Loan amount term) will be filled with the mean

In [None]:
# Filling categorical data

df_train['Gender'] = df_train['Gender'].fillna(df_train['Gender'].mode()[0])
df_train['Married'] = df_train['Married'].fillna(df_train['Married'].mode()[0])
df_train['Dependents'] = df_train['Dependents'].fillna(df_train['Dependents'].mode()[0])
df_train['Self_Employed'] = df_train['Self_Employed'].fillna(df_train['Self_Employed'].mode()[0])
df_train['Credit_History'] = df_train['Credit_History'].fillna(df_train['Credit_History'].mode()[0])

In [None]:
# Filling numerical data

df_train['LoanAmount'] = df_train['LoanAmount'].fillna(df_train['LoanAmount'].mean())
df_train['Loan_Amount_Term'] = df_train['Loan_Amount_Term'].fillna(df_train['Loan_Amount_Term'].mean())


In [None]:
# Crosschecking null values

df_train.isnull().sum()


In [None]:
#Visualising missing data to ensure all gaps have been covered
sns.heatmap(df_train.isnull(), cbar=False, cmap='viridis')

# Exploratory Data Analysis

## Data Visualization

### Visualizing numerical data

In [None]:
#Using scatterplot to check for outliers
from pandas.plotting import scatter_matrix

num_columns = ['LoanAmount', 'Loan_Amount_Term', 'ApplicantIncome', 'CoapplicantIncome']
scatter_matrix(df_train[num_columns], figsize = (12, 8))

In [None]:
sns.pairplot(df_train)

In [None]:
#Further examination of numerical outliers
plt.figure(figsize = (15,10))
sns.boxplot(data=df_train)


In [None]:
plt.figure(figsize = (15, 10))

Outlier_check = df_train[num_columns]
sns.stripplot(data = Outlier_check, palette='dark:red', jitter = 0.3, size = 5)

plt.title('Outlier Check')


plt.show()

Outliers are being observed in the Applicantincome and Coapplicantincome columns. 

We will come back to this.

### Visualizing Categorical data

In [None]:
# Visualisng all categorical columns at once
cat_columns = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Credit_History', 'Property_Area', 'Loan_Status']


#setting up plotting environment
num_cat = len(cat_columns)

fig, axes = plt.subplots(nrows=num_cat, ncols=1, figsize = (10, 5*num_cat))

#Plotting barchat of each categorical columns
for i, col in enumerate(cat_columns):
    sns.countplot(data=df_train, x=col, ax=axes[i], hue='Loan_Status', palette='Set2')
    axes[i].set_title(f'plot of {col}')
    axes[i].set_xticklabels(axes[i].get_xticklabels(), rotation = 45)

plt.tight_layout()
plt.show()

In [None]:
dependents_0 = df_train[df_train['Dependents'] == '0']
total_dependents_0 = len(dependents_0)
loan_status_no = dependents_0[dependents_0['Loan_Status'] == 'No']
percent_loan_status_no = (len(loan_status_no)/total_dependents_0)*100

percent_loan_status_no

In [None]:
# Step 1: Filter rows where Dependents is '0'
dependents_zero = df_train[df_train['Dependents'] == '0']

# Step 2: Calculate the total number of Dependents '0'
total_dependents_zero = len(dependents_zero)
total_dependents_notzero = len(df_train[df_train['Dependents'] != '0'])

# Step 3: Filter rows where Loan_Status is 'No' from the dependents_zero subset
loan_status_no = dependents_zero[dependents_zero['Loan_Status'] == 'N']

# Step 4: Calculate the percentage of Loan_Status 'No' in Dependents '0'
percentage_no_loan = (len(loan_status_no) / total_dependents_zero) * 100

print(f"Total number of Dependents '0': {total_dependents_zero}")
print(f"Total number of Dependents not '0': {total_dependents_notzero}")
print(f"Percentage of Dependents '0' with Loan_Status 'No': {percentage_no_loan:.2f}%")
print (len(df_train['Dependents']))
print (len(loan_status_no))

## Handling Outliers in the Numerical Columns

We use Interquartile Range(IQR) to peg down Outliers.

Conventionally, values outside 1.5 * IQR are typically considered Outliers.

In [None]:
#Calculating IQR for Applicantincome

Q1_app = df_train['ApplicantIncome'].quantile(0.25)
Q3_app = df_train['ApplicantIncome'].quantile(0.75)
IQR = Q3_app - Q1_app

lowerbound_app = Q1_app - 1.5*IQR
upperbound_app = Q3_app + 1.5*IQR

outliers_app = df_train[(df_train['ApplicantIncome'] < lowerbound_app) | (df_train['ApplicantIncome'] > upperbound_app)]

#calcuating IQR for coapplicant income

Q1_co = df_train['CoapplicantIncome'].quantile(0.25)
Q3_co = df_train['CoapplicantIncome'].quantile(0.75)
IQR_co = Q3_co - Q1_co

lowerbound_co = Q1_co - 1.5*IQR_co
upperbound_co = Q3_co + 1.5*IQR_co

outliers_co = df_train[(df_train['CoapplicantIncome'] < lowerbound_co) | (df_train['CoapplicantIncome'] > upperbound_co)]

# Dropping the Outliers
df_train = df_train[~((df_train['ApplicantIncome'] < lowerbound_app) | (df_train['ApplicantIncome'] > upperbound_app))]
df_train = df_train[~((df_train['CoapplicantIncome'] < lowerbound_co) | (df_train['CoapplicantIncome'] > upperbound_co))]

In [None]:
plt.figure(figsize = (15, 9))

outlier_check2 = df_train[num_columns]
sns.stripplot(data=outlier_check2, palette='dark:red', jitter = 0.3, size = 5)
plt.show()

print ('Shape of treated dataset: ', df_train.shape)

It can be observed from the plot above that the outliers in the numerical columns have been taken care of.

### Observing Correlations

In [None]:
#Selecting numeric columns from dataset
numeric_df = df_train.select_dtypes(include=[np.number]) 

plt.figure(figsize=(12, 9))

sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm')
plt.show()

* It is observed that Applicant income is positively correlated with loan status 
* Co-applicant income appears to be slightly correlated

# Feature Engineering

## One Hot Encoding Categorical variables

Categorical values will be converted into numerical ones to ease model building.

Pandas' get_dummies() library will be used for this.



In [None]:
df_train = pd.get_dummies(df_train, columns = cat_columns, drop_first=True)

## Feature Scaling

Standardizing numerical features to bring them to a common scale


In [None]:
from sklearn.preprocessing import StandardScaler

standard_scaler= StandardScaler()

df_train[['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount']] = standard_scaler.fit_transform(df_train[['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount']])

## Creating new features

Combining features to create new relationships.

New relationships:

* Total Income = Applicant Income + Co-applicant income
* Loan-to-income ratio = Loan amount / Total income

In [None]:
#Total Income

df_train['Total_Income'] = df_train['ApplicantIncome'] + df_train['CoapplicantIncome']

#Loan-to-income Ratio

df_train['Loan_to_Income'] = df_train['LoanAmount']/df_train['Total_Income'] 

# Viewing the updated features
df_train.head()

In [None]:
df_train['Loan_Status_Y'].value_counts(normalize=True)*100

The Loan status feature appear to be balanced.

No requirement for further sampling to correct for overfitting or underfitting

In [None]:
# dropping unneeded columns

df_train = df_train.drop(['Loan_ID', 'Loan_Amount_Term'], axis=1)

column_update = {'Gender_Male': 'Gender', 'Married_Yes': 'Married',
                'Self_Employed_Yes': 'Self_Employed', 'Loan_Status_Y': 'Loan_Status' }

df_train.rename(columns=column_update, inplace=True)

#Display updated dataset
df_train.head()

# Model Building

## Data Preparation

1. Split the dataset into Features and Targets.

2. Train-Test Split: Split data into training and test sets (80% train, 20% test).

In [None]:
# Splitting into features vs targets

X = df_train.drop(columns = ['Loan_Status']) #Features

Y = df_train['Loan_Status'] #Target variables

print ('Shape of X: ', X.shape)
print ('Shape of Y: ', Y.shape)


In [None]:
# Splitting into Training and Target variables

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state = 42)

## Model Selection: 


In [None]:
#IMporting evealuation metrics
from sklearn.metrics import accuracy_score, classification_report, log_loss
from sklearn.metrics import mean_squared_error

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
log_reg = LogisticRegression()

log_reg.fit(X_train, Y_train)

log_reg_predict = log_reg.predict(X_train)

#Accuracy_score

log_reg_accuracy = accuracy_score(Y_train, log_reg_predict)
print ('Logistic Regression Accuracy: ', log_reg_accuracy)

log_reg_prob = log_reg.predict_proba(X_train)
log_reg_log_loss = log_loss(Y_train, log_reg_prob)
print("Logistic Regression Log Loss:", log_reg_log_loss)

In [None]:
log_predict_test = log_reg.predict(X_test)

#Accuracy Score

log_test_accuracy = accuracy_score(Y_test, log_predict_test)

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

#Initialize and Train

tree_clf = DecisionTreeClassifier()

tree_clf.fit(X_train, Y_train)
tree_clf_predict = tree_clf.predict(X_train)

#Accuracy Score

tree_clf_accuracy = accuracy_score(Y_train, tree_clf_predict)
print ('Logistic Regression Accuracy: ', tree_clf_accuracy)

tree_clf_prob = tree_clf.predict_proba(X_train)
tree_clf_log_loss = log_loss(Y_train, tree_clf_prob)
print ('Logistic Regression Log Loss: ', tree_clf_log_loss)

Decision Tree appears to be Overfitting

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier()

forest_clf.fit(X_train, Y_train)
forest_clf_predict = forest_clf.predict(X_train)

#Accuracy

forest_clf_accuracy = accuracy_score(Y_train, forest_clf_predict)
print ('Random Forest Classifier accuracy: ', forest_clf_accuracy)

forest_clf_prob = forest_clf.predict_proba(X_train)
forest_clf_log_loss = log_loss(Y_train, forest_clf_prob)
print ('Random Forest Classifier Log Loss: ', forest_clf_log_loss)

### Support Vector Machine

In [None]:
from sklearn.svm import SVC

svm_clf = SVC(probability = True)

svm_clf.fit(X_train, Y_train)
svm_clf_predict = svm_clf.predict(X_train)

#Accuracy

svm_clf_accuracy = accuracy_score(Y_train, svm_clf_predict)
print ('Support Vector Machine Accuracy: ', svm_clf_accuracy)

#Logloss

svm_clf_prob = svm_clf.predict_proba(X_train)
svm_clf_log_loss = log_loss(Y_train, svm_clf_prob)
print ('Support Vector Machine Log Loss: ', svm_clf_log_loss)

## Comparing Accuracy of Different models

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

#Dictionary to store evaluation metrics
metrics = {}

models = {
    'Logistic Regression': log_reg,
    'Decision Tree': tree_clf,
    'Forest Classifier': forest_clf,
    'Support Vector Machine': svm_clf
}

for model_name, model in models.items():
    #Predict CLass labels
    test_prediction = model.predict(X_test)

    # If `predict_proba` is available, get probabilities for log loss
    try:
        test_probabilities = model.predict_proba(X_test)
        model_log_loss = log_loss(Y_test, test_probabilities)
    except AttributeError:
        model_log_loss = 'N/A'

    #Calculating metrics

    model_accuracy = accuracy_score(Y_test, test_prediction)
    model_precision = precision_score(Y_test, test_prediction, average='binary')
    model_recall = recall_score(Y_test, test_prediction, average='binary')
    model_f1 = f1_score(Y_test, test_prediction, average='binary')

    # Store metrics
    metrics[model_name] = {
        'Accuracy': model_accuracy,
        'Precision': model_precision,
        'Recall': model_recall,
        'F1_Score': model_f1,
        'Log_Loss': model_log_loss
    }

#Display results

for metric_name, metric_value in metrics.items():
    print (f'{metric_name}: {metric_value}')

### Summary
Logistic Regression and SVM are the strongest performers with high accuracy, precision, and recall, and both provide reliable probability estimates.
Random Forest is also solid, but slightly trails Logistic Regression and SVM.
Decision Tree has noticeably lower performance across all metrics, suggesting it may not generalize as well to this dataset.

### Gradient Boosting

In [None]:
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

xgb_clf = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

#fit and predict
xgb_clf.fit(X_train, Y_train)
xgb_predict = xgb_clf.predict(X_test)

#Calculatiing metrics
xgb_accuracy = accuracy_score(Y_test, xgb_predict)
xgb_precision = precision_score(Y_test, xgb_predict)
xgb_recall = recall_score(Y_test, xgb_predict)
xgb_f1 = f1_score(Y_test, xgb_predict)

#log_loss
xgb_probability = xgb_clf.predict_proba(X_test)
xgb_log_loss = log_loss(Y_test, xgb_probability)

print("XGBoost Results")
print("Accuracy:", xgb_accuracy)
print("Precision:", xgb_precision)
print("Recall:", xgb_recall)
print("F1 Score:", xgb_f1)
print("Log Loss:", xgb_log_loss)

# Model Interpretability

Viewing the level of contribution each feature makes to the predictive power of the **Logistic Regression model**(our model of choice)

In [None]:
from sklearn.inspection import permutation_importance

# Logistic Regression example
result = permutation_importance(log_reg, X_test, Y_test, n_repeats=10, random_state=42)

feature_importance = pd.DataFrame({
    'Feature': X_test.columns,
    'Importance': result.importances_mean
}).sort_values(by='Importance', ascending=False)

print(feature_importance)


In [None]:
plt.figure(figsize=(8, 6))
plt.barh(feature_importance['Feature'], feature_importance['Importance'], color='skyblue')
plt.xlabel('Features')
plt.ylabel('Feature Importance of Each Feature')
plt.title('Feature Importance of Each Feature')
plt.show()

This reveals that Credit History is by far the most influential decider on the eligibility of receiving loans

# Saving the Model

In [None]:
import pickle

with open ('Elgibility Prediction Model.pkl', 'wb') as file:
    pickle.dump(log_reg, file)