## Table of Contents: Random Forest – Loan Approval Prediction

Last Edited: October 1st, 2024

1. Uploading Dataset
2. Descriptive Analysis (`head()`, numerical summary / quantiles)
3. Data Cleaning (dropna)
4. Encoding Categorical Features (LabelEncoder, pd.get_dummies)
5. Feature Selection and Target Definition
6. Train–Test Split (30% test)
7. Random Forest Model Training (hyperparameters: n_estimators=150, random_state=42)
8. Cross-Validation
9. Model Prediction on Test Set
10. Model Evaluation: Accuracy Score

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error, r2_score

from google.colab import drive
drive.mount("/content/drive")

path = "/content/drive/MyDrive/Kellton Tech/Model Code/dataset/LoanDataset - LoansDatasest.csv"
df= pd.read_csv(path)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
df.isnull().sum()
df_data = df.dropna()
df_data.head()

Unnamed: 0,customer_id,customer_age,customer_income,home_ownership,employment_duration,loan_intent,loan_grade,loan_amnt,loan_int_rate,term_years,historical_default,cred_hist_length,Current_loan_status
0,1.0,22,59000,RENT,123.0,PERSONAL,C,"£35,000.00",16.02,10,Y,3,DEFAULT
2,3.0,25,9600,MORTGAGE,1.0,MEDICAL,B,"£5,500.00",12.87,5,N,3,DEFAULT
3,4.0,23,65500,RENT,4.0,MEDICAL,B,"£35,000.00",15.23,10,N,2,DEFAULT
4,5.0,24,54400,RENT,8.0,MEDICAL,B,"£35,000.00",14.27,10,Y,4,DEFAULT
5,6.0,21,9900,OWN,2.0,VENTURE,A,"£2,500.00",7.14,1,N,2,DEFAULT


In [None]:
Q1 = df_data['customer_age'].quantile(0.25)
Q3 = df_data['customer_age'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df_data1 = (df_data['customer_age'] >= lower_bound) & (df_data['customer_age'] <= upper_bound)

In [None]:
df_data2 = df_data[df_data1]

In [None]:
Q1 = df_data2['loan_int_rate'].quantile(0.25)
Q3 = df_data2['loan_int_rate'].quantile(0.75)
IQR = Q3 - Q1

lower_bound2 = Q1 - 1.5 * IQR
upper_bound2 = Q3 + 1.5 * IQR

In [None]:
df_data3 = (df_data2['loan_int_rate'] >= lower_bound) & (df_data2['loan_int_rate'] <= upper_bound)
df_data4 = df_data2[df_data3]

In [None]:
target = 'Current_loan_status'
X = df_data4.drop(columns=[target])
y = df_data4[target]

In [None]:
X = pd.get_dummies(X, drop_first=True)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
model = RandomForestClassifier(n_estimators=150, random_state=42)
scores = cross_val_score(model, X, y, cv=5)

In [None]:
print("Cross-Validation Scores:", scores)
print("Mean Score:", scores.mean())

Cross-Validation Scores: [0.59263521 0.82451093 0.69545193 0.67875648 0.63672999]
Mean Score: 0.6856169089067363


In [None]:
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)

In [None]:
print("Accuracy score:", accuracy_score(y_test, y_pred))

Accuracy score: 0.8772535481396241


In [None]:
#df_data4.shape