# Loan Approval Risk Prediction Using Machine Learning

## 1. Business Problem

Financial institutions must determine whether loan applicants should be approved or declined while minimising default risk.

This project builds predictive classification models to identify high-risk applicants using historical loan data.


## 2. Dataset Overview

The dataset contains approximately 58,600 loan applications with financial and demographic attributes such as income, employment length, loan amount, interest rate, and credit history.

The target variable is `loan_approval_status`, indicating whether a loan was approved or declined.

Initial analysis shows the dataset is imbalanced, with approved loans significantly outnumbering declined loans.


In [12]:
# Standard core libraries
import pandas as pd
import numpy as np

# Model Selection and preprocessing
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV

# Visualisation
from matplotlib import pyplot as plt

# Models
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Evaluation metrics
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report
from sklearn.metrics import RocCurveDisplay

In [13]:
from pathlib import Path
DATA_PATH = Path("../data/raw/loan_approval_data.csv")
df = pd.read_csv(DATA_PATH)

df.describe(include="all").transpose()

FileNotFoundError: [Errno 2] No such file or directory: '../data/raw/loan_approval_data.csv'

In [None]:
#Check number of features and instances
df.shape

In [None]:
# Check dataset structure
df.head(10)

In [None]:
#Dropping the features we don't want
df.drop(['id', 'max_allowed_loan', 'Credit_Application_Acceptance'], axis=1, inplace=True)

In [14]:
#Display the new dataframe
df.head()

NameError: name 'df' is not defined

In [None]:
#Display basic stats since we dropped features
df.describe(include='all').transpose()

In [None]:
df.info()

In [None]:
df.dtypes

## 3. Exploratory Data Analysis

Before training models, exploratory analysis was performed to understand the structure and distribution of key variables.

In particular, the distribution of the target variable (`loan_approval_status`) was examined to assess class imbalance, which has implications for model evaluation.


In [None]:
df['loan_approval_status'].value_counts().plot(kind='bar')
plt.title('Distribution of Target Variable (Loan Approval Status)')
plt.xlabel('Loan Status'), plt.ylabel('Count')
plt.show()

In [15]:
df.head()

NameError: name 'df' is not defined

In [None]:
df.tail()

In [None]:
df.describe(include='all').transpose()

In [None]:
df.describe(include='all')

In [16]:
df['age'].value_counts()

NameError: name 'df' is not defined

In [None]:
df['Sex'].value_counts()

In [None]:
df['Education_Qualifications'].value_counts()

In [17]:
df['income'].value_counts()
#max(df['income'])
#min(df['income'])

NameError: name 'df' is not defined

In [None]:
df['home_ownership'].value_counts()

In [18]:
df['emplyment_length'].value_counts()

NameError: name 'df' is not defined

In [None]:
df['loan_intent'].value_counts()

In [None]:
df['loan_amount'].value_counts()

In [19]:
df['loan_interest_rate'].value_counts()

NameError: name 'df' is not defined

In [20]:
df['loan_income_ratio'].value_counts()

NameError: name 'df' is not defined

In [21]:
df['payment_default_on_file'].value_counts()

NameError: name 'df' is not defined

In [None]:
df['credit_history_length'].value_counts()

In [None]:
df['loan_approval_status'].value_counts()

In [None]:
df.shape

In [None]:
df.dtypes
#df.describe()

In [None]:
# Variable age before cleaning
#df['age'].describe()
#df['age'].min(), df['age'].max()
df['age'].value_counts()
df.describe(include='all')

In [None]:
# Converting valid values to float and invalid values to NaN
df['age'] = pd.to_numeric(df['age'], errors='coerce')

# Setting a range to find all unrealistic numbers (such as -30 and 156) and changing them to NaN
df.loc[(df['age']<18) | (df['age']>95),'age'] = None

# Drop NaN values
df = df.dropna(subset=['age'])

# Variable age after cleaning.
df['age'].describe(include='all').round(2)

#df['age'].isna().sum()

In [22]:
df['age'].dtypes

NameError: name 'df' is not defined

In [None]:
# Variable Sex before cleaning
df['Sex'].describe()

In [None]:
# Variable Sex after cleaning (drop column)
df=df.drop(columns=['Sex'], axis =1)
df.columns

In [None]:
# Variable Education_Qualifications before cleaning
df['Education_Qualifications'].describe()

In [None]:
# Variable Education_Qualifications after cleaning (drop column)
df=df.drop(columns=['Education_Qualifications'], axis =1)
df.columns

In [None]:
# Variable home_ownership before cleaning
df.filter(like='home_ownership')

In [None]:
# Variable home_ownership after cleaning (one-hot encoding)
df = pd.get_dummies(df, columns=['home_ownership'])

df.filter(like='home_ownership')


In [None]:
# Variable emplyment_length before cleaning
df['emplyment_length'].value_counts().sort_index().tail(10)
df['emplyment_length'].describe().round(2)

In [None]:
# Variable emplyment_length after cleaning (setting all values less than 0 and above 75 to NaN, and then removing them)

df.loc[(df['emplyment_length']<0) | (df['emplyment_length']>75), 'emplyment_length'] = None

df = df.dropna(subset=['emplyment_length'])

df['emplyment_length'].describe().round(2)

#df['emplyment_length'].isna().sum()

In [None]:
# Variable loan_intent before cleaning
df.filter(like='loan_intent')

In [None]:
# Variable loan_intent after cleaning (one-hot encoding)
df = pd.get_dummies(df, columns=['loan_intent'])

df.filter(like='loan_intent')

In [None]:
# Variable loan_interest_rate before cleaning
df['loan_interest_rate'].describe().round(2)
df['loan_interest_rate'].isna().sum()

In [None]:
# Variable loan_interest_rate after cleaning (setting all values less than 0 and above 50 to NaN, and then removing them)
df.loc[(df['loan_interest_rate']<0) | (df['loan_interest_rate']>50), 'loan_interest_rate'] = None

df = df.dropna(subset=['loan_interest_rate'])

df['loan_interest_rate'].describe().round(2)
df['loan_interest_rate'].isna().sum()

In [None]:
# Variable payment_default_on_file before cleaning
df['payment_default_on_file'].value_counts(dropna=False)
df['payment_default_on_file'].describe()
df.filter(like='payment_default_on_file')

In [None]:
# Variable payment_default_on_file after cleaning
#(standardising other values to either Y, or N)
df['payment_default_on_file']=df['payment_default_on_file'].replace({'YES': 'Y','NO': 'N'})

# Remove NaN values
df = df.dropna(subset=['payment_default_on_file'])

# Encode categorical values so they are binary.
df['payment_default_on_file']=df['payment_default_on_file'].map({'Y':1, 'N':0})

df.filter(like='payment_default_on_file')

df['payment_default_on_file'].value_counts(dropna=False)

#df['payment_default_on_file'].isna().sum()

In [None]:
# Variable loan_approval_status before cleaning
df['loan_approval_status'].value_counts(dropna=False)

In [23]:
# Variable loan_approval_status after cleaning

# Make all values lowercase and removing any unwanted characters from the before and after the word.
df['loan_approval_status']=df['loan_approval_status'].str.lower().str.strip()

# Standardising other values to either Approved, or Declined
df['loan_approval_status']=df['loan_approval_status'].replace({'accept': 'Approved','approved': 'Approved','reject': 'Declined','declined': 'Declined'})

# Remove the instance where there is no value sinces it is a part of our target variable and we dont want to risk bias by implementing the mode.
df=df.dropna(subset=['loan_approval_status'])

# Encode categorical values so they are binary.
df['loan_approval_status']=df['loan_approval_status'].map({'Approved': 0,'Declined': 1})

df['loan_approval_status'].value_counts(dropna=False)

NameError: name 'df' is not defined

In [None]:
# All numerial values before scaling.
df.head()
df.describe().round(2)

In [None]:
# All numerical values after scaling.
scaler = StandardScaler()

numerical_features = ['age', 'income', 'emplyment_length', 'loan_amount', 'loan_interest_rate', 'loan_income_ratio', 'credit_history_length']

df[numerical_features]= scaler.fit_transform(df[numerical_features])

df.head()
df.describe().round(2)

In [None]:
df.shape
df.describe().round(2)

In [None]:
#Saving the classification dataset under the name loan_approval_status_data_cleaned
OUT_PATH = Path("../data/processed/loan_approval_status_data_cleaned.csv")
df.to_csv(OUT_PATH, index=False)

In [None]:
df_loan_status = Path("../data/processed/loan_approval_status_data_cleaned.csv")
df_loan_status.head()

## 4. Model Development

The cleaned dataset was split into training and testing sets using an 80:20 ratio. Stratified sampling was applied to preserve class distribution across both sets.

Three classification models were trained and compared:

- Naïve Bayes  
- Logistic Regression  
- Random Forest  

Model performance was evaluated using precision and recall, prioritising accurate identification of high-risk (declined) loan applications.


In [None]:
 # The inputs are all the features on the x-axis except for the target variable
X = df_loan_status.drop(['loan_approval_status'], axis=1)

# The target variable is assigned to y
y = df_loan_status['loan_approval_status']

# Split the dataset in 80% Training and 20% Test with class stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30, stratify=y)

## instantiate the model using default parameters
# Building the models
NB_clf = GaussianNB()
LR_clf = LogisticRegression(random_state=30)
RF_clf = RandomForestClassifier(random_state=30)

# Train the models
NB_clf.fit(X_train, y_train)
LR_clf.fit(X_train, y_train)
RF_clf.fit(X_train, y_train)

# Run the model and store the predicted values
NB_pred = NB_clf.predict(X_test)
LR_pred = LR_clf.predict(X_test)
RF_pred = RF_clf.predict(X_test)

# View the predicted values
NB_pred
LR_pred
RF_pred

In [None]:
# Features used for the model
for features in X.columns:
    print(features)

# Shape of datasets
print('\nShape of datasets:')
print('Whole Data shape',df_loan_status.shape)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

## 5. Model Evaluation

Model performance was evaluated using business-relevant classification metrics.

Due to class imbalance in the dataset, accuracy alone is not a reliable performance indicator. Instead, emphasis was placed on:

- **Recall (Declined class)** – ensuring high-risk applicants are correctly identified.
- **Precision (Declined class)** – ensuring rejected predictions are accurate.
- Confusion matrices – to visualise true positives, false positives, false negatives, and true negatives.

The Random Forest model demonstrated the strongest balance between precision and recall for identifying high-risk applicants.


In [None]:
# To plot the confusion matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

# Construct the confusion matrix cm
NB_cm = confusion_matrix(y_test, NB_pred, labels=NB_clf.classes_)
LR_cm = confusion_matrix(y_test, LR_pred, labels=LR_clf.classes_)
RF_cm = confusion_matrix(y_test, RF_pred, labels=RF_clf.classes_)

# Create a display to plot the confusion matrix
NB_disp = ConfusionMatrixDisplay(NB_cm,display_labels=NB_clf.classes_)
LR_disp = ConfusionMatrixDisplay(LR_cm,display_labels=LR_clf.classes_)
RF_disp = ConfusionMatrixDisplay(RF_cm,display_labels=RF_clf.classes_)

# Plot the confusion matrix
NB_disp.plot()
LR_disp.plot()
RF_disp.plot()

In [None]:
from sklearn.metrics import classification_report

# Produce the Naive Bayes classification report for test
print("Naive Bayes report \n", classification_report(y_test, NB_pred))
# Plot the ROC curve for Naive Bayes clf
NB_clf_Roc = RocCurveDisplay.from_estimator(NB_clf, X_test, y_test)

# Produce the Logistic Regression classification report for test
print("Logistic Regression report \n", classification_report(y_test, LR_pred))
# Plot the ROC curve for Logistic Regression clf
LR_clf_Roc = RocCurveDisplay.from_estimator(LR_clf, X_test, y_test)

# Produce the Random Forest classification report for test
print("Random Forest report \n", classification_report(y_test, RF_pred))
# Plot the ROC curve for Random Forest clf
RF_clf_Roc = RocCurveDisplay.from_estimator(RF_clf, X_test, y_test)

In [None]:
# Produce the Random Forest classification report for train
RF_train_pred = RF_clf.predict(X_train)
print("Random Forest report using train data\n", classification_report(y_train, RF_train_pred))

# Produce the Random Forest classification report for test
print("Random Forest report using test data \n", classification_report(y_test, RF_pred))

## 6. Hyperparameter Tuning

To improve model performance, hyperparameter optimisation was conducted using GridSearchCV with 5-fold cross-validation.

The Random Forest model was selected for tuning due to its strong baseline performance.

Key parameters such as the number of estimators, maximum depth, minimum samples split, and maximum features were evaluated to enhance generalisation and improve recall for high-risk (declined) loan applications.

In [None]:
#create a new random forest classifier
rf = RandomForestClassifier(random_state=30)

#create a dictionary of all values we want to test for n_estimators
params_rf = {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20], 'min_samples_split': [2, 5], 'max_features': ['sqrt', 'log2']}

#use gridsearch to test all values
rf_gs = GridSearchCV(rf, params_rf, cv=5, n_jobs = -1)

# fit model to training data
rf_gs.fit(X_train, y_train)

# save best model
rf_best = rf_gs.best_estimator_

# check best values
print(rf_gs.best_params_)

# make prediction on the test data
y_pred_rf = rf_best.predict(X_test)

In [None]:
# Before tuning hyperparameters values
params_before_tuning = {'max_depth': RF_clf.max_depth, 'max_features': RF_clf.max_features, 'min_samples_split': RF_clf.min_samples_split, 'n_estimators': RF_clf.n_estimators}
print(params_before_tuning)

In [None]:
print("confusion_matrix for RF after tuning")
best_rf_cm=confusion_matrix(y_test,y_pred_rf)
disp=ConfusionMatrixDisplay(confusion_matrix=best_rf_cm,display_labels = rf_best.classes_)
disp.plot()

In [None]:
print("Classification report for RF after tuning")
print(classification_report(y_test,y_pred_rf))

## 7. Final Model Performance

After hyperparameter tuning, the Random Forest model achieved strong performance on the test dataset while maintaining balanced precision and recall for the declined class.

Although training performance was near-perfect, the slight performance gap between training and testing results indicates mild overfitting — a common characteristic of ensemble models.

Overall, the tuned Random Forest model demonstrates strong generalisation ability and provides a reliable framework for identifying high-risk loan applicants in real-world deployment scenarios.


## 8. Ethical Considerations

Automated loan approval systems must be deployed responsibly.

Potential risks include:

- Historical bias embedded in training data
- Disproportionate rejection of specific demographic groups
- Over-reliance on automated decision systems without human oversight

Machine learning models should support decision-making rather than replace human judgement in high-stakes financial contexts.


# Part B – Loan Amount Prediction (Regression)

## 1. Business Objective (Regression)

In addition to predicting loan approval status, this section aims to predict the maximum approved loan amount using regression techniques.

Accurate loan amount prediction can assist financial institutions in:
- Risk-adjusted lending decisions
- Capital allocation planning
- Personalised loan offers


## 2. Dataset Preparation

The original dataset was reloaded to construct a regression modelling dataset.

The target variable for this section is the maximum loan amount approved.


In [None]:
pd.options.display.float_format = '{:.2f}'.format

# Creating the regression dataset using the original csv file with all the data
df2 = Path("../data/processed/loan_approval_status_data_cleaned.csv")

# Check values in Credit Application Acceptance
df2['Credit_Application_Acceptance'].value_counts(dropna=False)

## 3. Data Preprocessing (Regression)

Data cleaning and transformation steps were applied to prepare the dataset for regression modelling.

Categorical variables were encoded and missing values were handled to ensure compatibility with regression algorithms.


In [24]:
# Remove the instance where there is no value from Credit Application Acceptance
df2=df2.dropna(subset=['Credit_Application_Acceptance'])

# Check NaN value has been dropped successfully
df2['Credit_Application_Acceptance'].value_counts(dropna=False)

# Remove all applicants who were declined a loan.
approved_loan_applicants =  df2[(df2.Credit_Application_Acceptance < 1)]

# Checking basic stats to see if all applicants who were declined a loan have been removed successfully.
approved_loan_applicants.describe().transpose()

# Drop the Credit Application Acceptance and loan_approval_status variable from the data frame
approved_loan_applicants.drop('loan_approval_status',axis=1, inplace=True)
approved_loan_applicants.drop('Credit_Application_Acceptance',axis=1, inplace=True)
#approved_loan_applicants.head()

#Dropping features we don't need
approved_loan_applicants.drop(['id'], axis=1, inplace=True)

# Variable age cleaning
approved_loan_applicants['age'] = pd.to_numeric(approved_loan_applicants['age'], errors='coerce')
approved_loan_applicants.loc[(approved_loan_applicants['age']<18) | (approved_loan_applicants['age']>95),'age'] = None
approved_loan_applicants = approved_loan_applicants.dropna(subset=['age'])

# Variable Sex cleaning
approved_loan_applicants=approved_loan_applicants.drop(columns=['Sex'], axis =1)

# Variable Education_Qualifications cleaning
approved_loan_applicants=approved_loan_applicants.drop(columns=['Education_Qualifications'], axis =1)

# Variable home_ownership cleaning (one-hot encoding)
approved_loan_applicants = pd.get_dummies(approved_loan_applicants, columns=['home_ownership'])

# Variable emplyment_length cleaning (setting all values less than 0 and above 75 to NaN, and then removing them)
approved_loan_applicants.loc[(approved_loan_applicants['emplyment_length']<0) | (approved_loan_applicants['emplyment_length']>75), 'emplyment_length'] = None
approved_loan_applicants = approved_loan_applicants.dropna(subset=['emplyment_length'])

# Variable loan_intent cleaning (one-hot encoding)
approved_loan_applicants = pd.get_dummies(approved_loan_applicants, columns=['loan_intent'])

# Variable loan_interest_rate cleaning (setting all values less than 0 and above 50 to NaN, and then removing them)
approved_loan_applicants.loc[(approved_loan_applicants['loan_interest_rate']<0) | (approved_loan_applicants['loan_interest_rate']>50), 'loan_interest_rate'] = None
approved_loan_applicants = approved_loan_applicants.dropna(subset=['loan_interest_rate'])

# Variable payment_default_on_file cleaning - standardising, removing NaN values and encoding
approved_loan_applicants['payment_default_on_file']=approved_loan_applicants['payment_default_on_file'].replace({'YES': 'Y','NO': 'N'})
approved_loan_applicants = approved_loan_applicants.dropna(subset=['payment_default_on_file'])
approved_loan_applicants['payment_default_on_file']=approved_loan_applicants['payment_default_on_file'].map({'Y':1, 'N':0})

# Target variable max_allowed_loan cleaning
approved_loan_applicants = approved_loan_applicants[approved_loan_applicants['max_allowed_loan'] > 0]
approved_loan_applicants = approved_loan_applicants[approved_loan_applicants['max_allowed_loan'] < 10000000]

AttributeError: 'PosixPath' object has no attribute 'dropna'

In [None]:
approved_loan_applicants.head(50)
approved_loan_applicants.describe().round(2)

In [None]:
# Export the cleaned regression data
save_path = Path("../data/processed/loan_max_amount_data.csv")
save_path.parent.mkdir(parents=True, exist_ok=True)
approved_loan_applicants.to_csv(save_path, index=False)

In [None]:
df_loan_max = save_path
df_loan_max.head()

In [None]:
print ('Dimensions of Regression Dataset', df_loan_max.shape)

# Features used for the model
for feature in df_loan_max.columns:
    print(feature)

In [None]:
import matplotlib.pyplot as plt

# All numerical features
numeric_features = ['age', 'income', 'emplyment_length', 'loan_amount', 'loan_interest_rate', 'loan_income_ratio', 'credit_history_length', 'max_allowed_loan']


for features in numeric_features:
    df_loan_max[features].hist(edgecolor='black')
    plt.title(f'Distribution of {features}')
    plt.xlabel(features), plt.ylabel('Frequency')
    plt.grid(False)
    plt.show()


# All encoded categorical features
encoded_features = ['payment_default_on_file', 'home_ownership_MORTGAGE', 'home_ownership_OTHER', 'home_ownership_OWN', 'home_ownership_RENT', 'loan_intent_DEBTCONSOLIDATION', 'loan_intent_EDUCATION', 'loan_intent_HOMEIMPROVEMENT', 'loan_intent_MEDICAL', 'loan_intent_PERSONAL', 'loan_intent_VENTURE']

for features in encoded_features:
    df_loan_max[features].value_counts().plot(kind='bar', edgecolor='black'),
    plt.title(f'Distribution of {features}'),
    plt.ylabel('Count')
    plt.show()

In [None]:
df_loan_max.describe()

### 4. Model Development (Regression)

This section implements regression models to predict the maximum approved loan amount.

Two modelling approaches are evaluated:

• Model 1 – Decision Tree Regressor using numerical features only  
• Model 2 – Decision Tree Regressor using the full feature set (including encoded categorical variables)

The dataset is split into 80% training and 20% testing data.

Model performance is evaluated using regression metrics:
- R² Score
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)

This comparison allows assessment of whether additional categorical information improves predictive performance.


In [25]:
numeric_feature = ['age', 'income', 'emplyment_length', 'loan_amount', 'loan_interest_rate', 'loan_income_ratio', 'credit_history_length']

# DT Model 1
X1 = df_loan_max[numeric_feature]
# DT Model 2
X2 = df_loan_max.drop(['max_allowed_loan'], axis=1)
# Target variable for both models
y = df_loan_max['max_allowed_loan']

# Model 1 - Split the dataset in 80% Training and 20% Test
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y, test_size=0.2, random_state=30)

# Model 2 - Split the dataset in 80% Training and 20% Test
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size=0.2, random_state=30)

# Building Model 1
DT1 = DecisionTreeRegressor(random_state=30)
DT1.fit (X1_train, y1_train)

# To make predictions on the test set
y1_pred_test = DT1.predict(X1_test)

# Building Model 2
DT2 = DecisionTreeRegressor(random_state=30)
DT2.fit (X2_train, y2_train)

# To make predictions on the test set
y2_pred_test = DT2.predict(X2_test)

TypeError: 'PosixPath' object is not subscriptable

### 5. Train-Test Split Strategy

The regression dataset was divided into 80% training data and 20% testing data.

The training set is used to fit the decision tree models, while the test set is used to evaluate predictive performance on unseen data.

This approach helps reduce overfitting and provides a more realistic assessment of how the model would perform in real-world loan approval scenarios.


In [None]:
# DT Model 1s dimensions
print("Model 1 - Numeric features only")
print("X1_train shape:", X1_train.shape)
print("X1_test shape:", X1_test.shape)
print("y_train shape:", y1_train.shape)
print("y_test shape:", y1_test.shape)

# DT Model 1s features
print("\nModel 1 features:")
for features in X1.columns:
  print(features)

In [None]:
# DT Model 2s dimensions
print("Model 2 - All retained features")
print("X2_train shape:", X2_train.shape)
print("X2_test shape:", X2_test.shape)
print("y_train shape:", y2_train.shape)
print("y_test shape:", y2_test.shape)

# DT Model 2s features
print("\nModel 2 features:")
for features in X2.columns:
  print(features)

### 6. Feature Importance

To understand which variables most influence maximum approved loan amount, feature importance from the decision tree model was examined.

Income, loan amount requested, and credit history length appear to be the strongest predictors.

This aligns with financial intuition, as lenders prioritise repayment capacity and borrowing history when determining maximum lending thresholds.


In [None]:
print ('Metrics for Model 1')
print('Mean Squared Error:', metrics.mean_squared_error(y1_test, y1_pred_test))
print('Mean Absolute Error:', metrics.mean_absolute_error(y1_test, y1_pred_test))
print('R2:', metrics.r2_score(y1_test, y1_pred_test))

print('\nMetrics for Model 2')
print('Mean Squared Error:', metrics.mean_squared_error(y2_test, y2_pred_test))
print('Mean Absolute Error:', metrics.mean_absolute_error(y2_test, y2_pred_test))
print('R2:', metrics.r2_score(y2_test, y2_pred_test))

## 7. Model Comparison

Two regression models were evaluated:

• Model 1 – Numerical features only  
• Model 2 – Full feature set (including encoded categorical variables)

Model 1 achieved:
- R² ≈ 0.969  
- MAE ≈ £1,210  

Model 2 achieved:
- R² ≈ 0.971  
- MAE ≈ £1,257  

Both models demonstrate strong predictive performance.

The marginal improvement in R² suggests that numerical financial indicators (income, loan amount, credit history length) capture most of the predictive signal.

Given the minimal performance gain, the simpler feature set may be preferable for efficiency and interpretability.


In [None]:
# Limit the tree growth to 4 levels
DT1_pruned_regressor = DecisionTreeRegressor(max_depth=4, random_state=30)
DT1_pruned_regressor.fit(X1_train, y1_train)

# To make predictions on the test set
y1_pred_pruned = DT1_pruned_regressor.predict(X1_test)

# Plot the regression DT
pruned_Tree_model = plt.figure(figsize=(20,10))
pruned_Tree_model_Graph = tree.plot_tree(DT1_pruned_regressor, feature_names=list(X1_train.columns), filled=True)

# To save the DT graph as a png image
pruned_Tree_model.savefig("pruned_reg_decision_tree.png")

# Calculating the regression metrics for the pruned regression decision Tree
print ('Metrics for Pruned Regression Decision Tree')
print('Mean Squared Error:', metrics.mean_squared_error(y1_test, y1_pred_pruned))
print('Mean Absolute Error:', metrics.mean_absolute_error(y1_test, y1_pred_pruned))
print('R2:', metrics.r2_score(y1_test, y1_pred_pruned))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y1_test, y1_pred_pruned)))

## 8. Effect of Tree Pruning

Restricting the tree depth to four levels reduced model complexity and improved interpretability.

However, performance declined significantly (R² ≈ 0.86), indicating that deeper splits were capturing important interaction effects.

This highlights the trade-off between model simplicity and predictive power. In regulated financial environments, simpler models may be preferred for transparency, but at the cost of reduced accuracy.


In [None]:
# Create a new DataFrame from scratch to predict Maximum Loan Amount
data = []
data.append( {"age":56,
              "income":57000,
              "emplyment_length":15,
              "loan_amount":25700,
              "loan_interest_rate":23,
              "loan_income_ratio":0.10,
              "credit_history_length":35,
              } )
df3 = pd.DataFrame(data)

# Add a new column to `df3` with the predicted prices:
df3["Predicted Max Loan Amount"] = DT1_pruned_regressor.predict(df3)
df3.head()

### 9. New Applicant Prediction

For a 56-year-old applicant with:
- Income: £57,000  
- Employment length: 15 years  
- Loan amount requested: £25,700  
- Interest rate: 23%  
- Credit history length: 35 years  

The pruned model predicts a maximum approved loan amount of approximately **£92,901**.

This demonstrates how the regression model can support lending decision simulations and scenario analysis in real-world banking environments.


## 10. Limitations

Although the Decision Tree model performs well, several limitations exist:

- Decision trees can overfit complex datasets.
- The model assumes historical approval patterns remain stable over time.
- External macroeconomic factors (e.g., inflation, interest rate shifts) are not incorporated.

Future work could explore ensemble methods such as Random Forest or Gradient Boosting to improve robustness.


## 11. Conclusion – Regression Analysis

The Decision Tree Regressor demonstrates strong predictive performance for estimating maximum approved loan amounts.

Key findings:
- Numerical financial indicators drive most predictive power.
- Categorical variables provide limited additional improvement.
- Deep trees increase accuracy but reduce interpretability.
- Pruned trees simplify structure at the cost of performance.

Overall, regression modelling provides valuable decision support for capital allocation and personalised lending strategies.
