## Project: Loan Default Prediction System
## Author: Hassan Saifuddin
## Date: 28/4/2025
## Description: A machine learning model to predict loan default risks.



# Data Description:

Data Documentation:
The dataset consists of various features that describe loan applicants, their financial background, and other demographic information. Below is a detailed description of each column used in the analysis:

1. Age
Type: Numeric

Description: The age of the loan applicant in years.

Importance: Age is often an important factor when assessing loan eligibility or predicting the likelihood of default. Older individuals may have more stable income and less risk of default, but this is not always the case.

2. Income
Type: Numeric

Description: The annual income of the loan applicant in dollars.

Importance: Income is a key factor in evaluating an applicant’s ability to repay a loan. Higher income generally correlates with a lower risk of default.

3. LoanAmount
Type: Numeric

Description: The total loan amount requested by the applicant.

Importance: The loan amount can influence the risk of default. Larger loans may carry higher risks, particularly if the applicant's income is not substantial enough to support it.

4. CreditScore
Type: Numeric

Description: The credit score of the loan applicant.

Importance: Credit score is one of the most important indicators of creditworthiness. A higher score generally means lower risk, as the applicant has a history of managing credit responsibly.

5. MonthsEmployed
Type: Numeric

Description: The number of months the applicant has been employed at their current job.

Importance: This feature provides an indication of job stability. Longer employment may correlate with a stable income and, consequently, a lower risk of loan default.

6. NumCreditLines
Type: Numeric

Description: The number of credit lines (e.g., credit cards, mortgages) the applicant currently holds.

Importance: A higher number of credit lines could indicate a greater level of financial responsibility or the possibility of overextension of credit.

7. InterestRate
Type: Numeric

Description: The interest rate applied to the loan.

Importance: A higher interest rate may indicate greater financial risk or less favorable terms for the applicant, often associated with a higher likelihood of default.

8. LoanTerm
Type: Numeric

Description: The loan term in months.

Importance: Loan term affects monthly payments and total interest paid. Shorter loan terms may have higher monthly payments but lower total interest, while longer terms may have lower payments but higher total interest.

9. DTIRatio (Debt-to-Income Ratio)
Type: Numeric

Description: The ratio of the applicant's total debt to their income.

Importance: This is a critical indicator of an applicant’s ability to repay the loan. A higher DTI ratio indicates that the applicant may already be financially stretched and at higher risk of default.

10. Default
Type: Categorical (Binary)

Description: The target variable indicating whether the loan applicant defaulted (1) or did not default (0).

Importance: This is the outcome we are trying to predict. The model uses other features in the dataset to predict this variable.

11. Education
Type: Categorical

Description: The highest level of education attained by the applicant.

Categories:

0: No formal education

1: High school or equivalent

2: Undergraduate degree

3: Postgraduate degree

Importance: Education level can affect earning potential and financial stability, potentially influencing the risk of loan default.

12. EmploymentType
Type: Categorical

Description: The type of employment the applicant has.

Categories:

0: Unemployed

1: Self-employed

2: Employed full-time

Importance: Employment type may indicate the stability of income. Self-employed individuals might experience income volatility, while full-time employed individuals generally have more stable incomes.

13. MaritalStatus
Type: Categorical

Description: The marital status of the applicant.

Categories:

0: Single

1: Married

2: Divorced

Importance: Marital status can influence financial stability. Married individuals may have dual incomes or shared financial responsibilities, while single individuals may bear all financial responsibility themselves.

14. HasMortgage
Type: Categorical (Binary)

Description: Indicates whether the applicant currently has a mortgage.

Categories:

0: No mortgage

1: Has mortgage

Importance: Having a mortgage implies an existing financial obligation, which could affect the applicant's ability to repay new loans.

15. HasDependents
Type: Categorical (Binary)

Description: Indicates whether the applicant has dependents (e.g., children, other dependents).

Categories:

0: No dependents

1: Has dependents

Importance: Applicants with dependents may face higher living costs, which could reduce their available income for repaying loans.

16. LoanPurpose
Type: Categorical

Description: The purpose for which the loan is being requested.

Categories:

0: Home purchase

1: Debt consolidation

2: Education

3: Medical expenses

Importance: The purpose of the loan can affect the applicant’s likelihood of default. Loans for essential purposes like home purchase or medical expenses may have a different repayment pattern than those for discretionary purposes like education.

17. HasCoSigner
Type: Categorical (Binary)

Description: Indicates whether the applicant has a co-signer for the loan.

Categories:

0: No co-signer

1: Has co-signer

Importance: Having a co-signer reduces the risk of default because the co-signer is responsible for repayment in case the primary borrower defaults.

# CORE SYSTEM MODULES


In [1]:


import zipfile
# Module to handle extraction and management of ZIP compressed files.

import pandas as pd
# Primary library for structured data handling using DataFrames.

import numpy as np
# Essential package for numerical computing and array operations.

# =============================================================================
# MACHINE LEARNING MODEL ARCHITECTURE
# =============================================================================

from sklearn.ensemble import RandomForestClassifier
# Ensemble learning method: Builds multiple decision trees and merges their results for better accuracy and stability.

# =============================================================================
# DATA PREPROCESSING AND FEATURE ENGINEERING
# =============================================================================

from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
# - StandardScaler: Normalize features by removing the mean and scaling to unit variance.
# - OneHotEncoder: Encode categorical features as a one-hot numeric array.
# - PolynomialFeatures: Generate new polynomial and interaction features to capture non-linear relationships.

# =============================================================================
# MODEL VALIDATION AND HYPERPARAMETER OPTIMIZATION
# =============================================================================

from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
# - train_test_split: Partition data into training and testing sets.
# - cross_val_score: Evaluate model performance via cross-validation.
# - GridSearchCV: Exhaustive search over specified parameter values for an estimator.

# =============================================================================
# PIPELINING AND DATA TRANSFORMATION
# =============================================================================

from sklearn.pipeline import Pipeline as pipeline
# Pipeline: Assemble several steps (preprocessing + model) into one sequential object.

from sklearn.compose import ColumnTransformer
# Apply different preprocessing pipelines to different feature columns.

# =============================================================================
# MODEL EVALUATION METRICS
# =============================================================================

from sklearn.metrics import accuracy_score, make_scorer
# - accuracy_score: Calculates the ratio of correctly predicted observations to total observations.
# - make_scorer: Create a custom scoring function for model evaluation or hyperparameter tuning.

# =============================================================================
# HANDLING IMBALANCED DATASETS
# =============================================================================

from imblearn.over_sampling import RandomOverSampler, SMOTE
# - RandomOverSampler: Duplicate random records from the minority class to balance the dataset.
# - SMOTE (Synthetic Minority Over-sampling Technique): Create synthetic examples of the minority class.

from imblearn.under_sampling import RandomUnderSampler
# Randomly remove samples from the majority class to balance the class distribution.


In this step, we are importing the necessary libraries to carry out data preprocessing, modeling, and evaluation. These libraries serve the following purposes:

pandas and numpy: Used for data manipulation and numerical operations, like handling missing data, transformation, and reshaping.

sklearn: Includes functions for model creation (e.g., RandomForestClassifier), data preprocessing (e.g., StandardScaler, OneHotEncoder), and model evaluation (e.g., accuracy_score).

imblearn: Provides tools for handling class imbalances using techniques like SMOTE and RandomOverSampler.

These libraries enable various tasks such as data transformation, model fitting, hyperparameter tuning, and performance evaluation.

# STEP 1: Data Acquisition

In [2]:


# Define the file path for the input dataset.
file_location = '/kaggle/input/loan-default/Loan_default.csv'


# STEP 2: Data Loading

In [3]:

# Load the CSV data directly into a pandas DataFrame.
Data = pd.read_csv(file_location)

# Preview the first few rows to confirm successful loading.
Data.head()


Unnamed: 0,LoanID,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education,EmploymentType,MaritalStatus,HasMortgage,HasDependents,LoanPurpose,HasCoSigner,Default
0,I38PQUQS96,56,85994,50587,520,80,4,15.23,36,0.44,Bachelor's,Full-time,Divorced,Yes,Yes,Other,Yes,0
1,HPSK72WA7R,69,50432,124440,458,15,1,4.81,60,0.68,Master's,Full-time,Married,No,No,Other,Yes,0
2,C1OZ6DPJ8Y,46,84208,129188,451,26,3,21.17,24,0.31,Master's,Unemployed,Divorced,Yes,Yes,Auto,No,1
3,V2KKSFM3UN,32,31713,44799,743,0,3,7.07,24,0.23,High School,Full-time,Married,No,No,Business,No,0
4,EY08JDHTZP,60,20437,9139,633,8,4,6.51,48,0.73,Bachelor's,Unemployed,Divorced,No,Yes,Auto,No,0


Here, we load the dataset from a CSV file into a pandas DataFrame. This allows us to perform all data manipulation and analysis within Python. read_csv() is a common function in pandas to load CSV files into DataFrame format for easy processing.

# STEP 3: Data Cleaning - Drop Irrelevant Features

In [4]:


# Drop the 'LoanID' column as it is an identifier and not useful for prediction.
Data = Data.drop(columns='LoanID')

# Verify that the column has been dropped by checking the DataFrame shape.
print("Updated Data Shape:", Data.shape)


Updated Data Shape: (255347, 17)


In this step, we drop the 'LoanID' column from the dataset. Often, certain columns are identifiers that don’t provide useful information for modeling, and dropping them can reduce noise and improve model performance.

# STEP 4: Data Preprocessing - Select Numerical Features

In [5]:


# Select numerical columns (int and float types) for further analysis and model training.
num = Data.select_dtypes(include=['int', 'float'])

# Display the selected numerical features to verify.
print("Selected Numerical Features:", num.columns.tolist())


Selected Numerical Features: ['Age', 'Income', 'LoanAmount', 'CreditScore', 'MonthsEmployed', 'NumCreditLines', 'InterestRate', 'LoanTerm', 'DTIRatio', 'Default']


We separate the numerical columns (e.g., int, float) and categorical columns (e.g., strings or objects) into different DataFrames.

Numerical columns are used for mathematical operations like scaling or feature selection.

Categorical columns require encoding into numerical representations before feeding them into machine learning models.

# STEP 5: Feature Engineering - Categorical Encoding

In [6]:

def ColumnTrans(cat):
    """
    Function to convert categorical variables into numerical variables
    by mapping unique values to integer indices.

    Parameters:
    ----------
    cat : DataFrame
        A pandas DataFrame containing categorical columns to be transformed.

    Returns:
    -------
    cat : DataFrame
        The original DataFrame with categorical columns transformed into numerical values.
    """
    # Iterate over each column in the DataFrame.
    for column in cat.columns:
        # Get unique values for the column
        unique_values = cat[column].unique()
        
        # Create a mapping of each unique value to a corresponding integer
        value_map = {value: index for index, value in enumerate(unique_values)}
        
        # Map the column's categorical values to their integer indices
        cat[column] = cat[column].map(value_map)
    
    return cat


We define a custom function, ColumnTrans, to map categorical values into numerical indices. This is a form of Label Encoding, where each unique category in a column is replaced with a corresponding number.

For instance, if a column has categories ["Low", "Medium", "High"], these will be transformed into [0, 1, 2].

# STEP 6: Data Preprocessing - Transform Categorical Columns

In [7]:


# Select the categorical columns (object type) from the DataFrame.
cat = Data.select_dtypes(include='object')

# Apply the custom ColumnTrans function to transform categorical columns into numerical format.
cat = ColumnTrans(cat)

# Verify the transformation by displaying the first few rows of the transformed categorical data.
print("Transformed Categorical Columns (first few rows):")
print(cat.head())


Transformed Categorical Columns (first few rows):
   Education  EmploymentType  MaritalStatus  HasMortgage  HasDependents  \
0          0               0              0            0              0   
1          1               0              1            1              1   
2          1               1              0            0              0   
3          2               0              1            1              1   
4          0               1              0            1              0   

   LoanPurpose  HasCoSigner  
0            0            0  
1            0            0  
2            1            1  
3            2            1  
4            1            1  


We apply the ColumnTrans function to the categorical DataFrame, converting all categorical columns into numerical representations.

# STEP 7: Data Preprocessing - Combining Numerical and Categorical Data

In [8]:


# Concatenate the numerical and transformed categorical features along the columns (axis=1).
df = pd.concat([num, cat], axis=1)

# Verify the combined DataFrame by displaying the first few rows.
print("Combined DataFrame (first few rows):")
print(df.head())


Combined DataFrame (first few rows):
   Age  Income  LoanAmount  CreditScore  MonthsEmployed  NumCreditLines  \
0   56   85994       50587          520              80               4   
1   69   50432      124440          458              15               1   
2   46   84208      129188          451              26               3   
3   32   31713       44799          743               0               3   
4   60   20437        9139          633               8               4   

   InterestRate  LoanTerm  DTIRatio  Default  Education  EmploymentType  \
0         15.23        36      0.44        0          0               0   
1          4.81        60      0.68        0          1               0   
2         21.17        24      0.31        1          1               1   
3          7.07        24      0.23        0          2               0   
4          6.51        48      0.73        0          0               1   

   MaritalStatus  HasMortgage  HasDependents  LoanPurpose  Ha

We combine the numerical features and transformed categorical features into a single DataFrame. The concat() function merges them along the columns axis (axis=1).

# STEP 8: Data Preprocessing - Splitting Features and Target Variable

In [9]:


# Separate features (X) from the target variable (y)
x1 = df.drop(columns='Default')  # Features: All columns except 'Default'
y1 = df['Default']  # Target: 'Default' column

# Verify the separation by displaying the shapes of the features and target.
print("Shape of Features (X):", x1.shape)
print("Shape of Target (y):", y1.shape)


Shape of Features (X): (255347, 16)
Shape of Target (y): (255347,)


We split the dataset into features (x1) and the target variable (y1). The target, 'Default', indicates whether a loan is defaulted (1) or not (0). This allows us to train a model to predict this target based on the features.

# STEP 9: Data Preprocessing - Balancing the Dataset

In [10]:


# Initialize the resampling techniques
ros = RandomOverSampler()  # Random Over-Sampling to balance the dataset by increasing the minority class.
rus = RandomUnderSampler()  # Random Under-Sampling to balance the dataset by decreasing the majority class.
smote = SMOTE()  # SMOTE (Synthetic Minority Over-sampling Technique) to generate synthetic examples for the minority class.

# Verify that the resampling methods are correctly initialized
print("Resampling techniques initialized:")
print("RandomOverSampler:", ros)
print("RandomUnderSampler:", rus)
print("SMOTE:", smote)


Resampling techniques initialized:
RandomOverSampler: RandomOverSampler()
RandomUnderSampler: RandomUnderSampler()
SMOTE: SMOTE()


We initialize three techniques for handling class imbalance:

RandomOverSampler: Increases the minority class by randomly duplicating samples.

RandomUnderSampler: Decreases the majority class by randomly removing samples.

SMOTE: Uses Synthetic Minority Over-sampling Technique to create synthetic samples for the minority class.

These methods are crucial for ensuring the model is not biased towards the majority class.

# STEP 10: Balancing the Dataset - Applying Resampling Techniques

In [11]:


# Apply Random Over-Sampling to increase the minority class
x2, y2 = ros.fit_resample(x1, y1)
print("Shape after RandomOverSampler (ROS):", x2.shape, y2.shape)

# Apply SMOTE to generate synthetic samples for the minority class
x3, y3 = smote.fit_resample(x2, y2)
print("Shape after SMOTE:", x3.shape, y3.shape)

# Apply Random Under-Sampling to decrease the majority class
x, y = rus.fit_resample(x3, y3)
print("Shape after RandomUnderSampler (RUS):", x.shape, y.shape)

# Final balanced dataset
print("Final Balanced Dataset Shape (X):", x.shape)
print("Final Balanced Dataset Shape (Y):", y.shape)


Shape after RandomOverSampler (ROS): (451388, 16) (451388,)
Shape after SMOTE: (451388, 16) (451388,)
Shape after RandomUnderSampler (RUS): (451388, 16) (451388,)
Final Balanced Dataset Shape (X): (451388, 16)
Final Balanced Dataset Shape (Y): (451388,)


We apply the resampling techniques sequentially:

First, we oversample the minority class using RandomOverSampler.

Next, we apply SMOTE to generate synthetic samples.

Finally, we undersample the majority class using RandomUnderSampler.

This helps to balance the dataset before training the model.

# STEP 11: Data Splitting - Train-Test Split

In [12]:


# Split the balanced dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting datasets to verify the split
print("Shape of Training Features (X_train):", x_train.shape)
print("Shape of Testing Features (X_test):", x_test.shape)
print("Shape of Training Target (y_train):", y_train.shape)
print("Shape of Testing Target (y_test):", y_test.shape)


Shape of Training Features (X_train): (361110, 16)
Shape of Testing Features (X_test): (90278, 16)
Shape of Training Target (y_train): (361110,)
Shape of Testing Target (y_test): (90278,)


We split the resampled dataset into training and testing sets. train_test_split() divides the data into two subsets:

Training set (80%): Used to train the model.

Testing set (20%): Used to evaluate the model's performance.

The random_state=42 ensures reproducibility of the split.

# STEP 12: Model Initialization - Random Forest Classifier

In [13]:


# Initialize the Random Forest Classifier with 2000 estimators (trees)
model = RandomForestClassifier(n_estimators=2000, random_state=42)

# Display the model's parameters to verify the configuration
print("Random Forest Classifier initialized with parameters:")
print(model)


Random Forest Classifier initialized with parameters:
RandomForestClassifier(n_estimators=2000, random_state=42)


We initialize a Random Forest Classifier with 2000 estimators (trees). Random forests are ensembles of decision trees that improve accuracy and robustness by averaging predictions from multiple trees. The random_state=42 ensures reproducibility.

# STEP 13: Model Training - Fitting the Random Forest Classifier

In [14]:

# Train the model using the training data
model.fit(x_train, y_train)

# Display a message to indicate that training is complete
print("Model training complete with Random Forest Classifier.")


Model training complete with Random Forest Classifier.


We train the Random Forest model on the training data. The .fit() method uses the features (x_train) and the corresponding target labels (y_train) to build the model. This process involves decision trees learning patterns in the data.

# STEP 14: Model Prediction - Predicting on Test Data

In [15]:


# Use the trained model to make predictions on the test data
prediction = model.predict(x_test)

# Display the shape of the predictions to verify
print("Shape of predictions:", prediction.shape)


Shape of predictions: (90278,)


In [16]:
# Since the data was imbalanced looking to see if the model predicts both values or needs more work.
(prediction == 0).sum()

44334

In [17]:
(prediction == 1).sum()

45944

After training, we use the .predict() method to generate predictions for the test set (x_test). This is the model’s attempt to predict whether loans in the test data will default or not.

# STEP 15: Model Evaluation - Precision, Recall, F1, and Accuracy

In [18]:


# Import the evaluation metrics from sklearn
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

# Calculate accuracy, precision, recall, and F1 score
accuracy = accuracy_score(y_test, prediction)
precision = precision_score(y_test, prediction)
recall = recall_score(y_test, prediction)
f1 = f1_score(y_test, prediction)

# Display the metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")


Accuracy: 0.9898
Precision: 0.9814
Recall: 0.9984
F1 Score: 0.9899


Here, we evaluate the model’s performance by calculating:

Accuracy: Proportion of correct predictions.

Precision: Proportion of true positive predictions out of all positive predictions.

Recall: Proportion of true positives correctly identified from all actual positives.

F1 Score: Harmonic mean of precision and recall, providing a balance between them.

These metrics help us assess the model’s reliability and robustness, especially in the context of imbalanced data.