In this section, I will extend the previous experiments by following the machine learning pipeline process step by step, including data preprocessing, model training, evaluation, and preparing the model for deployment.

In [23]:
# Import necessary libraries
import pandas as pd  # For data manipulation
import matplotlib.pyplot as plt  # For visualization
import seaborn as sns  # For enhanced visualization
import warnings  # To suppress warnings
warnings.filterwarnings('ignore')  # Suppress unnecessary warnings
from sklearn.model_selection import train_test_split, GridSearchCV  # For splitting data and hyperparameter tuning
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler  # For categorical encoding and feature scaling
from lightgbm import LGBMClassifier  # For LightGBM model
from sklearn.metrics import f1_score  # For model evaluation
import joblib  # For saving and loading models/scalers

In [24]:
# Load the pre-processed dataset
df_orig = pd.read_csv('loan_analyzed.csv')  # Load data from CSV file
# Select features related to loan approval status and other relevant metrics
selected_features = [
    'employment_status', 'credit_score', 
    'diff_income_to_expenses', 'debt_to_income_ratio', 'loan_approval_status'
]
df = df_orig[selected_features]
df.head()  # Display the first 5 rows of the dataset

Unnamed: 0,employment_status,credit_score,diff_income_to_expenses,debt_to_income_ratio,loan_approval_status
0,Employed,743,9125.416667,0.141686,1
1,Employed,468,-2277.5,0.86575,0
2,Self-Employed,389,-1135.083333,0.497969,0
3,Self-Employed,778,8755.75,0.207525,1
4,Employed,752,5164.75,0.107397,1


In [25]:
# Create a copy of the DataFrame for scaling and encoding
df_scaled = df.copy()

## One-Hot Encoding

The first step involves **one-hot encoding** of categorical features to convert them into numerical representations. This process creates new binary features for each category, assigning 1 or 0 values to indicate presence or absence.
The categorical features processed here are:

* age_group
* employment_status

In [26]:
# Select categorical columns for encoding
cat_cols = [ 'employment_status']  # Columns to be encoded using OneHotEncoder

# Apply OneHotEncoding to categorical columns
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')  # Initialize OneHotEncoder
df_encoded = pd.DataFrame(encoder.fit_transform(df_scaled[cat_cols]))  # Apply encoding

# Rename the columns in the encoded DataFrame for clarity
df_encoded.columns = encoder.get_feature_names_out(cat_cols)  # Rename columns with meaningful names

# Drop original categorical columns and concatenate the encoded columns
df_scaled = df_scaled.drop(columns=cat_cols)  # Remove original categorical columns
df_scaled = pd.concat([df_scaled, df_encoded], axis=1)  # Add encoded columns to the dataset

In [27]:
df_scaled.head()  # Display the updated DataFrame

Unnamed: 0,credit_score,diff_income_to_expenses,debt_to_income_ratio,loan_approval_status,employment_status_Employed,employment_status_Self-Employed,employment_status_Unemployed
0,743,9125.416667,0.141686,1,1.0,0.0,0.0
1,468,-2277.5,0.86575,0,1.0,0.0,0.0
2,389,-1135.083333,0.497969,0,0.0,1.0,0.0
3,778,8755.75,0.207525,1,0.0,1.0,0.0
4,752,5164.75,0.107397,1,1.0,0.0,0.0


In [28]:
# Save the OneHotEncoder for future use
joblib.dump(encoder, '../models/one_hot_encoder.pkl')  # Save the encoder to a file


['../models/one_hot_encoder.pkl']

## Min-Max Scaling

Next, I applied the **Min-Max Scaler** to normalize the numerical features, ensuring that all values fall within the range of 0 to 1. This step standardizes the scale of features, making computations faster and improving the model's ability to learn effectively.

In [40]:
# Initialize MinMaxScaler for feature scaling
scaler = MinMaxScaler()

# Columns to scale (excluding the target column `loan_approval_status`)
cols_to_scale = ['credit_score', 'diff_income_to_expenses', 'debt_to_income_ratio', 'loan_approval_status']
df_scaled[cols_to_scale] = scaler.fit_transform(df_scaled[cols_to_scale])  # Scale the selected columns

In [41]:
# Save the MinMaxScaler for future use
joblib.dump(scaler, '../models/min_max_scaler.pkl')  # Save the scaler to a file

['../models/min_max_scaler.pkl']

In [42]:
df_scaled.head()  # Display the scaled DataFrame

Unnamed: 0,credit_score,diff_income_to_expenses,debt_to_income_ratio,loan_approval_status,employment_status_employed,employment_status_self-employed,employment_status_unemployed
0,0.806922,0.815031,0.095621,1.0,1.0,0.0,0.0
1,0.306011,0.067841,0.584275,0.0,1.0,0.0,0.0
2,0.162113,0.1427,0.336068,0.0,0.0,1.0,0.0
3,0.870674,0.790808,0.140054,1.0,0.0,1.0,0.0
4,0.823315,0.555503,0.07248,1.0,1.0,0.0,0.0


In [43]:
# Clean up column names by removing spaces and converting to lowercase
df_scaled.columns = df_scaled.columns.str.replace(' ', '_').str.lower()  # Standardize column names

In [44]:
df_scaled.head()  # Display the updated DataFrame

Unnamed: 0,credit_score,diff_income_to_expenses,debt_to_income_ratio,loan_approval_status,employment_status_employed,employment_status_self-employed,employment_status_unemployed
0,0.806922,0.815031,0.095621,1.0,1.0,0.0,0.0
1,0.306011,0.067841,0.584275,0.0,1.0,0.0,0.0
2,0.162113,0.1427,0.336068,0.0,0.0,1.0,0.0
3,0.870674,0.790808,0.140054,1.0,0.0,1.0,0.0
4,0.823315,0.555503,0.07248,1.0,1.0,0.0,0.0


## Train-Test Split

The dataset was split into training and testing sets in an 80:20 ratio. This separation is critical for model evaluation as it helps prevent overfitting, ensuring that the model performs well on unseen data.

In [45]:
# Split the data into features (X) and target (y)
X = df_scaled.drop(columns=['loan_approval_status'])  # Features
y = df_scaled['loan_approval_status']  # Target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123, stratify=y
)  # 80% training, 20% testing, maintain class distribution

## Model Training

The preprocessed training data was then fed into the model. The training process aimed to minimize the loss function, optimizing the model to perform well on both the training and test datasets.

In [46]:
# Initialize LightGBM Classifier parameters
lgbm_model = LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
               random_state=123, reg_alpha=0.0, reg_lambda=0.0, subsample=1.0,
               subsample_for_bin=200000, subsample_freq=0)

In [47]:
# Train the LightGBM model on the training data
lgbm_model.fit(X_train, y_train)  # Fit the model

[LightGBM] [Info] Number of positive: 25636, number of negative: 14139
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000769 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 768
[LightGBM] [Info] Number of data points in the train set: 39775, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.644525 -> initscore=0.595061
[LightGBM] [Info] Start training from score 0.595061


In [48]:
# Evaluate the model on the training and testing data
test_predictions = lgbm_model.predict(X_test)  # Predictions for testing data

## Model Testing and Evaluation

In [49]:
# Calculate the F1-Score for the testing set
f1 = f1_score(y_test, test_predictions)  # F1-Score balances precision and recall
print(f'F1-Score: {f1:.4f}')  # Print the F1-Score with 4 decimal places

F1-Score: 0.8908


After training, I used the testing data to evaluate the model's performance. The results showed an F1-Score of **0.8908**, indicating a high level of accuracy and reliability.

## Model Saving for Deployment

Upon completing the evaluation, the model was saved and is ready to be loaded for deployment. This step ensures that the model can be reused in real-world applications without retraining.

In [50]:
# Save the trained LightGBM model for future use
joblib.dump(lgbm_model, '../models/lgbm_model.pkl')  # Save the model to a file

['../models/lgbm_model.pkl']