# Bank Personal Loan Model 
## Model Objective:
This machine learning model is designed to predict which customers are most likely to accept the bank's personal loan offer.


In [1]:
# 1-Core Data Processing and Visualization Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 2-Scikit-learn Tools for Preprocessing and Splitting
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 3-Library for Handling Imbalanced Data
from imblearn.over_sampling import SMOTE

# 4-Machine Learning Models and Evaluation Metrics
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, f1_score, make_scorer, accuracy_score, precision_score, recall_score


## Load Dataset 

* Dataset Name : Bank_Personal_Loan_Modelling 

* From Kaggle.com 

In [2]:
df = pd.read_csv("Bank_Personal_Loan_Modelling.csv")

### Data Exploration: The First Look

* Column Information and Data Quality Check

In [3]:
df

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,4996,29,3,40,92697,1,1.9,3,0,0,0,0,1,0
4996,4997,30,4,15,92037,4,0.4,1,85,0,0,0,1,0
4997,4998,63,39,24,93023,2,0.3,3,0,0,0,0,0,0
4998,4999,65,40,49,90034,3,0.5,2,0,0,0,0,1,0



* Gathering fundamental information about the dataset, including the count and data type of entries in each column, to ensure data quality and guide feature engineering.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIP Code            5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal Loan       5000 non-null   int64  
 10  Securities Account  5000 non-null   int64  
 11  CD Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB


### Defining Features (X) and Target (y)

#### 1. Feature Matrix (X)
The feature matrix contains all the columns used to make the prediction. Irrelevant or redundant columns are dropped.

* Variable: X

* Columns Dropped:

* 'ID': A unique identifier with no predictive value.

* 'Personal Loan': The target variable itself (must be removed from features).

* 'ZIP Code': Assumed to be a non-predictive geographic identifier (though sometimes ZIP codes can be useful, they were excluded in your provided notebook).

#### 2. Target Variable (y)
The target variable is the outcome the model is trying to predict.

* Variable: y

* Column: 'Personal Loan'

* Description: This is a binary variable (0 or 1), indicating whether the customer accepted the personal loan offer.


In [5]:
data = df.copy()
X = data.drop(['ID', 'Personal Loan','ZIP Code'], axis = 1)
y = data['Personal Loan']

In [6]:
# Standardize features (mean=0, std=1) and split data into 80% training and 20% testing sets.
scaler = StandardScaler()
X = scaler.fit_transform(X)
# Using random_state=42 ensures the data split is reproducible.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

## Model Training and Prediction
#### 1. Training the Model (Logistic Regression)
The primary goal is to teach the computer how to associate customer features with the likelihood of accepting a loan.

In [7]:
# Initialize and train the Logistic Regression model using the standardized training data.
# max_iter=100 sets the maximum number of iterations, and random_state=42 ensures reproducibility.
Lr = LogisticRegression(max_iter = 100, random_state=42)
Lr.fit(X_train, y_train)

#### 2. Making and Evaluating Predictions
After training, the model's performance must be tested on data it has never seen before to ensure it generalizes well.

In [8]:
y_pred = Lr.predict(X_test)

print("LogisticRegression:\n")
print(classification_report(y_test, y_pred, digits=4))

print("ROC-AUC Score:", roc_auc_score(y_test, y_pred))

print("Weighted F1 Score:", f1_score(y_test, y_pred, average='weighted'))


LogisticRegression:

              precision    recall  f1-score   support

           0     0.9629    0.9866    0.9746       895
           1     0.8554    0.6762    0.7553       105

    accuracy                         0.9540      1000
   macro avg     0.9092    0.8314    0.8650      1000
weighted avg     0.9516    0.9540    0.9516      1000

ROC-AUC Score: 0.831391327480713
Weighted F1 Score: 0.9515877600864214


#### Random Forest Classifier: Training and Prediction
The Random Forest Classifier is an ensemble learning method that uses multiple decision trees to improve prediction accuracy and control overfitting. This model is often a high-performing choice for classification tasks.

### 1. Training the Model (Random Forest)
The model learns by building many independent decision trees and combining their votes for the final prediction.

In [9]:
Rf = RandomForestClassifier(n_estimators= 100, random_state = 42)
Rf.fit(X_train, y_train)

### 2. Making and Evaluating Predictions
After the forest is grown (trained), it is used to classify the unseen test data.

In [10]:
y_pred = Rf.predict(X_test)

print("RandomForest:\n")
print(classification_report(y_test, y_pred, digits=4))
print("ROC-AUC Score:", roc_auc_score(y_test, y_pred))
print("Weighted F1 Score:", f1_score(y_test, y_pred, average='weighted'))


RandomForest:

              precision    recall  f1-score   support

           0     0.9911    0.9989    0.9950       895
           1     0.9898    0.9238    0.9557       105

    accuracy                         0.9910      1000
   macro avg     0.9905    0.9613    0.9753      1000
weighted avg     0.9910    0.9910    0.9909      1000

ROC-AUC Score: 0.9613461026868848
Weighted F1 Score: 0.9908623568015659


In [11]:
del (Rf, Lr, y_pred)

###  What this code does

This code performs **Stratified 5-Fold Cross-Validation** on a machine-learning model.
It evaluates the model more reliably by training and testing it 5 times on different
splits of the dataset, while keeping class balance preserved.

###  Steps performed by the code

1. **Defines a Pipeline**  
   Combines StandardScaler (preprocessing) + RandomForest (model) so scaling is
   correctly applied inside each cross-validation fold.

2. **Sets Evaluation Metrics**  
   Accuracy, Precision, Recall, and F1-score are calculated for each fold.

3. **StratifiedKFold Split**  
   Ensures each fold contains the same proportion of classes (important for imbalanced data).

4. **Runs Cross-Validation**  
   `cross_validate()` trains and tests the model on each fold and collects all metrics.

5. **Prints Average Scores**  
   Shows the mean accuracy, precision, recall, and F1 over all 5 folds, giving a much
   more reliable evaluation than a single train/test split.

###  In short:
This is a clean, correct, and professional way to evaluate your ML model with multiple metrics using Stratified Cross-Validation.


In [12]:
from sklearn.model_selection import cross_val_score, cross_validate, StratifiedKFold

# Model 
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Scoring metrics
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score),
    'recall': make_scorer(recall_score),
    'f1': make_scorer(f1_score)
}

# “Setting up Cross Validation (5-fold stratified)"
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Performing cross-validation
cv_results = cross_validate(rf, X, y, cv=cv, scoring=scoring, return_train_score=True)

# Mean of Results
print("Train Accuracy:", cv_results['train_accuracy'].mean())
print("Test Accuracy:", cv_results['test_accuracy'].mean())
print("Test Precision:", cv_results['test_precision'].mean())
print("Test Recall:", cv_results['test_recall'].mean())
print("Test F1:", cv_results['test_f1'].mean())


Train Accuracy: 1.0
Test Accuracy: 0.9874
Test Precision: 0.9818198043835226
Test Recall: 0.8854166666666667
Test F1: 0.9308681374045866


#### What is it Corss Validation
Cross-validation is used to measure how well your machine-learning model will perform on new, unseen data.

It prevents your model from:

**1. Overfitting**

**2. Getting lucky with one train/test split**

**3. Showing a fake accuracy score**