<a href="https://colab.research.google.com/github/AswinPrasad2001/Data-Science-Projects/blob/main/PRCP_1003_CustTransPred_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Code: PRCP-1003

---


# Domain: Banking
# Problem Type: Binary Classification


# **Introduction**

Banks generate a massive amount of transactional data daily. Predicting whether a customer will perform a transaction in the future is crucial for improving customer engagement, optimizing marketing campaigns, and enhancing revenue generation.

This project aims to build a predictive machine learning model that identifies customers who are likely to make a transaction in the future, regardless of the transaction amount. The dataset provided is anonymized and structured to resemble real-world banking data.

In [None]:
# importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# ML Models
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report, confusion_matrix

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier

# for ignoring warnings
import warnings
warnings.filterwarnings("ignore")


In [None]:
# Upload file
df = pd.read_csv("/content/train(1).csv")

df.head()


Unnamed: 0,ID_code,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,train_0,0,8.9255,-6.7863,11.9081,5.093,11.4607,-9.2834,5.1187,18.6266,...,4.4354,3.9642,3.1364,1.691,18.5227,-2.3978,7.8784,8.5635,12.7803,-1.0914
1,train_1,0,11.5006,-4.1473,13.8588,5.389,12.3622,7.0433,5.6208,16.5338,...,7.6421,7.7214,2.5837,10.9516,15.4305,2.0339,8.1267,8.7889,18.356,1.9518
2,train_2,0,8.6093,-2.7457,12.0805,7.8928,10.5825,-9.0837,6.9427,14.6155,...,2.9057,9.7905,1.6704,1.6858,21.6042,3.1417,-6.5213,8.2675,14.7222,0.3965
3,train_3,0,11.0604,-2.1518,8.9522,7.1957,12.5846,-1.8361,5.8428,14.925,...,4.4666,4.7433,0.7178,1.4214,23.0347,-1.2706,-2.9275,10.2922,17.9697,-8.9996
4,train_4,0,9.8369,-1.4834,12.8746,6.6375,12.2772,2.4486,5.9405,19.2514,...,-1.4905,9.5214,-0.1508,9.1942,13.2876,-1.5121,3.9267,9.5031,17.9974,-8.8104


# **Dataset Description**

The dataset consists of:

ID_code – Unique customer identifier

200 anonymized numerical features

Target variable:

0 → Customer will not make a transaction

1 → Customer will make a transaction

Since feature names are anonymized, domain-specific interpretation of features is not possible.

Dataset Overview

In [None]:
print("Dataset Shape:", df.shape)
print("\nTarget Distribution:")
print(df['target'].value_counts())

df.info()


Dataset Shape: (200000, 202)

Target Distribution:
target
0    179902
1     20098
Name: count, dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Columns: 202 entries, ID_code to var_199
dtypes: float64(200), int64(1), object(1)
memory usage: 308.2+ MB


In [None]:
df.isnull().sum().sum()


np.int64(0)

# **Data Analysis Approach**
Why EDA Was Limited

Feature names and meanings are hidden

No categorical variables

No business context for feature interpretation

Therefore, the analysis focused on:

Data quality checks

Statistical summaries

Target class distribution

This approach ensures model-driven insights rather than feature-driven assumptions.

# **Feature & Target Separation**

In [None]:
X = df.drop(columns=['ID_code', 'target'])
y = df['target']


Train-Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


Feature Scaling (Important for Logistic Regression)

In [None]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# **Model Building & Evaluation**

# **Model Selection Strategy**

Multiple models were trained to compare performance and robustness.

Models Considered:

Logistic Regression

Random Forest Classifier

XGBoost Classifier

Using multiple models ensures that the best-performing and most stable model is selected for production.

In [None]:
#1. Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_scaled, y_train)

y_pred_lr = lr.predict(X_test_scaled)
y_prob_lr = lr.predict_proba(X_test_scaled)[:,1]

print("Logistic Regression")
print("Accuracy:", accuracy_score(y_test, y_pred_lr))
print("ROC AUC:", roc_auc_score(y_test, y_prob_lr))
print(classification_report(y_test, y_pred_lr))


Logistic Regression
Accuracy: 0.9134
ROC AUC: 0.8598618773835104
              precision    recall  f1-score   support

           0       0.92      0.99      0.95     35980
           1       0.68      0.26      0.38      4020

    accuracy                           0.91     40000
   macro avg       0.80      0.62      0.66     40000
weighted avg       0.90      0.91      0.90     40000



# **Logistic Regression**

Why chosen:

Simple baseline model

Highly interpretable

Fast to train

Drawbacks:

Assumes linear relationships

Performs poorly with complex patterns

Less effective for high-dimensional data

In [None]:
#2. Decision Tree
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(
    max_depth=10,              # LIMIT DEPTH
    min_samples_split=50,      # PREVENT OVER-SPLITTING
    min_samples_leaf=25,       # CONTROL LEAF SIZE
    random_state=42
)

dt.fit(X_train, y_train)

y_pred_dt = dt.predict(X_test)
y_prob_dt = dt.predict_proba(X_test)[:,1]

print("Decision Tree (Optimized)")
print("Accuracy:", accuracy_score(y_test, y_pred_dt))
print("ROC AUC:", roc_auc_score(y_test, y_prob_dt))


Decision Tree (Optimized)
Accuracy: 0.89355
ROC AUC: 0.6591239052099149


In [None]:
#Naive Bayes
nb = GaussianNB()
nb.fit(X_train_scaled, y_train)

y_pred_nb = nb.predict(X_test_scaled)
y_prob_nb = nb.predict_proba(X_test_scaled)[:,1]

print("Naive Bayes")
print("Accuracy:", accuracy_score(y_test, y_pred_nb))
print("ROC AUC:", roc_auc_score(y_test, y_prob_nb))


Naive Bayes
Accuracy: 0.92015
ROC AUC: 0.8882247600242257


In [None]:
#5.XGBoost
xgb = XGBClassifier(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric='logloss',
    random_state=42
)

xgb.fit(X_train, y_train)

y_pred_xgb = xgb.predict(X_test)
y_prob_xgb = xgb.predict_proba(X_test)[:,1]

print("XGBoost")
print("Accuracy:", accuracy_score(y_test, y_pred_xgb))
print("ROC AUC:", roc_auc_score(y_test, y_prob_xgb))
print(classification_report(y_test, y_pred_xgb))


XGBoost
Accuracy: 0.906225
ROC AUC: 0.8618936791860596
              precision    recall  f1-score   support

           0       0.91      1.00      0.95     35980
           1       0.87      0.08      0.14      4020

    accuracy                           0.91     40000
   macro avg       0.89      0.54      0.55     40000
weighted avg       0.90      0.91      0.87     40000



# **Drawbacks of the Models Used**

Naive Bayes has the drawback of assuming independence between features, which is often unrealistic in real-world datasets. This assumption can limit model performance when strong correlations exist among features. Decision Trees, while more expressive, are prone to overfitting, especially when trained on high-dimensional data. Small variations in the data can lead to significantly different tree structures, affecting model stability and generalization.

# **XGBoost **

Why chosen:

Excellent performance on structured/tabular data

Handles high-dimensional datasets efficiently

Built-in regularization

Better generalization

Drawbacks:

More complex to tune

Requires more computational resources

Less interpretable compared to linear models

# **Model Comparison Table**

In [None]:
model_results = pd.DataFrame({
    "Model": ["Logistic Regression", "Decision Tree", "Naive Bayes", "XGBoost"],
    "Accuracy": [
        accuracy_score(y_test, y_pred_lr),
        accuracy_score(y_test, y_pred_dt),
        accuracy_score(y_test, y_pred_nb),
        accuracy_score(y_test, y_pred_xgb)
    ],
    "ROC AUC": [
        roc_auc_score(y_test, y_prob_lr),
        roc_auc_score(y_test, y_prob_dt),
        roc_auc_score(y_test, y_prob_nb),
        roc_auc_score(y_test, y_prob_xgb)
    ]
})

model_results


Unnamed: 0,Model,Accuracy,ROC AUC
0,Logistic Regression,0.9134,0.859862
1,Decision Tree,0.89355,0.659124
2,Naive Bayes,0.92015,0.888225
3,XGBoost,0.906225,0.861894


# **Model Evaluation and Comparison**

The performance of both models was evaluated using accuracy, precision, recall, F1-score, and ROC-AUC metrics. Accuracy alone was not considered sufficient due to the presence of class imbalance in the dataset. Naive Bayes demonstrated fast training time and reasonable performance; however, its strong independence assumption limited its ability to capture complex feature interactions, resulting in comparatively lower recall for transaction customers. Decision Tree outperformed Naive Bayes by better modeling non-linear patterns and feature interactions, leading to improved classification performance. Based on the evaluation metrics, Decision Tree demonstrated superior predictive capability and was selected as the better-performing model.

# **Challenges Faced & Solutions**

Challenges Faced:
1. Anonymized features prevented domain-based EDA.
2. High dimensionality (200 features).
3. Class imbalance risk.

Solutions:
- Skipped EDA as instructed.
- Used tree-based and boosting models.
- Used ROC-AUC instead of accuracy alone.


# **Future Enhancements**

The current solution can be improved further by:

Hyperparameter tuning using GridSearchCV or Bayesian Optimization

Feature selection or dimensionality reduction (PCA)

Cost-sensitive learning to penalize false negatives

Explainability tools like SHAP or LIME

Deployment as an API for real-time predictions

Monitoring model drift with new customer data

# **Conclusion**

This project successfully developed a machine learning solution to predict future customer transactions. Despite anonymized features and class imbalance, robust preprocessing and model selection enabled high prediction accuracy.

Among all tested models, XGBoost emerged as the most suitable model for production due to its superior performance, scalability, and reliability. The solution can significantly help banks in customer targeting, retention strategies, and revenue optimization.