# Term Deposit Subscription Prediction
### Problem Statement:
Predict whether a bank customer will subscribe to a term deposit as a result of a direct marketing campaign, based on various customer attributes and historical campaign data.

### Objective:
The primary objective is to build and evaluate classification models (e.g., Logistic Regression, Random Forest) to accurately predict term deposit subscriptions. This involves loading and exploring the dataset, preprocessing features, performing exploratory data analysis, training models, evaluating their performance using metrics like Confusion Matrix, F1-Score, ROC Curve, and Accuracy, and gaining insights into customer behavior.

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, f1_score, roc_auc_score, roc_curve

import shap
import warnings
warnings.filterwarnings('ignore')


## Load and clean the dataset

In [None]:
df = pd.read_csv("../data/bank.csv")
print("Top 5 rows of dataset\n")
print(df.head())
print("\nDataset Description\n")
print(df.describe())
# Remove leading/trailing spaces from column names
df.columns = df.columns.str.strip()


**Insights:**
- Features include demographics (age, job, education), financial data (balance, loan), and call info (contact, duration, poutcome).

- Target column y is binary: 0 = no subscription, 1 = yes subscription.

- Many categorical fields (e.g., job, contact, month) need encoding for modeling

## Clean and preprocess the target column

In [None]:

print("Unique values before conversion:", df['y'].unique())

# Convert string '0'/'1' to integer 0/1
df['y'] = df['y'].astype(int)

print("Unique values after conversion:", df['y'].unique())

## Exploratory Data Analysis (EDA)

In [None]:
print("\nTarget class distribution:")
print(df['y'].value_counts(normalize=True))

# Visualize target distribution
sns.countplot(x='y', data=df)
plt.title("Target Class Distribution")
plt.show()



**Insights:**
- Class 0 (No subscription): 88.48%

- Class 1 (Yes subscription): 11.52%

**Data is highly imbalanced.**

In [None]:
# Correlation heatmap for numeric features
plt.figure(figsize=(12, 6))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()

**Insights** 
- duration (0.40) is the strongest predictor of term deposit subscription — longer calls increase chances.

- pdays (0.10) and previous (0.12) show weak positive correlation — past contact matters.

- balance, age, campaign have low correlation with the target.

- pdays and previous are moderately correlated (0.58) — possible multicollinearity.

``Focus on duration, previous, and pdays for predictive power.``

## Preprocessing

In [None]:

# One-hot encode categorical features
df_encoded = pd.get_dummies(df, drop_first=True)

# Separate features and target
X = df_encoded.drop('y', axis=1)
y = df_encoded['y']

## Train-test split

In [None]:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.2, random_state=42
)

## Feature Scaling with column names preserved

In [None]:

scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)


- It standardizes features by removing the mean and scaling to unit variance.
- It ensures all features in X_train and X_test have the same scale, which helps models like Logistic Regression perform better.

## Class Imbalance Handling: Class Weights and SMOTE

## Model Training

#### Logistic Regression with `class_weight='balanced'`

In [None]:

lr_balanced = LogisticRegression(class_weight='balanced', max_iter=1000)
lr_balanced.fit(X_train, y_train)
y_pred_lr_bal = lr_balanced.predict(X_test)
print("Logistic Regression (Balanced)")
print(classification_report(y_test, y_pred_lr_bal))


**Insights:**
- Accuracy: 82%

- High recall (0.78) for class 1 – detects positives well.

- Low precision (0.37) – many false positives.

- F1-score for class 1: 0.50 – decent balance for imbalanced data.


#### Random Forest with `class_weight='balanced'`

In [None]:

rf_balanced = RandomForestClassifier(class_weight='balanced', random_state=42)
rf_balanced.fit(X_train, y_train)
y_pred_rf_bal = rf_balanced.predict(X_test)
print("Random Forest (Balanced)")
print(classification_report(y_test, y_pred_rf_bal))


**Insights:**
- Accuracy: 89%

- High precision (0.61) but very low recall (0.18) for class 1.

- F1-score for class 1: 0.28 – poor at detecting positives despite high accuracy.

- Biased toward majority class (0).

#### SMOTE Oversampling and Random Forest
As SMOTE handle Tree Based Model Accurately

In [None]:

from imblearn.over_sampling import SMOTE

# Apply SMOTE to training data
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Train Random Forest on SMOTE data
rf_smote = RandomForestClassifier(random_state=42)
rf_smote.fit(X_train_smote, y_train_smote)
y_pred_rf_smote = rf_smote.predict(X_test)

print("Random Forest with SMOTE Oversampling")
print(classification_report(y_test, y_pred_rf_smote))


**Inisghts:**
- Accuracy: 88%

- Improved recall (0.34) and precision (0.49) for class 1.

- F1-score for class 1: 0.40 – better than RF alone.

- More balanced performance, ideal for handling imbalance.

## Model Evaluation

##### Logistic Regression Model Evaluation

In [None]:
from sklearn.metrics import f1_score, roc_auc_score, classification_report, confusion_matrix, roc_curve
import seaborn as sns
import matplotlib.pyplot as plt

# Predictions and probabilities
log_preds = lr_balanced.predict(X_test)
log_proba = lr_balanced.predict_proba(X_test)[:, 1]

# Evaluation metrics
print("\n--- Logistic Regression (Balanced) Evaluation ---")
print("F1 Score:", f1_score(y_test, log_preds))
print("ROC AUC Score:", roc_auc_score(y_test, log_proba))
print("Classification Report:\n", classification_report(y_test, log_preds))

# Confusion Matrix
sns.heatmap(confusion_matrix(y_test, log_preds), annot=True, fmt='d', cmap="YlGnBu")
plt.title("Logistic Regression (Balanced) - Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# ROC Curve
fpr_log, tpr_log, _ = roc_curve(y_test, log_proba)
plt.plot(fpr_log, tpr_log, label=f"Logistic Regression AUC = {roc_auc_score(y_test, log_proba):.2f}")
plt.plot([0, 1], [0, 1], linestyle="--")
plt.title("ROC Curve - Logistic Regression (Balanced)")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.show()


**Insights:**
Overall prediction ability is decent **(AUC ≈ 0.89)**, but it struggles when it comes to detecting actual subscribers:

- Class 1 (Subscribed) — precision 0.37, recall 0.78, F1-score 0.50

- High number of false positives (137) compared to true subscribers correctly identified (81)

- Class 0 (Not Subscribed) detection is strong: precision 0.93, recall 0.70, F1-score 0.80

**Usage:**
- Use this model when your priority is to not miss any interested customer.

- Even if it means sending marketing offers to some people who won’t subscribe — that’s okay.

- Best when you have resources to contact many people (like mass email or calls).


**Logistic Regression**
- **While excellent at predicting Class 0, it has poor precision for Class 1 (many false positives).**

##### Random Forest Model Evaluation

In [None]:
# Predictions and probabilities
rf_preds = rf_balanced.predict(X_test)
rf_proba = rf_balanced.predict_proba(X_test)[:, 1]

# Evaluation metrics
print("\n--- Random Forest (Balanced) Evaluation ---")
print("F1 Score:", f1_score(y_test, rf_preds))
print("ROC AUC Score:", roc_auc_score(y_test, rf_proba))
print("Classification Report:\n", classification_report(y_test, rf_preds))

# Confusion Matrix
sns.heatmap(confusion_matrix(y_test, rf_preds), annot=True, fmt='d', cmap="YlGnBu")
plt.title("Random Forest (Balanced) - Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# ROC Curve
fpr_rf, tpr_rf, _ = roc_curve(y_test, rf_proba)
plt.plot(fpr_rf, tpr_rf, label=f"Random Forest AUC = {roc_auc_score(y_test, rf_proba):.2f}")
plt.plot([0, 1], [0, 1], linestyle="--")
plt.title("ROC Curve - Random Forest (Balanced)")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.show()


**Insights:**
Best overall performance: highest AUC (≈ 0.90) and weighted F1-score (≈ 0.94)

= Class 1 (Subscribed) — precision 0.61, recall 0.18, F1-score 0.28

- Very few false positives (12), but many missed subscribers (false negatives: 85)

- Class 0 (Not Subscribed) detection is excellent: precision 0.98, recall 0.99, F1-score 0.99

**Usage:**
- Use this model when you want to be very confident before reaching out.

- It minimizes false offers, so you don’t waste marketing budget or annoy customers.

- Best when you can only afford to target a few people and want them to be very likely to subscribe.

**Random Forest (Balanced)**
- **Excellent for Class 0 but very poor recall for Class 1 (misses most positive cases).**

##### Random Forest (SMOTE) Evaluation

In [None]:
# Predictions and probabilities
rf_preds = rf_smote.predict(X_test)
rf_proba = rf_smote.predict_proba(X_test)[:, 1]

# Evaluation metrics
print("\n--- Random Forest (SMOTE) Evaluation ---")
print("F1 Score:", f1_score(y_test, rf_preds))
print("ROC AUC Score:", roc_auc_score(y_test, rf_proba))
print("Classification Report:\n", classification_report(y_test, rf_preds))

# Confusion Matrix
sns.heatmap(confusion_matrix(y_test, rf_preds), annot=True, fmt='d', cmap="YlGnBu")
plt.title("Random Forest (SMOTE) - Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# ROC Curve
fpr_rf, tpr_rf, _ = roc_curve(y_test, rf_proba)
plt.plot(fpr_rf, tpr_rf, label=f"Random Forest (SMOTE) AUC = {roc_auc_score(y_test, rf_proba):.2f}")
plt.plot([0, 1], [0, 1], linestyle="--")
plt.title("ROC Curve - Random Forest (SMOTE)")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.show()


**Insights:**
Better balance: boosts detection of actual subscribers with some trade-offs:

- Class 1 (Subscribed) — precision 0.49, recall 0.34, F1-score 0.40

- Recall improvement (18% → 34%) means it finds more real "yes" customers

- More false positives than before (36 vs 12)

- AUC slightly drops to ≈ 0.89, weighted F1 remains solid at ≈ 0.87

**Usage:**
- This model gives a better balance between reaching real subscribers and limiting mistakes.

- Use it if you want to increase conversions, even if a few extra non-subscribers are contacted.

- Good for mid-sized campaigns: you want reach + reasonable accuracy.

**Random Forest (SMOTE)**
- **Good performance for both classes with the most balanced results.**

## Model Conclusion

After Testing all these model we made a conclusion that:
**Random Forest (with SMOTE)** will best fit our problem requirement

- It gives a good balance between recall (finding real subscribers) and precision (not guessing too many wrong ones).

- It finds 35 actual subscribers (vs only 19 in plain Random Forest).

- Yes, it makes some wrong guesses, but not too many (36 false positives).

- F1-score and AUC are still strong (0.40 for "yes", and AUC ≈ 0.89).

**Use Random Forest + SMOTE to:**

- Target more potential subscribers.

- Accept a small number of extra offers sent to uninterested customers — worth it if the term deposit is valuable.

## Model Explainability with SHAP
To understand the influence of each feature on the predictions made by the best-performing model (Random Forest with SMOTE), we use SHAP (SHapley Additive exPlanations). This allows us to explain individual and global model behaviors.

## Random Forest + SMOTE

In [None]:
import shap
import numpy as np

# Initialize SHAP TreeExplainer
explainer = shap.TreeExplainer(rf_smote)
shap_values = explainer.shap_values(X_test)

# Get predictions and probabilities
y_probs = rf_smote.predict_proba(X_test)[:, 1]
y_preds = rf_smote.predict(X_test)

# Explain 5 individual predictions
print("SHAP Explanation for 5 individual predictions:")

for i in range(5):
    pred_class = y_preds[i]
    actual_class = y_test.iloc[i]
    pred_prob = y_probs[i]

    print(f"\n🔹 Prediction {i+1} (Index {i})")
    print(f"Predicted class: {pred_class} ({'subscription' if pred_class == 1 else 'no subscription'})")
    print(f"Actual class: {actual_class}")
    print(f"Predicted probability of subscription: {pred_prob:.2f}")

    display(shap.force_plot(
        explainer.expected_value[1],
        shap_values[1][i],
        X_test.iloc[i],
        matplotlib=True
    ))


In [None]:
import shap
import numpy as np

# Initialize SHAP TreeExplainer
explainer = shap.TreeExplainer(rf_smote)
shap_values = explainer.shap_values(X_test)

# Get model predictions and probabilities
y_probs = rf_smote.predict_proba(X_test)[:, 1]
y_preds = rf_smote.predict(X_test)

# Select indices where model predicts class 1 (subscription)
class_1_indices = np.where(y_preds == 1)[0]

print("SHAP Explanation for 5 instances predicted as class 1 (subscription):")

# Loop through first 5 predicted class 1 examples
for i, idx in enumerate(class_1_indices[:5]):
    pred_class = y_preds[idx]
    actual_class = y_test.iloc[idx]
    pred_prob = y_probs[idx]

    print(f"\n🔹 Prediction {i+1} (Index {idx})")
    print(f"Predicted class: {pred_class} (subscription)")
    print(f"Actual class: {actual_class}")
    print(f"Predicted probability of subscription: {pred_prob:.2f}")

    display(shap.force_plot(
        explainer.expected_value[1],
        shap_values[1][idx],
        X_test.iloc[idx],
        matplotlib=True
    ))


**Insights:**

Based on these prediction Results:
- **Subscription** is influenced by **long call duration, higher education, and positive past campaign outcomes**.

- **Non-subscription** is driven by **short call duration, unknown contact methods, and contact during less effective months like May or July**.

**Random Forest with SMOTE** effectively handles class imbalance and predicts non-subscriptions accurately.

The model is slightly cautious, occasionally missing borderline "Yes" cases.

Overall, it fits the task well by minimizing errors and offering interpretable predictions.

## Conclusion
- The objective was to predict whether a customer will subscribe to a term deposit using bank marketing data.

- During analysis, we identified that the dataset was imbalanced, with significantly fewer "Yes" (subscription) outcomes compared to "No".

- To address this, we applied techniques like class weight balancing and SMOTE (Synthetic Minority Oversampling Technique) to improve model fairness.

- Among the tested models, **Random Forest with SMOTE** performed best, giving balanced results for both classes while maintaining good accuracy.

- Although it made a few incorrect predictions, especially on borderline cases, it outperformed Logistic Regression and Random Forest with class weights.

- Model explanations using SHAP revealed that:

- Subscription is influenced by **long call duration, higher education, and favorable past outcomes**.

- Non-subscription is associated with **short calls, unknown contact methods, and less effective months like May or July**.

Overall, the **Random Forest + SMOTE** combination is effective and interpretable, making it well-suited for this prediction task.