In [1]:
 Introduction
This project, Bank Marketing Campaign Analysis, analyzes the Bank Marketing dataset from the UCI Machine Learning Repository. The dataset includes information on direct telemarketing campaigns conducted by a Portuguese banking institution to promote term deposit subscriptions.

Objectives:

Explore customer behavior and identify key factors influencing their decision to subscribe.

Preprocess and visualize the data for insights and model readiness.

Build and evaluate various machine learning models to predict whether a client will subscribe to a term deposit.

Determine feature importance to guide more effective future marketing strategies.

SyntaxError: invalid syntax (1265455369.py, line 2)

In [2]:
# Step 1: Import Libraries and Load Data
# This cell imports all the necessary Python libraries and loads the dataset.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc, ConfusionMatrixDisplay
import warnings
warnings.filterwarnings('ignore')

# Load the dataset
df = pd.read_csv("bank-full.csv", sep=';')
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'bank-full.csv'

In [None]:
# Step 2: Initial Data Exploration
print("Dataset shape:", df.shape)
df.info()
df.describe()

In [None]:
# Step 3: Check for Class Imbalance
sns.countplot(x='y', data=df)
plt.title("Class Distribution of Target Variable")
plt.show()

In [None]:
# Step 4: Remove Duplicates
print("Duplicates before:", df.duplicated().sum())
df = df.drop_duplicates()
print("Duplicates after:", df.duplicated().sum())

In [None]:
# Step 5: Missing Values
print("Missing values:\n", df.isnull().sum())

In [None]:
# Step 6: Visualizing Numerical Features
num_cols = df.select_dtypes(include='int64').columns
for col in num_cols:
    sns.histplot(df[col], kde=True)
    plt.title(f"Distribution of {col}")
    plt.show()

In [None]:
# Step 7: Visualizing Categorical Features
cat_cols = df.select_dtypes(include='object').columns
for col in cat_cols:
    sns.countplot(y=col, data=df, order=df[col].value_counts().index)
    plt.title(f"Countplot of {col}")
    plt.show()

In [None]:
# Step 8: Outlier Treatment (e.g., Age, Duration)
df = df[df['age'] < 100]
Q1 = df['duration'].quantile(0.25)
Q3 = df['duration'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['duration'] >= Q1 - 1.5 * IQR) & (df['duration'] <= Q3 + 1.5 * IQR)]

In [None]:
# Step 9: Encode Categorical Variables
categorical_cols = df.select_dtypes(include='object').columns.drop('y')
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

In [None]:
# Step 10: Encode Target Variable
df['y'] = df['y'].map({'no': 0, 'yes': 1})

In [None]:
# Step 11: Correlation Matrix
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), cmap="coolwarm", annot=False)
plt.title("Correlation Matrix")
plt.show()

In [None]:
# Step 12: Split Data into Features and Target
X = df.drop('y', axis=1)
y = df['y']

In [None]:
# Step 13: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
# Step 14: Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# Step 15: Train and Evaluate Models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'XGBoost': XGBClassifier(eval_metric='logloss', use_label_encoder=False),
    'Support Vector Machine': SVC(probability=True)
}

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    print(f"\n{name}")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print(classification_report(y_test, y_pred))
    ConfusionMatrixDisplay.from_estimator(model, X_test_scaled, y_test)
    plt.title(f"Confusion Matrix - {name}")
    plt.show()

In [None]:
# Step 16: ROC Curve Comparison
def get_roc_auc(model, X, y_true):
    y_prob = model.predict_proba(X)[:, 1]
    fpr, tpr, _ = roc_curve(y_true, y_prob)
    return fpr, tpr, auc(fpr, tpr)

plt.figure(figsize=(10, 6))
for name, model in models.items():
    fpr, tpr, auc_val = get_roc_auc(model, X_test_scaled, y_test)
    plt.plot(fpr, tpr, label=f"{name} (AUC = {auc_val:.2f})")

plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve Comparison of Models")
plt.legend()
plt.grid(True)
plt.show()

In [None]:
# Step 17: Feature Importance using XGBoost
importances = pd.Series(models['XGBoost'].feature_importances_, index=X.columns).sort_values(ascending=False)

plt.figure(figsize=(12, 6))
sns.barplot(x=importances[:15], y=importances.index[:15])
plt.title("Top 15 Feature Importances – XGBoost")
plt.xlabel("Importance")
plt.show()

In [None]:
# Step 18: Model Comparison Summary
results_df = pd.DataFrame({
    'Model': list(model_scores.keys()),
    'Accuracy': list(model_scores.values()),
    'AUC': [model_aucs[model] for model in model_scores.keys()]
}).sort_values(by='Accuracy', ascending=False)

print("\nModel Comparison:")
print(results_df)

In [None]:
# Best Models Summary
print("\n🏆 Based on both accuracy and AUC scores, the best-performing models are:")
print("- Random Forest Classifier")
print("- XGBoost Classifier")
print("These models showed strong predictive power and clearly identified important features influencing customer decisions.")

In [None]:
# Step 19: Conclusion
print("\n✅ Conclusion:")
print("This project explored the Bank Marketing dataset to understand customer behavior and predict term deposit subscriptions.")
print("Several models were evaluated, with Random Forest and XGBoost performing best based on accuracy and AUC.")
print("Feature importance analysis highlighted key drivers like duration, contact method, and month of contact.")

In [None]:
Conclusion
Through extensive analysis and preprocessing of the Bank Marketing dataset, multiple machine learning models were implemented and evaluated. Key findings include:

Several models, including Random Forest and XGBoost, achieved strong performance in predicting term deposit subscriptions.

The ROC curve and AUC scores provided deeper insights into model performance beyond accuracy alone.

Feature importance analysis revealed that attributes like duration, month, and contact method significantly influence customer decisions.

This end-to-end workflow demonstrates the power of data-driven marketing and predictive modeling in supporting targeted campaign strategies.