# Bank Marketing Dataset Analysis

## Overview
This notebook analyzes the Bank Marketing dataset. The goal is to predict whether a client will subscribe to a term deposit (variable: 'deposit', yes/no).

The dataset includes features such as:
- Numerical: age, balance, day, duration, campaign, pdays, previous
- Categorical: job, marital, education, default, housing, loan, contact, month, poutcome

We will use sklearn.pipeline for preprocessing and modeling.

## Steps:
- Load and explore data
- Preprocess using Pipeline (scaling numerical, one-hot encoding categorical)
- Part A: Train Decision Tree, evaluate Accuracy, Precision, Recall
- Part B: Tune two hyperparameters with at least 2 values each, compare results
- Part C: Train Random Forest, report Accuracy

In [1]:
import pandas as pd
import numpy as np
import warnings, json, os
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report

# Load the dataset
df = pd.read_csv('bank.csv')

# Display first few rows and info
print(df.head())
print(df.info())
print(df.describe())

# Check for missing values
print(df.isnull().sum())

   age         job  marital  education default  balance housing loan  contact  \
0   59      admin.  married  secondary      no     2343     yes   no  unknown   
1   56      admin.  married  secondary      no       45      no   no  unknown   
2   41  technician  married  secondary      no     1270     yes   no  unknown   
3   55    services  married  secondary      no     2476     yes   no  unknown   
4   54      admin.  married   tertiary      no      184      no   no  unknown   

   day month  duration  campaign  pdays  previous poutcome deposit  
0    5   may      1042         1     -1         0  unknown     yes  
1    5   may      1467         1     -1         0  unknown     yes  
2    5   may      1389         1     -1         0  unknown     yes  
3    5   may       579         1     -1         0  unknown     yes  
4    5   may       673         2     -1         0  unknown     yes  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11162 entries, 0 to 11161
Data columns (total 17 

## Data Preprocessing
We split the data into features (X) and target (y).
Target: 'deposit' (yes=1, no=0).

We use ColumnTransformer to:
- Scale numerical features
- One-hot encode categorical features

Split data into train/test (80/20).

In [2]:
# 1. New Column 1: Age category
df['age_category'] = pd.cut(df['age'], bins=[0, 25, 35, 45, 55, 65, 100], labels=['<25', '25-35', '35-45', '45-55', '55-65', '65+'])

# 2. New Column 2: Balance category
df['balance_category'] = pd.cut(df['balance'], bins=[-np.inf, 0, 1000, 5000, np.inf], labels=['Negative', 'Low', 'Medium', 'High'])

# 3. New Column 3: Duration category
df['duration_category'] = pd.cut(df['duration'], bins=[0, 100, 300, 600, np.inf], labels=['Short', 'Medium', 'Long', 'Very Long'])

# 4. New Column 4: Campaign type
df['campaign_type'] = df['campaign'].apply(lambda x: 'Low' if x <= 5 else 'High')

# 5. New Column 5: Previous interaction
df['previous_interaction'] = df['previous'].apply(lambda x: 'No Interaction' if x == 0 else 'Interaction')
# Display the DataFrame with new columns
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,...,campaign,pdays,previous,poutcome,deposit,age_category,balance_category,duration_category,campaign_type,previous_interaction
0,59,admin.,married,secondary,no,2343,yes,no,unknown,5,...,1,-1,0,unknown,yes,55-65,Medium,Very Long,Low,No Interaction
1,56,admin.,married,secondary,no,45,no,no,unknown,5,...,1,-1,0,unknown,yes,55-65,Low,Very Long,Low,No Interaction
2,41,technician,married,secondary,no,1270,yes,no,unknown,5,...,1,-1,0,unknown,yes,35-45,Medium,Very Long,Low,No Interaction
3,55,services,married,secondary,no,2476,yes,no,unknown,5,...,1,-1,0,unknown,yes,45-55,Medium,Long,Low,No Interaction
4,54,admin.,married,tertiary,no,184,no,no,unknown,5,...,2,-1,0,unknown,yes,45-55,Low,Very Long,Low,No Interaction


In [3]:
print (df.head())

   age         job  marital  education default  balance housing loan  contact  \
0   59      admin.  married  secondary      no     2343     yes   no  unknown   
1   56      admin.  married  secondary      no       45      no   no  unknown   
2   41  technician  married  secondary      no     1270     yes   no  unknown   
3   55    services  married  secondary      no     2476     yes   no  unknown   
4   54      admin.  married   tertiary      no      184      no   no  unknown   

   day  ... campaign  pdays  previous  poutcome  deposit age_category  \
0    5  ...        1     -1         0   unknown      yes        55-65   
1    5  ...        1     -1         0   unknown      yes        55-65   
2    5  ...        1     -1         0   unknown      yes        35-45   
3    5  ...        1     -1         0   unknown      yes        45-55   
4    5  ...        2     -1         0   unknown      yes        45-55   

  balance_category duration_category campaign_type previous_interaction  


In [4]:
# Define features and target
y = df['deposit'].map({'yes':1, 'no':0})
X = df.drop(columns='deposit', axis=1)
# print(X.columns)

# # Define the parameters for Decision Tree Classifier
dt_params = {'criterion': 'entropy', 
             'max_depth': 10, 
             'max_features': None, 
             'min_samples_leaf': 4, 
             'min_samples_split': 10}

# duration in minutes (trees love round numbers)'
X_eng = X.copy()
X_eng['duration_min'] = X_eng['duration'] / 60.0

X_eng['never_contacted'] = (X_eng['pdays'] == -1).astype(int)
X_eng['pdays'] = X_eng['pdays'].replace(-1, np.nan)   # let OneHot handle it
print(X_eng.head())
# Identify numerical and categorical columns
NUMERICAL = {'age','balance','day','duration_min','campaign','pdays','previous'}
# CATEGORICAL = ['job','marital','education','default','housing','loan',
#             'contact','month','poutcome','never_contacted']'
CATEGORICAL = set(X_eng.columns) - NUMERICAL
CATEGORICAL, NUMERICAL = list(CATEGORICAL), list(NUMERICAL)
# Preprocessing pipeline
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  
    ('scaler', StandardScaler())                 
])
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent', fill_value='unknown')),  
    ('onehot', OneHotEncoder(handle_unknown='ignore'))                    
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, NUMERICAL),
        ('cat', categorical_transformer, CATEGORICAL)
    ])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_eng, y, test_size=0.05, random_state=999, stratify=y)

print(f'Train shape: {X_train.shape}, Test shape: {X_test.shape}')


   age         job  marital  education default  balance housing loan  contact  \
0   59      admin.  married  secondary      no     2343     yes   no  unknown   
1   56      admin.  married  secondary      no       45      no   no  unknown   
2   41  technician  married  secondary      no     1270     yes   no  unknown   
3   55    services  married  secondary      no     2476     yes   no  unknown   
4   54      admin.  married   tertiary      no      184      no   no  unknown   

   day  ... pdays  previous  poutcome  age_category  balance_category  \
0    5  ...   NaN         0   unknown         55-65            Medium   
1    5  ...   NaN         0   unknown         55-65               Low   
2    5  ...   NaN         0   unknown         35-45            Medium   
3    5  ...   NaN         0   unknown         45-55            Medium   
4    5  ...   NaN         0   unknown         45-55               Low   

  duration_category campaign_type previous_interaction duration_min  \
0  

## Part A: Decision Tree Classifier
Train a Decision Tree using Pipeline.
Evaluate on test set: Accuracy, Precision, Recall.

In [5]:
# Pipeline for Decision Tree
dt_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier(**dt_params, random_state=999))
])
# dt_pipeline = DecisionTreeClassifier(**dt_params, random_state=999)
# Train the model
dt_pipeline.fit(X_train, y_train)

# Predict on test
y_pred_dt = dt_pipeline.predict(X_test)

# Evaluate
accuracy_dt = accuracy_score(y_test, y_pred_dt)
precision_dt = precision_score(y_test, y_pred_dt)
recall_dt = recall_score(y_test, y_pred_dt)

print(f'Accuracy: {accuracy_dt:.4f}')
print(f'Precision: {precision_dt:.4f}')
print(f'Recall: {recall_dt:.4f}')
print(classification_report(y_test, y_pred_dt))


Accuracy: 0.8533
Precision: 0.8188
Recall: 0.8868
              precision    recall  f1-score   support

           0       0.89      0.82      0.86       294
           1       0.82      0.89      0.85       265

    accuracy                           0.85       559
   macro avg       0.85      0.85      0.85       559
weighted avg       0.86      0.85      0.85       559



## Part B: Hyperparameter Tuning for Decision Tree
Select two hyperparameters:
1. max_depth: Controls tree depth to prevent overfitting.
2. min_samples_leaf: Minimum samples per leaf to control splitting.

Test at least 2 values for each:
- max_depth: 5, 10
- min_samples_leaf: 1, 5

Compare Accuracy, Precision, Recall for each combination.

In [6]:
# Define hyperparameters to test
hyperparams = [
    {'max_depth': 5, 'min_samples_leaf': 1},
    {'max_depth': 5, 'min_samples_leaf': 5},
    {'max_depth': 10, 'min_samples_leaf': 1},
    {'max_depth': 10, 'min_samples_leaf': 5}
]

# Results dictionary
results = []

for params in hyperparams:
    # Create pipeline with params
    dt_pipeline_tuned = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', DecisionTreeClassifier(**params, random_state=999))
    ])
    
    # Train
    dt_pipeline_tuned.fit(X_train, y_train)
    
    # Predict
    y_pred = dt_pipeline_tuned.predict(X_test)
    
    # Metrics
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    
    results.append({
        'max_depth': params['max_depth'],
        'min_samples_leaf': params['min_samples_leaf'],
        'Accuracy': acc,
        'Precision': prec,
        'Recall': rec
    })

# Display results in a table
results_df = pd.DataFrame(results)
print(results_df)

# Comparison:
# - Lower max_depth may reduce overfitting, improving generalization.
# - Higher min_samples_leaf smooths the model, potentially increasing precision but decreasing recall.


   max_depth  min_samples_leaf  Accuracy  Precision    Recall
0          5                 1  0.838998   0.825279  0.837736
1          5                 5  0.842576   0.826568  0.845283
2         10                 1  0.853309   0.830325  0.867925
3         10                 5  0.853309   0.830325  0.867925


## Part C: Random Forest Classifier
Train a Random Forest using Pipeline.
Report Accuracy on test set.

In [7]:
# Pipeline for Random Forest
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=999))
])

# Train the model
rf_pipeline.fit(X_train, y_train)

# Predict on test
y_pred_rf = rf_pipeline.predict(X_test)

# Evaluate Accuracy
accuracy_rf = accuracy_score(y_test, y_pred_rf)

print(f'Random Forest Accuracy: {accuracy_rf:.4f}')
print(classification_report(y_test, y_pred_rf))


Random Forest Accuracy: 0.8766
              precision    recall  f1-score   support

           0       0.91      0.85      0.88       294
           1       0.85      0.91      0.87       265

    accuracy                           0.88       559
   macro avg       0.88      0.88      0.88       559
weighted avg       0.88      0.88      0.88       559

