# Bank Marketing Classification

This notebook explores the classic Portuguese bank marketing dataset. The bank ran a series of phone campaigns to convince customers to subscribe to a term deposit. Our goal is to build classification models that can predict whether a customer will say “yes”.

A key challenge in this dataset is class imbalance: most customers say “no”, and only a small percentage say “yes”. Because of that, plain accuracy can look very high even when the model barely identifies any of the positive cases. To handle this, we’ll try different resampling techniques (SMOTE, random oversampling, undersampling) and compare models (logistic regression and random forest) using F1 score as our main metric.

We’ll follow this flow:

1. Load and inspect the data

2. Preprocess (one-hot encode / dummies)

3. Build baseline models

4. Handle imbalance

5. Compare models and pick the best one

## Imports

In [1]:
#General Libraries
import pandas as pd
import numpy as np

#sklearn
from sklearn.model_selection import train_test_split
from sklearn import metrics

#Classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

#imblearn #Imbalanced dataset
from imblearn.under_sampling import RandomUnderSampler, NearMiss
from imblearn.over_sampling import RandomOverSampler, SMOTE

## Load + inspect

In [2]:
bank = pd.read_csv("./bank.csv")
bank.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,0.0,0.0,0.0,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
1,57,services,married,high.school,,0.0,0.0,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
2,37,services,married,high.school,0.0,1.0,0.0,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
3,40,admin.,married,basic.6y,0.0,0.0,0.0,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
4,56,services,married,high.school,0.0,0.0,1.0,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0


In [3]:
bank.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         32591 non-null  float64
 5   housing         40198 non-null  float64
 6   loan            40198 non-null  float64
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  campaign        41188 non-null  int64  
 11  pdays           41188 non-null  int64  
 12  previous        41188 non-null  int64  
 13  poutcome        41188 non-null  object 
 14  emp.var.rate    41188 non-null  float64
 15  cons.price.idx  41188 non-null  float64
 16  cons.conf.idx   41188 non-null  float64
 17  euribor3m       41188 non-null 

In [4]:
bank['y'].value_counts()

y
0    36548
1     4640
Name: count, dtype: int64

## Preprocessing

In [5]:
y = bank['y']
X = bank.drop(columns=['y'])

X = pd.get_dummies(X, drop_first=True)
X = X.fillna(0)

X.head()

Unnamed: 0,age,default,housing,loan,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,...,month_may,month_nov,month_oct,month_sep,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_nonexistent,poutcome_success
0,56,0.0,0.0,0.0,1,999,0,1.1,93.994,-36.4,...,True,False,False,False,True,False,False,False,True,False
1,57,0.0,0.0,0.0,1,999,0,1.1,93.994,-36.4,...,True,False,False,False,True,False,False,False,True,False
2,37,0.0,1.0,0.0,1,999,0,1.1,93.994,-36.4,...,True,False,False,False,True,False,False,False,True,False
3,40,0.0,0.0,0.0,1,999,0,1.1,93.994,-36.4,...,True,False,False,False,True,False,False,False,True,False
4,56,0.0,0.0,1.0,1,999,0,1.1,93.994,-36.4,...,True,False,False,False,True,False,False,False,True,False


- Separated the target: `y = bank["y"]`, features: `X = bank.drop(columns=["y"])`.
- Turned all categorical columns into numbers with `pd.get_dummies(X, drop_first=True)`.
- Filled any remaining missing values with `X = X.fillna(0)` so the models don’t error.
- Result: fully numeric feature matrix ready for train/test split.

## Baseline models

### Baseline: Logistic Regression (no resampling)

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=511)

lr_base = LogisticRegression(solver="liblinear")
lr_base.fit(X_train, y_train)

lr_base_preds = lr_base.predict(X_test)

metrics.accuracy_score(y_test,lr_base_preds), metrics.f1_score(y_test, lr_base_preds)

(0.902107409925221, 0.2951048951048951)

### Baseline: Random Forest (no resampling)

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=511)

rf_base = RandomForestClassifier()
rf_base.fit(X_train, y_train)

rf_base_preds = rf_base.predict(X_test)

metrics.accuracy_score(y_test,rf_base_preds), metrics.f1_score(y_test, rf_base_preds)

(0.8959891230455472, 0.3862464183381089)

## Imbalance handling and model comparison

### SMOTE + Logistic Regression

In [8]:
X_train, X_test, y_train, y_test=train_test_split(X,y,random_state=511)

sm = SMOTE(random_state=511)
X_sm_resampled,y_sm_resampled=sm.fit_resample(X_train,y_train)

sm_lr = LogisticRegression(solver="liblinear")
sm_lr.fit(X_sm_resampled,y_sm_resampled)

sm_lr_preds=sm_lr.predict(X_test)
metrics.accuracy_score(y_test,sm_lr_preds), metrics.f1_score(y_test, sm_lr_preds)


(0.8436437797416724, 0.37109375)

### SMOTE + Random Forest

In [9]:
X_train, X_test, y_train, y_test=train_test_split(X,y,random_state=511)

sm = SMOTE(random_state=511)
X_sm_resampled,y_sm_resampled=sm.fit_resample(X_train,y_train)

rf_sm = RandomForestClassifier()
rf_sm.fit(X_sm_resampled,y_sm_resampled)

rf_sm_preds = rf_sm.predict(X_test)
metrics.accuracy_score(y_test,rf_sm_preds), metrics.f1_score(y_test, rf_sm_preds)

(0.8879285228707391, 0.41951710261569414)

### RandomOverSampler + Logistic Regression

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=511)

ros = RandomOverSampler(random_state=511)
X_ros_resampled, y_ros_resampled = ros.fit_resample(X_train, y_train)

rf_ros = RandomForestClassifier()
rf_ros.fit(X_ros_resampled, y_ros_resampled)

rf_ros_preds = rf_ros.predict(X_test)
metrics.accuracy_score(y_test, rf_ros_preds), metrics.f1_score(y_test, rf_ros_preds)


(0.8833640866271729, 0.4195263412276462)

### RandomOverSampler + Random Forest

In [11]:
X_train, X_test, y_train, y_test=train_test_split(X,y,random_state=511)

ros = RandomOverSampler(random_state=511)
X_ros_resampled,y_ros_resampled=ros.fit_resample(X_train,y_train)

lr_ros = LogisticRegression(solver="liblinear")
lr_ros.fit(X_ros_resampled,y_ros_resampled)

lr_ros_preds=lr_ros.predict(X_test)
metrics.accuracy_score(y_test,lr_ros_preds), metrics.f1_score(y_test, lr_ros_preds)

(0.8266485384092455, 0.44340505144995324)

### RandomUnderSampler + Logistic Regression

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=511)

rus = RandomUnderSampler(random_state=511)
X_rus_resampled, y_rus_resampled = rus.fit_resample(X_train, y_train)

lr_rus = LogisticRegression(solver="liblinear", max_iter=2000)
lr_rus.fit(X_rus_resampled, y_rus_resampled)

lr_rus_preds = lr_rus.predict(X_test)
metrics.accuracy_score(y_test, lr_rus_preds), metrics.f1_score(y_test, lr_rus_preds)

(0.828202389045353, 0.4417797412432944)

### RandomUnderSampler + Random Forest

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=511)

rus = RandomUnderSampler(random_state=511)
X_rus_resampled, y_rus_resampled = rus.fit_resample(X_train, y_train)

rf_rus = RandomForestClassifier(random_state=511)
rf_rus.fit(X_rus_resampled, y_rus_resampled)

rf_rus_preds = rf_rus.predict(X_test)
metrics.accuracy_score(y_test, rf_rus_preds), metrics.f1_score(y_test, rf_rus_preds)

(0.7777022433718559, 0.39106145251396646)

### NearMiss (undersampling) + Logistic Regression

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=511)

nm = NearMiss()
X_nm_resampled, y_nm_resampled = nm.fit_resample(X_train, y_train)

lr_nm = LogisticRegression(solver="liblinear")
lr_nm.fit(X_nm_resampled, y_nm_resampled)

nm_preds = lr_nm.predict(X_test)
metrics.accuracy_score(y_test, nm_preds), metrics.f1_score(y_test, nm_preds)

(0.5297659512479362, 0.25277777777777777)

### NearMiss (undersampling) + Logistic Regression

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=511)

nm = NearMiss()
X_nm_res, y_nm_res = nm.fit_resample(X_train, y_train)

rf_nm = RandomForestClassifier(random_state=511)
rf_nm.fit(X_nm_res, y_nm_res)

rf_nm_preds = rf_nm.predict(X_test)
metrics.accuracy_score(y_test, rf_nm_preds), metrics.f1_score(y_test, rf_nm_preds)

(0.5155870641934545, 0.25217391304347825)

## Conclusion

This dataset was highly imbalanced: about 36,500 rows were class 0 (no deposit) and only about 4,600 were class 1 (deposit), so a model could get high accuracy just by predicting the majority class. The plain logistic regression baseline did exactly that — it achieved around 0.90 accuracy but only about 0.29 F1, which means it didn’t identify many of the positive cases.

To address this, I tested several imbalance-handling strategies:

SMOTE + Logistic Regression improved F1 to about 0.37.

SMOTE + Random Forest and RandomOverSampler + Random Forest both did slightly better, around 0.41 F1.

RandomOverSampler + Logistic Regression gave the best result overall, with accuracy ≈ 0.83 and F1 ≈ 0.44. Oversampling the minority class in the training set gave logistic regression enough positive examples to learn from.

I also tried undersampling (RandomUnderSampler, NearMiss). These models trained and ran fine, but F1 was lower than the oversampling approaches, which makes sense because undersampling throws away a lot of majority-class data.

Final pick: Logistic Regression trained on the oversampled training data (RandomOverSampler) — it gave the strongest F1 score while still being simple and easy to explain. In a real business setting I would report F1 (and recall for the positive class), not just accuracy, because the goal is to catch as many likely depositors as possible.