# Exercise: Classification with the Bank Marketing Dataset


In this exercise you will work on a **classification** task using the [Bank Marketing dataset](https://archive.ics.uci.edu/ml/datasets/bank+marketing).  
Your goal is to predict whether a client will **subscribe to a term deposit** (`y` = yes/no).  

You will follow the full machine learning workflow: from baseline to more advanced models.  
This exercise should take about 1–2 hours.  

---


## Step 1: Load the data

We’ll use the Bank Marketing dataset from `sklearn.datasets`.

In [1]:

from sklearn.datasets import fetch_openml

# Load the dataset
bank = fetch_openml(name="bank-marketing", version=1, as_frame=True)
df = bank.frame

# Inspect the data
df.head()


Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,Class
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,1
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,1
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,1
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,1
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,1



## Step 2: Explore the dataset

- What variables do we have?
- Which ones are numerical, which are categorical?
- Are there missing values?
- What does the target (`y`) look like?



## Step 3: A very naive baseline

Before building any models, always compare against a trivial baseline.

- Use the most frequent class (majority class) as the prediction for all rows.
- Calculate accuracy and other relevant metrics.
- This gives you a baseline to beat.



## Step 4: Data preparation

- Handle categorical variables (e.g. one-hot encoding or ordinal encoding).
- Scale or normalize numerical features if needed.
- Split the dataset into train and test sets.



## Step 5: Train simple models

Start with traditional algorithms from `scikit-learn`:

- Logistic Regression
- Decision Tree

Evaluate them against the baseline. Which performs better?



## Step 6: Improve with ensembles

Try tree-based ensemble models:

- Random Forest
- Gradient Boosted Trees (e.g. XGBoost, LightGBM)

Do they improve performance compared to the single models?



## Step 7: Model evaluation

- Use metrics beyond accuracy, such as Precision, Recall, and F1-Score.
- Which model balances performance best for this classification problem?
- Think about how this would be used in practice (cost of false positives vs. false negatives).



## Step 8: Reflection

- What did you learn from this dataset compared to the California housing exercise?
- How did categorical data and the classification target change your workflow?
- Which model would you choose if you had to deploy one today?


## 9. Model explanation
- Can you extract and plot the feature importances of one of the tree-based models?
- Visualize predicted vs actual on a scatter plot

## 10. Reflection

Summarize your findings:

- Which model was best?
- Did tuning help?
- What did you learn about the trade-offs between models?
- Which model was easiest to use?

## 11. Extra 
- Can you enrich the data from this or previous exercise to improve your predictive power?