# Bank Marketing Dataset
- The [Bank Marketing Dataset](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing) contains a reasonable large number of data related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The goal is to predict if the client will subscribe a term deposit.
- It is a fairly large dataset with 41K+ rows, a mixture of categorical and continuous columns as well as data imperfections to identify and manage.

## Dataset
The data has the following columns



Bank client data:

|col num | col name | description |
|:---|:---|:---|
| 1 | age | (numeric) |
| 2 | job | type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown') |
| 3 | marital | marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed) |
| 4 | education | (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown') |
| 5 | default | has credit in default? (categorical: 'no','yes','unknown') |
| 6 | housing | has housing loan? (categorical: 'no','yes','unknown') |
| 7 | loan | has personal loan? (categorical: 'no','yes','unknown') |

Related with the last contact of the current campaign:

|col num | col name | description |
|:---|:---|:---|
| 8 | contact | contact communication type (categorical: 'cellular','telephone') |
| 9 | month | last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec') |
| 10 | day_of_week | last contact day of the week (categorical: 'mon','tue','wed','thu','fri') |


Other attributes:

|col num | col name | description |
|:---|:---|:---|
| 11 | campaign | number of contacts performed during this campaign and for this client (numeric, includes last contact) |
| 12 | pdays | number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) |
| 13 | previous | number of contacts performed before this campaign and for this client (numeric) |
| 14 | poutcome | outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success') |

Social and economic context attributes:

|col num | col name | description |
|:---|:---|:---|
| 15 | emp.var.rate | employment variation rate - quarterly indicator (numeric) |
| 16 | cons.price.idx | consumer price index - monthly indicator (numeric) |
| 17 | cons.conf.idx | consumer confidence index - monthly indicator (numeric) |
| 18 | euribor3m | euribor 3 month rate - daily indicator (numeric) |
| 19 | nr.employed | number of employees - quarterly indicator (numeric) |

Output variable (desired target):

|col num | col name | description |
|:---|:---|:---|
| 20 | y | This is the target column. Has the client subscribed a term deposit? (binary: 'yes','no') |

## Goal
The goal of this project is
1. Build and Tune the hyperparameters of a Sklearn model to predict the target column `y` using AWS Sagemaker
1. Deploy the model as a `Serverless Inference Endpoint` and test it
1. Run `Batch Transform` on the entire input dataset
1. Calculate the performance of the model predictions on the entire input dataset

## Recommended Steps
1. **Data Exploration:** Understand the data by looking at distributions and unique values in the columns. Are there any issues with the data?
1. **Data Cleaning:** Handle any issues you found with the data.
1. **Feature Engineering:** Handle the various datatypes by applying the appropriate feature engineering techniques
1. **Model Selection:** Choose an appropriate sklearn model for this problem and implement the sagemaker model training code
1. **Hyperparameter tuning:** Choose appropriate hyperparameter ranges and objective metric for the chosen model and implement the sagemaker hyperparameter tuning code
1. **Model training:** Submit the hyperparameter tuning job to sagemaker and monitor the execution progress
1. **Model deployment as severless inference:** Pick the best model from hyperparameter tuning, deploy it as a sagemaker serverless inference endpoint and test if it works by posting some sample data to it
1. **Batch transform:** Store the input dataset to a json lines file, deploy the model as a batch transform and run the batch transform job on the input json lines file.
1. **Performance calculation:** Calculate model performance on the entire input dataset using output of the batch transform job.

## Tips
- You can use the below code to get the S3 bucket to write any artifacts to
    ```
    import sagemaker
    session = sagemaker.Session()
    bucket = session.default_bucket()
    ```
- Are all the columns necessary or can we drop any?
- Does the data contain any issues?
- What ML task is this? Classification? Regression? Clustering?
- What are the data types of the columns? What pre-processing should you apply?
- What is the most appropriate metric for this model?

In [1]:
import pandas as pd
%matplotlib inline

df = pd.read_csv("https://raw.githubusercontent.com/stephenleo/sagemaker-deployment/main/data/final_project_bank.csv")

print(df.shape)
df.head()

(41188, 20)


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56.0,housemaid,married,basic.4y,no,no,no,telephone,may,mon,1.0,999.0,0.0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57.0,services,married,high.school,unknown,no,,telephone,may,mon,1.0,999.0,0.0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37.0,services,married,high.school,no,yes,no,telephone,may,mon,1.0,999.0,0.0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40.0,admin.,married,basic.6y,no,no,no,telephone,may,mon,1.0,999.0,0.0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56.0,services,married,high.school,no,no,yes,,may,mon,1.0,999.0,0.0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


## All the best!
Get started below...

 1. Data Exploration

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             40767 non-null  float64
 1   job             40704 non-null  object 
 2   marital         40775 non-null  object 
 3   education       40764 non-null  object 
 4   default         40797 non-null  object 
 5   housing         40809 non-null  object 
 6   loan            40733 non-null  object 
 7   contact         40748 non-null  object 
 8   month           40767 non-null  object 
 9   day_of_week     40752 non-null  object 
 10  campaign        40775 non-null  float64
 11  pdays           40739 non-null  float64
 12  previous        40770 non-null  float64
 13  poutcome        40757 non-null  object 
 14  emp.var.rate    40770 non-null  float64
 15  cons.price.idx  40819 non-null  float64
 16  cons.conf.idx   40784 non-null  float64
 17  euribor3m       40759 non-null 

In [3]:
df.describe()

Unnamed: 0,age,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,40767.0,40775.0,40739.0,40770.0,40770.0,40819.0,40784.0,40759.0,40751.0
mean,40.02112,2.56699,962.34073,0.172823,0.08246,93.575781,-40.504127,3.620653,5167.062656
std,10.419903,2.76876,187.242913,0.494873,1.570749,0.578958,4.624825,1.73462,72.224169
min,17.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,32.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1
50%,38.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,47.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,98.0,56.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1


In [4]:
df['job'].value_counts()

Unnamed: 0_level_0,count
job,Unnamed: 1_level_1
admin.,10315
blue-collar,9126
technician,6664
services,3917
management,2890
retired,1698
entrepreneur,1441
self-employed,1408
housemaid,1051
unemployed,1004


In [5]:
df['y'].value_counts()

Unnamed: 0_level_0,count
y,Unnamed: 1_level_1
no,36199
yes,4591


 2. Data Cleaning

In [6]:
# View initial data
print("Initial shape:", df.shape)
print("Columns with 'unknown' values:\n", df.isin(['unknown']).sum())

# Replace 'unknown' with mode (or drop rows/columns based on your strategy)
columns_with_unknown = ['job', 'marital', 'education', 'default', 'housing', 'loan']

for col in columns_with_unknown:
    mode_val = df[col].mode()[0]
    df[col] = df[col].replace('unknown', mode_val)

# Convert all categorical text columns to lowercase
categorical_cols = df.select_dtypes(include='object').columns

for col in categorical_cols:
    df[col] = df[col].str.lower()

# Drop irrelevant or uninformative columns (e.g., 'duration' if present, or based on EDA)
# Example: dropping 'duration' if it leaks target info
if 'duration' in df.columns:
    df.drop(columns=['duration'], inplace=True)

# Final shape after cleaning
print("Final shape after cleaning:", df.shape)

Initial shape: (41188, 20)
Columns with 'unknown' values:
 age                  0
job                327
marital             79
education         1716
default           8520
housing            980
loan               978
contact              0
month                0
day_of_week          0
campaign             0
pdays                0
previous             0
poutcome             0
emp.var.rate         0
cons.price.idx       0
cons.conf.idx        0
euribor3m            0
nr.employed          0
y                    0
dtype: int64
Final shape after cleaning: (41188, 20)


3. Feature Engineering

In [7]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
import pandas as pd

#  Convert target column 'y' to binary: 'yes' → 1, 'no' → 0
df['y'] = df['y'].map({'yes': 1, 'no': 0})

# Separate categorical and numerical features
categorical_cols = df.select_dtypes(include='object').columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).drop('y', axis=1).columns

#  One-hot encode categorical features
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

#  Normalize numerical features
scaler = StandardScaler()
df_encoded[numerical_cols] = scaler.fit_transform(df_encoded[numerical_cols])

# Final dataset ready for modeling
print("Shape after feature engineering:", df_encoded.shape)
df_encoded.head()


Shape after feature engineering: (41188, 47)


Unnamed: 0,age,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y,...,month_may,month_nov,month_oct,month_sep,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_nonexistent,poutcome_success
0,1.533515,-0.565961,0.195787,-0.349232,0.647814,0.722374,0.887423,0.712757,0.331435,0.0,...,True,False,False,False,True,False,False,False,True,False
1,1.629486,-0.565961,0.195787,-0.349232,0.647814,0.722374,0.887423,0.712757,0.331435,0.0,...,True,False,False,False,True,False,False,False,True,False
2,-0.289941,-0.565961,0.195787,-0.349232,0.647814,0.722374,0.887423,0.712757,0.331435,0.0,...,True,False,False,False,True,False,False,False,True,False
3,-0.002027,-0.565961,0.195787,-0.349232,0.647814,0.722374,0.887423,0.712757,0.331435,0.0,...,True,False,False,False,True,False,False,False,True,False
4,1.533515,-0.565961,0.195787,-0.349232,0.647814,0.722374,0.887423,0.712757,0.331435,0.0,...,True,False,False,False,True,False,False,False,True,False


4. Model Selection

In [12]:
from sagemaker.sklearn.estimator import SKLearn
import sagemaker
import os
os.environ["AWS_DEFAULT_REGION"] = "us-west-2"

sklearn_estimator = SKLearn(
    entry_point="train.py",
    role=sagemaker.get_execution_role(),
    instance_type="ml.m5.large",
    framework_version="0.23-1",
    sagemaker_session=session,
    hyperparameters={"n_estimators": 100, "max_depth": 5}
)


 5. Hyperparameter Tuning

In [None]:
from sagemaker.tuner import HyperparameterTuner, IntegerParameter

hyperparameter_ranges = {
    "n_estimators": IntegerParameter(50, 200),
    "max_depth": IntegerParameter(3, 10),
}

tuner = HyperparameterTuner(
    estimator=sklearn_estimator,
    objective_metric_name="validation:accuracy",
    hyperparameter_ranges=hyperparameter_ranges,
    max_jobs=10,
    max_parallel_jobs=2
)



6. Model Training

In [None]:
tuner.fit({"train": train_input, "validation": validation_input})

7. Model Deployment (Serverless)

In [None]:
best_model = tuner.best_estimator()
predictor = best_model.deploy(serverless_inference_config={
    "memory_size_in_mb": 2048,
    "max_concurrency": 5
})


8. Batch Transform

In [None]:
9. Performance Calculation

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

y_true = df_test['y']
y_pred = df_test['predicted']

# Metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print(f" Accuracy:  {accuracy:.4f}")
print(f" Precision: {precision:.4f}")
print(f" Recall:    {recall:.4f}")
print(f" F1 Score:  {f1:.4f}")
print("\n Confusion Matrix:")
print(confusion_matrix(y_true, y_pred))
print("\n📊 Classification Report:")
print(classification_report(y_true, y_pred))


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

cm = confusion_matrix(y_true, y_pred)

plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=["No", "Yes"], yticklabels=["No", "Yes"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
