# **Preprocessing**

I need to 
1. Encode object variables
2. Remove duration as a feature from X

### Imports

In [1]:
import numpy as np
import pandas as pd

### Anchor Relative Dir

In [5]:
import os

# Move working directory to project root if executed inside notebooks/
if os.getcwd().endswith("notebooks"):
    os.chdir("..")

print("Working directory:", os.getcwd())


Working directory: c:\Coding\pytorch\bank-marketing-ml


### Define our output paths

In [8]:
processed_data_path = "data/processed/"

### Load Dataset

In [6]:
dataset_path = "data/raw/bank-full.csv"
bank_df = pd.read_csv(dataset_path, sep=";")


# Encoding

The features that are not numerical are:
1)   job           object
2)   marital       object
3)   education     object
4)   default       object
6)   housing       object
7)   loan          object
8)   contact       object
10)  month         object
15)  poutcome      object
16)  y             object

In [7]:
object_features = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome', 'y']
# Visualize the unique values in each object feature
for feature in object_features:
    unique_values = bank_df[feature].unique()
    print(f"Feature: {feature}")
    print(f"Unique Values: {unique_values}\n")

Feature: job
Unique Values: ['management' 'technician' 'entrepreneur' 'blue-collar' 'unknown'
 'retired' 'admin.' 'services' 'self-employed' 'unemployed' 'housemaid'
 'student']

Feature: marital
Unique Values: ['married' 'single' 'divorced']

Feature: education
Unique Values: ['tertiary' 'secondary' 'unknown' 'primary']

Feature: default
Unique Values: ['no' 'yes']

Feature: housing
Unique Values: ['yes' 'no']

Feature: loan
Unique Values: ['no' 'yes']

Feature: contact
Unique Values: ['unknown' 'cellular' 'telephone']

Feature: month
Unique Values: ['may' 'jun' 'jul' 'aug' 'oct' 'nov' 'dec' 'jan' 'feb' 'mar' 'apr' 'sep']

Feature: poutcome
Unique Values: ['unknown' 'failure' 'other' 'success']

Feature: y
Unique Values: ['no' 'yes']



### Label Encoding
Some of these values have a natural hiearchy.
These are:
* education (Primary < Secondary < Tertiary)
* oh well thats it
Instead of one-hot encoding these, we will allow the natural hiearchy by encoding Primary as 0, Secondary as 1, and Tertiary as 3
Important to note that Neural Networks don't inherit any beneift from this.
HOWEVER, we cannot encode education "unknowns", for this reason, we will drop these samples, as they are less than 5% of our dataset.
### One-Hot
Most of the other cateogorical features that are binary do not have a natural hiearchy, but still must be converted to numerical values

In [10]:
# Label encode ordinal features (education)
bank_df = bank_df[bank_df['education'] != 'unknown']  # Drop 'unknown' education samples
education_mapping = {'primary': 0, 'secondary': 1, 'tertiary': 2}
bank_df['education'] = bank_df['education'].map(education_mapping)


In [11]:
# one hot encode nominal features
nominal_features = ['job', 'marital', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome', 'y']
bank_df = pd.get_dummies(bank_df, columns=nominal_features, drop_first=True)

In [12]:
# Remove duration as a feature, as it is not known before a call is performed
bank_df = bank_df.drop(columns=['duration'])

# Normalize the data

In [14]:
# we will normalize the numerical features
numerical_features = ['age', 'balance', 'day', 'campaign', 'pdays', 'previous']
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
bank_df[numerical_features] = scaler.fit_transform(bank_df[numerical_features])

## Class Imbalance
The predictor feature is heavily imbalanced. For this reason, we will need to fix class imbalance. How can we do this?
**2 Methods**

1) Oversampling
    * we create or find more samples with the minority class, balancing it out
2) Undersampling


I will attempt to undersample, or downsample the data. I will use k-means, and replace some clusters with their centroids.


# Save processed data

In [None]:
# save processed data to the already created processed data directory
bank_df.to_csv(os.path.join(processed_data_path, "bank_processed.csv"), index=False) # if using the imbalanced, chagne this
