# Predicting Subscription Numbers of JYB Telemarketing Dataset: Data Preparation and Feature Engineering

<em>"The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed."</em> - <a href="https://www.kaggle.com/datasets/aguado/telemarketing-jyb-dataset">Telemarketing JYB Dataset on Kaggle</a><br>

The problem is a simple Binary Classifcation problem, to determine whether a customer will subscribe based on collected data.
The dataset is comprised of:
- Bank client data.
- Previous contact data.
- Social and economic attributes.
<br><br>
<b>In this notebook we engineer new features and prepare the features for model training.</b>

## Import Libaries and Data
Here the frameworks Pandas and NumPy are imported in order to load manipulate the data.<br>
A custom script 'data_cleaning' is imported containing functions to clean the data.<br>
A custom script 'data_prep' is imported containing functions to prepare the data and engineer features.<br>
Clean all the training data using 'clean_all' which contains the entire cleaning process in one function.

In [4]:
import pandas as pd
import numpy as np
import pickle

import sys
sys.path.append("../src/")
import data_cleaning as dclean
import data_preparation as dprep

df = pd.read_csv('../data/raw/train.csv', index_col=None, delimiter=";")
df = dclean.clean_all(df)

Dropping unnamed columns.
Renaming columns.
Dropped columns with 'unknown' proportion > 10.0% and rows with < 10.0%
Dropped columns: ['default'].
Dropped columns: ['prev_days']
Dropped columns: ['emp_var_rate', 'euribor_3_month'].


## Feature Engineering
Feature engineering consists of appending the statistics of each feature as additional features, these statistics include min, max, mean, median as well as other percentiles. <br>
The features are then encoded into numerical values as such:
- Label encoding for binary features
- Ordinal encoding for features with ordered values
- One-hot encoding for multi-categorical features with no order.

In [5]:
df = dprep.add_statistical_features(df)
df = dprep.encode(df)
df.to_csv('../data/interim/data_semi_prepped.csv')

df.head()

Label encoding of ['housing', 'loan', 'contact']
Ordinal encoding of ['month', 'day_of_week', 'education']
One-Hot Encoding of ['job', 'marital', 'prev_outcome']


Unnamed: 0,age,education,housing,loan,contact,month,day_of_week,campaign,prev_nr_contacts,cons_price_idx,...,job_services,job_student,job_technician,job_unemployed,marital_divorced,marital_married,marital_single,prev_outcome_failure,prev_outcome_nonexistent,prev_outcome_success
0,52,4,1,0,0,10,1,1,0,93.2,...,0,0,1,0,0,1,0,0,1,0
1,33,5,1,0,0,10,3,1,0,93.2,...,0,0,0,0,0,0,1,0,1,0
2,54,5,1,0,0,4,0,1,0,92.893,...,0,0,0,0,0,0,1,0,1,0
3,53,4,0,1,0,5,3,1,2,92.963,...,0,0,0,0,0,1,0,1,0,0
4,42,5,1,0,0,7,1,2,0,93.444,...,0,0,0,0,0,1,0,0,1,0


## Data Preparation
- Convert the data to numpy
- Split the data into 5 stratified folds use in K-fold Cross Validation to preserve distributions.
- MinMaxScaler seems to give better results that the RobustScaler and so is selected.
- Finally perform SMOTE to oversample the positive examples and TomekLinks to undersample the negative examples close or over the decision boundary.
<br><br>
It can be seen that the skewedness is preserved in the test dataset and the scaling is set 0-1.

In [3]:
X, y = dprep.df_to_numpy(df)
folds = dprep.split_data(X, y)
folds = dprep.resample_folds(folds)
folds = dprep.scale_folds(folds)

dprep.skewedness(folds)

Test Dataset: fold 0 values range 0.0 to 1.0, with positive sample proportion 11.33%.
Train Dataset: fold 0 values range 0.0 to 1.0, with positive sample proportion 50.09%.
Test Dataset: fold 1 values range 0.0 to 1.0, with positive sample proportion 11.35%.
Train Dataset: fold 1 values range 0.0 to 1.0, with positive sample proportion 50.08%.
Test Dataset: fold 2 values range 0.0 to 1.0, with positive sample proportion 11.35%.
Train Dataset: fold 2 values range 0.0 to 1.0, with positive sample proportion 50.10%.
Test Dataset: fold 3 values range 0.0 to 1.0, with positive sample proportion 11.35%.
Train Dataset: fold 3 values range 0.0 to 1.0, with positive sample proportion 50.10%.
Test Dataset: fold 4 values range 0.0 to 1.0, with positive sample proportion 11.33%.
Train Dataset: fold 4 values range 0.0 to 1.0, with positive sample proportion 50.08%.


## Export Folds and Numpy Arrays

In [4]:
pickle.dump(folds, open("../data/processed/folds.p", "wb"))