# Feature Engineering

## Introduction to Feature Engineering:

### Purpose of feature engineering.

The purpose of feature engineering in this project is to enhance the model's predictive power by creating new features that capture complex relationships within the data, which might not be evident through the original features alone. Specifically, the aim is to improve the **ROC AUC** score, ensuring that the model is better at distinguishing between the positive and negative classes, particularly given the class imbalance in the dataset.

## Data Loading

In [1]:
import pandas as pd
import numpy as np

# Load the dataset
data = pd.read_csv("../data/bank-additional/bank-additional-full.csv",
                   sep=";")

# Drop duplicate rows
data.drop_duplicates(inplace=True)

# Reset index
data.reset_index(drop=True, inplace=True)

data.shape

(41176, 21)

## Performance before feature engineering and feature selection

In [2]:
# Spliting original features before feature engineering into X_0 and y_0
X = data.drop(['y', 'duration'], axis=1)
y = data['y']

from sklearn.preprocessing import LabelEncoder

# Label encode the target
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# Split the data into train and test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42)

print("\nX Train:", X_train.shape)
print("X Test:", X_test.shape)
print("Y Train:", y_train.shape)
print("Y Test:", y_test.shape)


X Train: (32940, 19)
X Test: (8236, 19)
Y Train: (32940,)
Y Test: (8236,)


In [3]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression

numerical_features = X.select_dtypes(include=['float64', 'int64']).columns
categorical_features = X.select_dtypes(include=['object']).columns
binary_features = X.select_dtypes(include=['int32']).columns

# Column Transformer to apply OneHotEncoder to categorical features 
# and StandardScaler to numerical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(
            handle_unknown='ignore', 
            sparse_output=False), categorical_features)
    ])

# Pipeline with preprocessor and Logistic Regression                                                                                        
pipe_lr = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42, 
                                      max_iter=10000,
                                      class_weight='balanced'))
])

# Define the Stratified K-Fold cross-validator
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)


scores = cross_val_score(pipe_lr,
                         X_train, y_train, 
                         cv=stratified_kfold, 
                         scoring='roc_auc')

print(f"CV ROC AUC Score: {np.mean(scores):.3f} +/- {np.std(scores):.3f}")

CV ROC AUC Score: 0.791 +/- 0.008


## Creating New Features

In [4]:
binned_continous_features = pd.DataFrame({
    'campaign_binned': pd.cut(
        data['campaign'], 
        bins=5, 
        labels=['0-12', '12-23', '23-34', '34-45', '45-56']).astype('str'),
    'previous_binned': pd.cut(
        data['previous'], 
        bins=5, 
        labels=['one', 'two', 'three', 'four', 'five']).astype('str'),
    'age_binned': pd.cut(
        data['age'], 
        bins=[15, 20, 30, 40, 50, 60, 70,  np.inf], 
        labels=['15-20', '20-30', '30-40', '40-50', '50-60', '60-70', '70+']).astype('str'),
    'pdays_binned': pd.cut(
        data['pdays'], 
        bins=[-1, 5, 10, 15, 20, 25, 30, np.inf], 
        labels=['0-5', '5-10', '10-15', '15-20', '20-25', '25-30', 'never_contacted']).astype('str'),
})

In [5]:
data_continous_binned = pd.concat([data, binned_continous_features],
                                  axis=1)

In [6]:
interaction_features = pd.DataFrame({
    'marital_job_unknown_unknown': (
        (data['marital'] == 'unknown') & 
        (data['job'] == 'unknown')).astype(int),
    'edu_job_basic.4y_housemaid': (
        (data['education'] == 'basic.4y') & 
        (data['job'] == 'housemaid')).astype(int),
    'edu_job_illiterate_self-employed': (
        (data['education'] == 'illiterate') & 
        (data['job'] == 'self-employed')).astype(int),
    'edu_job_illiterate_retired': (
        (data['education'] == 'illiterate') & 
        (data['job'] == 'retired')).astype(int),
    'edu_job_unknown_student': (
        (data['education'] == 'unknown') & 
        (data['job'] == 'student')).astype(int),
    'edu_job_unknown_unknown': (
        (data['education'] == 'unknown') & 
        (data['job'] == 'unknown')).astype(int), 
    'month_job_dec_retired': (
        (data['month'] == 'dec') & 
        (data['job'] == 'retired')).astype(int),
    'month_job_oct_retired': (
        (data['month'] == 'oct') & 
        (data['job'] == 'retired')).astype(int),
    'month_job_dec_student': (
        (data['month'] == 'dec') & 
        (data['job'] == 'student')).astype(int),
    'month_job_sep_student': (
        (data['month'] == 'sep') & 
        (data['job'] == 'student')).astype(int),
    'default_job_yes_technician': (
        (data['default'] == 'yes') & 
        (data['job'] == 'technician')).astype(int),
    'default_job_yes_unemployed': (
        (data['default'] == 'yes') & 
        (data['job'] == 'unemployed')).astype(int),
    'default_edu_yes_professional': (
        (data['default'] == 'yes') & 
        (data['education'] == 'professional.course')).astype(int),
    'default_week_yes_tue': (
        (data['default'] == 'yes') &
        (data['day_of_week'] == 'tue')).astype(int),
    'default_month_yes_aug': (
        (data['default'] == 'yes') &
        (data['month'] == 'aug')).astype(int),
    'loan_housing_unknown_unknown': (
        (data['loan'] == 'unknown') & 
        (data['housing'] == 'unknown')).astype(int),
    'poutcome_Job_Success_Student': (
        (data['poutcome'] == 'success') & 
        (data['job'] == 'student')).astype(int),
    'poutcome_month_success_dec': (
        (data['poutcome'] == 'success') & 
        (data['month'] == 'dec')).astype(int),
    'poutcome_month_success_mar': (
        (data['poutcome'] == 'success') & 
        (data['month'] == 'mar')).astype(int),
    'poutcome_month_success_oct': (
        (data['poutcome'] == 'success') & 
        (data['month'] == 'oct')).astype(int),
    'poutcome_month_success_sep': (
        (data['poutcome'] == 'success') & 
        (data['month'] == 'sep')).astype(int),
    'y_month_yes_dec': (
        (data['y'] == 'yes') & 
        (data['month'] == 'dec')).astype(int),
    'y_month_yes_mar': (
        (data['y'] == 'yes') & 
        (data['month'] == 'mar')).astype(int),
    'y_poutcome_yes_success': (
        (data['y'] == 'yes') & 
        (data['poutcome'] == 'success')).astype(int)
})

In [7]:
pdays_binned_interaction_features = pd.DataFrame({
    'pdays_binned_job_20-25_retired': (
        (data_continous_binned['pdays_binned'] == '20-25') & 
        (data_continous_binned['job'] == 'retired')).astype(int),
    'pdays_binned_job_10-15_student': (
        (data_continous_binned['pdays_binned'] == '10-15') & 
        (data_continous_binned['job'] == 'student')).astype(int),
    'pdays_binned_job_15-20_student': (
        (data_continous_binned['pdays_binned'] == '15-20') & 
        (data_continous_binned['job'] == 'student')).astype(int),
    'pdays_binnned_job_5-10_student': (
        (data_continous_binned['pdays_binned'] == '5-10') & 
        (data_continous_binned['job'] == 'student')).astype(int),
    'pdays_binned_job_25-30_technician': (
        (data_continous_binned['pdays_binned'] == '25-30') & 
        (data_continous_binned['job'] == 'technician')).astype(int),
    'pdays_binned_marital_15-20_unknown': (
        (data_continous_binned['pdays_binned'] == '15-20') & 
        (data_continous_binned['marital'] == 'unknown')).astype(int),
    'pdays_binned_edu_15-20_unknown': (
        (data_continous_binned['pdays_binned'] == '15-20') & 
        (data_continous_binned['education'] == 'unknown')).astype(int),
    'pdays_binned_edu_25-30_professional': (
        (data_continous_binned['pdays_binned'] == '25-30') & 
        (data_continous_binned['education'] == 'professional.course')).astype(int),
    'pdays_binned_month_0-5_oct': (
        (data_continous_binned['pdays_binned'] == '0-5') & 
        (data_continous_binned['month'] == 'oct')).astype(int),
    'pdays_binned_month_0-5_sep': (
        (data_continous_binned['pdays_binned'] == '0-5') & 
        (data_continous_binned['month'] == 'sep')).astype(int),
    'pdays_binned_month_10-15_mar': (
        (data_continous_binned['pdays_binned'] == '10-15') & 
        (data_continous_binned['month'] == 'mar')).astype(int),
    'pdays_binned_month_15-20_oct': (
        (data_continous_binned['pdays_binned'] == '15-20') & 
        (data_continous_binned['month'] == 'oct')).astype(int),
    'pdays_binned_month_15-20_sep': (
        (data_continous_binned['pdays_binned'] == '15-20') & 
        (data_continous_binned['month'] == 'sep')).astype(int),
    'pdays_binned_month_20-25_mar': (
        (data_continous_binned['pdays_binned'] == '20-25') & 
        (data_continous_binned['month'] == 'mar')).astype(int),
    'pdays_binned_month_20-25_sep': (
        (data_continous_binned['pdays_binned'] == '20-25') & 
        (data_continous_binned['month'] == 'sep')).astype(int),
    'pdays_binned_month_25-30_oct': (
        (data_continous_binned['pdays_binned'] == '25-30') & 
        (data_continous_binned['month'] == 'oct')).astype(int),
    'pdays_binned_month_5-10_dec': (
        (data_continous_binned['pdays_binned'] == '5-10') & 
        (data_continous_binned['month'] == 'dec')).astype(int),
    'pdays_binned_month_5-10_mar': (
        (data_continous_binned['pdays_binned'] == '5-10') & 
        (data_continous_binned['month'] == 'mar')).astype(int),
    'pdays_binned_month_5-10_oct': (
        (data_continous_binned['pdays_binned'] == '5-10') & 
        (data_continous_binned['month'] == 'oct')).astype(int),
    'pdays_binned_month_5-10_sep': (
        (data_continous_binned['pdays_binned'] == '5-10') & 
        (data_continous_binned['month'] == 'sep')).astype(int),
    'pdays_binned_poutcome_0-5_success': (
        (data_continous_binned['pdays_binned'] == '0-5') & 
        (data_continous_binned['poutcome'] == 'success')).astype(int),
    'pdays_binned_poutcome_10-15_success': (
        (data_continous_binned['pdays_binned'] == '10-15') & 
        (data_continous_binned['poutcome'] == 'success')).astype(int),
    'pdays_binned_poutcome_15-20_failure': (
        (data_continous_binned['pdays_binned'] == '15-20') & 
        (data_continous_binned['poutcome'] == 'failure')).astype(int),
    'pdays_binned_poutcome_15-20_success': (
        (data_continous_binned['pdays_binned'] == '15-20') & 
        (data_continous_binned['poutcome'] == 'success')).astype(int),
    'pdays_binned_poutcome_20-25_failure': (
        (data_continous_binned['pdays_binned'] == '20-25') & 
        (data_continous_binned['poutcome'] == 'failure')).astype(int),
    'pdays_binned_poutcome_20-25_success': (
        (data_continous_binned['pdays_binned'] == '20-25') & 
        (data_continous_binned['poutcome'] == 'success')).astype(int),
    'pdays_binned_poutcome_25-30_success': (
        (data_continous_binned['pdays_binned'] == '25-30') & 
        (data_continous_binned['poutcome'] == 'success')).astype(int),
    'pdays_binned_poutcome_5-10_success': (
        (data_continous_binned['pdays_binned'] == '5-10') & 
        (data_continous_binned['poutcome'] == 'success')).astype(int),
    'pdays_binned_0-5_yes': (
        (data_continous_binned['pdays_binned'] == '0-5') &
        (data['y'] == 'yes')).astype(int),
    'pdays_binned_10-15_yes': (
        (data_continous_binned['pdays_binned'] == '10-15') &
        (data['y'] == 'yes')).astype(int),
    'pdays_binned_20-25_yes': (
        (data_continous_binned['pdays_binned'] == '20-25') &
        (data['y'] == 'yes')).astype(int),
    'pdays_binned_25-30_yes': (
        (data_continous_binned['pdays_binned'] == '25-30') &
        (data['y'] == 'yes')).astype(int),
    'pdays_binned_5-10_yes': (
        (data_continous_binned['pdays_binned'] == '5-10') &
        (data['y'] == 'yes')).astype(int),
})

In [8]:
campaign_binned_interaction_feaures = pd.DataFrame({
    'campaign_binned_marital_12-23_unknown': (
        (data_continous_binned['campaign_binned'] == '12-23') & 
        (data_continous_binned['marital'] == 'unknown')).astype(int),
    'campaign_binned_default_45-56_unknown': (
        (data_continous_binned['campaign_binned'] == '45-56') & 
        (data_continous_binned['default'] == 'unknown')).astype(int),
    'campaign_binned_housing_45-56_unknown': (
        (data_continous_binned['campaign_binned'] == '45-56') & 
        (data_continous_binned['housing'] == 'unknown')).astype(int),
    'campaign_binned_loan_45-56_unknown': (
        (data_continous_binned['campaign_binned'] == '45-56') & 
        (data_continous_binned['loan'] == 'unknown')).astype(int),
    'campaign_binned_day_of_week_45-56_unknown': (
        (data_continous_binned['campaign_binned'] == '45-56') & 
        (data_continous_binned['day_of_week'] == 'mon')).astype(int)
})

In [9]:
previous_binned_interaction_features = pd.DataFrame({
    'previous_binned_job_five_mngmnt': (
        (data_continous_binned['previous_binned'] == 'five') & 
        (data_continous_binned['job'] == 'management')).astype(int),
    'previous_binned_job_five_retired': (
        (data_continous_binned['previous_binned'] == 'five') & 
        (data_continous_binned['job'] == 'retired')).astype(int),
    'previous_binned_job_four_student': (
        (data_continous_binned['previous_binned'] == 'four') & 
        (data_continous_binned['job'] == 'student')).astype(int),
    'previous_binned_job_three_student': (
        (data_continous_binned['previous_binned'] == 'three') & 
        (data_continous_binned['job'] == 'student')).astype(int),
    'previous_binned_job_two_student': (
        (data_continous_binned['previous_binned'] == 'two') & 
        (data_continous_binned['job'] == 'student')).astype(int),
    'previous_binned_month_five_nov': (
        (data_continous_binned['previous_binned'] == 'five') & 
        (data_continous_binned['month'] == 'nov')).astype(int),
    'previous_binned_month_four_sep': (
        (data_continous_binned['previous_binned'] == 'four') & 
        (data_continous_binned['month'] == 'sep')).astype(int),
    'previous_binned_month_three_mar': (
        (data_continous_binned['previous_binned'] == 'three') & 
        (data_continous_binned['month'] == 'mar')).astype(int),
    'previous_binned_month_three_oct': (
        (data_continous_binned['previous_binned'] == 'three') & 
        (data_continous_binned['month'] == 'oct')).astype(int),
    'previous_binned_month_three_sep': (
        (data_continous_binned['previous_binned'] == 'three') & 
        (data_continous_binned['month'] == 'sep')).astype(int),
    'previous_binned_month_two_dec': (
        (data_continous_binned['previous_binned'] == 'two') & 
        (data_continous_binned['month'] == 'dec')).astype(int),
    'previous_binned_month_two_mar': (
        (data_continous_binned['previous_binned'] == 'two') & 
        (data_continous_binned['month'] == 'mar')).astype(int),
    'previous_binned_month_two_oct': (
        (data_continous_binned['previous_binned'] == 'two') & 
        (data_continous_binned['month'] == 'oct')).astype(int),
    'previous_binned_month_two_sep': (
        (data_continous_binned['previous_binned'] == 'two') & 
        (data_continous_binned['month'] == 'sep')).astype(int),
    'previous_binned_poutcome_five_success': (
        (data_continous_binned['previous_binned'] == 'five') & 
        (data_continous_binned['poutcome'] == 'success')).astype(int),
    'previous_binned_poutcome_four_success': (
        (data_continous_binned['previous_binned'] == 'four') & 
        (data_continous_binned['poutcome'] == 'success')).astype(int),
    'previous_binned_poutcome_three_failure': (
        (data_continous_binned['previous_binned'] == 'three') & 
        (data_continous_binned['poutcome'] == 'failure')).astype(int),
    'previous_binned_poutcome_three_success': (
        (data_continous_binned['previous_binned'] == 'three') & 
        (data_continous_binned['poutcome'] == 'success')).astype(int),
    'previous_binned_poutcome_two_failure': (
        (data_continous_binned['previous_binned'] == 'two') & 
        (data_continous_binned['poutcome'] == 'failure')).astype(int),
    'previous_binned_poutcome_two_success': (
        (data_continous_binned['previous_binned'] == 'two') & 
        (data_continous_binned['poutcome'] == 'success')).astype(int),
    'previous_binned_age_binned_five_60-70': (
        (data_continous_binned['previous_binned'] == 'five') & 
        (data_continous_binned['age_binned'] == '60-70')).astype(int),
    'previous_binned_age_binned_five_70+': (
        (data_continous_binned['previous_binned'] == 'five') & 
        (data_continous_binned['age_binned'] == '70+')).astype(int),
    'previous_binned_age_binned_four_15-20': (
        (data_continous_binned['previous_binned'] == 'four') & 
        (data_continous_binned['age_binned'] == '15-20')).astype(int),
    'previous_binned_age_binned_four_60-70': (
        (data_continous_binned['previous_binned'] == 'four') & 
        (data_continous_binned['age_binned'] == '60-70')).astype(int),
    'previous_binned_age_binned_three_15-20': (
        (data_continous_binned['previous_binned'] == 'three') & 
        (data_continous_binned['age_binned'] == '15-20')).astype(int),
    'previous_binned_age_binned_three_60-70': (
        (data_continous_binned['previous_binned'] == 'three') & 
        (data_continous_binned['age_binned'] == '60-70')).astype(int),
    'previous_binned_age_binned_three_70+': (
        (data_continous_binned['previous_binned'] == 'three') & 
        (data_continous_binned['age_binned'] == '70+')).astype(int),
    'previous_binned_age_binned_two_15-20': (
        (data_continous_binned['previous_binned'] == 'two') & 
        (data_continous_binned['age_binned'] == '15-20')).astype(int),
    'previous_binned_age_binned_two_60-70': (
        (data_continous_binned['previous_binned'] == 'two') & 
        (data_continous_binned['age_binned'] == '60-70')).astype(int),
    'previous_binned_age_binned_two_70+': (
        (data_continous_binned['previous_binned'] == 'two') & 
        (data_continous_binned['age_binned'] == '70+')).astype(int),
    'previous_binned_pdays_binned_five_0-5': (
        (data_continous_binned['previous_binned'] == 'five') & 
        (data_continous_binned['pdays_binned'] == '0-5')).astype(int),
    'previous_binned_pdays_binned_four_0-5': (
        (data_continous_binned['previous_binned'] == 'four') & 
        (data_continous_binned['pdays_binned'] == '0-5')).astype(int),
    'previous_binned_pdays_binned_four_20-25': (
        (data_continous_binned['previous_binned'] == 'four') & 
        (data_continous_binned['pdays_binned'] == '20-25')).astype(int),
    'previous_binned_pdays_binned_four_5-10': (
        (data_continous_binned['previous_binned'] == 'four') & 
        (data_continous_binned['pdays_binned'] == '5-10')).astype(int),
    'previous_binned_pdays_binned_three_0-5': (
        (data_continous_binned['previous_binned'] == 'three') & 
        (data_continous_binned['pdays_binned'] == '0-5')).astype(int),
    'previous_binned_pdays_binned_three_10-15': (
        (data_continous_binned['previous_binned'] == 'three') & 
        (data_continous_binned['pdays_binned'] == '10-15')).astype(int),
    'previous_binned_pdays_binned_three_15-20': (
        (data_continous_binned['previous_binned'] == 'three') & 
        (data_continous_binned['pdays_binned'] == '15-20')).astype(int),
    'previous_binned_pdays_binned_three_5-10': (
        (data_continous_binned['previous_binned'] == 'three') & 
        (data_continous_binned['pdays_binned'] == '5-10')).astype(int),
    'previous_binned_pdays_binned_two_0-5': (
        (data_continous_binned['previous_binned'] == 'two') & 
        (data_continous_binned['pdays_binned'] == '0-5')).astype(int),
    'previous_binned_pdays_binned_two_10-15': (
        (data_continous_binned['previous_binned'] == 'two') & 
        (data_continous_binned['pdays_binned'] == '10-15')).astype(int),
    'previous_binned_pdays_binned_two_15-20': (
        (data_continous_binned['previous_binned'] == 'two') & 
        (data_continous_binned['pdays_binned'] == '15-20')).astype(int),
    'previous_binned_pdays_binned_two_20-25': (
        (data_continous_binned['previous_binned'] == 'two') & 
        (data_continous_binned['pdays_binned'] == '20-25')).astype(int),
    'previous_binned_pdays_binned_two_5-10': (
        (data_continous_binned['previous_binned'] == 'two') & 
        (data_continous_binned['pdays_binned'] == '5-10')).astype(int),
    'previous_binned_y_5_yes': (
        (data_continous_binned['previous_binned'] == 'five') &
        (data_continous_binned['y'] == 'yes')).astype(int),
    'previous_binned_y_4_yes': (
        (data_continous_binned['previous_binned'] == 'four') &
        (data_continous_binned['y'] == 'yes')).astype(int),
    'previous_binned_y_3_yes': (
        (data_continous_binned['previous_binned'] == 'three') &
        (data_continous_binned['y'] == 'yes')).astype(int),
    'previous_binned_y_2_yes': (
        (data_continous_binned['previous_binned'] == 'two') &
        (data_continous_binned['y'] == 'yes')).astype(int)
})

In [10]:
age_binned_interaction_features = pd.DataFrame({
    'age_binned_job_15-20_student': (
        (data_continous_binned['age_binned'] == '15-20') & 
        (data_continous_binned['job'] == 'student')).astype(int),
    'age_binned_job_20-30_student': (
        (data_continous_binned['age_binned'] == '20-30') & 
        (data_continous_binned['job'] == 'student')).astype(int),
    'age_binned_job_60-70_retired': (
        (data_continous_binned['age_binned'] == '60-70') & 
        (data_continous_binned['job'] == 'retired')).astype(int),
    'age_binned_job_70+_retired': (
        (data_continous_binned['age_binned'] == '70+') & 
        (data_continous_binned['job'] == 'retired')).astype(int),
    'age_binned_edu_15-20_unknown': (
        (data_continous_binned['age_binned'] == '15-20') & 
        (data_continous_binned['education'] == 'unknown')).astype(int),
    'age_binned_edu_70+_basic.4y': (
        (data_continous_binned['age_binned'] == '70+') & 
        (data_continous_binned['education'] == 'basic.4y')).astype(int),
    'age_binned_edu_70+_iliterate': (
        (data_continous_binned['age_binned'] == '70+') & 
        (data_continous_binned['education'] == 'illiterate')).astype(int),
    'age_binned_month_15-20_dec': (
        (data_continous_binned['age_binned'] == '15-20') & 
        (data_continous_binned['month'] == 'dec')).astype(int),
    'age_binned_month_15-20_sep': (
        (data_continous_binned['age_binned'] == '15-20') & 
        (data_continous_binned['month'] == 'sep')).astype(int),
    'age_binned_month_60-70_dec': (
        (data_continous_binned['age_binned'] == '60-70') & 
        (data_continous_binned['month'] == 'dec')).astype(int),
    'age_binned_month_60-70_mar': (
        (data_continous_binned['age_binned'] == '60-70') & 
        (data_continous_binned['month'] == 'mar')).astype(int),
    'age_binned_month_60-70_oct': (
        (data_continous_binned['age_binned'] == '60-70') & 
        (data_continous_binned['month'] == 'oct')).astype(int),
    'age_binned_month_60-70_sep': (
        (data_continous_binned['age_binned'] == '60-70') & 
        (data_continous_binned['month'] == 'sep')).astype(int),
    'age_binned_month_70+_dec': (
        (data_continous_binned['age_binned'] == '70+') & 
        (data_continous_binned['month'] == 'dec')).astype(int),
    'age_binned_month_70+_mar': (
        (data_continous_binned['age_binned'] == '70+') & 
        (data_continous_binned['month'] == 'mar')).astype(int),
    'age_binned_month_70+_oct': (
        (data_continous_binned['age_binned'] == '70+') & 
        (data_continous_binned['month'] == 'oct')).astype(int),
    'age_binned_month_70+_sep': (
        (data_continous_binned['age_binned'] == '70+') & 
        (data_continous_binned['month'] == 'sep')).astype(int),
    'age_binned_poutcome_15-20_success': (
        (data_continous_binned['age_binned'] == '15-20') & 
        (data_continous_binned['poutcome'] == 'success')).astype(int),
    'age_binned_poutcome_60-70_success': (
        (data_continous_binned['age_binned'] == '60-70') & 
        (data_continous_binned['poutcome'] == 'success')).astype(int),
    'age_binned_poutcome_70+_success': (
        (data_continous_binned['age_binned'] == '70+') & 
        (data_continous_binned['poutcome'] == 'success')).astype(int),    
    'age_binned_pdays_binned_15-20_0-5': (
        (data_continous_binned['age_binned'] == '15-20') & 
        (data_continous_binned['pdays_binned'] == '0-5')).astype(int),
    'age_binned_pdays_binned_15-20_10-15': (
        (data_continous_binned['age_binned'] == '15-20') & 
        (data_continous_binned['pdays_binned'] == '10-15')).astype(int),
    'age_binned_pdays_binned_15-20_15-20': (
        (data_continous_binned['age_binned'] == '15-20') & 
        (data_continous_binned['pdays_binned'] == '15-20')).astype(int),
    'age_binned_pdays_binned_15-20_5-10': (
        (data_continous_binned['age_binned'] == '15-20') & 
        (data_continous_binned['pdays_binned'] == '5-10')).astype(int),
    'age_binned_pdays_binned_60-70_0-5': (
        (data_continous_binned['age_binned'] == '60-70') & 
        (data_continous_binned['pdays_binned'] == '0-5')).astype(int),
    'age_binned_pdays_binned_60-70_10-15': (
        (data_continous_binned['age_binned'] == '60-70') & 
        (data_continous_binned['pdays_binned'] == '10-15')).astype(int),
    'age_binned_pdays_binned_60-70_15-20': (
        (data_continous_binned['age_binned'] == '60-70') & 
        (data_continous_binned['pdays_binned'] == '15-20')).astype(int),
    'age_binned_pdays_binned_60-70_20-25': (
        (data_continous_binned['age_binned'] == '60-70') & 
        (data_continous_binned['pdays_binned'] == '20-25')).astype(int),
    'age_binned_pdays_binned_60-70_5-10': (
        (data_continous_binned['age_binned'] == '60-70') & 
        (data_continous_binned['pdays_binned'] == '5-10')).astype(int),
    'age_binned_pdays_binned_70+_5-10': (
        (data_continous_binned['age_binned'] == '70+') & 
        (data_continous_binned['pdays_binned'] == '5-10')).astype(int),
    'age_binned_pdays_binned_70+_0-5': (
        (data_continous_binned['age_binned'] == '70+') & 
        (data_continous_binned['pdays_binned'] == '0-5')).astype(int),
    'age_binned_pdays_binned_70+_10-15': (
        (data_continous_binned['age_binned'] == '70+') & 
        (data_continous_binned['pdays_binned'] == '10-15')).astype(int),
    'age_binned_y_70+_yes': (
        (data_continous_binned['age_binned'] == '70+') &
        (data_continous_binned['y'] == 'yes')).astype(int),
})

In [11]:
enginnered_data = pd.concat([
    data,
    binned_continous_features,
    interaction_features, 
    pdays_binned_interaction_features, 
    campaign_binned_interaction_feaures, 
    previous_binned_interaction_features, 
    age_binned_interaction_features
    ], axis=1)

# Export combined data as csv file
# enginnered_data.to_csv(path_or_buf='../data/bank-additional/bank_engineered.csv',
#                        index=False)

In [12]:
print(enginnered_data.shape)

(41176, 167)


## Data Preprocessing

In [13]:
# Spliting data into X and y after feature engineering
X = enginnered_data.drop(['y', 'duration'], axis=1)
y = enginnered_data['y']

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
# Label encode the target
y = label_encoder.fit_transform(y)

print(X.shape)
print(y.shape)

(41176, 165)
(41176,)


In [14]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

numerical_features = X.select_dtypes(include=['float64', 'int64']).columns
categorical_features = X.select_dtypes(include=['object']).columns
binary_features = X.select_dtypes(include=['int32']).columns

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(
            handle_unknown='ignore', 
            sparse_output=False), categorical_features),
        ('bin', 'passthrough', binary_features)
    ])

# Fit and transform the training data
X_transformed = preprocessor.fit_transform(X)

# Get feature names
num_features_transformed = numerical_features
ohe = preprocessor.named_transformers_['cat']
cat_features_transformed = ohe.get_feature_names_out(categorical_features)

# Combine all feature names
all_features = np.concatenate([num_features_transformed, 
                               cat_features_transformed, 
                               binary_features])

# Convert to DataFrame with column names
X_transformed_df = pd.DataFrame(X_transformed, columns=all_features)

In [15]:
# Split the data into train and test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_transformed_df, y,
    test_size=0.2,
    stratify=y,
    random_state=42)

print("X Train:", X_train.shape)
print("X Test:", X_test.shape)
print("Y Train:", y_train.shape)
print("Y Test:", y_test.shape)

X Train: (32940, 228)
X Test: (8236, 228)
Y Train: (32940,)
Y Test: (8236,)


## Feature Selection

In [16]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
lr = LogisticRegression(solver='liblinear',
                        penalty='l1',
                        max_iter=10000, 
                        class_weight='balanced', 
                        random_state=42)

# Define the Stratified K-Fold cross-validator
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Define the hyperparameters and their values to search
param_grid = {'C': np.logspace(-4, 4, 20)}

# Set up GridSearchCV
gs = GridSearchCV(estimator=lr,
                  param_grid=param_grid,
                  scoring='roc_auc',
                  cv=stratified_kfold,
                  n_jobs=-1)

gs = gs.fit(X_train, y_train)

best_model = gs.best_estimator_

# Get the coefficients of the best model
coefficients = best_model.coef_.ravel()
    
# Identify non-zero coefficients
non_zero_indices = np.where(coefficients != 0)[0]

print(f"Number of features selected: {len(non_zero_indices)}")
print(f"Best C parameter: {gs.best_params_['C']}")
print(f"Training ROC AUC: {gs.best_score_}")

Number of features selected: 76
Best C parameter: 0.615848211066026
Training ROC AUC: 0.8170033224101279


In [18]:
X_imp = X_transformed_df.iloc[:, non_zero_indices]

imp_data = pd.concat([X_imp, data['y']], axis=1)

# imp_data.to_csv(path_or_buf='../data/bank-additional/bank_processed_data.csv',
#                 index=False)

## Performance after Feature Engineering and Feature Selection

In [19]:
# Split the data into train and test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_imp, y,
    test_size=0.2,
    stratify=y,
    random_state=42)

print("X Train:", X_train.shape)
print("X Test:", X_test.shape)
print("Y Train:", y_train.shape)
print("Y Test:", y_test.shape)

X Train: (32940, 76)
X Test: (8236, 76)
Y Train: (32940,)
Y Test: (8236,)


In [21]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state=42,
                        class_weight='balanced',
                        max_iter=10000)

# Define the Stratified K-Fold cross-validator
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)


scores = cross_val_score(lr,
                         X_train, y_train, 
                         cv=stratified_kfold, 
                         scoring='roc_auc')

print(f"CV ROC AUC Score: {np.mean(scores):.3f} +/- {np.std(scores):.3f}")

CV ROC AUC Score: 0.817 +/- 0.008


## Insights and Justifications:

### Rationale for chosen features.

The chosen features were derived through a methodical process aimed at uncovering and leveraging the inherent relationships within the data:

- **Interaction Terms**: Interaction terms were created between pairs of categorical variables where one level of a variable was highly correlated with a level of another. A custom function, `categorical_levels_corr()`, was developed during the data exploration phase to identify possible interactions among pairs of features. This function uses a threshold parameter to determine significant relationships, and a threshold of 3.99 was chosen to decide which interactions to include as new features. This approach helps to capture the combined effect of these categories on the target variable, which might not be evident when considering the individual variables separately.

- **Binning Continuous Features**: Continuous features were binned into categorical ones to reduce the impact of outliers and to allow the model to better capture non-linear relationships. This also helps in simplifying the model, making it more interpretable and robust.

- **L1 Regularization**: After generating a large number of features (167), and One Hot Encoding categorical features (total features became 228), L1 regularization was applied to select the most important features by shrinking less important feature coefficients to zero. This not only helped in reducing the dimensionality of the model but also in focusing on the features that contribute the most to model accuracy.

- **Removal of Duration**: The duration feature was removed during this phase due to the following note provided in the data description:

    “duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y="no"). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.”

    Including duration in the model results in an unrealistically high ROC AUC score, as it directly correlates with the outcome and is not available before making a prediction. Therefore, it was discarded to ensure the development of a realistic predictive model.

### Impact of new features on model performance.

The introduction of new features through interaction terms and binning, followed by feature selection using L1 regularization, had a significant positive impact on model performance. Before feature engineering, the average ROC AUC score was 0.791, indicating that the model had some ability to distinguish between classes but was not optimal. After feature engineering and regularization, the ROC AUC score improved to 0.817. This improvement suggests that the engineered features helped the model capture more relevant patterns in the data, thereby improving its predictive accuracy, especially in the presence of class imbalance.