 # (Don't do this)

 This notebook documents an attempt to select an optimal features set manually. However, this resulted in a slightly reduced model performance using the XGBoost classifier.

In [1]:
import pandas as pd
import seaborn as sns
sns.set_theme(context='notebook', style='whitegrid')
from sklearn.preprocessing import OrdinalEncoder
from helper_functions.descriptive_statistics import descriptive_statistics
from helper_functions.utils import calculate_contingency_sparsity, hour_of_day_group, day_of_week

In [2]:
# Load the clean training and testing datasets
train_data = pd.read_csv('../data/1_processed/train_data.csv')
test_data = pd.read_csv('../data/1_processed/test.csv')

## Feature Engineering

Here, we make a deeper dive into feature analysis. We look closer at the training dataset to find the optimal feature set for conversion prediction.

For our classification problem, we plan on using tree-based methods: XGBoost and Random Forest. For tree-based methods, feature independence and scaling are unnecessary, but targeted feature engineering, like grouping, encoding, or deriving new features, can still improve model performance and interpretability.

First, we notice that the SITE column is problematic. Its high cardinality can lead to overfitting and lower model interpretability. We will drop the SITE column from the training set.

Second, we check the contingency matrices between pairs of features. We are looking for sparse matrices. A sparse contingency matrix between two categorical features indicates that not all possible combinations of their unique values occur in the data. Combining these features will likely result in lower cardinality than their Cartesian product.

Third, we create two new, more informative features out of USER_HOUR_OF_WEEK.

Lastly, we encode the remaining non-numerical features. Since we are going to use tree-based methods for classification, we can use an OrdinalEncoder. Tree-based models decide splits based on thresholds rather than interpreting the encoded values as distances or magnitudes.

In [3]:
# Drop the SITE column from both, the train and the test sets
train_data = train_data.drop(columns=['SITE'])
test_data = test_data.drop(columns=['SITE'])

In [4]:
# Calculate sparsity for train_data_clean
calculate_contingency_sparsity(train_data)

{('BROWSER_NAME', 'SUPPLY_VENDOR'): 59,
 ('AD_FORMAT', 'SUPPLY_VENDOR'): 57,
 ('SUPPLY_VENDOR', 'OS_FAMILY_NAME'): 36,
 ('BROWSER_NAME', 'OS_FAMILY_NAME'): 35,
 ('AD_FORMAT', 'BROWSER_NAME'): 28,
 ('AD_FORMAT', 'OS_FAMILY_NAME'): 11}

The contingency matries between BROWSER_NAME and SUPPLY_VENDOR and between AD_FORMAT and SUPPLY_VENDOR are very sparse. About 60% of the features' combinations don't occer in the data. That means we'll benefit from combining those pairs into a single features.

In [5]:
# Create interaction features
train_data['BROWSER_VENDOR']   = train_data['BROWSER_NAME'] + "_" + train_data['SUPPLY_VENDOR']
train_data['AD_FORMAT_VENDOR'] = train_data['AD_FORMAT']    + "_" + train_data['SUPPLY_VENDOR']

test_data['BROWSER_VENDOR']   = test_data['BROWSER_NAME'] + "_" + test_data['SUPPLY_VENDOR']
test_data['AD_FORMAT_VENDOR'] = test_data['AD_FORMAT']    + "_" + test_data['SUPPLY_VENDOR']

# Drop the initial features
train_data = train_data.drop(columns=['BROWSER_NAME', 'SUPPLY_VENDOR', 'AD_FORMAT'])
test_data = test_data.drop(columns=['BROWSER_NAME', 'SUPPLY_VENDOR', 'AD_FORMAT'])

In [6]:
test_data[:3]

Unnamed: 0,METRO,OS_FAMILY_NAME,USER_HOUR_OF_WEEK,CLASS,BROWSER_VENDOR,AD_FORMAT_VENDOR
0,528.0,Windows,131.0,1,Chrome_google,300x50_google
1,501.0,Windows,59.0,1,Chrome_Xandr – Monetize SSP (AppNexus),160x600_Xandr – Monetize SSP (AppNexus)
2,505.0,OS X,40.0,1,Chrome_yieldmo,640x360_yieldmo


Let's create two new features out of USER_HOUR_OF_WEEK. The new features, HOUR_OF_DAY and DAY_OF_WEEK, have smaller cardinality and are more interpretable than USER_HOUR_OF_WEEK.

In [7]:
# Create new features
train_data['HOUR_OF_DAY'] = train_data['USER_HOUR_OF_WEEK'].apply(hour_of_day_group)
train_data['DAY_OF_WEEK'] = train_data['USER_HOUR_OF_WEEK'].apply(day_of_week)

test_data['HOUR_OF_DAY'] = test_data['USER_HOUR_OF_WEEK'].apply(hour_of_day_group)
test_data['DAY_OF_WEEK'] = test_data['USER_HOUR_OF_WEEK'].apply(day_of_week)

# Drop the initial features
train_data = train_data.drop(columns=['USER_HOUR_OF_WEEK'])
test_data = test_data.drop(columns=['USER_HOUR_OF_WEEK'])

## Data Encoding

Here, we encode all remaining non-numerical features including METRO.

First, let's have a look at the descriptive statistics of those features. 

In [8]:
descriptive_statistics(
    train_data, 
    categorical_columns=train_data.select_dtypes('object').columns.to_list() + ['METRO']
    )['descriptive_statistics']

Unnamed: 0,Column,Type,Count,Unique,Most Frequent,Counts of Most Frequent,Least Frequent,Counts of Least Frequent,Base 2 Entropy
0,OS_FAMILY_NAME,Categorical,156138,6,Windows,74203,Other,235,1.65
1,BROWSER_VENDOR,Categorical,156138,328,Chrome_google,16483,Opera_adyoulike,1,5.22
2,AD_FORMAT_VENDOR,Categorical,156138,417,9544x9544_taboola,14382,300x250_ironsource,1,6.6
3,METRO,Categorical,156138,224,501.0,6997,76009.0,1,6.55


The BROWSER_VENDOR and AD_FORMAT_VENDOR have a relatively high number of unique values, but still a good candidates for ordinal encoding considering our use of tree-based methods.

The METRO feature is naturally encoded.

In [9]:
# Columns to encode
columns_to_encode = ['OS_FAMILY_NAME', 'BROWSER_VENDOR', 'AD_FORMAT_VENDOR']

# Initialize the OrdinalEncoder with handling for unknown categories
ordinal_encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

# Fit the encoder on the training data
ordinal_encoder.fit(train_data[columns_to_encode])

# Transform both training and test data
train_data_encoded = train_data.copy()
test_data_encoded = test_data.copy()

train_data_encoded[columns_to_encode] = ordinal_encoder.transform(train_data[columns_to_encode])
test_data_encoded[columns_to_encode] = ordinal_encoder.transform(test_data[columns_to_encode])


In [10]:
test_data_encoded[:3]

Unnamed: 0,METRO,OS_FAMILY_NAME,CLASS,BROWSER_VENDOR,AD_FORMAT_VENDOR,HOUR_OF_DAY,DAY_OF_WEEK
0,528.0,4.0,1,39.0,122.0,2,5.0
1,501.0,4.0,1,24.0,16.0,2,2.0
2,505.0,2.0,1,68.0,336.0,3,1.0


In [11]:
# Save the encoded data sets
train_data_encoded.to_csv('../data/1_processed/train_data_engineered.csv', index=False)
test_data_encoded.to_csv('../data/1_processed/test_data_engineered.csv', index=False)