# Data Preparation

In this notebook, we create a clean, balanced training dataset and a testing dataset that emulates real-world data. In addition, we perform feature engineering to determine the optimal feature set and encode both datasets to get the model's input ready.

In [1]:
import pandas as pd
import seaborn as sns
sns.set_theme(context='notebook', style='whitegrid')
from sklearn.preprocessing import OrdinalEncoder

In [2]:
# Load the raw data
conversion_data = pd.read_csv('../data/0_raw/Conversion_data.csv')
nonconversion_data = pd.read_csv('../data/0_raw/nonconversion_data.csv')

## Data Cleaning

Let's create a clean, balanced training dataset by removing null values and combining the same number of conversion and non-conversion data points. The balanced dataset prevents the model from being biased towards the majority class (non-conversions).

We'll remove the null values for the testing dataset but keep class imbalance to emulate the real-world data.

In [3]:
# Clean column names by removing non-alphanumeric characters
conversion_data.columns = conversion_data.columns.str.replace(r'^[^a-zA-Z0-9]+|[^a-zA-Z0-9]+$', '', regex=True)
nonconversion_data.columns = nonconversion_data.columns.str.replace(r'^[^a-zA-Z0-9]+|[^a-zA-Z0-9]+$', '', regex=True)

# Check if ther columns' name are the same now
(conversion_data.columns == nonconversion_data.columns).all()

np.True_

In [4]:
# Drop rows with null values
conversion_data = conversion_data.dropna().reset_index(drop=True)
nonconversion_data = nonconversion_data.dropna().reset_index(drop=True)

In [5]:
# Create a class column in the conversion and non-conversion datasets
# 1 means conversion 
# 0 means non conversion
conversion_data['CLASS'] = 1
nonconversion_data['CLASS'] = 0

In [6]:
# Shuffle the conversion data and perform a 80-20% train-test split 
conversion_data = conversion_data.sample(frac=1, random_state=11).reset_index(drop=True)
n = int(0.8 * len(conversion_data))
conversion_train = conversion_data[:n]
conversion_test = conversion_data[n:]

# Shuffle the non-conversion data and select n rows for training
nonconversion_data = nonconversion_data.sample(frac=1, random_state=11).reset_index(drop=True)
nonconversion_train = nonconversion_data[:n]
# For non-conversion test set, select three times more rows than in the conversion_test 
# to preserve the initial 3:1 ratio of non-conversion vs. conversion data.
m = len(conversion_test)
nonconversion_test = nonconversion_data[n: n + 3*m]

In [7]:
# Combine the conversion data non-conversion for train and test sets
train_data = pd.concat([conversion_train, nonconversion_train]).reset_index(drop=True)
test_data  = pd.concat([conversion_test, nonconversion_test]).reset_index(drop=True)

In [8]:
# Check if the data is balnced
train_data.groupby('CLASS').count()

Unnamed: 0_level_0,SITE,AD_FORMAT,BROWSER_NAME,SUPPLY_VENDOR,METRO,OS_FAMILY_NAME,USER_HOUR_OF_WEEK
CLASS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,78069,78069,78069,78069,78069,78069,78069
1,78069,78069,78069,78069,78069,78069,78069


The resulting training set is slightly imbalanced with more data points of class 1, the conversions. 
Since the XGBoost and Random Forest algorithms are robust to slight class imbalance, we won't rebalance the training set to avoid losing more data points.

In [9]:
# Save the clean data sets
train_data.to_csv('../data/1_processed/train_data.csv', index=False)
test_data.to_csv('../data/1_processed/test.csv', index=False)

## Feature Engineering (Light)

At first, we try only minimal feature manipulation to establish a baseline performance. Later, we will use more advanced feature selection techniques to see whether or not we can improve the model's performance.

We notice that the SITE column is problematic. Its high cardinality can lead to overfitting and lower model interpretability. We will drop the SITE column from the training set.

Lastly, we encode the categorical features. Since we are going to use tree-based methods for classification, we can use an OrdinalEncoder. Tree-based models decide splits based on thresholds rather than interpreting the encoded values as distances or magnitudes.

In [10]:
# Drop the SITE column from both, the train and the test sets
train_data = train_data.drop(columns=['SITE'])
test_data = test_data.drop(columns=['SITE'])

In [11]:
train_data[:3]

Unnamed: 0,AD_FORMAT,BROWSER_NAME,SUPPLY_VENDOR,METRO,OS_FAMILY_NAME,USER_HOUR_OF_WEEK,CLASS
0,300x250,Chrome,Media.Net,770.0,Windows,35.0,1
1,728x90,Chrome,Xandr – Monetize SSP (AppNexus),519.0,OS X,13.0,1
2,300x250,Chrome,Index Exchange,505.0,Windows,45.0,1


In [12]:
# Columns to encode
columns_to_encode = train_data.select_dtypes('object').columns

# Initialize the OrdinalEncoder with handling for unknown categories
ordinal_encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

# Fit the encoder on the training data
ordinal_encoder.fit(train_data[columns_to_encode])

# Transform both training and test data
train_data_encoded = train_data.copy()
test_data_encoded = test_data.copy()

train_data_encoded[columns_to_encode] = ordinal_encoder.transform(train_data[columns_to_encode])
test_data_encoded[columns_to_encode] = ordinal_encoder.transform(test_data[columns_to_encode])


In [13]:
test_data_encoded[:3]

Unnamed: 0,AD_FORMAT,BROWSER_NAME,SUPPLY_VENDOR,METRO,OS_FAMILY_NAME,USER_HOUR_OF_WEEK,CLASS
0,3.0,0.0,45.0,528.0,4.0,131.0,1
1,1.0,0.0,26.0,501.0,4.0,59.0,1
2,7.0,0.0,78.0,505.0,2.0,40.0,1


In [14]:
# Save the encoded data sets
train_data_encoded.to_csv('../data/1_processed/train_data_encoded.csv', index=False)
test_data_encoded.to_csv('../data/1_processed/test_data_encoded.csv', index=False)