# CTR Prediction using Logistic Regression and XGBoost

The predictions are done on Simulated data and then on Sample Criteo Data. This notebook does not focus on EDA. It only focus on using the machine learning models.


Steps:

        1. Load and preprocess data
        2. Train a Logistic Regression model
        3. Train an XGBoost model
        4. Evaluate performance

## Load and Preprocess Data

#### Install dependancies

In [65]:
!pip install pandas numpy scikit-learn xgboost

[0m

For demonstration, we use a synthetic dataset with user, ad, and context features.

In [66]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import roc_auc_score, log_loss, accuracy_score

## CTR Prediction with Simulated Dataset

In [67]:
# Simulated dataset 
np.random.seed(42)
data_size = 10000

df = pd.DataFrame({
    'user_age': np.random.randint(18, 65, data_size),
    'user_gender': np.random.choice([0, 1], data_size),  # 0: Male, 1: Female
    'ad_category': np.random.randint(0, 10, data_size),
    'ad_price': np.random.uniform(1, 100, data_size),
    'click': np.random.choice([0, 1], data_size, p=[0.8, 0.2])  # CTR is ~20%
})

In [68]:
# One-hot encoding categorical features
df = pd.get_dummies(df, columns=['ad_category'], drop_first=True)

In [69]:
# Features and target
X = df.drop(columns=['click'])
y = df['click']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [70]:
# Scale numerical features
scaler = StandardScaler()
X_train[['user_age', 'ad_price']] = scaler.fit_transform(X_train[['user_age', 'ad_price']])
X_test[['user_age', 'ad_price']] = scaler.transform(X_test[['user_age', 'ad_price']])

### Train a Logistic Regression Model

In [71]:
from sklearn.linear_model import LogisticRegression

# Train logistic regression
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

# Predict
y_pred_prob = lr_model.predict_proba(X_test)[:, 1]
y_pred = (y_pred_prob > 0.5).astype(int)

# Evaluate
print(f'AUC: {roc_auc_score(y_test, y_pred):.4f}')
print(f'Log Loss: {log_loss(y_test, y_pred_prob):.4f}')
print(f'Accuracy: {accuracy_score(y_test, y_pred):.4f}')


AUC: 0.5000
Log Loss: 0.5044
Accuracy: 0.7970


### Train an XGBoost Model

In [72]:
import xgboost as xgb

# Train XGBoost classifier
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb_model.fit(X_train, y_train)

# Predict
y_pred_prob_xgb = xgb_model.predict_proba(X_test)[:, 1]
y_pred_xgb = (y_pred_prob_xgb > 0.5).astype(int)

# Evaluate
print(f'AUC: {roc_auc_score(y_test, y_pred_xgb):.4f}')
print(f'XGBoost Log Loss: {log_loss(y_test, y_pred_prob_xgb):.4f}')
print(f'XGBoost Accuracy: {accuracy_score(y_test, y_pred_xgb):.4f}')


AUC: 0.5057
XGBoost Log Loss: 0.5361
XGBoost Accuracy: 0.7885


## CTR Prediction with Criteo Data Sample

In [73]:
# Load dataset (Criteo Sample Data)
file = "criteo_sample.txt"
df = pd.read_csv(file)
df

Unnamed: 0,label,I1,I2,I3,I4,I5,I6,I7,I8,I9,...,C17,C18,C19,C20,C21,C22,C23,C24,C25,C26
0,0,,3,260.0,,17668.0,,,33.0,,...,e5ba7672,87c6f83c,,,0429f84b,,3a171ecb,c0d61a5c,,
1,0,,-1,19.0,35.0,30251.0,247.0,1.0,35.0,160.0,...,d4bb7bd8,6fc84bfb,,,5155d8a3,,be7c41b4,ded4aac9,,
2,0,0.0,0,2.0,12.0,2013.0,164.0,6.0,35.0,523.0,...,e5ba7672,675c9258,,,2e01979f,,bcdee96c,6d5d1302,,
3,0,,13,1.0,4.0,16836.0,200.0,5.0,4.0,29.0,...,e5ba7672,52e44668,,,e587c466,,32c7478e,3b183c5c,,
4,0,0.0,0,104.0,27.0,1990.0,142.0,4.0,32.0,37.0,...,e5ba7672,25c88e42,21ddcdc9,b1252a9d,0e8585d2,,32c7478e,0d4a6d1a,001f3601,92c878de
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,0,,0,113.0,3.0,3036.0,575.0,2.0,3.0,214.0,...,07c540c4,9880032b,21ddcdc9,5840adea,34cc61bb,c9d4222a,32c7478e,e5ed7da2,ea9a246c,984e0db0
196,1,0.0,1,1.0,1.0,1607.0,12.0,1.0,12.0,15.0,...,1e88c74f,3972b4ed,,,d1aa4512,,32c7478e,9257f75f,,
197,1,1.0,0,6.0,3.0,0.0,0.0,19.0,3.0,3.0,...,3486227d,5aed7436,54591762,a458ea53,4a2c3526,,32c7478e,1793a828,e8b83407,1a02cbe1
198,0,0.0,22,6.0,22.0,203.0,153.0,80.0,18.0,508.0,...,3486227d,13145934,55dd3565,5840adea,bf647035,,32c7478e,1481ceb4,e8b83407,988b0775


In [74]:
# Rename columns
df.columns = ['click'] + [f'feature_{i}' for i in range(1, df.shape[1])]

# Handle missing values
df.fillna(-1, inplace=True)

In [75]:
# Preprocess data
categorical_features = pd.DataFrame(df.dtypes, columns = ['col_dtype'])
categorical_features = categorical_features.index[categorical_features["col_dtype"]=='object'].to_list()
numerical_features = [col for col in df.columns if (col not in categorical_features and col!='label')]

label = 'click'

# One-Hot Encode Categorical Features
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)
X_cat = ohe.fit_transform(df[categorical_features].astype('str'))

# Standardize Numerical Features
scaler = StandardScaler()
X_num = scaler.fit_transform(df[numerical_features])


In [76]:
# Combine Features
X = np.hstack([X_num, X_cat])
y = df[label].values

# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Logistic Regression Model with Criteo Data Sample 

In [77]:
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
lr_preds = lr_model.predict_proba(X_test)[:, 1]

### Evaluate Logistic Regression

In [78]:
lr_auc = roc_auc_score(y_test, lr_preds)
lr_logloss = log_loss(y_test, lr_preds)
print(f"Logistic Regression AUC: {lr_auc:.4f}, Log Loss: {lr_logloss:.4f}")

Logistic Regression AUC: 1.0000, Log Loss: 0.0291


## XGBoost Model with Criteo Data Sample 

In [79]:
xgb_model = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=6, use_label_encoder=False, eval_metric='logloss')
xgb_model.fit(X_train, y_train)
xgb_preds = xgb_model.predict_proba(X_test)[:, 1]

### Evaluate XGBoost

In [80]:
xgb_auc = roc_auc_score(y_test, xgb_preds)
xgb_logloss = log_loss(y_test, xgb_preds)
print(f"XGBoost AUC: {xgb_auc:.4f}, Log Loss: {xgb_logloss:.4f}")

XGBoost AUC: 1.0000, Log Loss: 0.0131


Both Logistic Regression and XGBoost are achieving an AUC of 1.0000, which is suspiciously perfect. This suggests data leakage—meaning the models might be learning from features that directly or indirectly contain the target variable.

#### Which One Is Better?

AUC (Area Under Curve):

    Since both models have AUC = 1.0000, they are both distinguishing between classes perfectly.
    However, this is unrealistic in real-world datasets.

Log Loss (Lower is Better):

    Logistic Regression Log Loss: 0.0291
    XGBoost Log Loss: 0.0131 (lower = better)
    XGBoost performs slightly better in terms of probabilistic predictions.