# Data loading and exploratory data analysis

Load the Criteo uplift dataset, perform some inspection to the features and values, check if the data is useful and verify that the data can be treated as a true randomized experiment suitable for uplift modeling.



### Data loading

In [1]:
import pandas as pd 
import numpy as np

DATA_PATH = "../data/raw/criteo-uplift-v2.1.csv"

df = pd.read_csv(DATA_PATH)

df.head()

Unnamed: 0,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,treatment,conversion,visit,exposure
0,12.616365,10.059654,8.976429,4.679882,10.280525,4.115453,0.294443,4.833815,3.955396,13.190056,5.300375,-0.168679,1,0,0,0
1,12.616365,10.059654,9.002689,4.679882,10.280525,4.115453,0.294443,4.833815,3.955396,13.190056,5.300375,-0.168679,1,0,0,0
2,12.616365,10.059654,8.964775,4.679882,10.280525,4.115453,0.294443,4.833815,3.955396,13.190056,5.300375,-0.168679,1,0,0,0
3,12.616365,10.059654,9.002801,4.679882,10.280525,4.115453,0.294443,4.833815,3.955396,13.190056,5.300375,-0.168679,1,0,0,0
4,12.616365,10.059654,9.037999,4.679882,10.280525,4.115453,0.294443,4.833815,3.955396,13.190056,5.300375,-0.168679,1,0,0,0


Check the features and ranges and the meaning of each one

In [2]:
# First description of the dataset features
print(f'Dataset shape: {df.shape}')

print('Description of the values:')
df.describe()


Dataset shape: (13979592, 16)
Description of the values:


Unnamed: 0,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,treatment,conversion,visit,exposure
count,13979590.0,13979590.0,13979590.0,13979590.0,13979590.0,13979590.0,13979590.0,13979590.0,13979590.0,13979590.0,13979590.0,13979590.0,13979590.0,13979590.0,13979590.0,13979590.0
mean,19.6203,10.06998,8.446582,4.178923,10.33884,4.028513,-4.155356,5.101765,3.933581,16.02764,5.333396,-0.1709672,0.8500001,0.00291668,0.046992,0.03063122
std,5.377464,0.1047557,0.2993161,1.336645,0.3433081,0.4310974,4.577914,1.205248,0.05665958,7.018975,0.1682288,0.02283277,0.3570713,0.05392748,0.2116217,0.1723164
min,12.61636,10.05965,8.214383,-8.398387,10.28053,-9.011892,-31.42978,4.833815,3.635107,13.19006,5.300375,-1.383941,0.0,0.0,0.0,0.0
25%,12.61636,10.05965,8.214383,4.679882,10.28053,4.115453,-6.699321,4.833815,3.910792,13.19006,5.300375,-0.1686792,1.0,0.0,0.0,0.0
50%,21.92341,10.05965,8.214383,4.679882,10.28053,4.115453,-2.411115,4.833815,3.971858,13.19006,5.300375,-0.1686792,1.0,0.0,0.0,0.0
75%,24.43646,10.05965,8.723335,4.679882,10.28053,4.115453,0.2944427,4.833815,3.971858,13.19006,5.300375,-0.1686792,1.0,0.0,0.0,0.0
max,26.74526,16.34419,9.051962,4.679882,21.12351,4.115453,0.2944427,11.9984,3.971858,75.29502,6.473917,-0.1686792,1.0,1.0,1.0,1.0


The description of the dataset in the link says that this data was obtained from a "randomized control trial" collecting 13 million samples with a global treatment raio of 84.6% (so it seems like an unbalanced dataset).

At first look shows the dataset has almost 14 million samples with 16 different features:
- f0 to f11 (float): are described as feature values 
- treatment (numeric boolean): describe if that user belongs to the treatment group (1) or the control group(0)
- visit (numeric boolean): describe if that user whether visit (1) or not (0) the page
- conversion (numeric boolean): describe if that user convert (1) or not (0)
- exposure (numeric boolean): describe if the user was really exposed (1) or not (0)


A first comment about the feature values (f0,...,f11): they look like a numeric combination of many more features (such as age, income, interests,...) after some feature engineering process performed by Criteo due to privacy and comercial reasons.

Another comment about the exposure: it measures if the user was really expose to the add. That means that even if a user belongs to the treatment group it may not be exposed due to several reasons (for example adblocks). In reality, this variable is impossible to obtain so it won't be used in the model, this experiment was able to measure due to the control of the users.

Continuing with the exploration, check the binary features to show the unbalanced experiment and get some first impressions.

### Data understanding

In [4]:
# Check treatment / control split
treatment_counts = df['treatment'].value_counts(normalize=True)
print(f"Treatment split: {treatment_counts}")

# Check conversion rates by group
conversion_by_group = df.groupby('treatment')['conversion'].mean()
print(f"Conversion rate by group: {conversion_by_group}")

# Calculate the ratio of conversion between groups
ratio = np.round(conversion_by_group[1]/conversion_by_group[0], 4)
print(f'Ratio between groups: {ratio}')



Treatment split: treatment
1    0.85
0    0.15
Name: proportion, dtype: float64
Conversion rate by group: treatment
0    0.001938
1    0.003089
Name: conversion, dtype: float64
Ratio between groups: 1.5945


The results above verify that the dataset is unbalanced with an 85% of the user been treated and only a 15% in the control group. However, this does not invalidate the experiment if the assignation was completely fare and random. Indeed, it shows a perfect example of the trade-off between get a control group just enough that doesn't limit your conversion.

The ratio between groups shows how the treated group converts better than the control group by 59% on average. Showing a first measurement of the treatment effect. However, in uplift modeling is used the difference between treated and non-treated groups:

$$
\text{uplift}(X) = P(Y = 1 \mid T = 1, X) - P(Y = 1 \mid T = 0, X)
$$

$$
where: \\

    Y \in \{0,1\} \text{ is the conversion outcome,} \\
    T \in \{0,1\} \text{ indicates exposure to the advertisement (treatment),}\\
    X \text{ represents the user feature vector.}

$$

In other words: how much does the probability of conversion of a user (with specific characteristics) change if we impact them versus if we do not impact them?

In [5]:
average_treatment_effect = conversion_by_group[1] - conversion_by_group[0]
print(f"Average treatment effect (absolute): {average_treatment_effect}")


Average treatment effect (absolute): 0.0011518730521316279


The ATE (average treatment effect) shows how many extra conversions we get due to the treatment on average. In this case we get an increase of 0.11% of the probability of conversion on average. This number shows that there may be a positive average causal effect of the treatment, and so a first conclusion:

**Treat everybody is better than treat nobody**

However, the uplifting problem solves a different problem:

**Should we treat everybody or should we focus on specific users?**

Because a positive ATE doesn't mean that everybody will responde equally to the treatment. And so, it is necessary to understand the difference to make more individual treatment decisions.

### Some extra sanity checks:

Just to verify that in this dataset there is not "strange rows" such as:
- users that were not treated (treatment = 0) but were exposed (exposure = 1)
- users that converts (convert = 1) without a visit (visit = 0)

In [5]:
visit_rate = df.groupby('treatment')['visit'].mean()
print(f"Conversion rate by group: {visit_rate}")


invalid_exposure = df[(df['treatment'] == 0) & (df['exposure'] == 1)]
print(f"Treatment=0 & Exposure=1: {len(invalid_exposure)}")

invalid_conversion = df[(df['visit'] == 0) & (df['conversion'] == 1)]
print(f"Visit=0 & Conversion=1: {len(invalid_conversion)}")

Conversion rate by group: treatment
0    0.038201
1    0.048543
Name: visit, dtype: float64
Treatment=0 & Exposure=1: 0
Visit=0 & Conversion=1: 0
