In [2]:
# Utilities
import os
import urllib.request
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline


### SIMULATING A RCT DATA SET GENERATED FROM A DIRECT MKTING CAMPAIGN / E-COMMERCE A/B TEST ###

From my experience a very helpful thought experiment is to try to think about the data generating process behind a given data set. In the last few months I've been working with several randomized controlled trials data sets and quasi-experimental data sets generated by different marketing/operational process in an e-commerce platform.

Based on that experience I believe I have the enough maturity and understanding of the underlying business processes dynamics to try and generate a näive data generating process which will try to emulate the observed data distributions stemming from a typical RCT data set resulting from a marketing campaign.

##### 1. Lets define the MINIMAL OBSERVATION UNIT and start working our way from there! #####

For simplicity lets assume our minimal observational unit will be a given customer, in a real-world setting, one customer can receive several pushes and incentives in the form of marketing campaigns, hence the one-time effect of a campaign is diluted and entangled with several different effects originated with the set of campaigns which are triggered by the overall treatment policy definition for that given customer.

A direct approach to simulate a set of customers transactional behaviour is to simulate the purchase process as a mixture of a Bernoulli and a continous rv (log-normal for instance). First each customer will be characterized by a given purchase probability, then the number of valid transactions can be modelled as a nubmer of repeated Bernoulli trials (a Binomial rv), finally the observed purchased amounts will be modelled as the product of this bernoulli rv and the defined continous rv.


In [65]:
# Reproducibility
np.random.seed(1234)

# Number of Observations
SIZE = 10000

# Purchase Behaviour Number of valid transactions
n, p = 1, 0.2
purchase_prob = np.random.binomial(n, p, SIZE)

# Number of valid transactions
N = 3
n_trx = purchase_prob * N

# Purchase Amounts
mu, sigma = 1, 1
purchase_amounts = np.random.lognormal(mu, sigma, SIZE)

# Purchase client vector (ie: total amount spent by a given customer)
total_spent_amount = n_trx * purchase_amounts

Now lets simulate more exotic transactional features at a customers level, namely de number of units purchased. Lets assume this feature follows a Poisson distribution for every valid transaction generated by a given customer.

In [50]:
# Total number of units per customer
n_units = np.random.poisson(4, SIZE)
total_n_units = n_trx * n_units

Finally we can generate some customer-level descriptive features such as age and gender.

In [51]:
age = np.random.choice(a=range(18,70), size=size)
gender = np.random.choice(a=range(2), size=size)

Lets consolidate the DGP into a pandas DataFrame for readibilty purposes.

In [55]:
trx_df = pd.DataFrame({'gender': gender,
                       'age': age,
                       'purchase_flag': purchase_prob, 
                       'total_purchase_amt':total_spent_amount, 
                       'total_n_trx':n_trx, 
                       'total_n_units':total_n_units})

##### 2. Now lets add average treatment effect incrementals as incrementals observed in Treatment/Control groups! #####

Typically in an A/B testing or RCT (such as the one in defined by a RCT mkting campaign) one would expect to observe certain effects of the treatment in a set of transactional features, in order to encode this into our DGP we'll need to randomly select treatment/control groups and define an expected incremental ATE (avg. treatment effect). This is easy to do in a toy-example like this one however in a real-world application observed incremental effects are often diluted and is hard to pick up true signal amidst all the inherent noise present on a marketing campaign.

There are different approaches to simulate incremental ATE, a direct approach (perhaps näive) is to simulate two groups of customer, one with a baseline set of simulation parameters and the other one with a set of parameters with the incremental effect already encoded into them.
For simplicity lets assume we only observe an incremental ATE in the purchase rate for a given treatment policy, for this toy example let's assume the observed ATE post-treatment increases from 20% to 22%.

In [61]:
# Reproducibility
np.random.seed(1234)

# Number of Observations
SIZE = 10000

# Purchase Behaviour Number of valid transactions
n, p = 1, 0.22
purchase_prob = np.random.binomial(n, p, SIZE)

# Number of valid transactions
N = 3
n_trx = purchase_prob * N

# Purchase Amounts
mu, sigma = 1, 1
purchase_amounts = np.random.lognormal(mu, sigma, SIZE)

# Purchase client vector (ie: total amount spent by a given customer)
total_spent_amount = n_trx * purchase_amounts

# Total number of units per customer
n_units = np.random.poisson(4, SIZE)
total_n_units = n_trx * n_units

age = np.random.choice(a=range(18,70), size=size)
gender = np.random.choice(a=range(2), size=size)

trx_df_gt = pd.DataFrame({'gender': gender,
                       'age': age,
                       'purchase_flag': purchase_prob, 
                       'total_purchase_amt':total_spent_amount, 
                       'total_n_trx':n_trx, 
                       'total_n_units':total_n_units})

Now lets compare both groups to check if the purchase rate is indeed different between both groups of customers.

In [62]:
trx_df_gt.purchase_flag.mean() - trx_df.purchase_flag.mean()

0.02049999999999999

Now lets wrap all up in a convenient set of functions!

In [97]:
def _sim_control_group(base_dict, gt_gc_ratio, inc_ate = 0):
    # Reproducibility
    np.random.seed(1234)
    
    ### BASELINE GROUP ###
    # sim_params
    SIZE, p, N = round(base_dict['sample_size']*(gt_gc_ratio)), base_dict['purchase_rate'] + inc_ate, base_dict['avg_valid_transactions'] 
    mu, sigma = base_dict['avg_purchase_amt'], 1
    m = base_dict['avg_n_units']
    
    # Purchase Behaviour Number of valid transactions
    purchase_prob = np.random.binomial(1.0, p, SIZE)
    # Number of valid transactions
    n_trx = purchase_prob * N
    # Purchase Amounts
    purchase_amounts = np.random.lognormal(mu, sigma, SIZE)
    # Purchase client vector (ie: total amount spent by a given customer)
    total_spent_amount = n_trx * purchase_amounts

    # Total number of units per customer
    n_units = np.random.poisson(m, SIZE)
    total_n_units = n_trx * n_units
    
    # Descriptive features 
    age = np.random.choice(a=range(18,70), size=SIZE)
    gender = np.random.choice(a=range(2), size=SIZE)
    
    # group flag
    _df_group = pd.DataFrame({
                           'gender': gender,
                           'age': age,
                           'purchase_flag': purchase_prob, 
                           'total_purchase_amt':total_spent_amount, 
                           'total_n_trx':n_trx, 
                           'total_n_units':total_n_units
                            })
    _df_group['group'] = 0.0
    return _df_group

In [102]:
def _sim_treatment_group(base_dict, gt_gc_ratio, inc_ate = 0.02):
    
    _df_group = _sim_control_group(base_dict, gt_gc_ratio = 1-gt_gc_ratio, inc_ate = inc_ate)
    _df_group['group'] = 1.0
    return _df_group

In [103]:
def emm_campaign_sim(base_dict, inc_ate, gt_gc_ratio):
    _df_gc = _sim_control_group(base_dict, gt_gc_ratio, inc_ate = 0.0)
    _df_gt = _sim_treatment_group(base_dict, gt_gc_ratio, inc_ate = 0.02)
    
    return pd.concat([_df_gc, _df_gt], axis = 0)

In [104]:
params_dict = {"sample_size":10000, "purchase_rate": 0.2, "avg_valid_transactions": 3, "avg_purchase_amt": 1.5, "avg_n_units":4}
emm_campaign_sim(base_dict = params_dict, inc_ate = 0.03, gt_gc_ratio = 0.3)

Unnamed: 0,gender,age,purchase_flag,total_purchase_amt,total_n_trx,total_n_units,group
0,1,38,0,0.00000,0,0,0.0
1,1,33,0,0.00000,0,0,0.0
2,1,65,0,0.00000,0,0,0.0
3,1,50,0,0.00000,0,0,0.0
4,0,53,0,0.00000,0,0,0.0
...,...,...,...,...,...,...,...
6995,0,45,0,0.00000,0,0,1.0
6996,0,68,1,6.14398,3,12,1.0
6997,0,59,0,0.00000,0,0,1.0
6998,1,64,0,0.00000,0,0,1.0


As a final thought is quite straight forward to modify these simple functions in order to simulate incrementals in the different busniess metrics that we've defined (total spents amounts, total number of units purchased, total number of valid transacionts, etc...).

Moreover is a very good excercise to modify these and observe how the incremental ATE are modified when several business metrics change at the same time, which is less näive than assuming univariate changes due to treatment policies!