In this example, Amazon is running an experiment. This experiment will give treated users a one-day sale on Amazon Prime. The outcome of interest is revenue over the next year; but of course Amazon does not want to wait a full year to assess the experiment. Amazon thus defines a reasonable surrogate: the purchase of Amazon Prime. Amazon has lots of observational data linking Amazon Prime purchases to annual revenue.

In this module, we will see how Amazon can recover the true effect under certain assumptions; and how Amazon can be led astray under other assumptions.

In [1]:
# Load libraries
import numpy as np
import pandas as pd
import statsmodels.formula.api as sm

We define parameters for the underlying data process. Treatment is the a one-day sale for Amazon Prime, and it affects the surrogate -- the purchase of Amazon Prime -- by 0.1. The surrogate in turn affects the variable of interest, annual revenue, by $100 -- both in the experimental sample (which will remain unobserved to Amazon) and in the preceding observational samples (on which Amazon has lots of data).

In [2]:
# Define parameters
beta_treatment_purchase = 0.1
beta_purchase_revenue = 100
beta_purchase_revenue_obs = beta_purchase_revenue

We next define the data-generating process. Let's assume that the data scientist designs the experiment to be exactly 20,000 treated and 20,000 control units. Moreover, the data scientist accesses a further 80,000 observational records linking Amazon Prime purchases to annual revenue.

In [13]:
# Define the data-generating process here
np.random.seed(4)
n_experiment = 40000
experiment_data = pd.concat(
    [pd.DataFrame(np.ones(round(n_experiment/2)), columns = ['treatment']), 
     pd.DataFrame(np.zeros(round(n_experiment/2)), columns = ['treatment'])])
#below creates noise
experiment_data['purchase'] = (np.random.uniform(0, 1, n_experiment) + 
                               experiment_data['treatment'] * 
                               beta_treatment_purchase)
#below creates noise
experiment_data['revenue'] = (10 + np.random.normal(0, 10, n_experiment) + 
                              experiment_data['purchase'] * 
                              beta_purchase_revenue)
  
n_observation = 80000
observation_data = pd.DataFrame(np.random.uniform(0, 1, n_observation), 
                                columns = ['purchase'])
observation_data['revenue'] = (10 + np.random.normal(0, 10, n_observation) + 
                               observation_data['purchase'] * 
                               beta_purchase_revenue_obs)
experiment_data

Unnamed: 0,treatment,purchase,revenue
0,1.0,1.067030,123.948144
1,1.0,0.647232,67.021942
2,1.0,1.072684,120.571953
3,1.0,0.814816,87.662237
4,1.0,0.797729,90.334871
...,...,...,...
19995,0.0,0.712083,85.499942
19996,0.0,0.982169,95.514399
19997,0.0,0.938426,97.757091
19998,0.0,0.144135,9.651022


First, we identify the true effect of the experiment. The data scientist will of course never be able to run this regression in practice, because he or she will never have the long-term response variable from the experiment directly.

In [4]:
# Function to implement regression, extract coefficient, and return it
def regression(f, df):
  result = sm.ols(formula = f, data = df).fit()
  return(result.params[1])

# Report the true effect of treatment on revenue from experimental data
truth = regression('revenue ~ treatment', experiment_data)
print("The true effect of the experiment on annual revenue is: " + 
            str(round(truth, 2)))

The true effect of the experiment on annual revenue is: 9.95


Second, we identify the estimated effect of the experiment. The data scientist will be able to run these regressions in practice, as they link two accessible datasets: the short-term effect of the experiment on the surrogate, and the historical link between the surrogate and the response variable.

Notice precisely what we do: we estimate the coefficient of treatment on the surrogate, and we multiply that by the coefficient of the surrogate on the long-term outcome.

In [5]:
# Report the surrogacy estimate: the effect of treatment on surrogates (from 
# experiment data) times the effect of surrogates on revenue (from the 
# observational data)
estimate = (regression('purchase ~ treatment', experiment_data) * 
            regression('revenue ~ purchase', observation_data))
print("The measured effect of the experiment on annual revenue is: " + 
            str(round(estimate, 2)))


The measured effect of the experiment on annual revenue is: 9.83


Finally, we compute the bootstrap. Note that we have fixed treated and control groups, so we randomize within each of them separately. How do we implement the bootstrap with the observational sample?

In [14]:
# Implement bootstrap
i_experiment = np.where(experiment_data['treatment'] == 1)[0]
j_experiment = np.where(experiment_data['treatment'] == 0)[0]
bootstrap = []

# Note that 100 bootstrap iterations is very low, but we do this in the 
# interest of time. You should use at 2,000 iterations where possible.
for i in range(100):
  experiment_index = np.random.choice(i_experiment, len(i_experiment), 
                                      replace = True)
  experiment_index = np.append(experiment_index, np.random.choice(
                             j_experiment, len(j_experiment), replace = True))
  observation_index = np.random.choice(range(n_observation), n_observation, 
                                       replace = True)
  temporary = (regression('purchase ~ treatment', 
                          experiment_data.iloc[experiment_index,]) * 
               regression('revenue ~ purchase', 
                          observation_data.iloc[observation_index,]))
  bootstrap.append(temporary)

print("The standard error of the estimate: " + 
            str(round(np.std(bootstrap), 2)))

The standard error of the estimate: 0.29


This shows that surrogates, under the right conditions, can be used successfully! We recover an estimated value that is very close to the true value.

However, these conditions may not always hold. In this next section, we consider different conditions.

First, let's  make the model richer. There is an additional variable present: shopping taste, which is typically impossible to measure. Let's now consider three variants of the original scenario:

1.   The latent variable (taste) affects the response variable
2.   The latent variable (taste) is both affected by the treatment and affects the response variable
3.   The observational data estimates a different link between the surrogate variables and the true response variables

Rather than going step-by-step through the same processes, we wrap all the lines above in a single function; and we only change the parameters governing the data generating process.

In [7]:
def surrogates(model_type = 0, taste_visible = False):

  # Create different model parameterizations
  if model_type == 0:
    beta_treatment_purchase = 0.1
    beta_treatment_taste = 0
    beta_purchase_revenue = 100
    beta_taste_revenue = 0
    beta_purchase_revenue_obs = beta_purchase_revenue
    beta_taste_revenue_obs = beta_taste_revenue
   
  if model_type == 1:
    beta_treatment_purchase = 0.1
    beta_treatment_taste = 0
    beta_purchase_revenue = 100
    beta_taste_revenue = 20
    beta_purchase_revenue_obs = beta_purchase_revenue
    beta_taste_revenue_obs = beta_taste_revenue

  if model_type == 2:
    beta_treatment_purchase = 0.1
    beta_treatment_taste = 0.2
    beta_purchase_revenue = 100
    beta_taste_revenue = 20
    beta_purchase_revenue_obs = beta_purchase_revenue
    beta_taste_revenue_obs = beta_taste_revenue

  if model_type == 3:
    beta_treatment_purchase = 0.1
    beta_treatment_taste = 0
    beta_purchase_revenue = 100
    beta_taste_revenue = 0
    beta_purchase_revenue_obs = 200
    beta_taste_revenue_obs = beta_taste_revenue

  np.random.seed(4)

  # Create experiment data
  n_experiment = 40000
  experiment_data = pd.concat(
      [pd.DataFrame(np.ones(round(n_experiment/2)), columns = ['treatment']), 
       pd.DataFrame(np.zeros(round(n_experiment/2)), columns = ['treatment'])])
  experiment_data['purchase'] = (np.random.uniform(0, 1, n_experiment) + 
                                 experiment_data['treatment'] * 
                                 beta_treatment_purchase)
  experiment_data['taste'] = (np.random.uniform(0, 10, n_experiment) + 
                              experiment_data['treatment'] * 
                              beta_treatment_taste)
  experiment_data['revenue'] = (10 + np.random.normal(0, 10, n_experiment) + 
                                experiment_data['purchase'] * 
                                beta_purchase_revenue + 
                                experiment_data['taste'] * 
                                beta_taste_revenue)
  
  # Create separate observed data
  n_observation = 80000
  observation_data = pd.DataFrame(np.random.uniform(0, 1, n_observation), 
                                  columns = ['purchase'])
  observation_data['taste'] = np.random.uniform(0, 10, n_observation)
  observation_data['revenue'] = (10 + np.random.normal(0, 10, n_observation) + 
                                 observation_data['purchase'] * 
                                 beta_purchase_revenue_obs + 
                                 observation_data['taste'] * 
                                 beta_taste_revenue_obs)

  # Function to implement regression, extract coefficient, and return it
  def regression(f, df):
    result = sm.ols(formula = f, data = df).fit()
    return(result.params[1])

  # Report the true effect of treatment on revenue from experimental data
  truth = regression('revenue ~ treatment', experiment_data)

  # Report the surrogacy estimate: the effect of treatment on surrogates (from 
  # experiment data) times the effect of surrogates on revenue (from the 
  # observational data)
  estimate = (regression('purchase ~ treatment', experiment_data) * 
           regression('revenue ~ purchase', observation_data))
  if taste_visible:
    estimate = (estimate + regression('taste ~ treatment', experiment_data) *
                regression('revenue ~ taste', observation_data))

  # Implement bootstrap
  i_experiment = np.where(experiment_data['treatment'] == 1)[0]
  j_experiment = np.where(experiment_data['treatment'] == 0)[0]
  bootstrap = []

  # Note that this is a low number of Bootstrapped iterations, but this allows
  # us to get answers quickly
  for i in range(100):
    experiment_index = np.random.choice(i_experiment, len(i_experiment), 
                                        replace = True)
    experiment_index = np.append(experiment_index, np.random.choice(
                               j_experiment, len(j_experiment), replace = True))
    observation_index = np.random.choice(range(n_observation), n_observation, 
                                         replace = True)
    temporary = (regression('purchase ~ treatment', 
                            experiment_data.iloc[experiment_index,]) * 
                 regression('revenue ~ purchase', 
                            observation_data.iloc[observation_index,]))
    if taste_visible:
      temporary = (temporary + 
                   regression('taste ~ treatment', 
                              experiment_data.iloc[experiment_index,]) *
                   regression('revenue ~ taste', 
                              observation_data.iloc[observation_index,]))
    
    bootstrap.append(temporary)

  # Depict output
  print("The true effect of the experiment on annual revenue is: " + 
        str(round(truth, 2)))
  print("The measured effect of the experiment on annual revenue is: " + 
        str(round(estimate, 2)))
  print("The standard error of the estimate: " + 
        str(round(np.std(bootstrap), 2)))

In [8]:
surrogates(0)

The true effect of the experiment on annual revenue is: 9.81
The measured effect of the experiment on annual revenue is: 9.84
The standard error of the estimate: 0.29


First, consider the case where shopping taste only affects the long-term revenue. Is it a true surrogate? How does its presence affect our ability to estimate the true treatment effect?

In [9]:
surrogates(1)

The true effect of the experiment on annual revenue is: 9.58
The measured effect of the experiment on annual revenue is: 9.75
The standard error of the estimate: 0.3


Next, consider the case where shopping taste both affects the long-term revenue and is somehow affected by treatment (perhaps the one-day sale made the treated group more excited about online purchases in general). Is it a true surrogate? How does its presence affect our ability to estimate the true treatment effect?

In [10]:
surrogates(2)

The true effect of the experiment on annual revenue is: 13.58
The measured effect of the experiment on annual revenue is: 9.75
The standard error of the estimate: 0.3


Note that these problems would go away if we could somehow measure taste. There is no issue with having multiple surrogate variables, as long as you can measure and analyze them.

In [11]:
surrogates(2, taste_visible = True)

The true effect of the experiment on annual revenue is: 13.58
The measured effect of the experiment on annual revenue is: 13.52
The standard error of the estimate: 0.66


Finally, let's remove taste again; but now we assume that the observational data used to link the surrogate of a Prime purchase to the long-term outcome of revenue is no longer comparable to the experiment's true link. How does this affect our ability to estimate the true treatment effect?

In [12]:
surrogates(3)

The true effect of the experiment on annual revenue is: 9.81
The measured effect of the experiment on annual revenue is: 19.67
The standard error of the estimate: 0.58
