# Randomized Controlled Trials

We'll be taking a look at an online retailer based in the United Kingdom. Our
goal is to estimate the causal effect of switching the user's interface to dark
on the probability of purchasing an item.

We will fit the following model:

$$ E(Y_i | X_i) = X_i^T \gamma + \tau D_i$$

where $X_i$ are controls and $D_i$ indicates $i$'s treatment status.

---

## Imports

In [1]:
import os
import numpy as np
import pandas as pd
import statsmodels.api as sm

## Exploratory Data Analysis

Load data

In [2]:
PATH = os.path.join('..', 'data', 'raw.csv')
df = pd.read_csv(PATH)

View data

In [None]:
df.head()

Check for nulls

In [None]:
df.isna().sum()

Check data types

In [None]:
df.dtypes

Clean data

In [None]:
# Rename columns
df.columns = ['id', 'dark', 'views', 'time', 'purchase', 'mobile', 'location']

# Map columns to numeric dtypes
df.replace(
    to_replace={
        'dark': {'A': '0', 'B': '1'},
        'mobile': {'Mobile': '1', 'Desktop': '0'},
        'purchase': {'No': '0', 'Yes': '1'},
        'location': {'Northern Ireland': 'Ireland'}
    },
    inplace=True
)

# Convert strings -> ints
df[['dark', 'mobile', 'purchase']] = df[['dark', 'mobile', 'purchase']].astype(int)

# Set `location`` to lowercase
df['location'] = df['location'].str.lower()

Encode categorical variables to binary columns (also known as One-Hot Encoding)

In [None]:
df.head()

In [None]:
# One-hot encoding
df = pd.get_dummies(
    data=df,
    prefix='',
    prefix_sep='',
    columns=['location'],
    dtype=int
)

Feature engineering

- Create interaction term
- Assign a constant

In [None]:
# Interaction
df['dark_mobile'] = df['dark'].multiply(df['mobile'])

# Constant
df['const'] = 1

## Fitting a linear model

Declare linear model

In [None]:
# Declare specification
spec = sm.OLS(
    endog=df['purchase'],
    exog=df[['const', 'ireland', 'scotland', 'wales', 'dark', 'dark_mobile']],
    hasconst=True
)

# Fit model
model = spec.fit()

# View results
model.summary()

- Interaction term is not significant (remove it)

Fitting a parsimonious model

In [None]:
# Declare model
spec2 = sm.OLS(
    endog=df['purchase'],
    exog=df[['const', 'ireland', 'scotland', 'wales', 'dark']],  # No interaction
    hasconst=True
)

# Fit model
model2 = spec2.fit()

# View results
model2.summary()