# AB Testing

In [1]:
import math
import numpy as np
import pandas as pd
import scipy.stats as st

from statsmodels.stats.power import TTestIndPower
from statsmodels.stats.weightstats import ttest_ind

In [3]:
df = pd.read_csv("../../data/abtest/customer_service_ratings.csv")
control = df[df["type"]=="historical"]["rating"].values #historical
treatment = df[df["type"]=="treatment"]["rating"].values #treatment

In [9]:
expected_lift = 0.1 #expected difference we want to observe

#### Statistical significance

    We first check whether the control and treatment population are equal or not using a T-test. 

In [4]:
teststatistic, p_value, d_freedom = ttest_ind(control, treatment)

In [5]:
teststatistic, p_value, d_freedom

(-1.0764719751813947, 0.2817671660562652, 5098.0)

    We see that the probability of wrongly rejecting a true null hypothesis is 0.28, the null hypothesis being no difference between control and treatment. 

#### Practical significance
    We not calculate the the practical significance to check whether the difference between control and test is practical. We draw a 95% confidence interval around the mean of the treatment and check whether its lower bound is above the control average + the expected lift, which is 0.1. 

In [10]:
# calculate the confidence interval around the mean
ci = 0.95
treatment_avg = np.mean(treatment)
control_avg = np.mean(control)
lower, upper = st.t.interval(alpha=ci, df=len(treatment)-1, 
                             loc=treatment_avg, scale=st.sem(treatment)) 

# check whether the lift is as high as expected
print(lower > control_avg+expected_lift)

False


    We see that the 95% confidence interval is not greater than the control+expected_lift, meaning there is some uncertainty. Thus the improvement in the treatment is not practically significant. We need to run the A/B test longer. 

#### Run the test longer
    We not run a power analysis of the T-test to determine how many samples we need for the control and treatment group to detect an effect of 0.1. 

In [11]:
estimated_std = np.std(control) #historical estimate

In [16]:
cohen_d = expected_lift / estimated_std
effect_size = cohen_d #effect_size is the magnitude of the effect we want to see. 

In [13]:
power = 0.8 # we would like a probability of true positives of 0.8
alpha = 0.05 # we would like a probability of false positives (p-value) of 0.05 
ratio = 0.1 # we can afford risking to test the new recommender system on at most 10% of our users
analysis = TTestIndPower()
t_power_test = analysis.solve_power(effect_size=effect_size, power=power, 
                                    nobs1=None, ratio=ratio, alpha=alpha)
sample_size_control = int(np.ceil(t_power_test))
sample_size_treatment = int(np.ceil(t_power_test * ratio))
sample_size_total = sample_size_control + sample_size_treatment
print(f'We must run an A/B test with at least {sample_size_total} customers, \
{sample_size_control} in control and {sample_size_treatment} in treatment')

We must run an A/B test with at least 8272 customers, 7520 in control and 752 in treatment


In [14]:
print(f"The AB test should run for {math.ceil(sample_size_total/100)} days in total.")

The AB test should run for 83 days in total.


    From the analysis above we can conclude that we need 7520 in control and 752 in treament. Thus, the experiment should run for approximately 83 days. 