# **Experimental Design**

In [5]:
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.stats

## **A/B Testing**

    Conducted using a Chi-square hypothesis test.
    -----------------------------------------------------------------------------------------------------------
    Sample size calculator requires 3 parameters;
    
        Baseline conversion rate (BCR): is the current conversion rate or current_baseline.
        Minimum detectable effect (mde)/Minimum desired lift.
        Statistical significance threshold.
        
        mde = ((new_baseline(NCR) - current_baseline)/current_baseline)*100   
    -----------------------------------------------------------------------------------------------------------   
    False positive: The probability of finding a significant difference when there is none.
    False negative: The probability of not finding a significant difference when there is.
    -----------------------------------------------------------------------------------------------------------
    Most A/B test sample size calculators estimate the sample size needed for a 20% false negative rate; while a data scientist needs to choose the false positive rate they are comfortable with. 
    -----------------------------------------------------------------------------------------------------------
    The lower the false positive rate, the larger the sample size will need to be!
    -----------------------------------------------------------------------------------------------------------
    Two important rules for making sure that A/B tests remain unbiased:
        - Don’t continue to run the test after the predetermined sample size, until “significant” results are found.
        - Don’t stop a test before reaching the predetermined sample size, just because your results reach significance early (unless there are ethical reasons that require you to stop, like a prescription drug trial).

 #### **Import Data**

In [6]:
web_purchase = pd.read_csv('web_version.txt', engine='python', sep=',')
web_purchase

Unnamed: 0,Web_Version,Purchased
0,A,no
1,A,no
2,A,yes
3,A,yes
4,A,yes
...,...,...
95,B,yes
96,B,yes
97,B,yes
98,B,yes


In [9]:
# Perform Chi-square test.
ab_contingency = pd.crosstab(web_purchase.Web_Version, web_purchase.Purchased)

chi2, pval, dof, expected = stats.chi2_contingency(ab_contingency)

print(ab_contingency)
print('--------------------------------------')
print(pval, ', Significance:', pval < 0.05)

Purchased    no  yes
Web_Version         
A            24   26
B            15   35
--------------------------------------
0.10096676200907678 , Significance: False


#### **Simulate A/B Data for a Chi-square test**

In [11]:
# Assume.
sample_size = 500
lift = 0.3
BCR = 0.5
NCR = (1 + lift) * BCR


# Simulate random process with different probabilities for control and new.
sample_control = np.random.choice(['yes', 'no'], size=int(sample_size/2), p=[BCR, 1-BCR])
sample_new = np.random.choice(['yes', 'no'], size=int(sample_size/2), p=[NCR, 1-NCR])


# Develop a dataframe for the Chi-square test.
group = ['control']*int(sample_size/2) + ['new']*int(sample_size/2)
outcome = list(sample_control) + list(sample_new)
sim_data = {"Label": group, "Outcome": outcome}
sim_data = pd.DataFrame(sim_data)
sim_data

Unnamed: 0,Label,Outcome
0,control,no
1,control,yes
2,control,yes
3,control,yes
4,control,yes
...,...,...
495,new,yes
496,new,yes
497,new,yes
498,new,yes


In [12]:
# Perform Chi-square test.
ab_contingency = pd.crosstab(sim_data.Label, sim_data.Outcome)

chi2, pval, dof, expected = stats.chi2_contingency(ab_contingency)

print(ab_contingency)
print('--------------------------------------')
print(pval, ', Significance:', pval < 0.05)

Outcome   no  yes
Label            
control  122  128
new       85  165
--------------------------------------
0.0010806159373268863 , Significance: True


#### **Estimating the Power**

In [28]:
def estimate_power(significance_threshold, sample_size, lift, BCR, NCR, n):
    
    # Initialize an empty list.
    results = []

    for num in range(n):
        # Simulate random process with different probabilities for control and new.
        sample_control = np.random.choice(['yes', 'no'],  size=int(sample_size/2), p=[BCR, 1-BCR])
        sample_new = np.random.choice(['yes', 'no'], size=int(sample_size/2), p=[NCR, 1-NCR])

        # Develop a dataframe for the Chi-square test.
        group = ['control']*int(sample_size/2) + ['new']*int(sample_size/2)
        outcome = list(sample_control) + list(sample_new)
        sim_data = {"Label": group, "Outcome": outcome}
        sim_data = pd.DataFrame(sim_data)

        # run the test.
        ab_contingency = pd.crosstab(np.array(sim_data.Label), np.array(sim_data.Outcome))
        chi2, pval, dof, expected = stats.chi2_contingency(ab_contingency)
        result = ('significant' if pval < significance_threshold else 'not significant')

        # append the result to our results list here:
        results.append(result)

    # calculate proportion of significant results here:
    #print("Proportion of significant results:")
    results = np.array(results)
    #print(results[results == 'significant'])
    
    return np.sum(results == 'significant')/n

In [43]:
# Assume.
significance_threshold = 0.01
sample_size = 350
lift = 0.4
BCR = 0.5
NCR = (1 + lift) * BCR
n = 500  # no of experiments to run

power_of_test = estimate_power(significance_threshold, sample_size, lift, BCR, NCR, n)
power_of_test     # same as the true positive rate

0.894

    Typically, most A/B test selects sample sizes so as to obtain at least 80% power.
    
    To obtain this, experiments are designed with 80%/(1 - significance_threshold) power;
    this is done to take care of the false positives that results from running the experiment with a lift of 0, since these false positives will occur in the true positives.
    
    Caveats:
        Increasing the sample size increases the power of the test (the probability of detecting a difference if there is one); however, larger sample sizes require more time and resources.
        
        Increasing the significance threshold also increases the power of the test; however, it simultaneously increases the false positive rate (the probability of detecting a difference when there isn’t one).
        
        Increasing the minimum detectable effect/lift, will enable a decrease in the sample size without decreasing power. 