# Advanced Analytics for Organisational Impact: AB Testing

## 1. Design and set up an A/B test

#### *AB Testing* 

In this exercise, I perform A/B testing using Python on a real-life scenario (adapted from Fillinich 2020):

An expanding e-commerce company wants to test a new version of its home page. The company has put together a new layout or design that they hope will increase its conversion rate. The conversion rate is a key metric, as it refers to the percentage of visitors that take a desired action on the website. In this case, that desired action is clicking through to the product page.
The company's previous home page had a conversion rate of only 13%. The company is hoping that the new design will increase the conversion rate by at least 2%. For the business to meet their target here, the conversion rate needs to improve from 13% to at least 15%.

Here, we use an A/B test to determine if the conversion rate will indeed go up or down by 2%.

#### *Data Used* 

Information regarding the dataset used can be found below:

| Column name                 | Description                                                    |
|-----------------------------|----------------------------------------------------------------|
| user_id (integer)           | The identification number of the user (e.g. 851104).          |
| timestamp (Python object)   | When the user was in the session (e.g. 2017-01-21 22:11:48.556739). |
| group (Python object)       | Which sample group the user was part of (e.g. control or treatment). |
| landing_page (Python object)| Which version of the home page the user saw (e.g. old_page or new_page). |
| converted (integer)         | Whether the user performed a desired action or not (e.g. 0 or 1). |


# 

In [1]:
# Set working Directory
# import os
# os.chdir('/Volumes/GoogleDrive/My Drive/LSE Data Analytics/Course 3 - Advanced Analytics for Organisational Impact /LSE_DA301_Module_1_files/Data')

## 2. Conduct a power analysis

Power analysis helps you determine the sample size required for your A/B test to achieve a desired level of statistical power, which is the probability of correctly detecting an effect when it truly exists.

* __PA__: Population roportion of the old group that converts from the old landing page
* __PB__: Population proportion that convert for the new landing page </br>

1. __Define The Hypotheses:__

* __Null hypothesis (H0):__ There is no significant difference between the groups, where PA = PB or PA - PB = 0
* __Alternative hypothesis (Ha):__ There is a significant difference, where PA ≠ PB or PA - PB ≠ 0

2. __Choose Significance Level (Alpha):__

* Usually set at 0.05. This is the probability of a Type 1 error, which represents the threshold for statistical significance. 
* If the p-value calculated in the test is less than α, you reject the null hypothesis.

3. __Determine Effect Size:__

* The effect size represents the practical or meaningful difference you want to detect between the groups. 
* Here we set the effect size is set at 2%, a 2% increase or decrease in conversions. 

4. __Select Desired Power Level (1 - β):__

* Choose the desired power level (1 - β), typically set at 0.80 or 0.90. 
* This is the probability of correctly rejecting the false null hypothesis i.e. the probability of correctly detecting an effect if it exists. Higher power levels will require larger sample sizes.

5. __Choose a Statistical Test:__

* Determine which statistical test you'll use for your A/B test based on the nature of your data (e.g., t-test, chi-squared test, etc.).
* Here we use the t-test.

6. __Perform Power Analysis:__

* Here we use a power analysis calculator to calculate the required sample size based on the chosen significance level, effect size, and desired power level. The power analysis gives you the minimum sample size needed to achieve your goals.

7. __Sample Size Adjustment:__

* Adjust the calculated sample size if needed for practical reasons or budget constraints. It's essential to balance the desired sample size with resource limitations.


In [2]:
# Import statsmodel for statistical calculations and 
# TTestIndPower class to calculate the parameters.
import statsmodels.stats.api as sms
from statsmodels.stats.power import TTestIndPower

# Specify the three required parameters for the power analysis:
alpha = 0.05 
power = 0.80 
effect = sms.proportion_effectsize(0.13, 0.15) #e.g old and new coversion rate


# Perform power analysis by using the solve_power() function:
# Specify an instance of TTestIndPower.
analysis = TTestIndPower() 

# Calculate the sample size and list the parameters. 
# ratio ensures sample sizes are equal the a and the b group.
result = analysis.solve_power(effect, power=power, nobs1=None,
                              ratio=1.0 , alpha=alpha) 

# Print the output.
print('Sample Size: %.3f' % result)

Sample Size: 4720.435


  return np.clip(_boost._nct_sf(x, df, nc), 0, 1)
  return np.clip(_boost._nct_cdf(x, df, nc), 0, 1)


__ratio=1.0__ ensures the sample size are equal in the A and B group.

Sample Size of 4720.435 represents the absolute minimum (n) of the sample size to ensure the specefic criteria we specefied with regards to the sigicinance level and power with the 2% effect size. 

The minimum sample size required: 4721 (round the value up)

__What if we want to increase the power from 0.80 to 0.90?: This would reduce the probability of the type 2 error down to 10%.__

Having a larger power is a benefit, but at comes at a cost of a larger minimum sample size required.

In [3]:
# Import statsmodel for statistical calculations and 
# TTestIndPower class to calculate the parameters.
import statsmodels.stats.api as sms
from statsmodels.stats.power import TTestIndPower

# Specify the three required parameters for the power analysis:
alpha = 0.05 
power = 0.90 
effect = sms.proportion_effectsize(0.13, 0.15) 

# Perform power analysis by using the solve_power() function:
# Specify an instance of TTestIndPower.
analysis = TTestIndPower() 

# Calculate the sample size and list the parameters.
result = analysis.solve_power(effect, power=power, nobs1=None,
                              ratio=1.0, alpha=alpha) 

# Print the output.
print('Sample Size: %.3f' % result)

Sample Size: 6319.011


## 3. Preparing the data

In [4]:
# Install scipy.
!pip install scipy



In [5]:
# Import necessary libraries, packages and classes.
import pandas as pd
import math
import numpy as np
import statsmodels.stats.api as sms
import scipy.stats as st
import matplotlib as mpl
import matplotlib.pyplot as plt

In [6]:
# Read the CSV file (ab_data.csv).
df = pd.read_csv('ab_data.csv')

# View the DataFrame.
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


In [7]:
# Check the metadata. 294478 entries.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294478 non-null  int64 
 1   timestamp     294478 non-null  object
 2   group         294478 non-null  object
 3   landing_page  294478 non-null  object
 4   converted     294478 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


In [8]:
# Check for duplicates.
# Using Pandas's duplicated() function to check the user_id column. 
print(df[df.user_id.duplicated()])

        user_id                   timestamp      group landing_page  converted
2656     698120  2017-01-15 17:13:42.602796    control     old_page          0
2893     773192  2017-01-14 02:55:59.590927  treatment     new_page          0
7500     899953  2017-01-07 03:06:54.068237    control     new_page          0
8036     790934  2017-01-19 08:32:20.329057  treatment     new_page          0
10218    633793  2017-01-17 00:16:00.746561  treatment     old_page          0
...         ...                         ...        ...          ...        ...
294308   905197  2017-01-03 06:56:47.488231  treatment     new_page          0
294309   787083  2017-01-17 00:15:20.950723    control     old_page          0
294328   641570  2017-01-09 21:59:27.695711    control     old_page          0
294331   689637  2017-01-13 11:34:28.339532    control     new_page          0
294355   744456  2017-01-13 09:32:07.106794  treatment     new_page          0

[3894 rows x 5 columns]


In [9]:
# Drop duplicate values.
# Use drop_duplicates to return the Series without the duplicate values.
df2 = df.drop_duplicates(subset = 'user_id') 

# Check the metadata.
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 290584 entries, 0 to 294477
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       290584 non-null  int64 
 1   timestamp     290584 non-null  object
 2   group         290584 non-null  object
 3   landing_page  290584 non-null  object
 4   converted     290584 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 13.3+ MB


In [10]:
# Remove unnecessary columns.
# dropped.drop to remove irrelevant columns from the DataFrame. 
# Specify that user_id and timestamp are unnecessary columns (i.e. axis 1). 
df3 = df2.drop(['user_id', 'timestamp'], axis=1)  

# Check the DataFrame.
df3.head()

Unnamed: 0,group,landing_page,converted
0,control,old_page,0
1,control,old_page,0
2,treatment,new_page,0
3,treatment,new_page,0
4,control,old_page,1


### Checking for errors (Cross-Tabulation)

People in the old group should have viewed the old page and people in the treatment group should have viewed the treatment page. A cross-tabulation analysis is used to calcuate the error present in this dataset.

In [11]:
# Check for errors.
# Use crosstab to compute a simple cross-tabulation between two variables.
pd.crosstab(df3['group'], df3['landing_page'])

landing_page,new_page,old_page
group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,1006,144226
treatment,144314,1038


Everyone is the control group should have seen the old landing page
* 1006 people in the control group saw the new page </br>

Everyone in the treatment group should have seen the new page
* 1038 people in the treatment group saw the old page

In [12]:
# Specify groups to be dropped. Retain the correct number of individuals who saw the correct page. 
df4 = df3[((df3.group == 'control') & (df3.landing_page == 'old_page')) | (
    (df3.group == 'treatment') & (df3.landing_page == 'new_page'))]

# Print the shape of the new final table.
print(df4.shape)
df4['group'].value_counts()

(288540, 3)


treatment    144314
control      144226
Name: group, dtype: int64

In [13]:
# Re-check/compute another simple cross-tabulation.
pd.crosstab(df4['group'], df4['landing_page'])

landing_page,new_page,old_page
group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,0,144226
treatment,144314,0


## 4. Perform random sampling with Pandas

In [14]:
# Obtain a simple random sample (Every member of the population has an
# equal chance of being included) for control and treatment groups with n = 4721; 
# set random_stategenerator seed at an arbitrary value of 22.
# Obtain a simple random sample for the control group.
control_sample = df4[df4['group'] == 'control'].sample(n=4721, 
                                                       random_state=22) 

# Obtain a simple random sample for the treatment group.
treatment_sample = df4[df4['group'] == 'treatment'].sample(n=4721,
                                                           random_state=22)

In [15]:
# Join the two samples.  
ab_test = pd.concat([control_sample, treatment_sample], axis=0)  

# Reset the A/B index.
ab_test.reset_index(drop=True, inplace=True) 

# Print the sample table.
ab_test  

Unnamed: 0,group,landing_page,converted
0,control,old_page,0
1,control,old_page,0
2,control,old_page,0
3,control,old_page,0
4,control,old_page,0
...,...,...,...
9437,treatment,new_page,0
9438,treatment,new_page,0
9439,treatment,new_page,0
9440,treatment,new_page,0


## 5. Analyse the data

In [16]:
# Calculate basic statistics.
# Import library.
# SEM stands for standard error mean.
from scipy.stats import sem

# Group the ab_test data set by group and aggregate by converted.
conversion_rates = ab_test.groupby('group')['converted']

# Calculate conversion rates by calculating the means of columns STD_p and SE_p.
conversion_rates = conversion_rates.agg([np.mean, np.std, sem])

# Assign names to the three columns.
conversion_rates.columns = ['conversion_rate', 'std_deviation', 'std_error']  

# Round the output to 3 decimal places.
conversion_rates.style.format('{:.3f}')

Unnamed: 0_level_0,conversion_rate,std_deviation,std_error
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
control,0.124,0.33,0.005
treatment,0.115,0.319,0.005


 For binary data, the sample mean is what is referred to as the sample poportion, such that adding up the 0 and 1 gives you the total number of 1s (successes). Then when you divide by sample size to give you the sample mean, that equates to sample proportion. 
 
- From the table above, 12.4% of those in the control group with the old landing page clicked through.
- For the treatment group, 11.5% clicked through <br>

To determine whether the the difference is due to chance or is the difference statistically significant? to Detrmine this we perform a formal hypothesis test. 

__Note__ As The sample size gets bigger the estimate of the population parameters should improve and the standard error should get smaller.

In [17]:
# Calculate statistical significance.
# Import proportions_ztest and proportion_confint from statsmodels.
from statsmodels.stats.proportion import proportions_ztest, proportion_confint

# Create a subset of control and treatment results.
control_results = ab_test[ab_test['group'] == 'control']['converted']
treatment_results = ab_test[ab_test['group'] == 'treatment']['converted']

# Determine the count of the control_results and 
# treatment_result sub-data sets and store them in their respective variables.
n_con = control_results.count()
n_treat = treatment_results.count()

# Create a variable 'success' with the sum of the two data sets in a list format. 
# Here we are assuming the null hypothesis to be true.
successes = [control_results.sum(), treatment_results.sum()]

# Create a variable 'nobs' which stores the values of 
# variables n_con and n_treat in list format. 
nobs = [n_con, n_treat] 

# Use the imported libraries to calculate the statistical values. 
z_stat, pval = proportions_ztest(successes, nobs=nobs)
(lower_con, lower_treat), (upper_con, upper_treat) = proportion_confint(successes,
                                                                        nobs=nobs,
                                                                        alpha=0.05)

# Print the outputs (with lead-in text). The .3f indicates the number of decimal places.
print(f"Z test stat: {z_stat:.3f}")
print(f"P-value: {pval:.3f}")
print(f"Confidence Interval of 95% for control group: [{lower_con:.3f}, {upper_con:.3f}]")
print(f"Confidence Interval of 95% for treatment group: [{lower_treat:.3f}, {upper_treat:.3f}]")

Z test stat: 1.396
P-value: 0.163
Confidence Interval of 95% for control group: [0.115, 0.134]
Confidence Interval of 95% for treatment group: [0.106, 0.124]


## 6. Conclusion

In a z-test, under the null hypothesis the test statistic follows approximately a standard normal distrubution. 

The p-value is the probability of the test statistic value or a more extreme value, conditional on the null hypothesis being true. As the test above is a two sided t-test, the p value is the probablity of being greater then 1.396 or under -1.396. 

Calculating the standard normal distrubution = 0.081357, which represents the area in the lower tail of the normal distrubution. As its a two sided test, the total p-value is the sum of the two areas (symmetry of distrubution) = 0.163. 

Our results indicated a $p=0.163$ (16.3%) with an $alpha=0.05$. Therefore, we cannot reject the $H_0$. The $p$-values summarise statistical differences, essentially calculating the probability that the data could deviate from $H_0$. A $p=0.163$ (16.3%) suggests that the new design did not perform better than the old design, we do not have a difference in effective of conversions of the new landing page. The confidence interval for the treatment group does not reach 13%, never mind the desired 15%, confirming that the new website design was altogether ineffective in increasing the conversion rate.

This means 1 of 2 things has happened: 
1. Correct non-rejection of null hypothesis, the two landing pages are equally effective. 
2. Type 2 error, a false negative

Given the 80% power we had for the test, this means that if the altnerative hypothesis is true and we did not reject the null hypothesis, there was a 20% probability of a type 2 error.


Although the website redesign did not result in a higher conversion rate, the result of the test does not necessarily need to be seen as a failure. The non-rejection of the $H_0$ simply means that there isn’t enough statistical evidence to disprove the hypothesis. In theory, we can repeat the test with a different sample size, or a change in the confidence levels. If we continualy get the same test results, it is likely that there is not type 1 or type 2 errors. 

The e-commerce company now knows that maybe it isn’t the homepage design that is discouraging users from taking desired actions. There could also be invalid assumptions in the test itself. With this in mind, the company can reconsider how to either improve the conversion rate such as changes in pricing or range of products on offer, or run more tests.