### LSE Data Analytics Online Career Accelerator 
# Course 301: Advanced Analytics for Organisational Impact

### 1.1.7 Practical activity: Conduct A/B testing in Python

### Scenario

An online bicycle store has changed its home page interface to encourage visitors to click through to its loyalty programme sign-up page. It hopes the new interface will encourage more visitors to access the loyalty programme page, to see what benefits the programme brings, and to sign up. The current click-through rate (CTR) is around 50% annually, and the company hopes the new design will push this to at least 55%.

This analysis uses the bike_shop.csv data set. Using your Python and data wrangling skills, you will run an A/B test on the data to measure the significance of the interface change based on CTR to the loyalty programme page

In [1]:
# Import statsmodel for statistical calculations and TTestIndPower class to calculate the parameters. 
import statsmodels.stats.api as sms
from statsmodels.stats.power import TTestIndPower
# Specify the three required parameters for the power analisys: 
alpha = 0.05
power = 0.80
effect = sms.proportion_effectsize(0.5, 0.55)
# Specify power analysis by using the solve_power() funtion:
# Specify an instance of TTestIndPower.
analysis = TTestIndPower()
# Calculate the sample size and list the parameters.
result = analysis.solve_power(effect, power=power, nobs1=None, ratio=1.0, alpha=alpha)

# Print the output.
print('Sample Size: %.3f' % result)



Sample Size: 1565.490


In [2]:
# Install the relevant modules: 
!pip install scipy
# Import necessary libraries, packages, and classes 
import pandas as pd
import math
import numpy as np
import statsmodels.stats.api as sms
import scipy.stats as st
import matplotlib as mpl
import matplotlib.pyplot as plt 
# Read the CSV file.
df = pd.read_csv('bike_shop_28.02.csv')
# View the DataFrame
df.head()



Unnamed: 0,RecordID,IP Address,LoggedInFlag,ServerID,VisitPageFlag
0,1,39.13.114.2,1,2,0
1,2,13.3.25.8,1,1,0
2,3,247.8.211.8,1,1,0
3,4,124.8.220.3,0,3,0
4,5,60.10.192.7,0,2,0


In [3]:
# Check the metadata.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 184588 entries, 0 to 184587
Data columns (total 5 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   RecordID       184588 non-null  int64 
 1   IP Address     184588 non-null  object
 2   LoggedInFlag   184588 non-null  int64 
 3   ServerID       184588 non-null  int64 
 4   VisitPageFlag  184588 non-null  int64 
dtypes: int64(4), object(1)
memory usage: 7.0+ MB


In [4]:
# Create a new variable to store the cleaned data and rename the columns name.
df_new = df.rename(columns={'IP Address': 'IPAddress', 'LoggedInFlag': 'LoyaltyPage'})
# View the DataFrame.
df_new.head()

Unnamed: 0,RecordID,IPAddress,LoyaltyPage,ServerID,VisitPageFlag
0,1,39.13.114.2,1,2,0
1,2,13.3.25.8,1,1,0
2,3,247.8.211.8,1,1,0
3,4,124.8.220.3,0,3,0
4,5,60.10.192.7,0,2,0


In [5]:
# Drop duplicate values. 
# Use drop_duplicates to return Series without the duplicate values. 
df_new = df_new.drop_duplicates(subset = 'IPAddress', keep = False, inplace = True)
# Check the metadata.
df_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39608 entries, 7 to 184584
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   RecordID       39608 non-null  int64 
 1   IPAddress      39608 non-null  object
 2   LoyaltyPage    39608 non-null  int64 
 3   ServerID       39608 non-null  int64 
 4   VisitPageFlag  39608 non-null  int64 
dtypes: int64(4), object(1)
memory usage: 1.8+ MB


In [9]:
# Remove unnecessary columns. 
# Used dropped.drop to remove irrelevant columns from the DataFrame. 
# Specify that Unnamed: 0, RecordID and VisitPageFlag are columns (i.e. axis =1).
df_final = df_new.drop(['RecordID', 'VisitPageFlag'], axis=1)
# Check the metadata.
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39608 entries, 7 to 184584
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   IPAddress    39608 non-null  object
 1   LoyaltyPage  39608 non-null  int64 
 2   ServerID     39608 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.2+ MB


In [10]:
df_final.head()

Unnamed: 0,IPAddress,LoyaltyPage,ServerID
7,97.6.126.6,0,3
12,188.13.62.2,0,3
14,234.1.239.1,0,2
15,167.15.157.7,0,2
16,123.12.229.8,0,1


In [16]:
# Split the data set into ID1 as treatment and ID2 & ID3 as control groups.
df_final['Group'] = df_final['ServerID'].map({1:'Treatment', 2:'Control', 3:'Control'})
# View the DataFrame.
df_final.head()

Unnamed: 0,IPAddress,LoyaltyPage,ServerID,Group
7,97.6.126.6,0,3,Control
12,188.13.62.2,0,3,Control
14,234.1.239.1,0,2,Control
15,167.15.157.7,0,2,Control
16,123.12.229.8,0,1,Treatment


In [17]:
# Determine the sample sizes.
df_final['Group'].value_counts()

Control      26310
Treatment    13298
Name: Group, dtype: int64

In [20]:
# Obtain a simple random sample for control and treatment groups with n = 1565;
# Set random_stategenerator see at an arbitrary value of 42.
# Obtain a simple random sample for the control group.
control_sample = df_final[df_final['Group'] == 'Control'].sample(n=1565, random_state=42)

# Obtain a simple random sample for the treatment group.
treatment_sample = df_final[df_final['Group'] == 'Treatment'].sample(n=1565, random_state=42)

In [22]:
# View the Control DataFrame.
control_sample.head()

Unnamed: 0,IPAddress,LoyaltyPage,ServerID,Group
53313,25.16.126.2,1,3,Control
52290,106.13.67.3,1,3,Control
104046,169.11.137.7,0,2,Control
171756,164.9.86.8,1,2,Control
2317,112.12.25.7,0,2,Control


In [23]:
# View the Treatment DataFrame.
treatment_sample.head()

Unnamed: 0,IPAddress,LoyaltyPage,ServerID,Group
173762,251.0.251.9,1,1,Treatment
150588,16.1.214.6,1,1,Treatment
72805,39.3.26.5,0,1,Treatment
112098,90.14.154.1,1,1,Treatment
32507,18.5.206.8,0,1,Treatment


## Perform the A/B test:

In [24]:
# Join the two samples.
ab_test = pd.concat([control_sample, treatment_sample], axis=0)
# Reset the A/B index.
ab_test.reset_index(drop=True, inplace=True)
# View the Treatment DataFrame.
ab_test

Unnamed: 0,IPAddress,LoyaltyPage,ServerID,Group
0,25.16.126.2,1,3,Control
1,106.13.67.3,1,3,Control
2,169.11.137.7,0,2,Control
3,164.9.86.8,1,2,Control
4,112.12.25.7,0,2,Control
...,...,...,...,...
3125,32.16.5.8,0,1,Treatment
3126,187.4.117.9,1,1,Treatment
3127,134.0.112.5,1,1,Treatment
3128,7.3.242.7,0,1,Treatment


In [41]:
# Calculate basic statistics. 
# Import Library.
from scipy.stats import sem
# Group the ab_test data set by Group and aghregate by LoyaltyPage.
conversion_rates = ab_test.groupby('Group')['LoyaltyPage']
# Calculate conversion rates by calculating the means of columns STD_p and SE_p.
conversion_rates = conversion_rates.agg([np.mean, np.std, sem])
# Assign names to the three columns.
conversion_rates.columns=['conversion_rate', 'std_deviation', 'std_error']
# Round the results to three decimal places.
conversion_rates.style.format('{:.3f}')

Unnamed: 0_level_0,conversion_rate,std_deviation,std_error
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Control,0.532,0.499,0.013
Treatment,0.483,0.5,0.013


In [42]:
# Calculate the statistical significance.
# Import proportions_ztest and proportion_confint from statsmodels.
from statsmodels.stats.proportion import proportions_ztest, proportion_confint
# Create a subset of control and treatment results.
control_results = ab_test[ab_test['Group'] == 'Control']['LoyaltyPage']
treatment_results = ab_test[ab_test['Group'] == 'Treatment']['LoyaltyPage']
# Determine the count of the control_results and the treatment_results sub-datasets and store them in their respective variables.
n_con = control_results.count()
n_treat = treatment_results.count()
# Create a variable 'success' with the sum of the two data sets in a list format.
successes = [control_results.sum(), treatment_results.sum()]
# Create a variable 'nobs' wich stores values of variables n_con and n_treat in a list format.
nobs = [n_con, n_treat]
# Use the imported libraries to calculate the statistical values. 
z_stat, pval = proportions_ztest(successes, nobs=nobs)
(lower_con, lower_treat), (upper_con, upper_treat) = proportion_confint(successes,
                                                                        nobs=nobs,
                                                                        alpha=0.05)
# Print the outputs (with lead-in text).
print(f'Z test stat: {z_stat:.2f}')
print(f'P-value: {pval:.3f}')
print(f'Confidence Interval of 95% for control group: [{lower_con:.3f}, {upper_con:.3f}]')
print(f'Confidence Interval of 95% for treatment group: [{lower_treat:.3f}, {upper_treat:.3f}]')

Z test stat: 2.75
P-value: 0.006
Confidence Interval of 95% for control group: [0.508, 0.557]
Confidence Interval of 95% for treatment group: [0.458, 0.508]


## 7. Summarise results and explain your answers

The change to the homepage slightly decreased the click through to the login page. 

The `p`-value is smaller than the Alpha value of 0.05, meaning we reject the $H_0$. 