### LSE Data Analytics Online Career Accelerator 
# Course 301: Advanced Analytics for Organisational Impact

## Practical activity: Conduct A/B testing in Python

**This is the solution to the activity.**

An online bicycle store has changed its home page interface to encourage visitors to click through to its loyalty programme sign-up page. It is hoping the new interface will encourage more visitors to access the loyalty programme page, see what benefits the programme brings, and hopefully then sign up. The current click-through rate sits at around 50% annually, and the company hopes the new design will push this to at least 55%. 

This analysis uses the `new_bike_shop_AB.csv` data set. Using your Python and data wrangling skills, you will run an A/B test on the data to measure the significance of the interface change based on click-through rates to the loyalty programme page. 

## 1. Prepare your workstation

In [1]:
# Import the necessary libraries.
import statsmodels.stats.api as sms
from statsmodels.stats.power import TTestIndPower

## 2. Perform power analysis

In [2]:
# Perform the power analysis to determine sample size.
effect = sms.proportion_effectsize(0.50, 0.55)   
 
effect,
alpha = 0.05
power = 0.8

analysis = TTestIndPower()
result = analysis.solve_power(effect, power=power, nobs1=None, ratio=1.0, alpha=alpha)

print('Sample Size: %.3f' % result)

Sample Size: 1565.490


## 3. Import data set

In [3]:
# Import the necessary libraries.
import pandas as pd
import math
import numpy as np
import statsmodels.stats.api as sms
import scipy.stats as st
import matplotlib as mpl
import matplotlib.pyplot as plt

In [4]:
# Read the data set with Pandas.
df = pd.read_csv('new_bike_shop_AB.csv')

# Print the DataFrame.
print(df.shape)
df.head()

(184588, 5)


Unnamed: 0,RecordID,IPAddress,LoyaltyPage,ServerID,VisitPageFlag
0,1,39.13.114.2,1,2,0
1,2,13.3.25.8,1,1,0
2,3,247.8.211.8,1,1,0
3,4,124.8.220.3,0,3,0
4,5,60.10.192.7,0,2,0


In [5]:
# View the DataFrame.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 184588 entries, 0 to 184587
Data columns (total 5 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   RecordID       184588 non-null  int64 
 1   IPAddress      184588 non-null  object
 2   LoyaltyPage    184588 non-null  int64 
 3   ServerID       184588 non-null  int64 
 4   VisitPageFlag  184588 non-null  int64 
dtypes: int64(4), object(1)
memory usage: 7.0+ MB


## 4. Clean the data

In [6]:
# Rename the columns.
df_new = df.rename(columns={'IP Address': 'IPAddress', 'LoggedInFlag': 'LoyaltyPage'})

# View the DataFrame.
print(df_new.shape)
df_new.head()

(184588, 5)


Unnamed: 0,RecordID,IPAddress,LoyaltyPage,ServerID,VisitPageFlag
0,1,39.13.114.2,1,2,0
1,2,13.3.25.8,1,1,0
2,3,247.8.211.8,1,1,0
3,4,124.8.220.3,0,3,0
4,5,60.10.192.7,0,2,0


In [8]:
# Drop duplicate values.
df_new.drop_duplicates(subset ='IPAddress',
                       keep = False, inplace = True)


# Drop duplicate columns.
df_final = df_new.drop(['RecordID', 'VisitPageFlag'],
                       axis=1)


# View the DataFrame.
print(df_final.shape)

df_final.info()

(39608, 3)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39608 entries, 7 to 184584
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   IPAddress    39608 non-null  object
 1   LoyaltyPage  39608 non-null  int64 
 2   ServerID     39608 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.2+ MB


## 5. Subset the DataFrame

In [9]:
# Split the data set into ID1 as treatment and ID2 & ID3 as control groups.
df_final['Group'] = df_final['ServerID'].map({1:'Treatment', 2:'Control', 3:'Control'})

# View the DataFrame.
print(df_final.shape)
df_final.head()

(39608, 4)


Unnamed: 0,IPAddress,LoyaltyPage,ServerID,Group
7,97.6.126.6,0,3,Control
12,188.13.62.2,0,3,Control
14,234.1.239.1,0,2,Control
15,167.15.157.7,0,2,Control
16,123.12.229.8,0,1,Treatment


In [10]:
# Count the values.
df_final["Group"].value_counts()

Control      26310
Treatment    13298
Name: Group, dtype: int64

In [12]:
# Create two DataFrames.
# You can use any random_state.
c_sample = df_final[df_final['Group'] == 'Control'].sample(n=1565,
                                                           random_state=22) 

t_sample = df_final[df_final['Group'] == 'Treatment'].sample(n=1565,
                                                             random_state=22)

# View the DataFrames.
print(c_sample)
print(t_sample)

           IPAddress  LoyaltyPage  ServerID    Group
178875   110.13.32.9            0         3  Control
127932     11.1.11.1            1         2  Control
20425    234.4.250.3            0         2  Control
104611   244.12.33.1            1         2  Control
132873    232.4.61.9            0         2  Control
...              ...          ...       ...      ...
113936   111.11.27.7            1         3  Control
118109   251.0.237.2            1         3  Control
87173     192.2.62.4            1         3  Control
44491     65.12.18.6            1         2  Control
65759   210.16.220.4            1         3  Control

[1565 rows x 4 columns]
          IPAddress  LoyaltyPage  ServerID      Group
127674   79.7.253.6            1         1  Treatment
36272     13.16.6.9            1         1  Treatment
100570  213.10.22.5            0         1  Treatment
162988   232.4.29.4            1         1  Treatment
159937   75.1.232.4            0         1  Treatment
...            

## 6. Perform A/B testing

In [13]:
# Perform A/B testing.
# Create variable and merge DataFrames.
ab_test = pd.concat([c_sample, t_sample], axis=0)

ab_test.reset_index(drop=True, inplace=True)

# View the output.
ab_test

Unnamed: 0,IPAddress,LoyaltyPage,ServerID,Group
0,110.13.32.9,0,3,Control
1,11.1.11.1,1,2,Control
2,234.4.250.3,0,2,Control
3,244.12.33.1,1,2,Control
4,232.4.61.9,0,2,Control
...,...,...,...,...
3125,83.12.121.1,0,1,Treatment
3126,158.5.109.2,1,1,Treatment
3127,28.2.49.7,0,1,Treatment
3128,81.8.204.9,1,1,Treatment


In [14]:
# Calculate the conversion rates.
conversion_rates = ab_test.groupby('Group')['LoyaltyPage']


# Standard deviation of the proportion.
STD_p = lambda x: np.std(x, ddof=0)    
# Standard error of the proportion.
SE_p = lambda x: st.sem(x, ddof=0)     

conversion_rates = conversion_rates.agg([np.mean, STD_p, SE_p])
conversion_rates.columns = ['conversion_rate', 'std_deviation', 'std_error']


conversion_rates.style.format('{:.3f}')

Unnamed: 0_level_0,conversion_rate,std_deviation,std_error
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Control,0.513,0.5,0.013
Treatment,0.498,0.5,0.013


In [15]:
from statsmodels.stats.proportion import proportions_ztest, proportion_confint

control_results = ab_test[ab_test['Group'] == 'Control']['LoyaltyPage']
treatment_results = ab_test[ab_test['Group'] == 'Treatment']['LoyaltyPage']

n_con = control_results.count()
n_treat = treatment_results.count()

successes = [control_results.sum(), treatment_results.sum()]

nobs = [n_con, n_treat]

z_stat, pval = proportions_ztest(successes, nobs=nobs)
(lower_con, lower_treat), (upper_con, upper_treat) = proportion_confint(successes,
                                                                        nobs=nobs,
                                                                        alpha=0.05)

print(f'Z test stat: {z_stat:.2f}')
print(f'P-value: {pval:.3f}')
print(f'Confidence Interval of 95% for control group: [{lower_con:.3f}, {upper_con:.3f}]')
print(f'Confidence Interval of 95% for treatment group: [{lower_treat:.3f}, {upper_treat:.3f}]')

Z test stat: 0.86
P-value: 0.391
Confidence Interval of 95% for control group: [0.488, 0.538]
Confidence Interval of 95% for treatment group: [0.473, 0.523]


## 7. Summarise results and explain your answers

The change to the homepage slightly decreased the click through to the login page. 

The *p*-value is well over the Alpha value of 0.05, meaning the null hypothesis cannot be rejected. 