# Course 301: Advanced Analytics for Organisational Impact

## Practical activity: Conduct A/B testing in Python

An online bicycle store has changed its home page interface to encourage visitors to click through to its loyalty programme sign-up page. It is hoping the new interface will encourage more visitors to access the loyalty programme page, see what benefits the programme brings, and hopefully then sign up. The current click-through rate (CTR) sits at around 50% annually, and the company hopes the new design will push this to at least 55%. 

This analysis uses the `bike_shop.csv` data set. Using your Python and data wrangling skills, you will run an A/B test on the data to measure the significance of the interface change based on CTR to the loyalty programme page. 

## 1. Prepare your workstation

In [1]:
# Import the necessary libraries.
import statsmodels.stats.api as sms
from statsmodels.stats.power import TTestIndPower

## 2. Perform power analysis

In [2]:
# Perform the power analysis to determine sample size.
effect = sms.proportion_effectsize(0.50, 0.55)   
 
effect,
alpha = 0.05
power = 0.8

analysis = TTestIndPower()
result = analysis.solve_power(effect, power=power,
                              nobs1=None, ratio=1.0,
                              alpha=alpha)

print('Sample Size: %.3f' % result)

Sample Size: 1565.490


## 3. Import data set

In [3]:
# Import the necessary libraries.
import pandas as pd
import math
import numpy as np
import statsmodels.stats.api as sms
import scipy.stats as st
import matplotlib as mpl
import matplotlib.pyplot as plt

In [4]:
# Read the data set with Pandas.
df = pd.read_csv('bike_shop.csv')

# Print the DataFrame.
print(df.shape)
df.head()

(184588, 5)


Unnamed: 0,RecordID,IPAddress,LoyaltyPage,ServerID,VisitPageFlag
0,1,39.13.114.2,1,2,0
1,2,13.3.25.8,1,1,0
2,3,247.8.211.8,1,1,0
3,4,124.8.220.3,0,3,0
4,5,60.10.192.7,0,2,0


In [5]:
# View the DataFrame.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 184588 entries, 0 to 184587
Data columns (total 5 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   RecordID       184588 non-null  int64 
 1   IPAddress      184588 non-null  object
 2   LoyaltyPage    184588 non-null  int64 
 3   ServerID       184588 non-null  int64 
 4   VisitPageFlag  184588 non-null  int64 
dtypes: int64(4), object(1)
memory usage: 7.0+ MB


## 4. Clean the data

In [6]:
# Rename the columns.
df_new = df.rename(columns={'IP Address': 'IPAddress',
                            'LoggedInFlag': 'LoyaltyPage'})

# View the DataFrame.
print(df_new.shape)
df_new.head()

(184588, 5)


Unnamed: 0,RecordID,IPAddress,LoyaltyPage,ServerID,VisitPageFlag
0,1,39.13.114.2,1,2,0
1,2,13.3.25.8,1,1,0
2,3,247.8.211.8,1,1,0
3,4,124.8.220.3,0,3,0
4,5,60.10.192.7,0,2,0


In [7]:
df_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 184588 entries, 0 to 184587
Data columns (total 5 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   RecordID       184588 non-null  int64 
 1   IPAddress      184588 non-null  object
 2   LoyaltyPage    184588 non-null  int64 
 3   ServerID       184588 non-null  int64 
 4   VisitPageFlag  184588 non-null  int64 
dtypes: int64(4), object(1)
memory usage: 7.0+ MB


In [9]:
df_new[df_new.IPAddress.duplicated()]

Unnamed: 0,RecordID,IPAddress,LoyaltyPage,ServerID,VisitPageFlag
275,276,191.4.97.7,0,2,0
394,395,79.9.70.7,1,3,0
703,704,175.1.81.8,1,3,0
809,810,125.0.30.9,1,2,0
889,890,207.14.157.6,1,3,0
...,...,...,...,...,...
184582,184583,90.4.224.4,0,3,0
184583,184584,114.8.104.1,0,1,0
184585,184586,170.13.31.9,0,2,0
184586,184587,195.14.92.3,0,3,0


In [10]:
# Drop duplicate values.
df_new.drop_duplicates(subset ='IPAddress',
                       keep = False,
                       inplace = True)


# Drop duplicate columns.
df_final = df_new.drop(['RecordID', 'VisitPageFlag'],
                       axis=1)


# View the DataFrame.
print(df_final.shape)
df_final.head()

(39608, 3)


Unnamed: 0,IPAddress,LoyaltyPage,ServerID
7,97.6.126.6,0,3
12,188.13.62.2,0,3
14,234.1.239.1,0,2
15,167.15.157.7,0,2
16,123.12.229.8,0,1


In [11]:
# Check for duplicates again.
df_final[df_final.IPAddress.duplicated()]

Unnamed: 0,IPAddress,LoyaltyPage,ServerID


In [12]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39608 entries, 7 to 184584
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   IPAddress    39608 non-null  object
 1   LoyaltyPage  39608 non-null  int64 
 2   ServerID     39608 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.2+ MB


## 5. Subset the DataFrame

In [13]:
# Split the data set into:
# ID1 as treatment;
# ID2 & ID3 as control groups.
df_final['Group'] = df_final['ServerID'].map({1:'Treatment',
                                              2:'Control',
                                              3:'Control'})

# View the DataFrame.
print(df_final.shape)
df_final.head()

(39608, 4)


Unnamed: 0,IPAddress,LoyaltyPage,ServerID,Group
7,97.6.126.6,0,3,Control
12,188.13.62.2,0,3,Control
14,234.1.239.1,0,2,Control
15,167.15.157.7,0,2,Control
16,123.12.229.8,0,1,Treatment


In [14]:
# Count the values.
df_final['Group'].value_counts()

Control      26310
Treatment    13298
Name: Group, dtype: int64

In [15]:
# Create two DataFrames.
# You can use any random_state.
c_sample = df_final[df_final['Group'] == 'Control'].sample(n = 1565,
                                                           random_state = 42) 

t_sample = df_final[df_final['Group'] == 'Treatment'].sample(n = 1565,
                                                             random_state = 42)

# View the DataFrames.
print(c_sample.head())
t_sample.head()

           IPAddress  LoyaltyPage  ServerID    Group
53313    25.16.126.2            1         3  Control
52290    106.13.67.3            1         3  Control
104046  169.11.137.7            0         2  Control
171756    164.9.86.8            1         2  Control
2317     112.12.25.7            0         2  Control


Unnamed: 0,IPAddress,LoyaltyPage,ServerID,Group
173762,251.0.251.9,1,1,Treatment
150588,16.1.214.6,1,1,Treatment
72805,39.3.26.5,0,1,Treatment
112098,90.14.154.1,1,1,Treatment
32507,18.5.206.8,0,1,Treatment


## 6. Perform A/B testing

In [16]:
# Perform A/B testing.
# Create variable and merge DataFrames.
ab_test = pd.concat([c_sample, t_sample], axis = 0)

ab_test.reset_index(drop = True, inplace = True)

# View the output.
ab_test.head()

Unnamed: 0,IPAddress,LoyaltyPage,ServerID,Group
0,25.16.126.2,1,3,Control
1,106.13.67.3,1,3,Control
2,169.11.137.7,0,2,Control
3,164.9.86.8,1,2,Control
4,112.12.25.7,0,2,Control


In [17]:
# Calculate the conversion rates.
conversion_rates = ab_test.groupby('Group')['LoyaltyPage']


# Standard deviation of the proportion.
STD_p = lambda x: np.std(x, ddof = 0)

# Standard error of the proportion.
SE_p = lambda x: st.sem(x, ddof = 0)

conversion_rates = conversion_rates.agg([np.mean, STD_p, SE_p])

conversion_rates.columns = ['conversion_rate',
                            'std_deviation',
                            'std_error']

# Convert output into a Pandas DataFrame.
cr = pd.DataFrame(conversion_rates)

# Round the output to 3 decimal places.
# cr.style.format('{:.3f}')

# View output.
cr

Unnamed: 0_level_0,conversion_rate,std_deviation,std_error
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Control,0.532268,0.498958,0.012613
Treatment,0.483067,0.499713,0.012632


**Calculate *p*-value and confidence intervals**

In [18]:
# Import packages for the z-test (n > 30).
from statsmodels.stats.proportion import proportions_ztest, proportion_confint

# Create a subset of control and treatment results by
# subsetting the data set for control and treatment.
control_results = ab_test[ab_test['Group'] == 'Control']['LoyaltyPage']
treatment_results = ab_test[ab_test['Group'] == 'Treatment']['LoyaltyPage']

# Determine the count of the control_results and treatment_result subsets.
n_con = control_results.count()
n_treat = treatment_results.count()

# Store sum in list format.
successes = [control_results.sum(), treatment_results.sum()]

# Store count in list format.
nobs = [n_con, n_treat]

# Calculate statistical values.
z_stat, pval = proportions_ztest(successes, nobs = nobs)
(lower_con, lower_treat), (upper_con, upper_treat) = proportion_confint(successes,
                                                                        nobs = nobs,
                                                                        alpha = 0.05)
# View the outputs.
print(f'Z test stat: {z_stat:.2f}')
print(f'P-value: {pval:.3f}')
print(f'Confidence Interval of 95% for control group: [{lower_con:.3f}, {upper_con:.3f}]')
print(f'Confidence Interval of 95% for treatment group: [{lower_treat:.3f}, {upper_treat:.3f}]')

Z test stat: 2.75
P-value: 0.006
Confidence Interval of 95% for control group: [0.508, 0.557]
Confidence Interval of 95% for treatment group: [0.458, 0.508]


## 7. Summarise results and explain your answers

The change to the homepage slightly decreased the click through to the login page. 

The `p`-value is smaller than the Alpha value of 0.05, meaning we reject the $H_0$. 