# A/B testing landing page designs  

In [1]:
import pandas as pd
import numpy as np 
import math
import scipy
import statsmodels.stats.api as sms
from statsmodels.stats.proportion import proportions_ztest, proportion_confint

## Example description    

This example is based on the [Kaggle A/B testing dataset by Luyuan Zhang](https://www.kaggle.com/datasets/zhangluyuan/ab-testing?select=ab_data.csv).  
The dataset contains A/B test data for an experiment that compared an old vs new landing page design. The response variable is whether a user converted.  

## Experiment design  

We have an old and a new landing page design, and we want to know whether switching to the new design has a significant impact on the performance of our landing page.  

To measure this impact, we design an A/B test with the following setup:  
- Our **control** exposure is the old landing page design, and our **treatment** exposure is the new design.  
- We will randomly assign users to the treatment and control groups.  
- To control for any unintentional effects from users seeing the same design twice or accidentally seeing both designs, we are only interested in the first landing page visits in this experiment.  
- Our **response variable** is user conversion, a binary variable encoded as 0 = user did not convert and 1 = user converted.  
- Our **Null hypothesis**: The conversion rate is the same for the old and new landing page designs. 
  - In other words, the new design didn't significantly improve or worsen the performance of the landing page.  
- Our **Alternative hypothesis**: The new design's conversion rate is significantly different from that of the old design.  
  - In other words, we observe in the data significant evidence that the new design either improves or worsens the performance of the landing page.  
- We would like to have 95% confidence in our A/B test result, so we set our **level of statistical significance** (alpha) to 5%.  

In [2]:
ALPHA = 0.05

## Calculating sample size  

In practice, we usually need to select a sample size for the A/B test that is large enough to produce meaningful results and small enough to be feasible and cost effective. The bigger our data sample, the more precise our estimates will be.  

We will calculate the minimum sample size needed for our A/B test. To do this, we need to pick the **effect size**, which is the normalized difference between the estimators for the two samples (the mean conversion rates, in this case).  

Let's say we want to be able to detect a 1 percentage point difference between the old and new design conversion rates. In other words, if the average conversion rate for our old design is 12%, we want our experiment to be precise enough to distinguish an increase to 13% or a decrease to 11% in conversion rate for the new design.

In [3]:
MIN_CONVERSION_RATE_MEANS_DIFF = 0.01

According to [Hubspot](https://blog.hubspot.com/marketing/landing-page-stats), the average landing page conversion rate is about 10%. Let's assume this rate as the control/old design conversion rate.

In [4]:
CONTROL_CONVERSION_RATE = 0.1

We can use these numbers to estimate our effect size.

In [5]:
# Subtracts the 2nd proportion value from the first and normalizes it
ES = sms.proportion_effectsize(
    CONTROL_CONVERSION_RATE + MIN_CONVERSION_RATE_MEANS_DIFF, 
    CONTROL_CONVERSION_RATE) 
print(f'Effect size: {ES:.4f}')

Effect size: 0.0326


We are now ready to compute the minimum sample size needed for each treatment group in our experiment.

In [6]:
sample_size = math.ceil(
    sms.TTestIndPower().solve_power(
        effect_size=ES, 
        power=0.8, # standard setting
        alpha=ALPHA
    )
)
print(f'Minimum sample size: {sample_size:,} observations.')

Minimum sample size: 14,746 observations.


## Collect data  

Download the `ab_data.csv` dataset here: [Kaggle A/B testing dataset by Luyuan Zhang](https://www.kaggle.com/datasets/zhangluyuan/ab-testing?select=ab_data.csv).  

In [7]:
df = pd.read_csv('../data/ab_data.csv')

In [8]:
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294478 non-null  int64 
 1   timestamp     294478 non-null  object
 2   group         294478 non-null  object
 3   landing_page  294478 non-null  object
 4   converted     294478 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


Ensuring that the data is in the right format:

In [10]:
df['timestamp'] = pd.to_datetime(df['timestamp'])

In [11]:
df['user_id'] = df['user_id'].astype('str')

### Descriptive statistics of raw data

In [12]:
df.describe(include='all', datetime_is_numeric=True)

Unnamed: 0,user_id,timestamp,group,landing_page,converted
count,294478.0,294478,294478,294478,294478.0
unique,290584.0,,2,2,
top,805339.0,,treatment,old_page,
freq,2.0,,147276,147239,
mean,,2017-01-13 13:40:10.474213376,,,0.119659
min,,2017-01-02 13:42:05.378582,,,0.0
25%,,2017-01-08 02:06:48.649925120,,,0.0
50%,,2017-01-13 13:21:07.016475904,,,0.0
75%,,2017-01-19 01:43:51.611873792,,,0.0
max,,2017-01-24 13:41:54.460509,,,1.0


In [13]:
df.groupby('landing_page')['group'].value_counts(dropna=False)

landing_page  group    
new_page      treatment    145311
              control        1928
old_page      control      145274
              treatment      1965
Name: group, dtype: int64

In [14]:
df.groupby(['landing_page', 'group'])['converted'].value_counts(dropna=False)

landing_page  group      converted
new_page      control    0              1694
                         1               234
              treatment  0            128047
                         1             17264
old_page      control    0            127785
                         1             17489
              treatment  0              1715
                         1               250
Name: converted, dtype: int64

In [15]:
df.duplicated('user_id').value_counts()

False    290584
True       3894
dtype: int64

In [16]:
df.duplicated('user_id').value_counts(normalize=True)

False    0.986777
True     0.013223
dtype: float64

**The descriptive statistics above tell us the following:**  
- The A/B test expereiment was conducted in January 2017, for the duration of 23 days.  
- Most users visited the landing page only once (98.7%), and a small subset of users (3,894 or 1.3%) visited the landing page twice during the A/B test experiment.  
- The old landing page design is most likely the intended control exposure in the experiment, and the new landing page is the treatment. However, the data contains control group users who were shown the new landing page and treatment group users who were shown the old landing page. These things can happen when an experiment is carried out, which is why exploratory data analysis (EDA) and data cleaning steps are crucial steps in any data analysis.  
- The average conversion rate in the raw dataset is about 12%, with a relatively large standard deviation of 32%.  

## Data cleaning  

Based on the quick EDA above, we need to take the following data cleaning steps:  
- Remove the 2nd visit data rows for users who visited the landing page twice during the experiment. We want to keep their first visit data, since that is a valid exposure in our experiment.  
- Remove the observations where the control group user was shown the new page design or the treatment group user was shown the old page design. 
  - It is entirely possible that we could simply reassign all users who were shown the old page design to the control group and the rest to the treatment group. We could do that with confidence if we knew for a fact that no confounding factors played a role in the misassignment of users to treatments. Since we do not know that, it is safer to only keep the consistently assigned data for this analysis.  

#### Sort by time

In [17]:
df.sort_values('timestamp', inplace=True)

#### Drop the 2nd visit data rows

First, let's take a look at what the duplicates look like.

In [18]:
df[df.duplicated(subset='user_id', keep=False)].sort_values(['user_id', 'timestamp'])

Unnamed: 0,user_id,timestamp,group,landing_page,converted
213114,630052,2017-01-07 12:25:54.089486,treatment,old_page,1
230259,630052,2017-01-17 01:16:05.208766,treatment,new_page,0
22513,630126,2017-01-14 13:35:54.778695,treatment,old_page,0
251762,630126,2017-01-19 17:16:00.280440,treatment,new_page,0
183371,630137,2017-01-20 02:08:49.893878,control,old_page,0
...,...,...,...,...,...
99479,945703,2017-01-18 06:39:31.294688,control,old_page,0
40370,945797,2017-01-11 03:04:49.433736,control,new_page,1
186960,945797,2017-01-13 17:23:21.750962,control,old_page,0
165143,945971,2017-01-16 10:09:18.383183,control,old_page,0


In [19]:
len(df)

294478

In [20]:
df.drop_duplicates(subset='user_id', keep='first', inplace=True)

In [21]:
len(df)

290584

#### Clean up control/treatment misasignments

In [22]:
control_group_filter = (df['landing_page']=='old_page') & (df['group']=='control')
treatment_group_filter = (df['landing_page']=='new_page') & (df['group']=='treatment')

df = df[control_group_filter | treatment_group_filter]

In [23]:
df.groupby(['landing_page', 'group'])['converted'].value_counts(dropna=False)

landing_page  group      converted
new_page      treatment  0            127186
                         1             17130
old_page      control    0            126944
                         1             17375
Name: converted, dtype: int64

In [24]:
len(df)

288635

### Descriptive stats of the cleaned up dataset

In [25]:
df.describe(include='all', datetime_is_numeric=True)

Unnamed: 0,user_id,timestamp,group,landing_page,converted
count,288635.0,288635,288635,288635,288635.0
unique,288635.0,,2,2,
top,922696.0,,control,old_page,
freq,1.0,,144319,144319,
mean,,2017-01-13 13:05:20.392675072,,,0.119545
min,,2017-01-02 13:42:05.378582,,,0.0
25%,,2017-01-08 01:25:10.110609920,,,0.0
50%,,2017-01-13 12:24:48.310561024,,,0.0
75%,,2017-01-19 01:11:56.032314368,,,0.0
max,,2017-01-24 13:41:54.460509,,,1.0


## Sampling

In [26]:
print(f"Total observations in control group: {len(df[df['group'] == 'control']):,}.")

Total observations in control group: 144,319.


In [27]:
print(f"Total observations in treatment group: {len(df[df['group'] == 'treatment']):,}.")

Total observations in treatment group: 144,316.


In [28]:
print(f'Our minimum sample size: {sample_size:,}')

Our minimum sample size: 14,746


Before sampling, it is a good idea to check how many observations are already available or easily feasible to obtain.  
In this example, the cleaned dataset contains nearly equal numbers of oberservations for control and treatment groups, and the size of each subset is nearly 10 times what we need for our minimum sample size. This is great, because not only do we have enough data to test for our selected effect size, but we can further improve on our precision but taking the maximum samples available to us.   

How small of an effect size can we detect by doing so? Let's take a look.  

In [29]:
N = 144_316 # the new sample size

In [30]:
ES_N = sms.TTestIndPower().solve_power(
        nobs1=N, 
        power=0.8, # standard setting
        alpha=ALPHA
    )
ES_N

0.010429579633782445

In `sms.proportion_effectsize` function, effect size for `normal` distribution is defined as :

       ES = 2 * (arcsin(sqrt(prop1)) - arcsin(sqrt(prop2)))  

We want to solve this for prop1, which in our case is the detectable treatment conversion rate:   

       prop1 = sin(ES/2 + arcsin(sqrt(prop2)))**2  
       
The effect size we'll use here is `ES_N`, and prop2 value is the `CONTROL_CONVERSION_RATE`.

In [31]:
prop1 = np.sin(ES_N/2 + np.arcsin(np.sqrt(CONTROL_CONVERSION_RATE)))**2
prop1

0.10315057219504112

In [32]:
prop1 - CONTROL_CONVERSION_RATE

0.0031505721950411125

This means that if we use our maximum available sample size `N` (=144,316), our A/B test will be precise enough to detect an increase in the conversion rate of just 0.32 of a percentage point, compared to our initial goal of 1 percentage point.  
This may seem like a small difference, but if the user base is large enough, it could represent thousands of users who would convert with the new design.  

We will take advantage of the maximum available sample size and proceed with an approximate sample size N (the control group size in the cleaned dataset has 3 extra observations, which is close enough).  

## Experiment conversion rates by treatment group

In [33]:
df.groupby('group')['converted'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
control,144319.0,0.120393,0.325422,0.0,0.0,0.0,0.0,1.0
treatment,144316.0,0.118698,0.323434,0.0,0.0,0.0,0.0,1.0


It looks like the treatment, i.e. the new design, is performing very slightly worse than the old design. This tiny difference could be due to chance, so we need to test whether it is statistically significant.  

## Fisher's exact test

#### Contingency table

In [34]:
pd.DataFrame(df.groupby('group')['converted'].value_counts())

Unnamed: 0_level_0,Unnamed: 1_level_0,converted
group,converted,Unnamed: 2_level_1
control,0,126944
control,1,17375
treatment,0,127186
treatment,1,17130


In [35]:
contingency_table = pd.DataFrame(df.groupby('group')['converted'].value_counts())
contingency_table.rename(columns={'converted': 'obs_count'}, inplace=True)
contingency_table.reset_index(inplace=True)
contingency_table = contingency_table.pivot(
    columns='converted',
    index='group',
    values='obs_count'
)

contingency_table.columns = ['did not convert', 'converted']

In [36]:
contingency_table

Unnamed: 0_level_0,did not convert,converted
group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,126944,17375
treatment,127186,17130


In [37]:
contingency_table.to_numpy()

array([[126944,  17375],
       [127186,  17130]])

In [38]:
fisher_exact_result = scipy.stats.fisher_exact(
    contingency_table.to_numpy(), 
    alternative='two-sided'
)
fisher_exact_result

SignificanceResult(statistic=0.9840233852262144, pvalue=0.161545015393662)

The resulting Fisher's exact test p-value is much higher than our chosen alpha=0.05.

## Two-sample z-test for proportions  

Given our very large sample size, we could also leverage normal approximation and test our hypothesis using a simple two-sample z-test for proportions.  
We expect the p-value for this test to be roughly consistent with the one from Fisher's exact test. This approach also provides easy to interpret confidence intervals.

In [39]:
control_converted = df[df['group'] == 'control']['converted']
treatment_converted = df[df['group'] == 'treatment']['converted']

In [40]:
counts = [
    control_converted.sum(), 
    treatment_converted.sum()
]
nobs = [
    control_converted.count(), 
    treatment_converted.count()
]

prop_ztest_result = proportions_ztest(counts, nobs)
zstat, pval = prop_ztest_result

print(f'Z-statistic: {zstat:.3f}')
print(f'p-value: {pval:.3f}')

Z-statistic: 1.404
p-value: 0.160


#### Confidence intervals

In [41]:
confidence_intervals_bounds = proportion_confint(counts, nobs, alpha=ALPHA)
lower_CI_bounds = confidence_intervals_bounds[0]
upper_CI_bounds = confidence_intervals_bounds[1]

print('Control group 95% confidence interval: '
      + f'[{lower_CI_bounds[0]:.4f}, {upper_CI_bounds[0]:.4f}]'
     )
print('Treatment group 95% confidence interval: '
      + f'[{lower_CI_bounds[1]:.4f}, {upper_CI_bounds[1]:.4f}]'
     )

Control group 95% confidence interval: [0.1187, 0.1221]
Treatment group 95% confidence interval: [0.1170, 0.1204]


## Conclusion  

Fisher's exact test p-value=0.16 is much higher than our chosen alpha=0.05. Therefore, we cannot reject the null hypothesis. **This means that there is no significant difference in terms of conversion rates between the new and old landing page designs in our data sample.**  

The two-sample z-test for proportions also yielded p-value=0.16, once again supporting that we cannot reject the null hypothesis. The large overlap between the confidence intervals for the control and treatment groups illustrates how the two conversion rates are very similar.  

**Since we cannot reject the null hypothesis, we conclude that the new landing page design neither improves nor worsens the conversion rate of the landing page.** 