<a href="https://colab.research.google.com/github/KelvinLam05/A-B-Testing/blob/main/A_B_Testing_with_Landing_Pages.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Goal of the project**

In this notebook, we will be working to understand the results of an A/B test run by an e-commerce website. The company has developed a new web page in order to try and increase the number of users who "convert," meaning the number of users who decide to pay for the company's product. Our goal is to work through this notebook to help the company understand if they should implement this new page, keep the old page, or perhaps run the experiment longer to make their decision.

**Attribute information**

* user_id : The user ID of each session

* timestamp : Timestamp for the session

* group : Which group the user was assigned to for that session {control, treatment}

* landing_page : Which design each user saw on that session {old_page, new_page}

* converted : Whether the session ended in a conversion or not (binary, 0 = not converted, 1 = converted)


**Scenario**

To make it a bit more realistic, here’s a potential scenario for our study:

*Let’s imagine we work on the product team at a medium-sized online e-commerce business. The UX designer worked really hard on a new version of the product page, with the hope that it will lead to a higher conversion rate. The product manager (PM) told us that the current conversion rate is about 12% on average throughout the year, and that the team would be happy with an increase of 0.35%, meaning that the new design will be considered a success if it raises the conversion rate to 12.35%.*

**Formulating a hypothesis**

First things first, we want to make sure we formulate a hypothesis at the start of our project. This will make sure our interpretation of the results is correct as well as rigorous.

$ H_{0}: p_{old} ≥ p_{new} $

$ H_{1}: p_{old} < p_{new} $

where $ p_{old} $ and $ p_{new} $ stand for the conversion rate of the new and old design, respectively. We’ll also set a confidence level of 95%:

$ α $ = 0.05 

The α value is a threshold we set, by which we say “if the probability of observing a result as extreme or more (p-value) is lower than $ α $, then we reject the null hypothesis”. Since our $ α $ = 0.05 (indicating 5% probability), our confidence (1 — $ α $) is 95%.


**Choosing the variables**

For our test we’ll need two groups:

* A control group - They'll be shown the old design

* A treatment (or experimental) group - They'll be shown the new design

This will be our *Independent Variable*. The reason we have two groups even though we know the baseline conversion rate is that we want to control for other variables that could have an effect on our results, such as seasonality: by having a control group we can directly compare their results to the treatment group, because the only systematic difference between the groups is the design of the product page, and we can therefore attribute any differences in results to the designs.

For our *Dependent Variable* (i.e. what we are trying to measure), we are interested in capturing the conversion rate. A way we can code this is by each user session with a binary variable:

* 0 - The user did not buy the product during this user session

* 1 - The user bought the product during this user session

**Collecting the data**

In [209]:
# Importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [210]:
# Load dataset
df = pd.read_csv('/content/ab_data.csv')

In [211]:
# Examine the data
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


In [212]:
# Overview of all variables, their datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294478 non-null  int64 
 1   timestamp     294478 non-null  object
 2   group         294478 non-null  object
 3   landing_page  294478 non-null  object
 4   converted     294478 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


**Preparing the data**

In [213]:
df['group'].value_counts()

treatment    147276
control      147202
Name: group, dtype: int64

In [214]:
pd.crosstab(df['group'], df['landing_page'])

landing_page,new_page,old_page
group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,1928,145274
treatment,145311,1965


For the rows where treatment is not aligned with new_page or control is not aligned with old_page, we cannot be sure if this row truly received the new or old page. 

In [215]:
# To make sure all the control group are seeing the old page 
df_old = df[df['group'] == 'control'][df['landing_page'] == 'old_page']

  


In [216]:
# To make sure all the treatment group are seeing the new page 
df_new = df[df['group'] == 'treatment'][df['landing_page'] == 'new_page']

  


In [217]:
# Create a new dataset 
df = pd.concat([df_old, df_new])

In [218]:
df.tail()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
294462,677163,2017-01-03 19:41:51.902148,treatment,new_page,0
294465,925675,2017-01-07 20:38:26.346410,treatment,new_page,0
294468,643562,2017-01-02 19:20:05.460595,treatment,new_page,0
294472,822004,2017-01-04 03:36:46.071379,treatment,new_page,0
294477,715931,2017-01-16 12:40:24.467417,treatment,new_page,0


In [219]:
# The number of the unique user
df['user_id'].nunique()

290584

In [220]:
len(df.index)

290585

There is one user_id repeated in df.

In [221]:
duplicates = df[df.duplicated(['user_id'], keep = False)]

In [222]:
duplicates.sort_values(['user_id'], ascending = False) 

Unnamed: 0,user_id,timestamp,group,landing_page,converted
1899,773192,2017-01-09 05:37:58.781806,treatment,new_page,0
2893,773192,2017-01-14 02:55:59.590927,treatment,new_page,0


In [223]:
# Remove one of the rows with a duplicate user_id
df.drop_duplicates(subset = 'user_id', keep = 'first', inplace = True)

In [224]:
df['user_id'].nunique()

290584

In [225]:
len(df.index)

290584

In [226]:
import scipy.stats as stats

In [227]:
# What is the probability of an individual converting regardless of the page they receive?
df['converted'].mean()

0.11959708724499628

In [228]:
# Given that an individual was in the control group, what is the probability they converted?
df[df['group'] == 'control']['converted'].mean()

0.1203863045004612

In [229]:
# Given that an individual was in the treatment group, what is the probability they converted?
df[df['group'] == 'treatment']['converted'].mean()

0.11880806551510564

There is no suffient evidence to support the statement that one page leads to more conversions. For the reason that the converted rate of the treatment group is approximately 11.9% while the converted rate of the control group is approximately 12.0%. We can see the two results are pretty close to each other, we can not make a conclusion that the treatment group take effect or vice versa. In order to make that conclution, we need to make further experiments such as A/B testing.

**Choosing a sample size**

Having a required sample size is one of the important cornerstones of a successful A/B test and is dependant on 3 factors:

* **Statistical power** (usually 1 − $ 𝛽 $ = 0.8): The ability of the experiment to correctly identify a positive change, given that there is indeed one.

* **Significant level** ($ 𝛼 $ = 0.05): The probability of wrongly identifying a positive change, when there is actually none.

* **Minimum detectable effect** (MDE, or $ 𝐷_{𝑚𝑖𝑛} $): The minimum change that the business would like to detect in this test.

The A/B test can begin to go bad when a sample size is not calculated before the A/B test, or 'peeking' at your results before the total number of observations is less than your minimum sample size and concluding prematurely when you think you've achieved a positive result. Usually if an A/B test is running and there aren't enough users to match the required sample size, the 3 factors would have to be adjusted to lower the minimium sample size.

One evaluation metric we want to define will be the increase in conversions. For this metric, we want to define $ 𝐷_{𝑚𝑖𝑛} $ : the minimum change which is practically significant to the business. In this case, the practical minimum difference would be the 0.35%, closely related to our business objectives of increasing our current conversion rates of 12% by 0.35%. Therefore, we'll set $ 𝐷_{𝑚𝑖𝑛} $ = 0.0035.



This online calculator does a great job of calculating sample size: https://www.evanmiller.org/ab-testing/sample-size.html

However, we're going to manually calculate the required sample size using the Evan Miller's Calculator:


In [230]:
import scipy.stats as ss
import math

In [231]:
# Function for getting z-scores for alpha
def get_z_score(alpha):
    
    return ss.norm.ppf(alpha)

In [232]:
# Calculating the minimum sample size for the a/b test
def get_sample_size(sds, alpha, beta, d):
    
    n = pow((get_z_score(1 - alpha / 2) * sds[0] + get_z_score(1 - beta) * sds[1]), 2) / pow(d, 2)
    
    return n

In [233]:
# Baseline + Expected change standard deviation calculations
def get_sds(p, d):
    
    sd1 = math.sqrt(2 * p * (1 - p))
    sd2 = math.sqrt(p * (1 - p) + (p + d) * (1 - (p + d)))
    sds = [sd1, sd2]
    
    return sds

In [234]:
# Using Evan Miller's Calculator but deriving the values ourselves
round(get_sample_size(get_sds(0.12, 0.0035), 0.05, 0.2, 0.0035))

135830

When using Evan Miller's calculator, the minimum sample size per per variation = 135,830. Given we have 2 variations (control and treatment): the total minimum sample size = 271,660. Since we have a total sample size of 290,584, our A/B test will have enough statistical power and significance.

**Testing the hypothesis**

We can use existing packages to calculate our test statistic and p-values and test for proportions based on the z-test. 

In [235]:
from statsmodels.stats.proportion import proportions_ztest

In [236]:
# The number of successes in nobs trials
converted_old = len(df[df.landing_page == 'old_page'][df.converted == 1])

  


In [237]:
converted_new = len(df[df.landing_page == 'new_page'][df.converted == 1])

  """Entry point for launching an IPython kernel.


In [238]:
# The number of trials or observations
n_old = len(df[df.landing_page == 'old_page'])

In [239]:
n_new = len(df[df.landing_page == 'new_page'])

In [240]:
import statsmodels.api as sm
from statsmodels.stats.proportion import proportions_ztest

In [241]:
# Conducting the z test
stat, p_val = sm.stats.proportions_ztest([converted_old, converted_new], [n_old, n_new], alternative = 'smaller')

In [242]:
# Test statistic for the z-test
stat

1.3109241984234394

In [243]:
# P-value for the z-test
p_val

0.9050583127590245

**Drawing conclusions**

Since our **p-value = 0.905 is way above our $ α $ = 0.05** threshold, we cannot reject the null hypothesis, which means that **our new design did not perform significantly different (let alone better) than our old one**.
