<a href="https://colab.research.google.com/github/KelvinLam05/ab_testing/blob/main/ab_testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Goal of the project**

In this project, we will be working to understand the results of an A/B test run by an e-commerce website. The company has developed a new web page in order to try and increase the number of users who "convert," meaning the number of users who decide to pay for the company's product. Our goal is to work through this project to help the company understand if they should implement this new page, keep the old page, or perhaps run the experiment longer to make their decision.

**Formulating a hypothesis**

A/B test was performed under the assumption that the old page yields a larger convert rate than the new page unless the new page proves to definitely yield a larger convert rate at a Type I error rate of 5%. Based on this assumption,

* for the null hypothesis, the probability of all users who convert from landing on the old page is greater than or equal to the probability of all users who convert from landing on the new page.

* for the alternative hypothesis, the probability of all users who convert from landing on the new page is greater than or equal to the probability of all users who convert from landing on the old page.

<center>$ H_{0}: p ≥ p_{0} $<center>

<center>$ H_{1}: p < p_{0} $<center>

**Choosing the variables**

For our test we’ll need two groups:

* A control group - They'll be shown the old design

* A treatment (or experimental) group - They'll be shown the new design

This will be our **independent variable**. The reason we have two groups is that we want to control for other variables that could have an effect on our results, such as seasonality: by having a control group we can directly compare their results to the treatment group, because the only systematic difference between the groups is the design of the product page, and we can therefore attribute any differences in results to the designs.

For our **dependent variable** (i.e. what we are trying to measure), we are interested in capturing the conversion rate. A way we can code this is by each user session with a binary variable:

* 0 - The user did not buy the product during this user session

* 1 - The user bought the product during this user session

**Collecting the data**

In [1]:
# Importing libraries
import pandas as pd
import numpy as np

In [2]:
# Load dataset
df = pd.read_csv('/content/ab_data.csv')

In [3]:
# Examine the data
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


In [4]:
# Overview of all variables, their datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294478 non-null  int64 
 1   timestamp     294478 non-null  object
 2   group         294478 non-null  object
 3   landing_page  294478 non-null  object
 4   converted     294478 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


**Preparing the data**

In [5]:
df['group'].value_counts()

treatment    147276
control      147202
Name: group, dtype: int64

In [6]:
pd.crosstab(df['group'], df['landing_page'])

landing_page,new_page,old_page
group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,1928,145274
treatment,145311,1965


For the rows where treatment is not aligned with new_page or control is not aligned with old_page, we cannot be sure if this row truly received the new or old page. 

In [7]:
# To make sure all the control group are seeing the old page 
df_old = df[df['group'] == 'control'][df['landing_page'] == 'old_page']

  


In [8]:
# To make sure all the treatment group are seeing the new page 
df_new = df[df['group'] == 'treatment'][df['landing_page'] == 'new_page']

  


In [9]:
# Create a new dataset 
df = pd.concat([df_old, df_new])

In [10]:
df.tail()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
294462,677163,2017-01-03 19:41:51.902148,treatment,new_page,0
294465,925675,2017-01-07 20:38:26.346410,treatment,new_page,0
294468,643562,2017-01-02 19:20:05.460595,treatment,new_page,0
294472,822004,2017-01-04 03:36:46.071379,treatment,new_page,0
294477,715931,2017-01-16 12:40:24.467417,treatment,new_page,0


The number of unique users in the dataset.

In [11]:
df['user_id'].nunique()

290584

In [12]:
len(df.index)

290585

There is one user_id repeated in df.

In [13]:
duplicates = df[df.duplicated(['user_id'], keep = False)]

In [14]:
duplicates.sort_values(['user_id'], ascending = False) 

Unnamed: 0,user_id,timestamp,group,landing_page,converted
1899,773192,2017-01-09 05:37:58.781806,treatment,new_page,0
2893,773192,2017-01-14 02:55:59.590927,treatment,new_page,0


In [15]:
# Remove one of the rows with a duplicate user_id
df.drop_duplicates(subset = 'user_id', keep = 'first', inplace = True)

In [16]:
df['user_id'].nunique()

290584

In [17]:
len(df.index)

290584

In [18]:
import scipy.stats as stats

In [19]:
# Given that an individual was in the control group, what is the probability they converted?
df[df['group'] == 'control']['converted'].mean()

0.1203863045004612

In [20]:
# Given that an individual was in the treatment group, what is the probability they converted?
df[df['group'] == 'treatment']['converted'].mean()

0.11880806551510564

There is no suffient evidence to support the statement that one page leads to more conversions.

For the reason that the converted rate of the treatment group is approximately 11.9% while the converted rate of the control group is approximately 12.0%. We can see the two results are pretty close to each other, we can not make a conclusion that the treatment group take effect or vice versa. In order to make that conclution, we need to make further experiments such as A/B testing.

**Normality**

Given the sufficiently large size of the sample (n ≥ 30) and the fact that the difference in the convert rates is a random variable, the distribution of the differences should be approximately normal according to the Central Limit Theorem.

**Testing the hypothesis**

A two proportion z-test is used to test for a difference between two population proportions.

In [21]:
from statsmodels.stats.proportion import proportions_ztest

In [22]:
# The number of successes in nobs trials
converted_old = len(df[df.landing_page == 'old_page'][df.converted == 1])

  


In [23]:
converted_new = len(df[df.landing_page == 'new_page'][df.converted == 1])

  """Entry point for launching an IPython kernel.


In [24]:
# The number of trials or observations
n_old = len(df[df.landing_page == 'old_page'])

In [25]:
n_new = len(df[df.landing_page == 'new_page'])

In [26]:
import statsmodels.api as sm
from statsmodels.stats.proportion import proportions_ztest

In [27]:
# Perform two proportion z-test
stat, p_val = sm.stats.proportions_ztest([converted_new, converted_old], [n_new, n_old], alternative = 'larger')

In [28]:
# Test statistic for the z-test
stat

-1.3109241984234394

In [29]:
# P-value for the z-test
p_val

0.9050583127590245

**Conclusion**

Since our p-value = 0.905 is way above our $ α $ = 0.05, we cannot reject the null hypothesis, which means that our new design did not perform significantly different (let alone better) than our old one.