# A/B Testing

* For this project, we will be working to understand the results of an <b>A/B test run by an e-commerce website.</b> Our goal is to work through this notebook to help the company understand if they should implement the new page, keep the old page, or perhaps run the experiment longer to make their decision

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random

random.seed(42)

import warnings
warnings.filterwarnings("ignore")

In [2]:
df=pd.read_csv('ab_test.csv')

In [3]:
df.head(10)

Unnamed: 0,id,time,con_treat,page,converted
0,851104,11:48.6,control,old_page,0
1,804228,01:45.2,control,old_page,0
2,661590,55:06.2,treatment,new_page,0
3,853541,28:03.1,treatment,new_page,0
4,864975,52:26.2,control,old_page,1
5,936923,20:49.1,control,old_page,0
6,679687,26:46.9,treatment,new_page,1
7,719014,48:29.5,control,old_page,0
8,817355,58:09.0,treatment,new_page,1
9,839785,11:06.6,treatment,new_page,1


In [4]:
df.columns=['user_id', 'timestamp', 'group', 'landing_page', 'converted']
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,11:48.6,control,old_page,0
1,804228,01:45.2,control,old_page,0
2,661590,55:06.2,treatment,new_page,0
3,853541,28:03.1,treatment,new_page,0
4,864975,52:26.2,control,old_page,1


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294478 non-null  int64 
 1   timestamp     294478 non-null  object
 2   group         294478 non-null  object
 3   landing_page  294478 non-null  object
 4   converted     294478 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


In [6]:
df.shape

(294478, 5)

In [7]:
df['user_id'].nunique()

290584

In [8]:
df.isna().sum()

user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64

In [9]:
n_treat=df[df['group']=='treatment'].shape[0]
n_new_page=df[df['landing_page']=='new_page'].shape[0]

diff=n_treat-n_new_page

pd.DataFrame({'No. of treatment people':[n_treat], 'No. of new page':[n_new_page],'Difference':[diff]})

Unnamed: 0,No. of treatment people,No. of new page,Difference
0,147276,147239,37


* There is mismatch between number of users assigned to treatment and the number of those landed on treatment page. This might indicate a problem with the data and needs further exploration.

In [10]:
df[(df['group']=='treatment')&(df['landing_page']=='old_page')]

Unnamed: 0,user_id,timestamp,group,landing_page,converted
308,857184,34:59.8,treatment,old_page,0
327,686623,26:40.7,treatment,old_page,0
357,856078,29:30.4,treatment,old_page,0
685,666385,11:54.8,treatment,old_page,0
713,748761,47:44.4,treatment,old_page,0
...,...,...,...,...,...
293773,688144,34:50.5,treatment,old_page,1
293817,876037,15:09.0,treatment,old_page,1
293917,738357,37:55.7,treatment,old_page,0
294014,813406,25:33.2,treatment,old_page,0


In [11]:
df_mismatch=df[(df["group"]=='treatment') & (df['landing_page']=='old_page') 
               | (df["group"]=='control') & (df['landing_page']=='new_page')]

df_mismatch

Unnamed: 0,user_id,timestamp,group,landing_page,converted
22,767017,58:15.0,control,new_page,0
240,733976,11:16.4,control,new_page,0
308,857184,34:59.8,treatment,old_page,0
327,686623,26:40.7,treatment,old_page,0
357,856078,29:30.4,treatment,old_page,0
...,...,...,...,...,...
294014,813406,25:33.2,treatment,old_page,0
294200,928506,32:10.5,control,new_page,0
294252,892498,11:10.5,treatment,old_page,0
294253,886135,49:20.5,control,new_page,0


In [12]:
print("No. of mismatched rows: ", df_mismatch.shape[0])
print("Percent of mismatched rows: ", round(df_mismatch.shape[0]/df.shape[0]*100,2), "%.")

No. of mismatched rows:  3893
Percent of mismatched rows:  1.32 %.


In [13]:
df_correct=df[(df["group"]=='treatment') & (df['landing_page']=='new_page') 
               | (df["group"]=='control') & (df['landing_page']=='old_page')]

In [14]:
df_correct.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,11:48.6,control,old_page,0
1,804228,01:45.2,control,old_page,0
2,661590,55:06.2,treatment,new_page,0
3,853541,28:03.1,treatment,new_page,0
4,864975,52:26.2,control,old_page,1


In [15]:
df_correct.shape

(290585, 5)

In [16]:
df_correct=df_correct.drop_duplicates("user_id")

In [17]:
df_correct.shape

(290584, 5)

In [18]:
df_correct.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,11:48.6,control,old_page,0
1,804228,01:45.2,control,old_page,0
2,661590,55:06.2,treatment,new_page,0
3,853541,28:03.1,treatment,new_page,0
4,864975,52:26.2,control,old_page,1


In [19]:
print("Probability of conversion regardless of the page they receive ",df_correct['converted'].mean() * 100,"%.")

Probability of conversion regardless of the page they receive  11.959708724499627 %.


* Probability of conversion regardless of the page customers receive is around 11.96%.

In [20]:
df_correct.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 290584 entries, 0 to 294477
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       290584 non-null  int64 
 1   timestamp     290584 non-null  object
 2   group         290584 non-null  object
 3   landing_page  290584 non-null  object
 4   converted     290584 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 13.3+ MB


In [21]:
df_correct['user_id']=df_correct['user_id'].astype('str')
df_correct.groupby('group').mean()*100

Unnamed: 0_level_0,converted
group,Unnamed: 1_level_1
control,12.03863
treatment,11.880807


* Probability of conversion if a customer was in Control Group is 12%.
* Probability of conversion if a customer was in Treatment Group is 11.8%.

In [22]:
(df_correct['landing_page'].value_counts()/df_correct.shape[0])*100

new_page    50.006194
old_page    49.993806
Name: landing_page, dtype: float64

* Probability that a customer received a new page is around 50%.

#### On the basis of Probability, we conclude that

1. The probability that an individual received the new page is 50%
2. The probability of an individual converting regardless of the page they receive is 11.96%
3. Given that an individual was in the control group, the probability they converted is 12.04%
4. Given that an individual was in the treatment group, the probability they converted is 11.88%

1 to 4 suggests that there is no significant difference in convergence between treatment and control groups. Therefore we may conclude that the new treatment page has no impact and does not lead to more conversions.

### Let's conduct A/B testing :

* H0 : There is no siginifact difference between the conversion rate for old page and new page.
* H1 : There is a siginifact difference between the conversion rate for old page and new page.

P value = 0.05

In [23]:
converted_old=df_correct[(df_correct['landing_page']=='old_page') & (df_correct['converted']==1)]['user_id'].nunique()

converted_new=df_correct[(df_correct['landing_page']=='new_page') & (df_correct['converted']==1)]['user_id'].nunique()

print(converted_old)
print(converted_new)

17489
17264


In [24]:
n_old=df_correct[df_correct['landing_page']=='old_page']['user_id'].nunique()
n_new=df_correct[df_correct['landing_page']=='new_page']['user_id'].nunique()

In [25]:
print(n_old)
print(n_new)

145274
145310


In [30]:
import statsmodels.api as sm

z_score,p_value=sm.stats.proportions_ztest(np.array([converted_old,converted_new]),np.array([n_old,n_new]),alternative='smaller')

In [31]:
z_score, p_value

(1.3109241984234394, 0.9050583127590245)

* Since p value > 0.05, we accept the null hypothesis that there is no significant difference between the conversion rate of old page and new page.

* Another way of saying the same thing is that the Z score test statistics lies between the 95% confidence interval or it lies inside the 2.5th and 97.5th percentile.