You can find the dataset (here)[https://www.kaggle.com/datasets/putdejudomthai/ecommerce-ab-testing-2022-dataset1/data] on Kaggle

In [1]:
# Importing necessary libraries

import pandas as pd
from scipy.stats import shapiro, levene, mannwhitneyu
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Loading the dataset

df = pd.read_csv('ab_data.csv')

In [3]:
# EDA
# Displaying the first few rows of the dataset

print("First few rows of the dataset:")
print(df.head())

First few rows of the dataset:
   user_id timestamp      group landing_page  converted
0   851104   11:48.6    control     old_page          0
1   804228   01:45.2    control     old_page          0
2   661590   55:06.2  treatment     new_page          0
3   853541   28:03.1  treatment     new_page          0
4   864975   52:26.2    control     old_page          1


In [4]:
# Displaying information about the dataset

print("\nInformation about the dataset:")
print(df.info())


Information about the dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294480 entries, 0 to 294479
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294480 non-null  int64 
 1   timestamp     294480 non-null  object
 2   group         294480 non-null  object
 3   landing_page  294480 non-null  object
 4   converted     294480 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB
None


In [5]:
# Checking for missing values

print("\nNumber of missing values in each column:")
print(df.isnull().sum())


Number of missing values in each column:
user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64


In [6]:
# Dropping duplicate entries based on 'user_id'

df = df.drop_duplicates(subset='user_id', keep=False)

In [7]:
# Displaying the shape of the dataset after removing duplicates

print("\nShape of the dataset after removing duplicates:")
print(df.shape)


Shape of the dataset after removing duplicates:
(286690, 5)


In [8]:
# Grouping data by 'group' and 'landing_page' and count occurrences

landing_page_counts = df.groupby(['group', 'landing_page']).size()
print("\nCounts of landing page by group:")
print(landing_page_counts)


Counts of landing page by group:
group      landing_page
control    old_page        143293
treatment  new_page        143397
dtype: int64


In [9]:
# Calculating conversion rate by group and landing page

conversion_rate = df.groupby(['group', 'landing_page'])['converted'].mean()
print("\nConversion rate by group and landing page:")
print(conversion_rate)


Conversion rate by group and landing page:
group      landing_page
control    old_page        0.120173
treatment  new_page        0.118726
Name: converted, dtype: float64


In [10]:
# Calculating the percentage of each landing page type

landing_page_percentage = df['landing_page'].value_counts(normalize=True) * 100
print("\nPercentage of each landing page:")
print(landing_page_percentage)


Percentage of each landing page:
new_page    50.018138
old_page    49.981862
Name: landing_page, dtype: float64


In [11]:
# Filtering out inconsistent group-landing_page combinations

inconsistent_entries = df[((df['group'] == 'control') & (df['landing_page'] == 'new_page')) |
                          ((df['group'] == 'treatment') & (df['landing_page'] == 'old_page'))]
print("\nInconsistent entries:")
print(inconsistent_entries)


Inconsistent entries:
Empty DataFrame
Columns: [user_id, timestamp, group, landing_page, converted]
Index: []


In [12]:
# Performing A/B Testing
# Shapiro-Wilk test for normality assumptions

shapiro_old_page = shapiro(df.loc[df["landing_page"] == "old_page", "converted"])
shapiro_new_page = shapiro(df.loc[df["landing_page"] == "new_page", "converted"])
print("\nShapiro-Wilk test for normality assumption:")
print("Old page - p-value:", shapiro_old_page.pvalue, "Test stat:", shapiro_old_page.statistic)
print("New page - p-value:", shapiro_new_page.pvalue, "Test stat:", shapiro_new_page.statistic)


Shapiro-Wilk test for normality assumption:
Old page - p-value: 0.0 Test stat: 0.3792334198951721
New page - p-value: 0.0 Test stat: 0.37685757875442505


In [13]:
# Levene's test for variance homogeneity

levene_test = levene(df.loc[df["landing_page"] == "new_page", "converted"],
                     df.loc[df["landing_page"] == "old_page", "converted"])
print("\nLevene's test for variance homogeneity:")
print("p-value:", levene_test.pvalue, "Test stat:", levene_test.statistic)


Levene's test for variance homogeneity:
p-value: 0.2322897281547632 Test stat: 1.4267917566652295


In [14]:
# Performing Mann-Whitney U test for hypothesis testing

test_stat, pvalue = mannwhitneyu(df.loc[df["landing_page"] == "new_page", "converted"],
                                  df.loc[df["landing_page"] == "old_page", "converted"])
print("\nMann-Whitney U test results:")
print('Test Stat = %.4f, p-value = %.4f' % (test_stat, pvalue))


Mann-Whitney U test results:
Test Stat = 10259026653.0000, p-value = 0.2323


In [15]:
# Interpreting the results

alpha = 0.05
if pvalue < alpha:
    print("\nReject the null hypothesis. There is a statistically significant difference between the old and new pages.")
    print("The new page brings profit.")
else:
    print("\nFail to reject the null hypothesis. There is no statistically significant difference between the old and new pages.")
    print("The new page does not bring profit.")


Fail to reject the null hypothesis. There is no statistically significant difference between the old and new pages.
The new page does not bring profit.
