# **Background**

In this project, the company had developed a webpage on an e-commerce website, where the new webpage they had developed sought to increase the number of users who paid for their products. In this project, the company wanted to see if the new pages they developed had a significant effect on users so that the new pages developed could be implemented or it was possible that the new pages developed had no effect on users so they had to keep the old pages. For this reason, the data used had Two groups of equal size were created as the control and treatment groups, labeled A and B. The treatment group (B) was presented with the new web page while the control group (A) was presented with the old one, and the experiment was run. Data was collected for both groups, and then hypothesis testing was applied to determine whether or not the difference was significant.

## About Dataset

* **user_id**: unique users number
* **timestamp**: time
* **group**: treatment and control group
* **landing_page**: old_page and new_page
* **converted**: Sign up status after viewing the page (0-1)

## AB Testing Steps 

1. **Create Hypothesis**
* H0: There is no statistically significant difference between the old and new pages.
* H1: There is a statistically significant difference between the old page and the new page. 

The purpose of this hypothesis is to determine whether a change (for example, a new page) provides significantly different results compared to the old page.

2. **Checking Statistical Assumptions**
**Normality Assumption**
* H0: Data is normally distributed.
* H1: Data is not normally distributed.

If p-value < 0.05, we reject H0 and conclude that the data is not normally distributed (non-parametric is required). However, if the p-value ≥ 0.05, we fail to reject H0 and conclude that the data is normally distributed (parametric can be used).

**Variance Homogeneity**
* H0: The variances between groups are homogeneous (equal).
* H1: Variances between groups are not homogeneous (different).

*Note:
* If the data meet the assumption of normality and homogeneous variance, use **T-test**.
* If the data does not meet the assumption of normality (regardless of homogeneous variance status), use the **Mann-Whitney U test** (non-parametric test).
* If the data meets the assumption of normality but does not meet homogeneous variance, use **Welch's t-test**.

3. **Analyzing Results and conclusions** 

The hypothesis will be concluded based on the p-value obtained from the parametric/non-parametric test to be performed. This result will answer the question: Is there a significant difference between the new page and the old page?

*Note: If the p-value of the statistical test is < 0.05, we reject H0 and conclude that there is a statistically significant difference between the old page and the new page. On the other hand, if the p-value ≥ 0.05, we fail to reject H0 and conclude that there is no statistically significant difference between the old page and the new page.



# Import Libraries & Read Dataset

In [18]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

In [19]:
path = "/kaggle/input/ecommerce-ab-testing-2022-dataset1/ecommerce_ab_testing_2022_dataset1/ab_data.csv"
df = pd.read_csv(path)
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,11:48.6,control,old_page,0
1,804228,01:45.2,control,old_page,0
2,661590,55:06.2,treatment,new_page,0
3,853541,28:03.1,treatment,new_page,0
4,864975,52:26.2,control,old_page,1


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294480 entries, 0 to 294479
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294480 non-null  int64 
 1   timestamp     294480 non-null  object
 2   group         294480 non-null  object
 3   landing_page  294480 non-null  object
 4   converted     294480 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


In [21]:
df.nunique()

user_id         290585
timestamp        35993
group                2
landing_page         2
converted            2
dtype: int64

In [22]:
df.shape

(294480, 5)

In [23]:
df['user_id'].duplicated().sum()
df = df.drop_duplicates(subset='user_id', keep=False)

In [24]:
df.shape

(286690, 5)

In [25]:
df.isnull().sum()

user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64

In [26]:
count_group_page = df.groupby('group')['landing_page'].value_counts()
count_group_page

group      landing_page
control    old_page        143293
treatment  new_page        143397
Name: count, dtype: int64

In [27]:
count_group_page / count_group_page.sum()

group      landing_page
control    old_page        0.499819
treatment  new_page        0.500181
Name: count, dtype: float64

In [28]:
df.describe()

Unnamed: 0,user_id,converted
count,286690.0,286690.0
mean,788036.057184,0.11945
std,91239.396095,0.324317
min,630000.0,0.0
25%,709036.25,0.0
50%,788059.5,0.0
75%,866998.75,0.0
max,945999.0,1.0


In [29]:
df.groupby(['group', 'landing_page']).agg(conv_rate=('converted', 'mean'))

Unnamed: 0_level_0,Unnamed: 1_level_0,conv_rate
group,landing_page,Unnamed: 2_level_1
control,old_page,0.120173
treatment,new_page,0.118726


In [30]:
print("Old Page")
print(df.loc[df['landing_page'] == 'old_page', 'converted'].value_counts())
print("New Page")
print(df.loc[df['landing_page'] == 'new_page', 'converted'].value_counts())

Old Page
converted
0    126073
1     17220
Name: count, dtype: int64
New Page
converted
0    126372
1     17025
Name: count, dtype: int64


# AB Test

## Normality Assumption

**Normality Assumption**
* H0: Data is normally distributed.
* H1: Data is not normally distributed.

If p-value < 0.05, we reject H0 and conclude that the data is not normally distributed (non-parametric is required). However, if the p-value ≥ 0.05, we fail to reject H0 and conclude that the data is normally distributed (parametric can be used).

In [31]:
test_stat, pvalue= stats.shapiro(df.loc[df['landing_page'] == 'old_page', 'converted'])
if pvalue < 0.05:
    print("Data is not normally distributed.")
else:
    print("Data is normally distributed.")
print(f"p-value:", pvalue)
print(f"test_stat:", test_stat)

Data is not normally distributed.
p-value: 9.036752175762015e-178
test_stat: 0.37923878278163115


  res = hypotest_fun_out(*samples, **kwds)


In [32]:
test_stat, pvalue= stats.shapiro(df.loc[df['landing_page'] == 'new_page', 'converted'])
if pvalue < 0.05:
    print("Data is not normally distributed.")
else:
    print("Data is normally distributed.")
print(f"p-value:", pvalue)
print(f"test_stat:", test_stat)

Data is not normally distributed.
p-value: 6.263942656903057e-178
test_stat: 0.37673520650313475


  res = hypotest_fun_out(*samples, **kwds)


It can be seen that the data is not normally distributed so **non-parametric** is required. 

## Variance Homogeneity

**Variance Homogeneity**
* H0: The variances between groups are homogeneous (equal).
* H1: Variances between groups are not homogeneous (different).

In [33]:
test_stat, pvalue= stats.levene(df.loc[df['landing_page'] == 'old_page', 'converted'],
                                df.loc[df['landing_page'] == 'new_page', 'converted'])
if pvalue < 0.05:
    print("Variances are not homogeneous")
else:
    print("Variances are homogeneous")
print(f"p-value:", pvalue)
print(f"test_stat:", test_stat)

Variances are homogeneous
p-value: 0.2322897281547632
test_stat: 1.4267917566652295


Because **Data is Not Normally Distributed** but the **Variance Are Homogeneous** we will use Non-Parametric test **(Mann-Whitney U test)**

In [34]:
test_stat, pvalue= stats.mannwhitneyu(df.loc[df['landing_page'] == 'old_page', 'converted'],
                                df.loc[df['landing_page'] == 'new_page', 'converted'])

print(f"Mann-Whitney U Test: p-value=", pvalue)
print(f"Mann-Whitney U Test: test_stat=", test_stat)

Mann-Whitney U Test: p-value= 0.23228910319572493
Mann-Whitney U Test: test_stat= 10288759668.0


The conclusion is that the results obtained fail to reject H0 or zero hypothesis because the p-value is more than 0.05 so it can be said that the old_page can still be applied to the website because the new_page does not show a significant impact.