# **Introduction**
The company has developed a new webpage to increase the number of paying users for their product. They are conducting an AB test to see the results of this new page. Two equally-sized groups are created as control and treatment groups, labeled A and B. The treatment group (B) is presented with the new webpage while the control group (A) is presented with the old one, and the experiment is run. Data is collected for both groups, and then hypothesis testing is applied to determine if the difference is significant or not.

## **AB Test Steps:**
1. **Create the hypothesis:** 
New page increased the number of paying users
2. **Assumption Check**
 - Normality Assumption
 - Variance Homogeneity
3. **Hypothesis Testing**
 - If the assumptions are met, a parametric test with t test
 - Else a non-parametric test with  mannwhitneyu test
4. **The results are interpreted based on the p-value**

 
## **About the Dataset** 
**user_id:** unique users number

**timestamp:** time

**group:** treatment and control group

**landing_page:** old_page and new_page

**converted:** Sign up status after viewing the page (0-1)


## **AB Testing**

### **Create Hypothesis**

H0: There is not statistically significant difference between the old page and new page

H1: There is statistically significant difference between the old page and new page

### **Assumption Check**

**Normality Assumption**

- H0: The assumption of normal distribution is provided
- H1: The assumption of normal distribution is not provided


If the p-value is less than 0.05, it is considered significant and a non-parametric test (mannwhitneyu test) will be used. Else a parametric test (t-test) 

**Variance Homogeneity**

- H0: Variances are homogeneous
- H1: Variances are not homogeneous

### **Conclusion**
The hypothesis will be concluded based on the p-value obtained from the parametric/non-parametric test we will perform. This result will answer the question: Is there a significant difference between the new and old page?

### **Hypothesis Testing**

H0: There is not statistically significant difference between the old page and new page

H1: There is statistically significant difference between the old page and new page

In [1]:
import pandas as pd
import statsmodels.stats.api as sms
from scipy.stats import shapiro, levene, mannwhitneyu

import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('/kaggle/input/ecommerce-ab-testing-2022-dataset1/ecommerce_ab_testing_2022_dataset1/ab_data.csv')

# **EDA(Exploratory Data Analysis)**

In [3]:
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,11:48.6,control,old_page,0
1,804228,01:45.2,control,old_page,0
2,661590,55:06.2,treatment,new_page,0
3,853541,28:03.1,treatment,new_page,0
4,864975,52:26.2,control,old_page,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294480 entries, 0 to 294479
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294480 non-null  int64 
 1   timestamp     294480 non-null  object
 2   group         294480 non-null  object
 3   landing_page  294480 non-null  object
 4   converted     294480 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


In [5]:
df.apply(lambda x: x.nunique())

user_id         290585
timestamp        35993
group                2
landing_page         2
converted            2
dtype: int64

In [6]:
df.isnull().sum()

user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64

In [7]:
print(df.shape)
df = df.drop_duplicates(subset= 'user_id', keep= False)
print(df.shape)

(294480, 5)
(286690, 5)


In [8]:
df.groupby(['group','landing_page']).agg({'landing_page': lambda x: x.value_counts()})

Unnamed: 0_level_0,Unnamed: 1_level_0,landing_page
group,landing_page,Unnamed: 2_level_1
control,old_page,143293
treatment,new_page,143397


In [9]:
df.groupby(['group','landing_page']).agg({'converted': 'mean'})

Unnamed: 0_level_0,Unnamed: 1_level_0,converted
group,landing_page,Unnamed: 2_level_1
control,old_page,0.120173
treatment,new_page,0.118726


In [10]:
pd.DataFrame(df.loc[:,'landing_page'].value_counts(normalize = True) * 100)

Unnamed: 0,landing_page
new_page,50.018138
old_page,49.981862


In [11]:
df[((df['group'] == 'control') & (df['landing_page'] == 'new_page')) |((df['group'] == 'treatment') & (df['landing_page'] == 'old_page')) ]

Unnamed: 0,user_id,timestamp,group,landing_page,converted


Control group = new page

Treatment group = old page

# **AB Test**

### Normality Assumption

- H0: The assumption of normal distribution is provided
- H1: The assumption of normal distribution is not provided

If the p-value is less than 0.05, it is considered significant and a non-parametric test (mannwhitneyu test) will be used. Else a parametric test (t-test)

In [12]:
test_stat, pvalue = shapiro(df.loc[df["landing_page"] == "old_page", "converted"])
print("p-value:",pvalue)
print("test_stat:",test_stat)

p-value: 0.0
test_stat: 0.3792334198951721


In [13]:
test_stat, pvalue = shapiro(df.loc[df["landing_page"] == "new_page", "converted"])
print("p-value:",pvalue)
print("test_stat:",test_stat)

p-value: 0.0
test_stat: 0.37685757875442505


p-value < 0.05, so assumption of normality is not provided. we will use non-parametric test(mannwhitneyu test)

### Variance Homogeneity

H0: Variances are homogeneous

H1: Variances are not homogeneous

In [14]:
test_stat, pvalue = levene(df.loc[df["landing_page"] == "new_page", "converted"],
                           df.loc[df["landing_page"] == "old_page", "converted"])
print("p-value:",pvalue)  
print("test_stat:",test_stat)

p-value: 0.2322897281547632
test_stat: 1.4267917566652295


Variances are homogeneous

### Hypothesis Testing

a non-parametric test with mannwhitneyu test

H0: There is not statistically significant difference between the old page and new page

H1: There is  statistically significant difference between the old page and new page

In [15]:
test_stat, pvalue = mannwhitneyu(df.loc[df["landing_page"] == "new_page", "converted"],
                                 df.loc[df["landing_page"] == "old_page", "converted"])

print('Test Stat = %.4f, p-value = %.4f' % (test_stat, pvalue))

Test Stat = 10259026653.0000, p-value = 0.2323


p-value(0.2323) >0.05

We fail to reject zero. there is no statistically significant difference between the new page and the old page, so it does not bring a profit