# A/B Test performed on "Grocery Website Data"

The dataset this exercise is performed with is "Grocery website data for AB test" uploaded by Tetiana Klimonova on kaggle.com.  
[kaggle.com: Grocery website data for AB test](https://www.kaggle.com/datasets/tklimonova/grocery-website-data-for-ab-test)  

The goal is to perform an A/B Test to expand my professional portfolio. Also the title of the data on kaggle.com suggests that that's what it is intended for.  

A/B testing is a user experience research methodology. A/B tests consist of a randomized experiment that usually involves two variants, although the concept can be also extended to multiple variants of the same variable. It includes application of statistical hypothesis testing or "two-sample hypothesis testing" as used in the field of statistics. A/B testing is a way to compare multiple versions of a single variable, for example by testing a subject's response to variant A against variant B, and determining which of the variants is more effective.  

One way of testing, whether or not there is a different underlying distribution responsible for the change in positives, based on relation to a categorial variable, is the $\chi^2$-Test.  
[R.L.Ott, M.Longnecker, An introduction to Statistical Methods & Data Analysis, 7thEdition, 2016, p508ff](https://www.filepicker.io/api/file/GdeoWEcATWi1pXV1DlPe#page=514)

In [1]:
# import section ------------------------------------------------------------------------------------------------------------- #
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency


In [2]:
# loading and exploring of the data ------------------------------------------------------------------------------------------ #
# loading the csv-file and saving it to a Pandas DataFrame for ease of use --------------------------------------------------- #
all_data_df = pd.read_csv("grocerywebsiteabtestdata.csv")
all_data_df = all_data_df.set_axis(["record_id", "ip_address", "logged_id", "server_id", "loyalty_program"], axis=1)

all_data_df.head(10)

Unnamed: 0,record_id,ip_address,logged_id,server_id,loyalty_program
0,1,39.13.114.2,1,2,0
1,2,13.3.25.8,1,1,0
2,3,247.8.211.8,1,1,0
3,4,124.8.220.3,0,3,0
4,5,60.10.192.7,0,2,0
5,6,23.5.199.2,1,3,0
6,7,195.12.126.2,1,1,0
7,8,97.6.126.6,0,3,1
8,9,93.10.165.4,1,1,0
9,10,180.3.76.4,1,1,0


From the structure of the dataset and the explanation on kaggle, the following can be derived, regarding the columns.

**RecordID:**  
 - unique identifier of the row of data (can be dropped, identical function as index)
 
**IP Address:**
 - numerical label of the Internet Protocol, assigned to the user visiting the website
 
**LoggedInFlag:**
 - wether or not the user logged into an account
 
**ServerID:**
 - which server the user was routed through (probably splitting the users into test groups)
 
**Loyalty Program:**
 - wether or not the user clicked on the loyalty program page (probably subject of the test)

**What to test?:**  
As there are no instructions provided with the dataset on how to perform the A/B-Test, one could arrive at two different variables to test in regards to the loyalty program, one being the “LoggedInFlag”, the other being the “ServerID”. Both would make sense, as there could be two different ways to sign up for the loyalty program, an analog version in the store and an online version, that requires to be logged in.
I chose to go for the traditional version of an A/B-Test and assumed that there is a different website version on each server and the goal of the test was to find out if one of the designs was more likely to get people to sign up to the loyalty program.

**Can IP-Addresses be used as a unique identifier for each user?**  
The first question that presents itself in regarding the IP-Addresses is whether or not they represent individual accesses or they’re a summarized behavior for an address?
The next question in case of them being individual accesses, would be whether or not they were routed via the same server in case of repeated access by the same address?


In [3]:
# checking if an ip-address accessed the page more than once ----------------------------------------------------------------- #

print("")
print("Number of data points:")
print(all_data_df["record_id"].nunique())

print("")
print("Number of unique IP-Adresses:")
print(all_data_df["ip_address"].nunique())



Number of data points:
184588

Number of unique IP-Adresses:
99516


In [4]:
# checking if an IP was routed via the same server in case of repeated access ------------------------------------------------ #
# appending each ip as key, with a list of server_id's as value to a dict ---------------------------------------------------- #
ip_server_dict = dict()
for i in range(len(all_data_df)):
    key = all_data_df.ip_address[i]
    value = all_data_df.server_id[i]
    if key not in ip_server_dict:
        ip_server_dict.update({key:[value]})
    else:
        ip_server_dict[key].append(value)

# comparing the server_id's in each value list to each other ----------------------------------------------------------------- #
values_lst = ip_server_dict.values()
double_lst = list()
return_lst = list()

for item in values_lst:
    double_lst.append(set(item))
    
for item in double_lst:
    if len(item) == 1:
        return_lst.append(True)
    else:
        return_lst.append(False)

if False not in return_lst:
    print("IP's were routed via the same server in case of repeated access.")
else:
    print("That's going to be interesting.")

IP's were routed via the same server in case of repeated access.


**Assumptions for a Chi-Square Test:**  
[Mary L. McHugh, 2013 Jun, The Chi-square test of independence, Biochem Med (Zagreb), 23 (2), 143-149](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3900058/#:~:text=The%20assumptions%20of%20the%20Chi,the%20variables%20are%20mutually%20exclusive)

1. [x] The data in the cells should be counts of cases.
2. [x] The categories of the variables are mutually exclusive.
3. [x] Each subject may contribute data to one and only one cell in the χ2. (server_id's and loyalty_programs, each have exclusive values)
4. [x] The population sample must be random. (It is to be assumed that the majority of visits is independent, checking the "neighborhood" of the IP-Adresses would bloat this exercise beyond reasonable.)
5. [x] There are 2 variables, and both are measured as categories. (split of server_id's not necessary)
6. [x] Expected values are at least 5. ( (row total * column total) / sample size >= 5 )

**Hypothesis:**  
 - H0: The two variables (server_id, loyalty_program) are independent of each other.
 - H1: The two variables (server_id, loyalty_program) are dependent on each other.
 - Level of significance, alpha = 0.05

In [5]:
# df with multiple access-ip's ----------------------------------------------------------------------------------------------- #
data_df = all_data_df.groupby(["ip_address"], as_index=False).agg({"logged_id": "max", "server_id": "first", "loyalty_program": "max"})


data_df

Unnamed: 0,ip_address,logged_id,server_id,loyalty_program
0,0.0.108.2,0,1,0
1,0.0.109.6,1,1,0
2,0.0.111.8,0,3,0
3,0.0.160.9,1,2,0
4,0.0.163.1,0,2,0
...,...,...,...,...
99511,99.9.53.7,1,2,0
99512,99.9.65.2,0,2,0
99513,99.9.79.6,1,2,0
99514,99.9.86.3,0,1,1


In [6]:
# df's with server_id 1,2 ---------------------------------------------------------------------------------------------------- #
data_12_df = data_df[(data_df["server_id"] == 1) | (data_df["server_id"] ==  2)]

# contingency table for "loyalty_program" ------------------------------------------------------------------------------------ #
xtab_12 = pd.crosstab(data_12_df.server_id ,data_12_df.loyalty_program)

xtab_12

loyalty_program,0,1
server_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,29382,3847
2,30054,3051


In [7]:
# df's with server_id 2,3 ---------------------------------------------------------------------------------------------------- #
data_23_df = data_df[(data_df["server_id"] == 2) | (data_df["server_id"] ==  3)]

# contingency table for "loyalty_program" ------------------------------------------------------------------------------------ #
xtab_23 = pd.crosstab(data_23_df.server_id ,data_23_df.loyalty_program)

xtab_23

loyalty_program,0,1
server_id,Unnamed: 1_level_1,Unnamed: 2_level_1
2,30054,3051
3,30102,3080


In [8]:
# df's with server_id 1,3 ---------------------------------------------------------------------------------------------------- #
data_13_df = data_df[(data_df["server_id"] == 1) | (data_df["server_id"] ==  3)]

# contingency table for "loyalty_program" ------------------------------------------------------------------------------------ #
xtab_13 = pd.crosstab(data_13_df.server_id ,data_13_df.loyalty_program)

xtab_13

loyalty_program,0,1
server_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,29382,3847
3,30102,3080


In [9]:
# chi2_contingency table ----------------------------------------------------------------------------------------------------- #
chi2, pval_12, dof, expected = chi2_contingency(xtab_12)
# H0: no relationship exists on the categorical variables in the population; they are independent ---------------------------- #
# probability of obtaining test results at least as extreme the ones observed, under the assumption, that there is no -------- #
# relationship between the variables (server_id) ----------------------------------------------------------------------------- #
print("")
print("Chi2:")
print(chi2)
print("")
print("P-value:")
print(pval_12)
print("")
print("Degrees of freedom:")
print(dof)
print("")
print("Expected:")
print(expected)


Chi2:
98.96815681716856

P-value:
2.5659464632786813e-23

Degrees of freedom:
1

Expected:
[[29773.55268791  3455.44731209]
 [29662.44731209  3442.55268791]]


In [10]:
# chi2_contingency table ----------------------------------------------------------------------------------------------------- #
chi2, pval_23, dof, expected = chi2_contingency(xtab_23)
# H0: no relationship exists on the categorical variables in the population; they are independent ---------------------------- #
# probability of obtaining test results at least as extreme the ones observed, under the assumption, that there is no -------- #
# relationship between the variables (server_id) ----------------------------------------------------------------------------- #
print("")
print("Chi2:")
print(chi2)
print("")
print("P-value:")
print(pval_23)
print("")
print("Degrees of freedom:")
print(dof)
print("")
print("Expected:")
print(expected)


Chi2:
0.07834334803362858

P-value:
0.7795551344810971

Degrees of freedom:
1

Expected:
[[30043.06093201  3061.93906799]
 [30112.93906799  3069.06093201]]


In [11]:
# chi2_contingency table ----------------------------------------------------------------------------------------------------- #
chi2, pval_13, dof, expected = chi2_contingency(xtab_13)
# H0: no relationship exists on the categorical variables in the population; they are independent ---------------------------- #
# probability of obtaining test results at least as extreme the ones observed, under the assumption, that there is no -------- #
# relationship between the variables (server_id) ----------------------------------------------------------------------------- #
print("")
print("Chi2:")
print(chi2)
print("")
print("P-value:")
print(pval_13)
print("")
print("Degrees of freedom:")
print(dof)
print("")
print("Expected:")
print(expected)


Chi2:
93.36318595747174

P-value:
4.352720309635105e-22

Degrees of freedom:
1

Expected:
[[29763.04883227  3465.95116773]
 [29720.95116773  3461.04883227]]


**Server_ID 1 and 2**  
**P-value:** 2.5659464632786813e-23  
p < 0.05 => reject H0  
=> The two variables (server_id (1 and 2), loyalty_program) are dependent of each other.  

**Server_ID 2 and 3**  
**P-value:** 0.7795551344810971  
p > 0.05 => fail to reject H0  
=> The two variables (server_id (2 and 3), loyalty_program) are independent of each other.  

**Server_ID 1 and 3**  
**P-value:** 4.352720309635105e-22  
p < 0.05 => reject H0  
=> The two variables (server_id (1 and 3), loyalty_program) are dependent of each other.  


In [12]:
# contingency table for "loyalty_program" ------------------------------------------------------------------------------------ #
xtab = pd.crosstab(data_df.server_id ,data_df.loyalty_program)

xtab

loyalty_program,0,1
server_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,29382,3847
2,30054,3051
3,30102,3080


In [13]:
# calculating the percentage that opted for the loyalty program for each server_id ------------------------------------------- #
traffic_per_ip_1 = 29382 + 3847
traffic_per_ip_2 = 30054 + 3051
traffic_per_ip_3 = 30102 + 3080

loyalty_sub_ip_1 = 3847
loyalty_sub_ip_2 = 3051
loyalty_sub_ip_3 = 3080

sp_index_lst = ["server_id_1", "server_id_2", "server_id_3"]
sp_columns_lst = ["traffic_per_ip", "loyalty_ips", "loyalty_percs"]
sub_percs_lst = [
                    [traffic_per_ip_1, loyalty_sub_ip_1, loyalty_sub_ip_1/traffic_per_ip_1],
                    [traffic_per_ip_2, loyalty_sub_ip_2, loyalty_sub_ip_2/traffic_per_ip_2],
                    [traffic_per_ip_3, loyalty_sub_ip_3, loyalty_sub_ip_3/traffic_per_ip_3]
                ]

sub_percs_df = pd.DataFrame(sub_percs_lst, columns = sp_columns_lst, index = sp_index_lst)

percs_diff = loyalty_sub_ip_1/traffic_per_ip_1 - (loyalty_sub_ip_2/traffic_per_ip_2 + loyalty_sub_ip_3/traffic_per_ip_3) / 2
percs_str = round(percs_diff * 100, 2)

sub_percs_df

Unnamed: 0,traffic_per_ip,loyalty_ips,loyalty_percs
server_id_1,33229,3847,0.115772
server_id_2,33105,3051,0.092161
server_id_3,33182,3080,0.092821


In [14]:
print("The server_id_1 had an increase in subscribers to the loyalty program by " + str(percs_str) + "%, \ncompared to the other two servers, this was almost certainly not duo to chance.")

The server_id_1 had an increase in subscribers to the loyalty program by 2.33%, 
compared to the other two servers, this was almost certainly not duo to chance.
