<a href="https://colab.research.google.com/github/MaxwellMensah/A-B-Testing/blob/main/A_B_Testing_(for_everyone).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is **A/B Testing?**

A/B testing is a general methodology used online when testing product changes and new features. You take two sets of users, show **one set of users the changed product (experiment group)** and show the **second set of users the original product** or set of features **(control group)**. You then compare the two groups to determine which version of the product is better.

A/B Testing is important because it:

— Removes the need for guessing or relying on intuition

— Provides accurate answers quickly

— Allows for rapid iteration on ideas

— Establishes causal(temporary) relationships — not just correlations.

A/B testing **works** best when testing incremental changes, such as UX changes, new features, ranking and page load times. Generally there needs to be a material amount of users that enable statistical inference and experiments that can be completed within a reasonable amount of time.

A/B testing **doesn’t work** well when testing major changes, like new products, new branding or completely new user experiences. In these cases there may be novelty effects that drive higher than normal engagement or emotional responses that cause users to resist a change initially. The re-branding of Instagram is a great example of a bad use case for A/B testing which initially caused a major negative emotional response. Additionally, A/B tests on products that are purchased infrequently and have long lead times is a bad idea — waiting to measure the repurchase rate of customer’s buying houses is going to blow out experiment timelines.

# Causation explicitly applies to cases where action A causes outcome B.  On the other hand, correlation is simply a relationship

# **Designing an A/B Test**

# 1. Research

Before diving into any experiment, do some research. Research can help you design an experiment, ensure you use best practices and may even unearth results of similar experiments. There are several useful data sources that may help design or support an experiment::

**> External Data and Research**: the adage “if you’ve thought of it, someone’s probably tried it” is true for experiments; doing some research may reveal the insights you need or at least provide ideas for experiment design

**> Historical Data**: retrospective analysis is cheap and easy and can provide insight into historical correlations (but not causation!)

**> User Experience Research**: actually getting to see how a user interacts with a product can be invaluable for setting up and testing an experiment

**> Focus Groups**: while not accessible for most analysts, focus groups can be great for testing major changes and new products

**> Surveys**: surveys can be effective for gathering information that is not revealed online

# 2. Choose Metrics
Metric choice is, unsurprisingly, fundamental to successful A/B testing. Many common experiments have clear, best practice metrics to use however every company faces its own nuances and priorities that should be considered when choosing metrics; developing the metrics should leverage(support) experience, domain knowledge and exploratory data analysis. 

**My experience is that metrics work best when cascaded top-down**(hierarchical structure in the organization) from a company or business unit strategy. This way there is a clear link between the goals of the experiment and the goals of the company. This approach means experiments can anchor to one or two clear metrics (e.g. customer conversions or number of active users) that are simple to report and understandable to all stakeholders.

Having one or two key metrics is usually not enough though. Most decision makers will want to understand what has driven change, and so key metrics should be supported by more detailed metrics that help explain results. One framework that can provide this explainability is tracking metrics across the user journey or customer funnel. Having metrics across these frameworks helps clearly identify where changes have occurred and where additional experimentation and attention could be focused.

Four categories of metrics should be considered when designing an experiment, namely:

**— Sums and counts** — e.g. how many cookies visit a webpage

**— Distributed metrics such as mean, median and quartiles** — e.g. what is the mean page load time

**— Probability and rates** — e.g. clickthrough rates

**— Ratios** — e.g. ratio of the probability of a revenue generating click to the probability of any click

# 3. Define Metrics
Once you have decided on a set of metrics, you then need to specifically define the metrics. This step seems simple on the surface until you begin to dig into your metrics. For example, the number of active users is a core metric used by most technology companies — but what does this actually mean? When defining this metric some considerations would include:

**What is the time period for a user to be active?** Is someone active who used the product today, in the last 7 days, in the last month?

**What activity constitutes active?** Is it logging in, spending a certain amount of time using the product? What if the activity is only an automatic notification?

**Does how active the user is matter**, e.g. one user spends 10 seconds per day vs another user who spends 5 hours per day?

**Are we measuring paying and non-paying users differently?**
Many metrics have industry defined standards which can be leveraged for experiments, but these questions need to be fleshed out upfront so that everyone is clear on exactly what is being considered and how to implement the experiment.


# 4. Determine Sample Size Required

Next we want to figure out how many data points will be required to run an experiment. There are four test **parameters** that need to be set to enable the calculation of a suitable sample size:

**— Baseline rate** — an estimate of the metric being analyzed before making any changes

**— Practical significance level** — the minimum change to the baseline rate that is useful to the business, for example an increase in the conversion rate of 0.001% may not be worth the effort required to make the change whereas a 2% change will be

**— Confidence level** — also called significance level is the probability that the null hypothesis (experiment and control are the same) is rejected when it shouldn’t be

**— Sensitivity** — the probability that the null hypothesis is not rejected when it should be

The baseline rate can be estimated using historical data, the practical significance level will depend on what makes sense to the business and the **confidence level** and **sensitivity** are generally set at **95%** and **80%** respectively but can be adjusted to suit different experiments or business needs.

Once these are set, the sample size required can be calculated statistically using a calculator such as this.

# 5. Finalize Experiment Design

Armed with well-defined metrics and sample size requirements, the final step in experiment design is to consider the practical elements of conducting an experiment. Some things to keep in mind for this step is deciding:

**— When the experiment should run** (e.g. every day or only business days) and how long the experiment should run (e.g. 1 day or 1 week)

**— How many users to expose to the experiment per day**, where exposing fewer users means longer experiment time periods

**— How to account for the learning effect** i.e. allow for users to learn the changes before measuring the impact of a change

**— What tools will be used to capture data**, such as Google Analytics

Before running any experiments you should also conduct some tests to make sure there are no bugs that will throw experiment results, such as issues with certain browsers, devices or operating systems. It is also very important to ensure users are selected randomly when being assigned into the control and experiment groups.


## **Implementing an A/B Test using Python**
The real difficulty in conducting A/B tests is designing the experiment and gathering the data. Once you have done these steps the analysis itself follows standard statistical significance tests. To demonstrate how to conduct an A/B test I’ve downloaded a public dataset from Kaggle available here, which is testing the conversion rates of two groups (control and treatment) exposed to different landing pages. The dataset comprises twenty-nine thousand rows of datapoints, access to the Python code outlined below can be found on my Github here.

For this example, I’ll test the null hypothesis that the probability of conversion in the treatment group minus the probability of conversion in the control group equals zero. And set the alternative hypothesis to be that the probability of conversion in the treatment group minus the probability of conversion in the control group does not equal zero.

# 1. Clean Data
As with any analysis, the first step is to clean up the data that has been collected. This step is important to ensure that the data aligns with the metric definitions determined during the experiment design.
You should start by running some sanity checks to make sure the experiment results reflect the initial design, e.g. did roughly an equal number of users see the old and new landing page or are the conversion rates in the realm of possibilities.
I found one issue in this dataset, that some users in both groups were exposed to the wrong landing page. The control group should have been exposed to the old landing page while the treatment group should have been exposed to the new page. I dropped the almost 4,000 rows where this issue was present.

In [1]:
import numpy as np
import pandas as pd
import math
# data = pd.read_csv("/kaggle/input/ab-testing/ab_data.csv")
data = pd.read_csv("ab_data.csv")

In [2]:
data.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294478 non-null  int64 
 1   timestamp     294478 non-null  object
 2   group         294478 non-null  object
 3   landing_page  294478 non-null  object
 4   converted     294478 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


# **Clean Data**

In [4]:
data.isnull().sum()

user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64

In [5]:
data.columns

Index(['user_id', 'timestamp', 'group', 'landing_page', 'converted'], dtype='object')

In [6]:
data['converted'].value_counts()

0    259241
1     35237
Name: converted, dtype: int64

In [7]:
data['group'].value_counts()

treatment    147276
control      147202
Name: group, dtype: int64

In [8]:
data['landing_page'].value_counts()

old_page    147239
new_page    147239
Name: landing_page, dtype: int64

In [9]:
data.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


*Some users in both groups were exposed to the wrong landing page. The control group should have been exposed to the old landing page while the treatment group should have been exposed to the new page. I dropped the almost 4,000 rows where this issue was present.*

**Control group/Original finding themselves in New landing page**

In [10]:
mask = (data["group"] == "control") & (data["landing_page"] == "new_page")
mask.value_counts()

False    292550
True       1928
dtype: int64

In [11]:
datadrop = data[mask].index
datadrop.value_counts()

235519    1
25882     1
140628    1
40279     1
173400    1
         ..
129748    1
29397     1
209625    1
76506     1
184320    1
Length: 1928, dtype: int64

In [12]:
data = data.drop(datadrop)
#292550 - 1928

**Treatment group/Experiment finding themselves under Old landing page**

In [13]:
mask2 = (data['group'] == 'treatment') & (data['landing_page'] == 'old_page')
index_mask2 = data[mask2].index
data = data.drop(index_mask2)

In [14]:
data['group'].value_counts()

treatment    145311
control      145274
Name: group, dtype: int64

In [15]:
#Check how many duplicated users exist, if there is any
print(data["user_id"].count())
print(data["user_id"].nunique())

290585
290584


In [16]:
#drop duplicated users
data.drop_duplicates(subset ='user_id',keep ='first',inplace = True)

In [17]:
data.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


What is count and sum?

The count of a dataset is the number of observations in the dataset. This applies to all variables. 

Sum Definition: The sum of a dataset is the addition of all values in the dataset.

In [18]:
data['converted'].count()

290584

In [19]:
#Show the % split between users who saw new vs old page
#Calculate pooled probability
# FOR CONTROL
control_only = (data["group"] == "control")    # for only control group
conversions_control = data["converted"][control_only].sum()   # converted 1 or not 0.. addition of count
total_users_control = data["converted"][control_only].count()  # converted = 0/1.. total count
print(f"conversions_control is {conversions_control}")
print(f"total_users_control is {total_users_control}")
print ( )
# FOR TREATMENT
treatment_only = (data["group"] == "treatment")  # for only treatment group
conversions_treatment = data["converted"][treatment_only].sum() # addition of all counts
total_users_treatment = data["converted"][treatment_only].count() # total of all counts
print(f"conversions_treatment is {conversions_treatment}")
print(f"total_users_treatment is {total_users_treatment}")

conversions_control is 17489
total_users_control is 145274

conversions_treatment is 17264
total_users_treatment is 145310


In [20]:
# Originally:
# control users ==> old page and treatment users ==> new page
print("Split of control users who saw old page vs treatment users who saw new page: ", 
          round(total_users_control / data["converted"].count() * 100, 2), "% ",            # 145274 / 290584
          round((total_users_treatment / data["converted"].count()) * 100, 2), "%")         # 145310 / 290584


Split of control users who saw old page vs treatment users who saw new page:  49.99 %  50.01 %


In [21]:
# count number of users who converted in each group
print("Number of control users who converted on old page: ", conversions_control)
print("Percentage of control users who converted: ", round((conversions_control / total_users_control) * 100, 2), "%")

#mask = (df["group"] == "treatment")
print("Number of treatment users who converted on new page: ", conversions_treatment)
print("Percentage of treatment users who converted: ", round((conversions_treatment/ total_users_treatment) * 100, 2), "%")

Number of control users who converted on old page:  17489
Percentage of control users who converted:  12.04 %
Number of treatment users who converted on new page:  17264
Percentage of treatment users who converted:  11.88 %


# **Set Test Parameters**

# 2. Input Test Parameters and Check Sample Size is Large Enough
As mentioned in the “Designing an A/B Test, Section 4 — 
Determine Sample Size Required” there are four test parameters to set: 

**the baseline rate, practical significance level, confidence level and sensitivity**. 
I used the **control group probability** as a proxy for the **baseline significance level** and set the practical significance level, confidence level and sensitivity to 1%, 95% and 80% respectively. 

Using these values I calculated the minimum sample size required for each test group to make sure there was sufficient data to draw statistically robust conclusions.

*What is the null hypothesis for a confidence interval?
Typically our null hypothesized value will be 0 (point of no difference), and if we find 0 in our confidence interval then that would mean we have a good chance of actually finding NO DIFFERENCE, which is typically the opposite of what we want*

In [22]:
import math
import statsmodels.stats.api as sm
import scipy.stats as st

  import pandas.util.testing as tm


In [23]:
baseline = conversions_control / total_users_control
practical_significance = 0.01 # user defined
confidence_level = 0.05 # user defined, for a 95% confidence interval.
sensitivity = 0.8 # user defined; by default we know it's normally 80%.

effect_size = sm.proportion_effectsize(baseline, baseline + practical_significance)
sample_size = sm.NormalIndPower().solve_power(effect_size = effect_size, power = sensitivity, 
                                               alpha = confidence_level, ratio=1.0)

print("Required sample size: ", round(sample_size), " per group")

Required sample size:  17209  per group


***The minimum sample size of 17,209 is slightly less than the 17,489 users in the control group and 17,264 users in the treatment group and is therefore sufficient to conduct the hypothesis testing.***

# 3. Run A/B Test
Finally, we can run the A/B Test, or more specifically, test whether we can reject the null hypothesis (that the two groups have the same conversion rate) with 95% confidence. To do this we calculate a pooled probability and pooled standard error, a margin of error and the upper and lower bounds of
the confidence interval. If these terms are unfamiliar its worth working through the modules in the Udacity course about A/B Testing.

In [24]:
#Calculate pooled probability (from above...already)
# FOR CONTROL
control_only = (data["group"] == "control")    # for only control group
conversions_control = data["converted"][control_only].sum()   # converted 1 or not 0.. addition of count
total_users_control = data["converted"][control_only].count()  # converted = 0/1.. total count
# FOR TREATMENT
treatment_only = (data["group"] == "treatment")  # for only treatment group
conversions_treatment = data["converted"][treatment_only].sum() # addition of all counts
total_users_treatment = data["converted"][treatment_only].count() # total of all counts


# probability
prob_pooled = (conversions_control + conversions_treatment) / (total_users_control + total_users_treatment)

In [25]:
#Calculate pooled standard error and margin of error
se_pooled = math.sqrt(prob_pooled * (1 - prob_pooled) * (1 / total_users_control + 1 / total_users_treatment))
z_score = st.norm.ppf(1 - confidence_level / 2)
margin_of_error = se_pooled * z_score

#Calculate dhat, the estimated difference between probability of conversions in the experiment(treatment) and control groups
d_hat = (conversions_treatment / total_users_treatment) - (conversions_control / total_users_control)

#Test if we can reject the null hypothesis
lower_bound = d_hat - margin_of_error
upper_bound = d_hat + margin_of_error

In [26]:
if practical_significance > lower_bound: 
    print("Do not/Failed to reject the null hypothesis")
else: 
    print("Reject null hypothesis")
    
print("The lower bound of the confidence interval is ", round(lower_bound * 100, 2), "%")
print("The upper bound of the confidence interval is ", round(upper_bound * 100, 2), "%")

Do not/Failed to reject the null hypothesis
The lower bound of the confidence interval is  -0.39 %
The upper bound of the confidence interval is  0.08 %


In this example the confidence interval is between -0.39% and 0.08%. 
Given the minimum level defined in the practical significance level is 1%, we could only reject the null hypothesis if the confidence interval lower bound was above 1%. Therefore we cannot reject the null hypothesis and conclude which landing page drives more conversions.

# 4. Reducing random chance and managing small sample sizes
The above approach should provide a starting point for conducting A/B testing in Python, however you should be mindful that as you begin running more and parallel experiments a 95% confidence interval will sometimes reject the null hypothesis due to random chance. Another issue that can arise will be when you may be unable to gather sufficient data points, such as on a low traffic website.

One solution for this is to use bootstrapping, a technique that samples the original dataset with replacement to create new and unique datasets. It assumes that the original dataset is a fairly good reflection of the population as a whole so therefore sampling with replacement roughly simulates random sampling from the population.

By bootstrapping the data, you can run the same test multiple times to reduce the chance of randomly rejecting the null hypothesis. Additionally, it can enable you to create larger sample sets that may be needed for hypothesis testing.

If you’d like to expand your understanding of A/B testing further, see my blog: How to run better and more intuitive A/B tests using Bayesian statistics.