## Introduction

A/B testing is a methodology for testing product changes. You split your users to two groups - the control group which sees the default feature, and an experimental group that sees the new features

## Metrics 

Generally, rate is used when you want to measure the usability of the site, and probability when you want to measure the impact.

## Binomial Distribution 

For a binomial distribution with probability p , the mean is given by p and the standard deviation is sqrt(p∗(1−p)/N)  where N is the number of trials. A binomial distribution can be used when

The outcomes are of 2 types
Each event is independent of the other
Each event has an identical distribution (i.e. p is the same for all)

## Confidence Interval 

A confidence interval indicates the range within which the mean is expected to fall in multiple trials of the experiment.

For e.g., consider p^ - the proportion of users that click, where N is the number of users. Let us assume a binomial distribution (this requires Np^>5 and N(1−p^)>5). The margin of error is given by
m=z∗se and 
m=z∗sqrt(p^.(1−p^)N)

For a 95% confidence interval, z = 1.96.

## Hypothesis Testing 

The null hypothesis states that the difference between the control and experiment is due to chance. If pcont and ptest are the control and test probabilities, then according to the null hypothesis
H0:pexp−pcont=0

The alternate hypothesis is that
H1:pexp−pcont≠0

## Comparing two samples

For comparing two samples, we calculate the pooled standard error. For e.g., suppose Xcont and Ncont are the control number of users that click, and the total number of users in the control group. Let Xexp and Nexp be the values for the experiment. The pooled probability is given by
p^pool=Xcont+Xexp/Ncont+Ntest and 
SEpool=sqrt(p^pool∗(1−p^pool)∗(1Ncont+1Ntest))
and d^=p^exp−p^cont
and H0:d=0 where d^∼N(0,SEpool)

If d^>1.96∗SEpool or d^<−1.96∗SEpool then we can reject the null hypothesis and state that our difference represents a statistically significant difference

## Practical Significance 

Practical significance is the level of change that you would expect to see from a business standpoint for the change to be valuable. What is considered practically significant can vary by field.

## Size v/s Power Tradeoff 

One of the decisions is to determine the number of data points needed to get a statistically significant result. This is called statistical power. Power has an inverse trade-off with size. The smaller the change you want to detect or the increased confidence you want to have in the result, means you have to run a larger experiment.

As you increase the number of samples, the confidence interval moves closer to the mean

α=P(reject null | null true) and 
β=P(fail to reject null | null false) and 
1−β is referred to as the sensitivity of the experiment, or statistical power. People often choose high sensitivity, typically around 80%.

Analysis of an AB Test 

In [29]:
import math
N_cont = 10072  # Control samples (pageviews)
N_exp = 9886  # Test samples (pageviews)
X_cont = 974  # Control clicks
X_exp = 1242  # Exp. clicks
d_min=0.02    #minimum practical significance level

In [22]:
p_pool=(X_cont+X_exp)/(N_cont+N_exp)
SEpool=math.sqrt(p_pool*(1-p_pool)*(1/N_cont+1/N_exp))


p_exp=X_exp/N_exp
p_cont=X_cont/N_cont
d=p_exp - p_cont
m=1.96*SEpool

In [23]:
print("d is"+" "+str(d)+"and margin of error is"+" "+str(m))

d is 0.028928474040117697and margin of error is 0.008717973932038056


In [24]:
cf_min=d-m
cf_max=d+m

In [28]:
#confidence interval
cf_min,cf_max

(0.020210500108079642, 0.03764644797215575)

as minimum of the confidecne level is greater than 0 and minimum practical confidence level, we conclude that click through 
probability is higher than 0.2 and is significant. Hence, we would launch the experiment. 

## Metrics 

Two types of checks

Invariant checking: Metrics that shouldn’t change between your test and control
Evaluation: High level business metrics, user experience with the product
How do we go about making a definition of a metric (for sanity checking)?

High level concept of metrics (e.g active users, CTR)
Details (e.g. how do you define user activity)
Take a set of metrics and summarize them into a single metric (e.g. overall evaluation criterion (OEC))
For evaluation, you can choose either one metric or a whole suite of metrics. If you have multiple metric, you can combine them into one metric, such as an objective function, or an Overall Evaluation Criterion (OEC) - a term that Microsoft uses.

The last situation is how generally applicable the metric is. If you are running a suite of A/B tests, it is preferable to have a metric that works across the entire suite.

User funnel indicates a series of steps taken by users through the site. It is called a funnel because every subsequent stage has fewer users than the stage above. Each stage is a metric - total count, rate, and probability (i.e. a unique user progressed down).

## properities of metrics 

Characteristics of a metric:
Sensitivity and Robustness: Whether the metric is sensitive to changes you care about, and is robust to changes you don’t care about (e.g. mean is sensitive to outliers, median is robust but not sensitive to changes to small group of users). This can be measured by using prior experiments to see if the metric moves in a way that intuitively make sense. Another alternative is to do A/A tests to see if the metric picks up any spurious differences

Distribution: Obtained by doing a distribution on the retrospective data

##  Absolute vs Relative 

If you are running a lot of experiments you want to use the relative difference i.e the percentage change. The main advantage of computing the percentage change is that you only have to choose one practical significance boundary to get stability over time. If you are running a lot of experiments over time, your metrics are probably changing over time. Using relative difference helps here by having to use one practical significance boundary rather than change it as the system changes. The main disadvantage is variability, relative differences such as ratios are not as well behaved as absolute differences

##  Calculating Variability 

We want to check the variability of a metric to later determine the sizing of the experiment and to analyze confidence intervals and draw conclusions. If we have a metric that varies a lot, then the practical significance level that we are looking for may not be feasible.

To calculate the confidence interval, you need

1.Variance (or standard deviation) 
2.Distribution

In [40]:
import statistics
import numpy as np
vec=[87029, 113407, 84843, 104994, 99327, 92052, 60684]
stder=statistics.stdev(vec)/math.sqrt(len(vec))
mean=np.mean(vec)
conf95_min=mean-1.96*stder
conf95_max=mean+1.96*stder

In [41]:
#confidence interval 
conf95_min,conf95_max

(79157.54332794028, 104367.02810063114)

##  Non Parametric Methods 

This is a way to analyze data without making an assumption about the distribution. At Google, it was observed that the analytical estimates of variance was often under-estimated, and therefore they have resorted to use empirical measurements based on A/A test to evaluate variance. If you see a lot of variability in a metric in an A/A test, it is probably too sensitive to be used. Rather than do several multiple A/A tests, one way is to do a large A/A test, and then do bootstrap to generate small groups and test the variability.

With A/A tests, we can

1.Compare result to what you expect (sanity check)

2.Estimate variance empirically and use your assumption about the distribution to calculate confidence

3.Directly estimate confidence interval without making any assumption of the data

##  Design Experiment 

Designing an Experiment

1.Choose subject: What are the units in the population you are going to run the test on? (unit of diversion)

2.Choose population: What population are you going to use (US only?)

3.Size

4.Duration

##  Unit of Diversion 

Commonly used units of diversion are:

User identifier (id): Typically the username or email address used on the website. It is typically stable and unchanging. If user id is used as a unit of diversion, then it is either in the test group or the control group. User ID is personally identifiable

Anonymous id: This is usually an anonymous identifier such as a cookie. It changes with browser or device. People may often refresh their cookies every time they log in. It is difficult to refresh a cookie on an app or a phone compared to the computer.

Event: An event is a page load that can change for each user. This is used typically for changes that is not user facing.

Lesser used units of diversion are

Device id: Typically available for mobile devices. It is tied to a specific device and cannot be changed by the user.

IP address: The ip address is location specific, but may change as the user changes location (e.g. testing on infrastructure change to test impact on latency)

Variability is higher when it is calculated empirically than when calculated analytically. This is because the unit of analysis (i.e. the denominator in the metric) is different from the unit of variability.

E.g. If unit of diversion is a query, then coverage (= #queries with ads/ # queries) will have lower variability compared to using a cookie as a unit of diversion. This is because when a query is used, the unit of diversion matches the unit of analysis (which is the denominator of the metric i.e. query)

In [42]:
N_cont = 50000 + 6021
X_cont = 2500 + 302
N_exp = 50000 + 5979
X_exp = 2500 + 374

In [43]:
p_pool=(X_cont+X_exp)/(N_cont+N_exp)
SEpool=math.sqrt(p_pool*(1-p_pool)*(1/N_cont+1/N_exp))


p_exp=X_exp/N_exp
p_cont=X_cont/N_cont
d=p_exp - p_cont
m=1.96*SEpool

In [48]:
print("d is"+" "+str(d)+"and margin of error is"+" "+str(m))

d is 0.0013237234004343165and margin of error is 0.0025691881506085417


As d which is p_exp-p_cont is less than the margin of error, the global difference is not statistically significant 

##  Analyzing Results 

### Sanity Checks 

One of the first things to do once you finish collecting experimental data is to analyze the invariants. This is done by calculating the values for one or more invariants on the test and control group, and check if the difference is statistically significant. For e.g. if the values for an invariant (say total # of cookies) are x and y, then calculate the se as sqrt(0.5∗0.5/x+y), since one would expect the same number of cookies in both groups. Then calculate the margin as 1.96∗se. If the margin is greater than x/(x+y)−y/(x+y), then the difference of the invariant is insignificant. However if the difference is greater than the margin, then the difference is insignifiant and needs to be investigated further

An example is provided below:

In [52]:
control_event=[2451,2475,2394,2482,2374,1704,1468]
test_event=[2404,2507,2376,2444,2504,1612,1465]
sum_control=sum(control_event)
sum_test=sum(test_event)
se=math.sqrt((0.5*0.5)/(sum_control+sum_test))
margin=1.96*se

In [55]:
str(sum_control/(sum_control+sum_test) - sum_test/(sum_control+sum_test)) + " " + "is less than the margin which is " + str(margin) 

'0.0011741682974559797 is less than the margin which is 0.005596802740247507'

##  Analysis with a single metric 

In [61]:
N_cont = [196, 200, 200, 216, 212, 185, 225, 187, 205, 211, 192, 196, 223, 192]
X_cont = [2029, 1991, 1951, 1985, 1973, 2021, 2041, 1980, 1951, 1988, 1977, 2019, 2035, 2007]
N_exp = [179, 208, 205, 175, 191, 291, 278, 216, 225, 207, 205, 200, 297, 299]
X_exp = [1971, 2009, 2049, 2015, 2027, 1979, 1959, 2020, 2049, 2012, 2023, 1981, 1965, 1993]

N_cont_sum = sum(N_cont)
X_cont_sum = sum(X_cont)
N_exp_sum = sum(N_exp)
X_exp_sum = sum(X_exp)

p_cont=X_cont_sum/N_cont_sum
p_exp=X_exp_sum/N_exp_sum

# Empirical standard error and count provided
empirical_se = 0.0062
empirical_ct = 5000
se = (math.sqrt(1/N_cont_sum + 1/N_exp_sum))*empirical_se/math.sqrt(1/empirical_ct + 1/empirical_ct)

d=p_cont - p_exp
margin=se*1.96

d_c95min = d - margin
d_c95max = d + margin

In [63]:
print("d is"+" "+str(d)+"and margin of error is"+" "+str(margin))

d is 1.0083513676517537and margin of error is 0.0156917818580486


## References 

https://www.udacity.com/course/ab-testing--ud257

https://towardsdatascience.com/a-summary-of-udacity-a-b-testing-course-9ecc32dedbb1

https://rpubs.com/superseer/ab_testing