## A/B test
---
**Elo notes**

In marketing and business intelligence, A/B testing is a term for a randomized experiment with two variants, A and B, which are the control and variation in the controlled experiment.

A/B testing is a form of statistical hypothesis testing with two variants leading to the technical term, two-sample hypothesis testing, used in the field of statistics. Other terms used for this method include bucket tests and split-run testing. These terms can have a wider applicability to more than two variants, but the term A/B testing is also frequently used in the context of testing more than two variants. 

In online settings, such as web design (especially user experience design), the goal of A/B testing is to identify changes to web pages that increase or maximize an outcome of interest (e.g., click-through rate for a banner advertisement).

Formally the current web page is associated with the null hypothesis. A/B testing is a way to compare two versions of a single variable typically by testing a subject's response to variable A against variable B, and determining which of the two variables is more effective.

As the name implies, two versions (A and B) are compared, which are identical except for one variation that might affect a user's behavior. Version A might be the currently used version (control), while version B is modified in some respect (treatment). 

For instance, on an e-commerce website the purchase funnel is typically a good candidate for A/B testing, as even marginal improvements in drop-off rates can represent a significant gain in sales. Significant improvements can sometimes be seen through testing elements like copy text, layouts, images and colors, but not always. 

The vastly larger group of statistics broadly referred to as multivariate testing or multinomial testing is similar to A/B testing, but may test more than two different versions at the same time and/or has more controls, etc. Simple A/B tests are not valid for observational, quasi-experimental or other non-experimental situations, as is common with survey data, offline data, and other, more complex phenomena.

The benefits of A/B testing are considered to be that it can be performed continuously on almost anything, especially since most marketing automation software now, typically, comes with the ability to run A/B tests on an on-going basis. This allows for updating websites and other tools, using current resources, to keep up with changing trends.

#### Test statistics

"Two-sample hypothesis tests" are appropriate for comparing the two samples where the samples are divided by the two control cases in the experiment. Z-tests are appropriate for comparing means under stringent conditions regarding normality and a known standard deviation. Student's t-tests are appropriate for comparing means under relaxed conditions when less is assumed. Welch's t test assumes the least and is therefore the most commonly used test in a two-sample hypothesis test where the mean of a metric is to be optimized. While the mean of the variable to be optimized is the most common choice of estimator, others are regularly used.

A company with a customer database of 2,000 people decides to create an email campaign with a discount code in order to generate sales through its website. It creates two versions of the email with different call to action (the part of the copy which encourages customers to do something — in the case of a sales campaign, make a purchase) and identifying promotional code.

- To 1,000 people it sends the email with the call to action stating, "Offer ends this Saturday! Use code A1",
- and to another 1,000 people it sends the email with the call to action stating, "Offer ends soon! Use code B1".

All other elements of the emails' copy and layout are identical. The company then monitors which campaign has the higher success rate by analyzing the use of the promotional codes. The email using the code A1 has a 5% response rate (50 of the 1,000 people emailed used the code to buy a product), and the email using the code B1 has a 3% response rate (30 of the recipients used the code to buy a product). The company therefore determines that in this instance, the first Call To Action is more effective and will use it in future sales. A more nuanced approach would involve applying statistical testing to determine if the differences in response rates between A1 and B1 were statistically significant (that is, highly likely that the differences are real, repeatable, and not due to random chance).[9]

In the example above, the purpose of the test is to determine which is the more effective way to encourage customers to make a purchase. If, however, the aim of the test had been to see which email would generate the higher click-rate – that is, the number of people who actually click onto the website after receiving the email – then the results might have been different.

For example, even though more of the customers receiving the code B1 accessed the website, because the Call To Action didn't state the end-date of the promotion many of them may feel no urgency to make an immediate purchase. Consequently, if the purpose of the test had been simply to see which email would bring more traffic to the website, then the email containing code B1 might well have been more successful. An A/B test should have a defined outcome that is measurable such as number of sales made, click-rate conversion, or number of people signing up/registering.

#### Segmentation and targeting

A/B tests most commonly apply the same variant (e.g., user interface element) with equal probability to all users. However, in some circumstances, responses to variants may be heterogeneous. That is, while a variant A might have a higher response rate overall, variant B may have an even higher response rate within a specific segment of the customer base.

It is important to note that if segmented results are expected from the A/B test, the test should be properly designed at the outset to be evenly distributed across key customer attributes, such as gender. That is, the test should both (a) contain a representative sample of men vs. women, and (b) assign men and women randomly to each “variant” (variant A vs. variant B). Failure to do so could lead to experiment bias and inaccurate conclusions to be drawn from the test.

This segmentation and targeting approach can be further generalized to include multiple customer attributes rather than a single customer attribute – for example, customers' age AND gender – to identify more nuanced patterns that may exist in the test results.




https://developer.amazon.com/public/apis/manage/ab-testing/doc/math-behind-ab-testing

In [58]:
from __future__ import division
import pandas as pd
from numpy import sqrt
import scipy.stats as scipys
import datetime 
import matplotlib.pyplot as plt
%matplotlib inline

In [59]:
def z_test(ctr_old, ctr_new, nobs_old, nobs_new, effect_size=0., two_tailed=True, alpha=.05):
    """Perform z-test to compare two proportions (e.g., Click Through Rates (CTR)).

        Note: if you set two_tailed=False, z_test assumes H_A is that the effect is
        non-negative, so the p-value is computed based on the weight in the upper tail.
        
        Arguments:
            ctr_old (float):    baseline proportion (CTR)
            ctr_new (float):    new proportion
            nobs_old (int):     number of observations in baseline sample
            nobs_new (int):     number of observations in new sample
            effect_size (float):    size of effect
            two_tailed (bool):  True to use two-tailed test; False to use one-sided test
                                where alternative hypothesis if that effect_size is non-negative
            alpha (float):      significance level

        Returns:
            z-score, p-value, and whether to reject the null hypothesis
    """
    # p : conversion rate 
    p = (ctr_old * nobs_old + ctr_new * nobs_new) / (nobs_old + nobs_new)
    
    se = sqrt(p*(1-p)*(1./nobs_old + 1./nobs_new))
    
    z_score = (ctr_new - ctr_old - effect_size) / se
    
    if two_tailed:
        p_val = (1 - scipys.norm.cdf(abs(z_score)))* 2
    else:
        p_val = 1 - scipys.norm.cdf(z_score)
    
    reject_null = p_val < alpha
    print 'p_val:{}, standard_dev:{}, z_score:{}, reject null:{}'.format(p_val, se, z_score, reject_null) 
    
    return p_val, z_score, reject_null
    

In [8]:
df = pd.read_csv('data.csv')

In [9]:
df[:2]

Unnamed: 0,user_id,ts,ab,landing_page,converted
0,4040615247,1356998400,treatment,new_page,0
1,4365389205,1356998400,treatment,new_page,0


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 191148 entries, 0 to 191147
Data columns (total 5 columns):
user_id         191148 non-null int64
ts              191148 non-null float64
ab              191148 non-null object
landing_page    191148 non-null object
converted       191148 non-null int64
dtypes: float64(1), int64(2), object(2)
memory usage: 8.8+ MB


In [12]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
user_id,191148,5007786000.0,2889032000.0,3416,2505443000.0,5005865728,7508326000.0,9999962347
ts,191148,1357042000.0,24931.74,1356998400,1357020000.0,1357041629,1357063000.0,1357084799
converted,191148,0.09980225,0.2997369,0,0.0,0,0.0,1


In [15]:
len(df.user_id.unique())

186388

In [30]:
control_group = df[df['ab'] == 'control'].copy()

In [31]:
control_group[:2]

Unnamed: 0,user_id,ts,ab,landing_page,converted
3,8122359922,1356998402,control,old_page,0
4,6077269891,1356998402,control,old_page,0


In [32]:
n_control = float(len(control_group.user_id.unique()))

In [33]:
treatment_group = df[df.ab == 'treatment'].copy() 

In [34]:
treatment_group[:2]

Unnamed: 0,user_id,ts,ab,landing_page,converted
0,4040615247,1356998400,treatment,new_page,0
1,4365389205,1356998400,treatment,new_page,0


In [35]:
n_treatment = float(len(treatment_group.user_id.unique()))

In [36]:
n_treatment

95574.0

In [39]:
conversion_rate_control = control_group.converted.sum() / n_control

In [40]:
conversion_rate_control

0.09964322681524874

In [41]:
conversion_rate_treatment = treatment_group.converted.sum() / n_treatment

In [42]:
conversion_rate_treatment

0.10492393328729571

In [60]:
z_test(conversion_rate_control, conversion_rate_treatment, n_control, n_treatment, two_tailed=False)

p_val:8.51269056524e-05, standard_dev:0.00140463028958, z_score:3.75949921572, reject null:True


(8.512690565243286e-05, 3.7594992157238489, True)

In [63]:
datetime.datetime.fromtimestamp(int(control_group.ts.head(n=1))) #head() is equal than .iloc[0]

datetime.datetime(2012, 12, 31, 16, 0, 2)

In [65]:
control_group['t_stamp'] = control_group.ts.apply(lambda x: datetime.datetime.fromtimestamp(int(x)))

In [67]:
treatment_group['t_stamp'] = treatment_group.ts.apply(lambda x: datetime.datetime.fromtimestamp(int(x)))

In [69]:
a = 4. / sqrt(75)

In [70]:
a

0.46188021535170054

In [73]:
a*1.96

0.90528522208933304

In [74]:
b = 0.05/(1-0.05)

In [75]:
b

0.052631578947368425