# Frequentist AB testing - using Z test

Simple demonstration of two sample Z-test usage for AB testing. Reality is usually much more complex. :)

In [1]:
import sys
sys.path.append("./tools")
from z_test_ab_testing import *

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Let's say we want to test a new design of our web (e.g. online shop).<br>
We aim to improve conversion rate at least by **10%**. To test that, we will start a test where random **50%** of all sessions will stay on the old version of a web (variant ___A___) and other 50% will see the new version (variant __B__). Currently our conversion is around **1/20** (1 conversion in 20 sessions).<br>
Before running a test we want to calculate how many visits do we need to collect before trying to evaluate a test.<br>
To calculate a minimal sample size we need several inputs:
 - current conversion rate = **1/20 = 5%**
 - uplift/MDE (minimum detectable effect) = **10%** (=> goal for a new web conversion rate: 1.1 * 20% = 22%)
 - split = **50:50**
 - confidence level = **95%** (most typical value)
 - power = **80%** (most typical value)

In [2]:
z_test_sample_size(
    conv_r = 1/20,
    mde = 0.1,
    confidance=0.95,
    power=0.8,
    test_share_size=0.5
)

[31230.69, 31230.69]

=> The minimal sample size for our test is ~31K for each group.

After 7 days of testing we got results:<br>
- number of sessions A = 31,500<br>
- number of sessions B = 32,000<br>
- number of conversions A = 1550<br>
- number of conversions B = 1750<br>

We already have our desired sample size so we can try to evaluate our test.<br>
To test if this difference is significant we can perform Z-test with __hypothesis__ that there is no difference in conversion rates for these two variants.

In [3]:
AB_z_test(
    totals_A = 31500
    ,successes_A = 1550
    ,totals_B = 32000
    ,successes_B = 1750
    ,confidance = 0.95 #confidance level (0.95 if not set)
    #,test_type = 'two-tailed' #('two-tailed' if not set)
)

winning: B (significant)
conversion rate A: 4.921%
conversion rate B: 5.469%
uplift: 11.139%
standard error A: 0.00122
standard error B: 0.00127
Z-score: 3.1127
Z-test p-value: 0.00185
Z-test power: 0.87549


We can see that conversion rate for variant B is higher (5.47% vs 4.92%) with **11% uplift**.<br>
As p-value < 0.01, we are __rejecting__ the hypothesis and we can be more than 99% confident that this result is a consequence of variant difference. Power of our test is above 80% which was also one of our goals so we are happy and we can consider **B as a winner**. :)

# Bayesian AB testing

Another way how to evaluate our test is using bayesian approach by which we can calculate PBB (probability of being best) for both variants.

In [5]:
from bayes_ab_testing import *

In [6]:
pbb_conversion(
    totals = [31500,32000],
    successes = [1550,1750]
)

[0.0017, 0.9983]

=> Based on data we have, the probability that variant B is better than A is **99.94%**.

_With this approach we can easily add more variants and always calculate PBBs that will be comparable and sum up to 1. For example if we had 3rd variant with 30,000 sessions and 1,600 conversions:_

In [7]:
pbb_conversion(
    totals = [31500,32000,30000],
    successes = [1550,1750,1600]
)

[0.0004, 0.7789, 0.2207]

Let's get back to our original example with two variants.<br>
In reality, we might want to make decision also based on different facts. For instance the new variant may cause more conversions but the total revenue might be lower as people tend to purchase less or cheaper products.

To demonstrate that we will generate some fake data.

In [8]:
import datetime
import pandas as pd


try:
    df = pd.read_pickle("ab_test_data.pkl")
except FileNotFoundError:
    params = [
        {'variant' : 'A', 'source': 'mobile', 'impressions': round(31500 * 1/3), 'orders': round(1550 * 1/3), 'mean': 2},
        {'variant' : 'B', 'source': 'mobile', 'impressions': round(32000 * 1/3), 'orders': round(1750 * 1/3), 'mean': 2.3},
        {'variant' : 'A', 'source': 'desktop', 'impressions': round(31500 * 2/3), 'orders': round(1550 * 2/3), 'mean': 2.7},
        {'variant' : 'B', 'source': 'desktop', 'impressions': round(32000 * 2/3), 'orders': round(1750 * 2/3), 'mean': 2.9},
    ]
    data = []
    for p in params:
        dates = np.array([datetime.date(2019, 5, np.random.randint(1,7+1)) for i in range(p['impressions'])])
        orders = np.append(
            np.random.lognormal(mean=p['mean'], sigma=1.0, size=p['orders']),
            np.zeros(p['impressions'] - p['orders'])
        )
        for i in range(p['impressions']):
            data.append({
                'date': dates[i],
                'variant': p['variant'],
                'source': p['source'],
                'conversion': 1 if orders[i] > 0 else 0,
                'revenue': orders[i]
            })

    df = pd.DataFrame(data).sample(frac=1).reset_index(drop=True)
    df.to_pickle("ab_test_data.pkl")


In [9]:
df.head(10)

Unnamed: 0,conversion,date,revenue,source,variant
0,0,2019-05-03,0.0,mobile,B
1,0,2019-05-07,0.0,mobile,B
2,0,2019-05-04,0.0,mobile,B
3,0,2019-05-07,0.0,desktop,A
4,0,2019-05-03,0.0,desktop,B
5,0,2019-05-05,0.0,mobile,B
6,0,2019-05-06,0.0,desktop,B
7,1,2019-05-02,63.928773,mobile,B
8,0,2019-05-04,0.0,mobile,A
9,0,2019-05-05,0.0,desktop,B


In each row we have information about one session. As our conversion rate was around 5%, we can see that in most cases the revenue per session is 0.<br>
Total conversions and revenue per variant:

In [32]:
groupby_1 = df.groupby('variant')['variant', 'conversion', 'revenue'].agg({'variant': 'count', 'conversion': 'sum','revenue': 'sum'}).rename(columns = {'variant': 'sessions', 'conversion': 'conversions'})
groupby_1['revenue_per_session'] = groupby_1['revenue'] / groupby_1['sessions']

groupby_1

Unnamed: 0_level_0,sessions,conversions,revenue,revenue_per_session
variant,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,31500,1550,32801.475912,1.041317
B,32000,1750,44485.88274,1.390184


In [15]:
pbb_conversion([80000,80000,80000], [1600,1700,1650])

pbb_revenue(
    [4212694,4213358,4333878],
    [639,716,760],
    [2280,2585,2691],
    [8265,9466,9680]
)

[0.0225, 0.7882, 0.1893]

[0.0025, 0.7053, 0.2922]