# Example of the ``Designer`` class usage for A/B test parameters calculation

This tutorial will review *Ambrosia's* experiments design tools using an example of calculating the parameters of a hypothetical A/B test. For this, synthetic data on MTS KION users metrics will be used.

Before we start looking at the tools, here is a short list of questions and answers to help understand some of the experiment design essentials

**Note:** In this tutorial fixed-horizon experiments are assumed. For this kind of experiments, decisions are made based on the results obtained at the end of the planned test duration.

## What is needed before designing A/B test parameters?

Before the experiment it is good to have:

- Formulated and fixed hypothesis
- One or a set of metrics that meet all the requirements of the task, and on which the conclusion will be drawn
- A fixed plan of the decision-making process based on the results of the experiment

Also, for a calculation of the test parameters itself, we need to have historical data on selected metrics.

## What parameters does usual A/B test have?

In usual A/B test we use some **statistical criteria** that tests our hypothesis, and there are four related parameters for the experimental setup:

- **I type error** (alpha) - probability of false success of the criterion in the absence of real changes,\
1 - alpha is called *statistical significance*
- **II type error** (beta) - probability of false failure of the criterion in the presence of real changes,\
1 - beta is called *statistical power*
- **Groups sizes** - number of objects in each experimental group, converted to the duration of the experiment using traffic
- **Minimal detectable effect** (MDE) - is the smallest true effect from a change that has a certain level of statistical power for a certain level of statistical significance

**Note:** For tests with multigroups or sets of metrics, this must be taken into account in an appropriate way when calculating the parameters of the experiment.

## Why one need to calculate A/B test parameters?

This is necessary to obtain correct and expected results from the experiment.\
Nobody wants to run an experiment longer than necessary or get results with low statistical power.

Basically, researches fix I type error at some level (industry default is 0.05) and try to maximize statistical power of test under the existing limitations of business environment .

These limitations usually include:

- Test duration limitation due to risks of implemented change negative impact
- Test duration limitation because of test costs
- Group sizes limitation due to limits of available objects pool or traffic channel
- MDE limitation due to it's minimal reasonable size
- MDE limitation due to weak impact of the change on the tested metric
- MDE limitation due to development costs of implemented change
- Costs of I and II type errors, which limits of fixes these values


For example, there may be such a statement of the design problem: \
What is the minimal detectable effect on the metric that we can detect with given errors of type I and II, if the size of our groups is fixed?

## How parameters can be calculated?

*Ambrosia* offers two approaches to calculate experiment parameters using metric historical data.

**Theoretical approach**

First method is based on the results of the analytical formula for the difference of normally distributed quantities. 
This method is very fast because it only requires the value of the mean and variance of the empirical distribution of the metric, and is recommended for first use.

Don't worry if your metric isn't distributed normally, for a large enough group the CLT will work for you. However, to obtain completely correct results, it is necessary to check the nominal coverage of the corresponding confidence intervals.

You can read more about this theoretical formula [here](https://habr.com/ru/company/ru_mts/blog/700992/) or in other sources.

**Empirical approach**

The theoretical approach is fast and convenient, but does not take into account your specific criteria and all the features of the distribution of metric values.

The empirical method allows parameters to be calculated by repeatedly sampling the groups from the passed historical data, modeling the effect on the test group, and applying the selected statistical test on large number of group pairs. Thus, the statistical power can be estimated empirically and other parameters optimally matched.

This method is more computationally consuming and can give noisy results in parameters estimation with a small number of sampled groups.

**Note:** For binary metrics empirical approach is not suitable. You can choose ``binary`` method which solves inverse problem by constructing a large number of binary confidence intervals. \
This method has its own features, see a separate example with the design of binary metrics.

## Now, let's start the tutorial

In [2]:
import numpy as np
import pandas as pd

import yaml

from ambrosia.designer import Designer, design, load_from_config

Load data

In [3]:
data = pd.read_csv('../tests/test_data/kion_data.csv', sep=';')

In [4]:
data.head()

Unnamed: 0,profile_id,sum_dur,vod_cnt,ln_vod_cnt,bin_col
0,99402893794,20104282,83,5.533356,1
1,878511937265,3986136,53,4.807294,1
2,998929369788,2063965,22,3.187069,1
3,265028786131,523539,14,2.679252,1
4,995182338752,1588224,19,4.177776,1


The ``Designer`` class is *Ambrosia*'s main tool for calculating experimental parameters. It has one main public method ``run()`` which returns the table with calculated parameters of the test. 

Let's create an instance of the class and pass to the constructor a dataframe with historical data about the metrics that we will design, in our case, this is the total duration of viewing the content ``sum_dur`` per user.

In [5]:
designer = Designer(dataframe=data, metrics='sum_dur')

In fact, we can pass this dataframe and metrics later as an argument to the ``run()`` method. We can do the same with most of the parameters related directly to the experiment (errors, effects, and so on) - either pass them to the constructor during initialization (and then they will become attributes of the created instance), or pass them later, when execute ``run()`` method. In case of parameter selection ambiguity, the argument in the method takes precedence over the attribute value.

### Theoretical design

Now we will calculate the parameters of the experiment using theoretical approach and grid of other known params

In [6]:
### Set parameters grid
effects = [1.05, 1.1, 1.2]  # MDE in percents
sizes = [1000, 3000, 7000]  # Size of each group
first_type_errors = [0.01, 0.05]
second_type_errors = [0.1, 0.2]

Calculate MDE

In [7]:
designer.run(to_design='effect',
             method='theory',
             first_type_errors=first_type_errors,
             second_type_errors=second_type_errors,
             sizes=sizes)

"Errors ($\alpha$, $\beta$)",(0.01; 0.1),(0.01; 0.2),(0.05; 0.1),(0.05; 0.2)
Group sizes,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1000,61.1%,54.2%,51.4%,44.4%
3000,35.3%,31.3%,29.6%,25.6%
7000,23.1%,20.5%,19.4%,16.8%


We will use these error rates further, so let's set them using setters

In [8]:
designer.set_first_errors(first_type_errors)
designer.set_second_errors(second_type_errors)

Now calculate group sizes

In [9]:
designer.run(to_design='size', method='theory', effects=effects)

"Errors ($\alpha$, $\beta$)",(0.01; 0.1),(0.01; 0.2),(0.05; 0.1),(0.05; 0.2)
Effect,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5.0%,149323,117206,105448,78768
10.0%,37332,29303,26363,19693
20.0%,9335,7327,6592,4924


Finally calculate statistical power

In [10]:
designer.run(to_design='power', method='theory', effects=effects, sizes=sizes)

Unnamed: 0_level_0,Group sizes,1000,3000,7000
$\alpha$,Effect,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.01,5.0%,1.4%,2.2%,4.1%
0.01,10.0%,2.7%,6.9%,18.3%
0.01,20.0%,9.4%,34.8%,77.8%
0.05,5.0%,6.1%,8.5%,13.3%
0.05,10.0%,9.7%,19.4%,38.6%
0.05,20.0%,24.3%,59.0%,91.6%


We can change alternative, by default it is ``"two-sided"``, now we want test only positive changes 

In [11]:
designer.run(to_design='power',
             method='theory',
             effects=effects,
             sizes=sizes,
             alternative='greater')

Unnamed: 0_level_0,Group sizes,1000,3000,7000
$\alpha$,Effect,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.01,5.0%,2.2%,3.8%,6.8%
0.01,10.0%,4.5%,10.9%,25.6%
0.01,20.0%,14.4%,44.4%,84.5%
0.05,5.0%,9.2%,13.6%,20.9%
0.05,10.0%,15.5%,29.1%,51.0%
0.05,20.0%,35.1%,70.6%,95.5%


Parameter ``groups_ratio`` allows to make groups sizes unequal. The size of group B is equal to the size of group A multiplied by ``groups_ratio`` value. By default, it is equal to ``1.0``. 

Let's make calculation of required size for group A : group B  in proportion of 10 : 1. The output group size calculation results show us the size of group A

In [12]:
designer.run(to_design='size',
             method='theory',
             effects=effects,
             sizes=sizes,
             groups_ratio=0.1)

"Errors ($\alpha$, $\beta$)",(0.01; 0.1),(0.01; 0.2),(0.05; 0.1),(0.05; 0.2)
Effect,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5.0%,821269,644622,579958,433219
10.0%,205320,161158,144991,108306
20.0%,51333,40292,36249,27078


### Empirical design

Now we will change design method to ``empiric`` and calculate group sizes by conducting a lot of pseudo A/B tests on historical data.

As a default statistical criterion, the ``Designer`` uses the two-sample independent T-test.

To limit computational cost we will set ``bs_samples`` parameter to a low value. This parameter determines how many pseudo A/B tests we will conduct to evaluate one value of the parameter, and high values (use at least >1000) will give more accurate estimation of parameters.

We will also use multiprocessing to speed up calculations and set the value of ``n_jobs`` to ``4`` (by default it is equal to ``1``).

In [13]:
designer.run(to_design='size',
             method='empiric',
             effects=effects,
             bs_samples=100,
             n_jobs=4)

Group sizes calculation:   0%|          | 0/12 [00:00<?, ?it/s]

"Errors ($\alpha$, $\beta$)","(0.01, 0.1)","(0.01, 0.2)","(0.05, 0.1)","(0.05, 0.2)"
Effect,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5.0%,153569,137126,117300,73706
10.0%,41096,34920,27711,21503
20.0%,10299,8827,7639,5822


Statistical criterion can be changed using corresponding parameter ``criterion``

In [14]:
designer.run(to_design='size',
             method='empiric',
             effects=effects,
             criterion='mw',
             bs_samples=100,
             n_jobs=4)

Group sizes calculation:   0%|          | 0/12 [00:00<?, ?it/s]

"Errors ($\alpha$, $\beta$)","(0.01, 0.1)","(0.01, 0.2)","(0.05, 0.1)","(0.05, 0.2)"
Effect,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5.0%,66810,58589,46579,39088
10.0%,15249,12426,10748,10069
20.0%,4247,3891,2979,2340


We can use bootstrap criterion to calculate some parameter

In [25]:
designer.run(to_design='power',
             method='empiric',
             effects=effects,
             sizes=sizes,
             criterion='bootstrap',
             bs_samples=1000,
             n_jobs=4)

Empirical errors calculation:   0%|          | 0/18 [00:00<?, ?it/s]

Unnamed: 0_level_0,Group sizes,"(1000, 1000)","(3000, 3000)","(7000, 7000)"
$\alpha$,Effect,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.01,5.0%,2.2%,3.9%,5.1%
0.01,10.0%,4.5%,7.8%,19.7%
0.01,20.0%,9.9%,31.5%,70.8%
0.05,5.0%,7.6%,9.8%,13.9%
0.05,10.0%,10.6%,19.7%,37.5%
0.05,20.0%,24.4%,57.1%,84.7%


There is a number of implemented criteria in *Ambrosia*, but it must be remembered that each of them has its own prerequisites and each tests its own null hypothesis.

``alternative`` and ``groups_ratio`` parameters are also available in the empirical approach

In [27]:
designer.run(to_design='power',
             method='empiric',
             sizes=sizes,
             effects=effects,
             criterion='ttest',
             bs_samples=10000,
             alternative='greater',
             groups_ratio=2.0,
             n_jobs=4)

Empirical errors calculation:   0%|          | 0/18 [00:00<?, ?it/s]

Unnamed: 0_level_0,Group sizes,"(1000, 2000)","(3000, 6000)","(7000, 14000)"
$\alpha$,Effect,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.01,5.0%,3.6%,5.9%,10.7%
0.01,10.0%,7.9%,16.4%,34.5%
0.01,20.0%,21.9%,55.4%,88.0%
0.05,5.0%,12.5%,17.6%,25.9%
0.05,10.0%,20.5%,37.1%,58.8%
0.05,20.0%,44.4%,76.5%,96.4%


**Note:** The empirical approach consumes a significant amount of computing resources and memory, especially when calculations are made on large groups.

### Stand-alone design function

There is a function that replicates the behavior of the ``Designer`` and it can also be used in the same way to calculate A/B test parameters

Let's design test parameters for two metrics, we will get the output dict with pandas tables

In [28]:
design_result = design(to_design='power',
                       dataframe=data,
                       metrics=['sum_dur', 'vod_cnt'],
                       method='theory',
                       first_type_errors=first_type_errors,
                       sizes=sizes,
                       effects=effects)

Theoretical design of power for ``sum_dur`` metric

In [29]:
design_result['sum_dur']

Unnamed: 0_level_0,Group sizes,1000,3000,7000
$\alpha$,Effect,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.01,5.0%,1.4%,2.2%,4.1%
0.01,10.0%,2.7%,6.9%,18.3%
0.01,20.0%,9.4%,34.8%,77.8%
0.05,5.0%,6.1%,8.5%,13.3%
0.05,10.0%,9.7%,19.4%,38.6%
0.05,20.0%,24.3%,59.0%,91.6%


Theoretical design of power for ``vod_cnt`` metric

In [30]:
design_result['vod_cnt']

Unnamed: 0_level_0,Group sizes,1000,3000,7000
$\alpha$,Effect,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.01,5.0%,2.3%,5.6%,14.4%
0.01,10.0%,7.6%,27.5%,67.2%
0.01,20.0%,38.5%,91.6%,100.0%
0.05,5.0%,8.8%,16.7%,32.7%
0.05,10.0%,20.8%,50.7%,85.6%
0.05,20.0%,62.7%,97.7%,100.0%


### Storable configuration

The ``Designer`` class instance could be saved and created from a ``yaml`` config file. Attributes like datasets are not serialized and must be set after instanse is loaded.

Lets create an instance with preferred attributes

In [31]:
store_path = '_examples_configs/designer_config.yaml'

In [32]:
storable_designer = Designer(effects=[1.05, 1.1, 1.2],
                             sizes=[1000, 3000, 7000],
                             first_type_errors=[0.01, 0.05],
                             metrics=['sum_dur', 'ln_vod_cnt'])

In [33]:
storable_designer.__getstate__()

{'effects': [1.05, 1.1, 1.2],
 'sizes': [1000, 3000, 7000],
 'first_type_errors': [0.01, 0.05],
 'second_type_errors': [0.2],
 'metrics': ['sum_dur', 'ln_vod_cnt'],
 'method': 'theory'}

Save the config in a file

In [34]:
with open(store_path, 'w') as outfile:
    yaml.dump(storable_designer, outfile, default_flow_style=True)

Load instance from a file and set data

In [35]:
loaded_designer = load_from_config(store_path)
loaded_designer.set_dataframe(data)

In [36]:
loaded_designer.__getstate__()

{'effects': [1.05, 1.1, 1.2],
 'sizes': [1000, 3000, 7000],
 'first_type_errors': [0.01, 0.05],
 'second_type_errors': [0.2],
 'metrics': ['sum_dur', 'ln_vod_cnt'],
 'method': 'theory'}

Design some experiment parameter

In [37]:
design_results = loaded_designer.run('power')

In [38]:
design_results['sum_dur']

Unnamed: 0_level_0,Group sizes,1000,3000,7000
$\alpha$,Effect,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.01,5.0%,1.4%,2.2%,4.1%
0.01,10.0%,2.7%,6.9%,18.3%
0.01,20.0%,9.4%,34.8%,77.8%
0.05,5.0%,6.1%,8.5%,13.3%
0.05,10.0%,9.7%,19.4%,38.6%
0.05,20.0%,24.3%,59.0%,91.6%


---

## Learn more

There are a few more examples of designing experiment parameters with *Ambrosia*

Check:

* ``Designer`` class documentation
* An example of binary metrics experiment design
* An example of designing parameters using Spark DataFrame (currently has limited functionality)
* [Habr post about *Ambrosia*](https://habr.com/ru/company/ru_mts/blog/700992/)