# Example of the ``Tester`` class usage for evaluation of the effect

This tutorial will show how *Amrosia* testing tools can be used to create statistical evaluation of the effects in the experiments.

Usually when we make statistical evaluation, we have pre-selected statistical criteria and first error decision threshold on the experiment design stage.

The experimenters compare the p-value obtained after the experiment with the first error threshold and obtain a point estimate of the effect with constructed confidence intervals.

Further we will observe all these steps using ``Tester``  class on the synthetically generated data.

## Let's start the tutorial

In [2]:
import pandas as pd
import numpy as np

from ambrosia.tester import Tester

Load data

In [3]:
data = pd.read_csv('../tests/test_data/watch_result_agg.csv')

There is some  data on users content views, which was aggregated during the experiment, and we have two groups.

In [4]:
data.head()

Unnamed: 0,id,watched,group
0,6,597.833362,A
1,11,549.314234,A
2,20,564.401942,A
3,21,248.735358,A
4,23,926.048946,B


All what is needed for the effect estimation is inside ``Tester`` class. It has one main public method ``run()`` which returns the table with a p-value, point effect and cinfidence interval.

The ``Splitter`` class is *Ambrosia's* main tool for splitting objects into the creating groups. 

Let's create an instance of the class and pass to the constructor experimental data and the name of group columns, and arguments that we defined during the design stage

In [5]:
tester = Tester(dataframe=data,
                column_groups='group',
                metrics='watched',
                first_type_errors=0.01)

Now we will call ``run()`` method to estimate absolute uplift using t-test criterion

In [6]:
tester.run(effect_type='absolute',
           method='theory',
           criterion='ttest',
          )

Unnamed: 0,first_type_error,pvalue,effect,confidence_interval,metric name,group A label,group B label
0,0.01,2.2e-05,55.314679,"(14.578, 96.0514)",watched,A,B


We can also estimate relative effect

In [7]:
tester.run(effect_type='relative',
           method='theory',
           criterion='ttest',
          )

Unnamed: 0,first_type_error,pvalue,effect,confidence_interval,metric name,group A label,group B label
0,0.01,4e-05,0.079901,"(0.0299, 0.1303)",watched,A,B


Change alternative from ``"two-sided"`` to ``"greater"``

In [8]:
tester.run(
    effect_type='relative',
    method='theory',
    criterion='ttest',
    alternative='greater',
)

Unnamed: 0,first_type_error,pvalue,effect,confidence_interval,metric name,group A label,group B label
0,0.01,2e-05,0.079901,"(0.0347, inf)",watched,A,B


Change criterion to Mann–Whitney test

In [9]:
tester.run(effect_type='absolute',
           method='theory',
           criterion='mw',
           metrics='watched',
           first_type_errors=0.01)

Unnamed: 0,first_type_error,pvalue,effect,confidence_interval,metric name,group A label,group B label
0,0.01,3.5e-05,43.598116,"(None, None)",watched,A,B


Use bootstrap criteria by changing ``method`` to ``"empiric"``

In [10]:
tester.run(effect_type='absolute',
           method='empiric',
           metrics='watched',
           first_type_errors=0.01)

Unnamed: 0,first_type_error,pvalue,effect,confidence_interval,metric name,group A label,group B label
0,0.01,3.552714e-15,55.314679,"(21.2797, 88.1704)",watched,A,B


If we want to make evaluation binary values, like conversion, method should be changed to ``"binary"``

## Multiple hypothesis correction

``Tester`` has ability to apply MHC to p-value and confidence intervals. Total number of hypothesis is equal to the number of groups combinations multiplied by the number of metrics passed.

By the default Bonferroni correction is applied, but it can be turned off by passing ``None`` argument to the ``correction_method``.

Let's create number of synthetic groups and look at the ``Tester`` behavior

In [11]:
total_size = 1000
groups = ['A', 'B', 'C', "D"]

In [12]:
np.random.seed(42)
multi_groups_result = pd.DataFrame(np.random.normal(size=(total_size, 2)),
                                   columns=['metric_1', 'metric_2'])
multi_groups_result['groups'] = np.random.choice(groups, size=total_size)
multi_groups_result = multi_groups_result.sort_values('groups')

In [13]:
multi_tester = Tester(dataframe=multi_groups_result,
                      column_groups='groups',
                      metrics=['metric_1', 'metric_2'])

Here we have 6 unique pairs to test and two metrics, so due to Bonferroni correction the p-values will reduced by 12 times and CI's will be increased to corresponding values

In [14]:
multi_tester.run(method='theory')

Unnamed: 0,first_type_error,pvalue,effect,confidence_interval,metric name,group A label,group B label
0,0.05,1.0,-0.084442,"(-0.3213, 0.1524)",metric_1,A,B
1,0.05,1.0,-0.102428,"(-0.3644, 0.1595)",metric_2,A,B
2,0.05,1.0,0.028641,"(-0.2191, 0.2764)",metric_1,A,C
3,0.05,1.0,-0.142255,"(-0.4022, 0.1176)",metric_2,A,C
4,0.05,1.0,0.050312,"(-0.1946, 0.2952)",metric_1,A,D
5,0.05,1.0,-0.063565,"(-0.3157, 0.1885)",metric_2,A,D
6,0.05,1.0,0.113082,"(-0.1351, 0.3613)",metric_1,B,C
7,0.05,1.0,-0.039827,"(-0.3085, 0.2289)",metric_2,B,C
8,0.05,1.0,0.134753,"(-0.1107, 0.3802)",metric_1,B,D
9,0.05,1.0,0.038863,"(-0.2223, 0.3)",metric_2,B,D


When we deny the MHC

In [15]:
multi_tester.run(method='theory', correction_method=None)

Unnamed: 0,first_type_error,pvalue,effect,confidence_interval,metric name,group A label,group B label
0,0.05,0.307529,-0.084442,"(-0.2465, 0.0776)",metric_1,A,B
1,0.05,0.263036,-0.102428,"(-0.2816, 0.0767)",metric_2,A,B
2,0.05,0.740575,0.028641,"(-0.1408, 0.1981)",metric_1,A,C
3,0.05,0.117435,-0.142255,"(-0.32, 0.0355)",metric_2,A,C
4,0.05,0.556405,0.050312,"(-0.1172, 0.2179)",metric_1,A,D
5,0.05,0.470332,-0.063565,"(-0.236, 0.1089)",metric_2,A,D
6,0.05,0.192379,0.113082,"(-0.0567, 0.2829)",metric_1,B,C
7,0.05,0.671231,-0.039827,"(-0.2236, 0.144)",metric_2,B,C
8,0.05,0.116301,0.134753,"(-0.0331, 0.3026)",metric_1,B,D
9,0.05,0.670008,0.038863,"(-0.1398, 0.2175)",metric_2,B,D


---

## Learn more

There is some more information on evaluating the effect of experiments using *Ambrosia*

Check:

* ``Tester`` class documentation
* An example of making statistical inference and effect estimation on Spark data 