# Example of the ``Splitter`` class usage for solving group splitting problem

In this tutorial we will use *Amrosia* splitting tools to create a number of groups using different strategies.

Group splitting problem usually appears in A/B testing when we have designed experiment parameters and want to create experimental groups consist from the objects of the research.

## Two different splitting paradigms

Basically, the splitting of objects into groups is divided into *batch* and *real-time split* approaches. 

For the first type of splitting we precalculate the contents of our experimental groups using, for example, a common database with research objects. \
In the second type of splitting approach some tools distribute objects into groups in real time as they arrive, although it may also use some pre-calculated information.

Further in this tutorial we will review the tools for batch splitting tasks.

**Note:** *Ambrosia* now supports only batch spliiting. Real-time splitting tools are under development.

## Let's start the tutorial

In [1]:
import sys, os
sys.path.insert(1, os.path.realpath(os.path.pardir))

In [2]:
import pandas as pd
import numpy as np

import yaml

from ambrosia.splitter import Splitter, split, load_from_config

Your CPU supports instructions that this binary was not compiled to use: AVX2
For maximum performance, you can install NMSLIB from sources 
pip install --no-binary :all: nmslib


Generate synthetic data with a number of defferent columns.\
We will create 200000 objects with unique id and some numerical features

In [3]:
np.random.seed(42)

dataframe = pd.DataFrame({
    'm': np.zeros((200000, )),
    'a': np.random.normal(size=200000),
    'b': np.random.normal(size=200000)
})
dataframe['l'] = np.where(dataframe['a'] > 0, 1, 0)
dataframe['e'] = np.where(dataframe['b'] > 0, 1, 0)
dataframe['object_id'] = np.random.choice(dataframe.index,
                                          size=dataframe.shape[0],
                                          replace=False)
dataframe.head()

Unnamed: 0,m,a,b,l,e,object_id
0,0.0,0.496714,1.561841,1,1,63869
1,0.0,-0.138264,-0.094228,0,0,82374
2,0.0,0.647689,-1.329536,1,0,162918
3,0.0,1.52303,-1.388638,1,0,36327
4,0.0,-0.234153,-0.342651,0,0,91526


In [4]:
dataframe.shape

(200000, 6)

Now let's get acquainted with the ``Splitter`` class.

The ``Splitter`` class is *Ambrosia's* main tool for splitting objects into the creating groups. It has one main public method ``run()`` which returns the table with a groups of the desired size.

Let's create an instance of the class and pass to the constructor generated data ``dataframe`` about objects *(this data is like some abstract user database)* which will be used further for the creation of the groups using different methods. We also specify for ``id_column``  a column ``"object_id"``  that contains unique identifiers of objects. If this column had not been specified, dataframe indexes will be used as identifiers.

In [5]:
splitter = Splitter(dataframe=dataframe, id_column='object_id')

As well as in the ``Designer`` class, we can pass this dataframe and other parameters later as an argument to the ``run()`` method. We can do the same with most of the parameters related directly to the experiment (errors, effects, and so on) - either pass them to the constructor during initialization (and then they will become attributes of the created instance), or pass them later, when execute ``run()`` method. In case of parameter selection ambiguity, the argument in the method takes precedence over the attribute value.

Now let's move on to review different ways to create groups that are implemented in the ``Splitter`` class.

## Split approaches

### Simple split

The first type of splitting strategy is called ``"simple"`` and is really about a very simple, non-deterministic way of creating groups, in which a new result is produced each time it is executed.

To create such split we need to execute ``run()`` method with corresponding value of ``method`` parameter. We will create groups each of size 2000 objects.

In [6]:
splitter.run(method='simple', groups_size=2000)

Unnamed: 0,m,a,b,l,e,object_id,group
191060,0.0,-0.230298,1.253592,0,1,136859,A
121593,0.0,1.974664,-1.780258,1,0,164797,A
185512,0.0,-1.254767,-0.152099,0,0,49954,A
79803,0.0,-1.572960,-0.706893,0,0,154922,A
98956,0.0,0.714251,0.662607,1,1,99718,A
...,...,...,...,...,...,...,...
53739,0.0,0.070655,0.644952,1,1,62827,B
178405,0.0,-0.423988,-0.706336,0,0,103080,B
95002,0.0,-0.105022,0.714893,0,1,155745,B
166811,0.0,-1.459109,0.339358,0,1,157092,B


### Hash split 

The hash split strategy is based on hashing object identifiers and distributing the resulting hash values into appropriate groups. \
This method allows you to perform a deterministic split of objects into groups, also there is no need for a tables with the assigned group labels, because this splitting method allows  to restore the labels at any time by re-execution. 

To make the splits for each experiment unique, the ``"salt"`` parameter is used, which is appended to the end of the identifier of each object. The salt value can be, for example, the name of the experiment being performed.

You can read more about hash-based splitting on the web.

Let's create a hash split and make sure the result is deterministic

In [7]:
groups_size= 5000
salt = 'example_dummy_experiment_2023'

Execute split with pre-defined salt value

In [8]:
splitter.run(method='hash', groups_size=groups_size, salt=salt)

Unnamed: 0,m,a,b,l,e,object_id,group
14,0.0,-1.724918,-0.350186,0,0,90837,A
44,0.0,-1.478522,0.166608,0,1,123196,A
64,0.0,0.812526,0.914659,1,1,117133,A
65,0.0,1.356240,0.731410,1,1,144787,A
161,0.0,0.787085,-1.012367,1,0,186437,A
...,...,...,...,...,...,...,...
199760,0.0,0.172396,0.844596,1,1,166816,B
199783,0.0,-0.477993,-0.899310,0,0,134168,B
199867,0.0,-1.164759,-0.649031,0,0,41423,B
199868,0.0,0.162848,2.835048,1,1,33513,B


Then get a similar groups for the same salt value

In [9]:
splitter.run(method='hash', groups_size=groups_size, salt=salt)

Unnamed: 0,m,a,b,l,e,object_id,group
14,0.0,-1.724918,-0.350186,0,0,90837,A
44,0.0,-1.478522,0.166608,0,1,123196,A
64,0.0,0.812526,0.914659,1,1,117133,A
65,0.0,1.356240,0.731410,1,1,144787,A
161,0.0,0.787085,-1.012367,1,0,186437,A
...,...,...,...,...,...,...,...
199760,0.0,0.172396,0.844596,1,1,166816,B
199783,0.0,-0.477993,-0.899310,0,0,134168,B
199867,0.0,-1.164759,-0.649031,0,0,41423,B
199868,0.0,0.162848,2.835048,1,1,33513,B


Split result will be different if the salt is changed

In [10]:
splitter.run(method='hash', groups_size=groups_size, salt='salt')

Unnamed: 0,m,a,b,l,e,object_id,group
43,0.0,-0.301104,0.440295,0,1,139147,A
192,0.0,0.214094,0.021427,1,1,231,A
226,0.0,0.064280,1.553626,1,1,139761,A
235,0.0,0.633919,-1.277988,1,0,153281,A
285,0.0,-1.952088,1.610653,0,1,36040,A
...,...,...,...,...,...,...,...
199862,0.0,2.035899,0.452816,1,1,34064,B
199949,0.0,0.438721,-0.592572,1,0,99013,B
199970,0.0,0.868163,0.463027,1,1,53783,B
199991,0.0,0.383196,0.230814,1,1,199822,B


If no salt argument is passed, a random value will be generated during the split.

**Hash splitting method is fast and convenient and is recommended to use by default.**

### Metric split

For some tasks, it is very useful to find similar objects and distribute them into groups. For example, we can choose a random object in group A and from the general pool find the closest neighbor to it by some metric and send it to group B. This will make the groups more similar and increase the power of some statistical tests, which is especially valuable for small groups.

This approach is implemented in the ``"metric"`` split method, we can specify a set of features using ``fit_columns`` parameter, based on which pairs of similar objects will be selected using minimization of the Euclidean distance and distributed between the groups.

We will create two groups using metric split based on two features ``a`` and ``b``. Metric split requires sufficient computational resources to find nearest neighbors to set of points equal to size of one group.

In [11]:
metric_split = splitter.run(method='metric', groups_size=groups_size, fit_columns=['a', 'b'])

In [12]:
metric_split

Unnamed: 0,m,a,b,l,e,object_id,group
199994,0.0,-0.590488,-0.518154,0,0,12123,A
80866,0.0,0.436653,-0.458537,1,0,11556,A
128000,0.0,0.448011,-0.555275,1,0,71871,A
95833,0.0,0.514975,1.088812,1,1,149913,A
41929,0.0,-1.537990,-0.270142,0,0,54406,A
...,...,...,...,...,...,...,...
191916,0.0,-1.306423,-0.777014,0,0,13089,B
57853,0.0,-0.490125,1.742080,0,1,159390,B
189321,0.0,-1.759917,0.181625,0,1,153730,B
92099,0.0,-0.972475,0.624865,0,1,13100,B


Currently, pairs of similar objects occupy the same positions in group slices, and that is the only way to find them if you want to inspect individually.

In [13]:
metric_split.query("group == 'A'").iloc[0]

m                 0.0
a           -0.590488
b           -0.518154
l                   0
e                   0
object_id       12123
group               A
Name: 199994, dtype: object

In [14]:
metric_split.query("group == 'B'").iloc[0]

m                 0.0
a            -0.59219
b           -0.519087
l                   0
e                   0
object_id      145596
group               B
Name: 111639, dtype: object

**Note:** Metric split creates pairs (or sets in the case of multiple groups) of dependent objects between groups. This leads to the need to **use paired statistical tests**.

## Stratification

We can sample groups based with stratification. 

The stratification technique makes groups more homogeneous and similar to the general population from which these groups were sampled, as well as to reduce the dispersion of metrics in groups. This may be especially usefull in the case of small groups.

To demonstrate let's choose a binary column for stratification and pass it to ``strat_columns`` parameter, and see the ratios of the feature distribution in the case of stratification and without it

In [15]:
groups_size = 500

In [16]:
stratified_split = splitter.run(method='simple', groups_size=groups_size, strat_columns=['l'])
non_stratified_split = splitter.run(method='simple', groups_size=groups_size)

In [17]:
print(f'Initial share of strata: {dataframe["l"].mean() * 100:.1f}%')

Initial share of strata: 50.1%


Share of strata inside the splits

In [18]:
print(
    f'Share of strata in groups with stratification: {stratified_split["l"].mean() * 100:.1f}%'
)
print(
    f'Share of strata in groups without stratification: {non_stratified_split["l"].mean() * 100:.1f}%'
)

Share of strata in groups with stratification: 50.1%
Share of strata in groups without stratification: 51.3%


Share of strata inside the groups

In [19]:
print('Share of strata in each group with stratification\n',
      np.round(stratified_split.groupby('group')['l'].mean(), 3))
print('\n\nShare of strata in each group without stratification\n',
      np.round(non_stratified_split.groupby('group')['l'].mean(), 3))

Share of strata in each group with stratification
 group
A    0.500
B    0.502
Name: l, dtype: float64


Share of strata in each group without stratification
 group
A    0.514
B    0.512
Name: l, dtype: float64


## Multigroup split

Often, two experimental groups are not enough, for example, when we want to test the performance of multiple new recommender system algorithms. For that scenario one may want to make A/B/C/.. split.

In *Ambrosia*, all  functions and methods above are generalized for split into several groups and the number of groups can be controlled using ``groups_number`` parameter.

Let's create 3 groups using metric split

In [20]:
metric_multisplit = splitter.run(method='metric',
                                 groups_size=groups_size,
                                 fit_columns=['a', 'b'],
                                 groups_number=3)

In [21]:
metric_multisplit.query("group == 'A'").iloc[0]

m                 0.0
a            0.612142
b            0.115751
l                   1
e                   1
object_id      182967
group               A
Name: 74537, dtype: object

In [22]:
metric_multisplit.query("group == 'B'").iloc[0]

m                 0.0
a            0.614548
b            0.116243
l                   1
e                   1
object_id      197300
group               B
Name: 118068, dtype: object

In [23]:
metric_multisplit.query("group == 'C'").iloc[0]

m                 0.0
a            0.613116
b            0.115762
l                   1
e                   1
object_id       77456
group               C
Name: 150076, dtype: object

And now create 10 groups using hash method

In [24]:
hash_multisplit = splitter.run(method='hash',
                               groups_size=1000,
                               groups_number=10)

In [25]:
hash_multisplit

Unnamed: 0,m,a,b,l,e,object_id,group
16,0.0,-1.012831,0.747465,0,1,155174,A
245,0.0,-0.334501,0.615637,0,1,199630,A
710,0.0,0.211017,1.701546,1,1,156984,A
920,0.0,1.073632,0.596244,1,1,78818,A
1159,0.0,-0.324831,0.547028,0,1,40862,A
...,...,...,...,...,...,...,...
198886,0.0,2.243574,0.782718,1,1,134662,J
199060,0.0,1.154370,0.258987,1,1,195179,J
199812,0.0,-2.566508,0.553087,0,1,165028,J
199946,0.0,0.934797,-1.368926,1,0,25702,J


In [26]:
hash_multisplit.group.value_counts()

A    1000
B    1000
C    1000
D    1000
E    1000
F    1000
G    1000
H    1000
I    1000
J    1000
Name: group, dtype: int64

## Splitting the full table

Sometimes there are scenarios where one need to divide an entire table into groups. At the moment, *Ambrosia* allows to split data frames into 2 groups using the ``part_of_table``.

We will split passed data frame in a ratio of 1/3 (group A to B) using hash method

In [27]:
part_of_table = 1/3
fractional_hash_split = splitter.run(method='hash',
                                     part_of_table=part_of_table,
                                     salt='fractional_split')

In [28]:
fractional_hash_split

Unnamed: 0,m,a,b,l,e,object_id,group
1,0.0,-0.138264,-0.094228,0,0,82374,A
3,0.0,1.523030,-1.388638,1,0,36327,A
5,0.0,-0.234137,-1.580520,0,0,63304,A
6,0.0,1.579213,0.587148,1,1,187546,A
15,0.0,-0.562288,-1.362157,0,0,133839,A
...,...,...,...,...,...,...,...
199991,0.0,0.383196,0.230814,1,1,199822,B
199994,0.0,-0.590488,-0.518154,0,0,12123,B
199996,0.0,0.565654,-2.316381,1,0,147356,B
199998,0.0,0.855673,0.462531,1,1,132270,B


In [29]:
fractional_hash_split.group.value_counts(normalize=True)

B    0.666665
A    0.333335
Name: group, dtype: float64

## Selection of an existing group for a test group

Another type of scenario that sometimes occurs in A/B testing tasks, is a problem of post-generation of a control group from the total available pool of objects that were not affected by the treatment.

Although you have to be quite careful in post-analysis of experiments and in post-generation of samples, *Ambrosia* allows to create control group to the existing test using all methods above.

To do this, it is enough to pass a list of identifiers from the test group to ``test_group_ids`` parameter.

In [30]:
np.random.seed(42)
group_size = 10000
test_ids = np.random.choice(dataframe.object_id, size=group_size, replace=False)

In [31]:
post_hash_split = splitter.run(method='hash',
                               groups_size=groups_size,
                               test_group_ids=test_ids,
                               salt='post-split')

In [32]:
post_hash_split

Unnamed: 0,m,a,b,l,e,object_id,group
24,0.0,-0.544383,0.242347,0,1,37916,A
94,0.0,-0.392108,0.026810,0,1,12345,A
136,0.0,-0.783253,-1.911507,0,0,94132,A
152,0.0,-0.680025,-0.649503,0,0,87931,A
155,0.0,-0.714351,0.266708,0,1,122498,A
...,...,...,...,...,...,...,...
199945,0.0,2.866851,-0.048362,1,0,146867,B
199963,0.0,-0.935605,0.169802,0,1,116839,B
199976,0.0,-0.811208,-0.931313,0,0,18652,B
199977,0.0,0.547009,0.330221,1,1,180593,B


Check that all objects with test ids are in group B and not in A

In [33]:
np.isin(test_ids, post_hash_split.query("group == 'A'").object_id).sum()

0

In [34]:
np.isin(test_ids, post_hash_split.query("group == 'B'").object_id).sum()

10000

## Storable configuration

Sometimes it is convenient to save the created class instance to a file, so later it can be loaded and reused with preselected attributes. Attributes like datasets are not serialized and must be set after instanse is loaded. \
Implemented ``load_from_config`` function allows to restore instance directly from ``.yaml`` file

Let's create an instance with the parameters we want to save in a file

In [35]:
store_path = '_examples_configs/splitter_config.yaml'

In [36]:
storable_splitter = Splitter(id_column='object_id',
                             groups_size=322,
                             strat_columns=['l', 'e'])

In [37]:
storable_splitter.__getstate__()

{'id_column': 'object_id',
 'groups_size': 322,
 'fit_columns': None,
 'strat_columns': ['l', 'e']}

Save as the ``.yaml`` file

In [38]:
with open(store_path, "w") as outfile:
    yaml.dump(storable_splitter, outfile, default_flow_style=False)

Load from the file

In [39]:
loaded_splitter = load_from_config(store_path)

In [40]:
loaded_splitter.__getstate__()

{'id_column': 'object_id',
 'groups_size': 322,
 'fit_columns': None,
 'strat_columns': ['l', 'e']}

Set dataframe and make some split

In [41]:
loaded_splitter.set_dataframe(dataframe)

In [42]:
loaded_splitter.run(method='hash', salt='from_yaml')

Unnamed: 0,m,a,b,l,e,object_id,group
2198,0.0,-0.641487,-0.658645,0,0,20742,A
2642,0.0,-0.847634,-0.043161,0,0,120722,A
7596,0.0,-1.572989,-1.090749,0,0,24930,A
7688,0.0,-0.233898,-1.196993,0,0,5767,A
10737,0.0,-1.000271,-0.298924,0,0,187434,A
...,...,...,...,...,...,...,...
197562,0.0,2.832918,0.327783,1,1,41784,B
132857,0.0,0.286866,-0.218729,1,0,11805,A
36947,0.0,-0.893407,-0.847359,0,0,175180,A
27596,0.0,1.459311,1.463153,1,1,65061,B


## Stand-alone split function

*Ambrosia* contains the ``split`` function that replicates the behavior of the ``Splitter`` class and can also be used for the same split tasks

In [43]:
split(method='simple',
      dataframe=dataframe,
      id_column='object_id',
      groups_size=1000)

Unnamed: 0,m,a,b,l,e,object_id,group
196900,0.0,0.050390,0.819410,1,1,14646,A
43323,0.0,-3.393915,-0.281103,0,0,36721,A
171194,0.0,-0.177053,0.248580,0,1,156469,A
185518,0.0,0.757116,0.366774,1,1,53901,A
10289,0.0,-0.751167,2.058094,0,1,134394,A
...,...,...,...,...,...,...,...
69994,0.0,-1.329803,-0.023193,0,0,24351,B
114139,0.0,0.714640,1.330547,1,1,99421,B
142263,0.0,-0.590689,-0.951325,0,0,97706,B
86716,0.0,-0.601845,0.501994,0,1,83325,B


---

## Learn more

There is some more information about groups split using *Ambrosia*

Check:

* ``Splitter`` class documentation
* An example of splitting groups from a Spark DataFrame (currently has limited functionality)