This notebook demonstrates the [ApplyTransformer](https://bitbucket.org/dhigroupinc/dhi-match-datascience/src/master/dsmatch/sklearnmodeling/models/applytransformer.py), which operates in parallel on splitted data, as well as the [GroupBySplitter](https://bitbucket.org/dhigroupinc/dhi-match-datascience/src/master/dsmatch/util/datasplitters.py) that ensures that data splits occur along semantic lines as done by a Pandas Groupby. That is, members of a group are assured to belong to the same data chunk going to a single processor, so the applied function can operate on cohesive sets as compared to individual, arbitrary rows, which is the default splitting scheme in the ApplyTransformer.

**Author:** Tom McTavish

**Date:** September 14, 2020

In [1]:
import pandas as pd
import numpy as np

from dhi.dsmatch.util.datasplitters import GroupBySplitter
from dhi.dsmatch.util.parallel import ParallelFileLogger
from dhi.dsmatch.sklearnmodeling.models.applytransformer import ApplyTransformer
from dhi.dsmatch.sklearnmodeling.functiontransformermapper import try_func

## Create a toy DataFrame.

In the DataFrame we create below, `a` is a range of numbers in reverse order so we can see how it looks compared to the index. `b` takes on one of 3 values in the range [0, 2], and `c` takes on one of 5 values in the range [0, 4].

In [2]:
grpby_cols = ['b', 'c']
n_rows = 20
n_splits = 4
df = pd.DataFrame(np.arange(n_rows*3).reshape(n_rows,3), columns=['a', 'b', 'c'])
df['a'] = np.arange(n_rows, 0, -1)
df['b'] = np.arange(n_rows) % 3
df['c'] = np.arange(n_rows, 0, -1) % 5
df

Unnamed: 0,a,b,c
0,20,0,0
1,19,1,4
2,18,2,3
3,17,0,2
4,16,1,1
5,15,2,0
6,14,0,4
7,13,1,3
8,12,2,2
9,11,0,1


# Split.

Given this dataframe, let's try 4 splits grouping by `b` and `c`.

In [3]:
print(f'Number of groups to split: {df.groupby(["b", "c"]).ngroups}')

Number of groups to split: 15


As can be seen, we are trying to squeeze 15 groups into 4 subframes. The cell below shows how this divides out: 3 frames have 4 groups and the last frame has 3 groups. Some groups have multiple rows.

In [4]:
n_splits = 4
splitter = GroupBySplitter(['b', 'c'])
for X in splitter.split(df, n_splits):
    display(X)

Unnamed: 0,a,b,c
0,20,0,0
1,5,0,0
4,18,2,3
5,3,2,3
14,11,0,1
18,7,1,2


Unnamed: 0,a,b,c
8,16,1,1
9,1,1,1
11,14,0,4
13,12,2,2
15,10,1,0


Unnamed: 0,a,b,c
2,19,1,4
3,4,1,4
10,15,2,0
17,8,0,3
19,6,2,1


Unnamed: 0,a,b,c
6,17,0,2
7,2,0,2
12,13,1,3
16,9,2,4


# Apply a transform.

Use the [`ApplyTransformer`](https://bitbucket.org/dhigroupinc/dhi-match-datascience/src/master/dsmatch/sklearnmodeling/models/applytransformer.py) to apply a function to the DataFrame, doing this between groups. In this example, we group by `b` and `c` and get the mean of `a`.

This also provides an example of the [`ParallelFileLogger`](https://bitbucket.org/dhigroupinc/dhi-match-datascience/src/master/dsmatch/util/parallel.py). Given an instantiation of the `ParallelFileLogger` that writes to `datasplit.log` text file, we illustrate a mechanism for logging during parallel computations. With this dataframe, `chunksize=5` and `min_chunksize=1` gives us the same 4 groups as shown above, each group on a different processor.

In [5]:
import os
try:
    os.remove('datasplit.log')
except FileNotFoundError:
    pass

pfl = ParallelFileLogger(filename='datasplit.log')

def mean_a(df, pfl):
    """Get the mean of column `a` when we group by [`b`, `c`]."""
    by_grp = df.groupby(['b', 'c'])
    logger = pfl.get_logger()
    logger.info(f'Splitting df with {df.shape[0]} rows and {by_grp.ngroups} groups\n{df}\n')
    return by_grp['a'].mean()

txfmr = ApplyTransformer(try_func, mean_a, fkwargs=dict(pfl=pfl), 
                         use_tqdm=True, split_func=splitter.split, chunksize=5, min_chunksize=1)
df_ = txfmr.transform(df).to_frame('mean_a').reset_index()
df_

HBox(children=(FloatProgress(value=0.0, description='mean_a', max=4.0, style=ProgressStyle(description_width='…




Unnamed: 0,b,c,mean_a
0,0,0,12.5
1,0,1,11.0
2,1,2,7.0
3,2,3,10.5
4,0,4,14.0
5,1,0,10.0
6,1,1,8.5
7,2,2,12.0
8,0,3,8.0
9,1,4,11.5


Above, we have a transformed dataframe that contains the mean of `a` when grouped by `b` and `c` . And as we see below, the original DataFrame is still intact.

In [6]:
df

Unnamed: 0,a,b,c
0,20,0,0
1,19,1,4
2,18,2,3
3,17,0,2
4,16,1,1
5,15,2,0
6,14,0,4
7,13,1,3
8,12,2,2
9,11,0,1
