# Example of *Ambrosia* ``Splitter`` class Spark data support

This example shows the functionality of the ``Splitter`` class on Spark DataFrames. Synthetic data on MTS KION users metrics will be used.

The functionality of the ``Designer`` class on Spark data currently is limited compared to the pandas format. \
See the main ``Splitter`` tutorial on pandas data to learn the full functionality and details of splitting experimental objects into groups.

**Note:** *Ambrosia* now supports only batch spliiting. Real-time splitting tools are under development.

In [4]:
import sys, os
sys.path.insert(1, os.path.realpath(os.path.pardir))

In [17]:
import os

import pandas as pd
import pyspark

from ambrosia.splitter import Splitter

Your CPU supports instructions that this binary was not compiled to use: AVX2
For maximum performance, you can install NMSLIB from sources 
pip install --no-binary :all: nmslib


Build local spark session

In [6]:
os.environ['SPARK_LOCAL_IP'] = '127.0.0.1'
spark = pyspark.sql.SparkSession.builder.master("local[1]").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/03/03 18:21:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/03/03 18:21:50 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


Create Spark DataFrame

In [16]:
kion_dataset = pd.read_csv("./../tests/test_data/kion_data.csv", sep=';')
sdf = spark.createDataFrame(kion_dataset)

In [18]:
kion_dataset.shape

(300000, 5)

In [19]:
sdf.printSchema()

root
 |-- profile_id: long (nullable = true)
 |-- sum_dur: long (nullable = true)
 |-- vod_cnt: long (nullable = true)
 |-- ln_vod_cnt: double (nullable = true)
 |-- bin_col: long (nullable = true)



### Spark hash group split

Unlike pandas data, only the ``"hash'`` method is implemented for spark. \
This method allows to deterministically create groups using the ``salt`` parameter.

Set data and name of column with unique object ids

In [23]:
splitter = Splitter(dataframe=sdf, id_column='profile_id')

Make hash split on 2 groups with specified salt value

In [28]:
hash_split = splitter.run(groups_size=1000, method='hash', salt='spark322')

23/03/03 18:32:40 WARN TaskSetManager: Stage 47 contains a task of very large size (8236 KiB). The maximum recommended task size is 1000 KiB.
23/03/03 18:32:40 WARN TaskSetManager: Stage 50 contains a task of very large size (8236 KiB). The maximum recommended task size is 1000 KiB.
23/03/03 18:32:41 WARN TaskSetManager: Stage 56 contains a task of very large size (8236 KiB). The maximum recommended task size is 1000 KiB.
23/03/03 18:32:41 WARN TaskSetManager: Stage 59 contains a task of very large size (8236 KiB). The maximum recommended task size is 1000 KiB.


In [27]:
hash_split.toPandas()

23/03/03 18:32:31 WARN TaskSetManager: Stage 46 contains a task of very large size (8237 KiB). The maximum recommended task size is 1000 KiB.


                                                                                

Unnamed: 0,profile_id,sum_dur,vod_cnt,ln_vod_cnt,bin_col,group
0,559783878399,16243096,26,3.451662,1,A
1,807427182946,55078,3,0.909034,0,A
2,845784297949,31545,1,0.000000,0,A
3,41350284663,1878050,10,2.894374,0,A
4,5082903657,584191,1,0.475820,0,A
...,...,...,...,...,...,...
1995,449871171656,5890763,29,3.699892,1,B
1996,25374705733,3964937,51,4.053246,1,B
1997,368955636652,27693,1,0.000000,0,B
1998,674408525538,7284,1,0.000000,0,B


Now make 5 different groups each of 1000 objects

In [34]:
hash_split_multi = splitter.run(groups_size=1000, groups_number=5, method='hash', salt='spark322')

23/03/03 18:41:16 WARN TaskSetManager: Stage 64 contains a task of very large size (8236 KiB). The maximum recommended task size is 1000 KiB.
23/03/03 18:41:17 WARN TaskSetManager: Stage 67 contains a task of very large size (8236 KiB). The maximum recommended task size is 1000 KiB.
23/03/03 18:41:17 WARN TaskSetManager: Stage 73 contains a task of very large size (8236 KiB). The maximum recommended task size is 1000 KiB.
23/03/03 18:41:18 WARN TaskSetManager: Stage 76 contains a task of very large size (8236 KiB). The maximum recommended task size is 1000 KiB.


In [56]:
res0= hash_split_multi.toPandas()

23/03/03 19:24:49 WARN TaskSetManager: Stage 132 contains a task of very large size (8237 KiB). The maximum recommended task size is 1000 KiB.


In [57]:
res0.groupby('group').agg({"bin_col": "value_counts"}) / 1000

Unnamed: 0_level_0,Unnamed: 1_level_0,bin_col
group,bin_col,Unnamed: 2_level_1
A,0,0.615
A,1,0.385
B,0,0.593
B,1,0.407
C,0,0.598
C,1,0.402
D,0,0.611
D,1,0.389
E,0,0.611
E,1,0.389


In [36]:
strat_hash_split_multi = splitter.run(groups_size=1000, strat_columns=['bin_col'], groups_number=5, method='hash', salt='spark322')

23/03/03 18:49:09 WARN TaskSetManager: Stage 81 contains a task of very large size (8236 KiB). The maximum recommended task size is 1000 KiB.
23/03/03 18:49:10 WARN TaskSetManager: Stage 84 contains a task of very large size (8236 KiB). The maximum recommended task size is 1000 KiB.
23/03/03 18:49:10 WARN TaskSetManager: Stage 90 contains a task of very large size (8236 KiB). The maximum recommended task size is 1000 KiB.
23/03/03 18:49:11 WARN TaskSetManager: Stage 96 contains a task of very large size (8236 KiB). The maximum recommended task size is 1000 KiB.
23/03/03 18:49:11 WARN TaskSetManager: Stage 102 contains a task of very large size (8236 KiB). The maximum recommended task size is 1000 KiB.
23/03/03 18:49:12 WARN TaskSetManager: Stage 108 contains a task of very large size (8236 KiB). The maximum recommended task size is 1000 KiB.
23/03/03 18:49:12 WARN TaskSetManager: Stage 114 contains a task of very large size (8236 KiB). The maximum recommended task size is 1000 KiB.


In [39]:
res = strat_hash_split_multi.toPandas()



In [55]:
res.groupby('group').agg({"bin_col": "value_counts"}) / 1000

Unnamed: 0_level_0,Unnamed: 1_level_0,bin_col
group,bin_col,Unnamed: 2_level_1
A,0,0.609
A,1,0.391
B,0,0.609
B,1,0.391
C,0,0.609
C,1,0.391
D,0,0.609
D,1,0.391
E,0,0.609
E,1,0.391


In [43]:
res.group.value_counts()

A    1000
B    1000
C    1000
D    1000
E    1000
Name: group, dtype: int64