## Data pre-processing
We follow the data-preprocessing steps described in the [REaLTabFormer](https://arxiv.org/pdf/2302.02041#page=12.08) paper.
> For the Rossmann dataset, we used 80% of the stores data and their associated sales records for our training data. We used the remaining stores as the test data. We also limit the data used in the experiments from 2015-06 onwards spanning 2 months of sales data per store.


## Training a downstream model
we load the synthesized private data for training a downstream model. We want to evaluate:
- logistic detection (LD) performance for relational data
- $F_1$ and $R^2$ score for non-relational data

In [None]:
import pandas as pd

# load the data
all_table = pd.read_csv('./synthesized/all_table.csv')
parent_table = pd.read_csv('./synthesized/parent_table.csv')

# we sample 80% of the stores data and their associated sales records in train data for training
train_parent_table = parent_table.sample(frac=0.8, random_state=0)
test_parent_table = parent_table.drop(train_parent_table.index)

# we get our new training data and test data
train_table = all_table[all_table['Store'].isin(train_parent_table['Store'])]
test_table = all_table[all_table['Store'].isin(test_parent_table['Store'])]

ratio = len(train_parent_table) / len(parent_table)
ratio_train = len(train_table) / len(all_table)
assert ratio > 0.7 and ratio < 0.9
assert ratio_train > 0.7 and ratio_train < 0.9
print(f"ratio of train parent data to all data: {ratio}")
print(f"ratio of train data to all data: {ratio_train}")
