## Data pre-processing
We follow the data-preprocessing steps described in the [REaLTabFormer](https://arxiv.org/pdf/2302.02041#page=12.08) paper.
> For the Rossmann dataset, we used 80% of the stores data and their associated sales records for our training data. We used the remaining stores as the test data. We also limit the data used in the experiments from 2015-06 onwards spanning 2 months of sales data per store.


In [2]:
import pandas as pd

In [3]:
parent_table = pd.read_csv('./raw/store.csv')
parent_table.head()

Unnamed: 0,Store,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,c,a,1270.0,9.0,2008.0,0,,,
1,2,a,a,570.0,11.0,2007.0,1,13.0,2010.0,"Jan,Apr,Jul,Oct"
2,3,a,a,14130.0,12.0,2006.0,1,14.0,2011.0,"Jan,Apr,Jul,Oct"
3,4,c,c,620.0,9.0,2009.0,0,,,
4,5,a,a,29910.0,4.0,2015.0,0,,,


In [4]:
train_table = pd.read_csv('./raw/train.csv')
test_table = pd.read_csv('./raw/test.csv')

# we merge train and test tables to get all the data
all_table = pd.concat([train_table, test_table])
all_table.head()

  train_table = pd.read_csv('./raw/train.csv')


Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday,Id
0,1,5,2015-07-31,5263.0,555.0,1.0,1,0,1,
1,2,5,2015-07-31,6064.0,625.0,1.0,1,0,1,
2,3,5,2015-07-31,8314.0,821.0,1.0,1,0,1,
3,4,5,2015-07-31,13995.0,1498.0,1.0,1,0,1,
4,5,5,2015-07-31,4822.0,559.0,1.0,1,0,1,


In [15]:
# we limit the data used in the experiments from 2015-06 onwards spanning 2 months of sales data per store
all_table = all_table[all_table['Date'] >= '2015-06-01']


# we sample 80% of the stores data and their associated sales records in train data for training
train_parent_table = parent_table.sample(frac=0.8, random_state=0)
test_parent_table = parent_table.drop(train_parent_table.index)

# we get our new training data and test data
train_table = all_table[all_table['Store'].isin(train_parent_table['Store'])]
test_table = all_table[all_table['Store'].isin(test_parent_table['Store'])]

ratio = len(train_parent_table) / len(parent_table)
ratio_train = len(train_table) / len(all_table)
assert ratio > 0.7 and ratio < 0.9
assert ratio_train > 0.7 and ratio_train < 0.9
print(f"ratio of train parent data to all data: {ratio}")
print(f"ratio of train data to all data: {ratio_train}")


ratio of train parent data to all data: 0.8
ratio of train data to all data: 0.8036076001576492


In [16]:
# saving the processed data
train_table.to_csv('./preprocessed/train.csv', index=False)
test_table.to_csv('./preprocessed/test.csv', index=False)
train_parent_table.to_csv('./preprocessed/train_parent.csv', index=False)
test_parent_table.to_csv('./preprocessed/test_parent.csv', index=False)
