# Add new data to dataset with splits

When adding new incoming data to a dataset that already has several split tables, there are two ways to go about it:
1. Merge all the data into one table, and then create new splits from this table.
2. Create new split Tables for the incoming data and then join those with the corresponding existing tables.

Here we will show each of these in turn.

In [1]:
import tlc

from tools.split import split_table

PROJECT_NAME = "tutorials"
DATASET_NAME = "add_new_data_merge_first"

## 1. Merge-first strategy

![Merge First Strategy](../images/merge_first.png)

The merge-first strategy first merges in the new data, and then creates new splits for all the data.

In [2]:
original_train = tlc.Table.from_dict(data={"my_column": [1, 2, 3, 4, 5]}, project_name=PROJECT_NAME, dataset_name=DATASET_NAME, table_name="original_train")
original_val = tlc.Table.from_dict(data={"my_column": [6, 7, 8, 9, 10]}, project_name=PROJECT_NAME, dataset_name=DATASET_NAME, table_name="original_val")
original_test = tlc.Table.from_dict(data={"my_column": [11, 12, 13, 14, 15]}, project_name=PROJECT_NAME, dataset_name=DATASET_NAME, table_name="original_test")

In [3]:
original_joined = tlc.Table.join_tables(
    tables=[original_train, original_val, original_test],
    project_name=PROJECT_NAME,
    dataset_name=DATASET_NAME,
    table_name="original_joined"
)

new = tlc.Table.from_dict(data={"my_column": [16, 17, 18, 19, 20]}, project_name=PROJECT_NAME, dataset_name=DATASET_NAME, table_name="new")

In [4]:
all_joined = tlc.Table.join_tables(
    tables=[original_joined, new],
    project_name=PROJECT_NAME,
    dataset_name=DATASET_NAME,
    table_name="all_joined",
)

Here a random split is used, but any strategy could be used. See 'split-tables.ipynb' for a more complete example.

In [5]:
new_tables = split_table(
    all_joined,
    splits={"train": 0.34, "val": 0.33, "test": 0.33},
)

new_train = new_tables["train"]
new_val = new_tables["val"]
new_test = new_tables["test"]

[90m3lc: [0mCreating transaction
[90m3lc: [0mCommitting transaction
[90m3lc: [0mCreating transaction
[90m3lc: [0mCommitting transaction
[90m3lc: [0mCreating transaction
[90m3lc: [0mCommitting transaction


In [9]:
for split, table in new_tables.items():
    print(f"New {split} table: [" + ", ".join(str(row["my_column"]) for row in table) + "]")

New train table: [2, 7, 9, 11, 14, 18, 19, 20]
New val table: [3, 5, 6, 8, 10, 15]
New test table: [1, 4, 12, 13, 16, 17]


## 2. Split-first strategy

![Split First Strategy](../images/split_first.png)

The split-first strategy first splits the new data, and then merges each resulting split with the corresponding original splits.

In [11]:
DATASET_NAME = "add_new_data_split_first"

In [12]:
original_train = tlc.Table.from_dict(data={"my_column": [1, 2, 3, 4, 5]}, project_name=PROJECT_NAME, dataset_name=DATASET_NAME, table_name="original_train")
original_val = tlc.Table.from_dict(data={"my_column": [6, 7, 8, 9, 10]}, project_name=PROJECT_NAME, dataset_name=DATASET_NAME, table_name="original_val")
original_test = tlc.Table.from_dict(data={"my_column": [11, 12, 13, 14, 15]}, project_name=PROJECT_NAME, dataset_name=DATASET_NAME, table_name="original_test")

new = tlc.Table.from_dict(data={"my_column": [16, 17, 18, 19, 20]}, project_name=PROJECT_NAME, dataset_name=DATASET_NAME, table_name="new")

In [13]:
new_tables_tmp = split_table(new, splits={"train": 0.34, "val": 0.33, "test": 0.33})

new_train_tmp = new_tables_tmp["train"]
new_val_tmp = new_tables_tmp["val"]
new_test_tmp = new_tables_tmp["test"]

[90m3lc: [0mCreating transaction
[90m3lc: [0mCommitting transaction
[90m3lc: [0mCreating transaction
[90m3lc: [0mCommitting transaction
[90m3lc: [0mCreating transaction
[90m3lc: [0mCommitting transaction


In [14]:
new_train = tlc.Table.join_tables(tables=[original_train, new_train_tmp], project_name=PROJECT_NAME, dataset_name=DATASET_NAME, table_name="new_train")
new_val = tlc.Table.join_tables(tables=[original_val, new_val_tmp], project_name=PROJECT_NAME, dataset_name=DATASET_NAME, table_name="new_val")
new_test = tlc.Table.join_tables(tables=[original_test, new_test_tmp], project_name=PROJECT_NAME, dataset_name=DATASET_NAME, table_name="new_test")

In [15]:
for split, table in new_tables.items():
    print(f"New {split} table: [" + ", ".join(str(row["my_column"]) for row in table) + "]")

New train table: [2, 7, 9, 11, 14, 18, 19, 20]
New val table: [3, 5, 6, 8, 10, 15]
New test table: [1, 4, 12, 13, 16, 17]
