# ID Grouping

This notebook's purpose is to explain how to take a large dataset with recurrent IDs and reorganize it into a new dataset with similar IDs grouped together.<br><br>
Please note that unlike the other notebooks, this notebook <b>cannot</b> be run on the datasets we provide in the repository. The reason is that the datasets are too small and don't include recurrent IDs.

## Imports

In [1]:
from os.path import join
from shutil import copytree
import pathlib
import datagen

## Directory reorganization

**BASE_DIR** is the directory where your datasets is located.<br>
The new dataset directory will be called **Grouped** (it will be created automatically under **BASE_DIR**)

In [2]:
BASE_DIR = '.'
SRC_DATASET = join(BASE_DIR, 'Data')
DST_DATASET = join(BASE_DIR, 'Grouped')

We iterate over all the datapoints in the dataset, check their IDs, and copy them into the new dataset, under the relevant ID folders. <br>
This might take a few minutes depending on the dataset size.

In [3]:
ds = datagen.load(SRC_DATASET)

for i, dp in enumerate(ds):
    # Look up the datapoint ID
    identity = dp.actor_metadata.identity_id
    
    # Create a new folder for this ID if it doesn't exist already
    id_dst_path = join(DST_DATASET, str(identity))
    pathlib.Path(id_dst_path).mkdir(parents=True, exist_ok=True)
    
    # Find out the current environment path
    print(dp.scene_path)
    env_dir_name = 'environment_' + str(i + 1).zfill(5)
    env_path = join(SRC_DATASET, env_dir_name)
    
    # Copy the current environment folder to the new dataset
    copytree(env_path, join(id_dst_path, env_dir_name))

FileNotFoundError: [Errno 2] No such file or directory: 'Data'

The new dataset is now ready! You can find it under **BASE_DIR**