# ID Grouping

Say you have generated a few datasets with recurrent IDs. <br>
For example, you could select a single ID category during dataset generation (e.g. North European + Male + Young) and generate two datasets with 300+ renders, and different camera parameters.<br>
The first dataset is called **Far** (the camera is far from the subject) and and the second one is called **Close** (the camera is close).<br>
Say that you want to create a new dataset out of **Far** and **Close** that groups the common IDs in both datasets.

<img src="Images/ID_grouping1.png"/>
<br><br>(Markdown images don't show up on GitHub. To visualize the image, please clone the repository locally)

## Imports

In [6]:
import os
from os import listdir
from os.path import isdir, join
from shutil import copytree
import pathlib
from tqdm import tqdm

import sys
sys.path.append('../DataLoader')
from Dataset import Dataset

## Directory reorganization

**BASE_DIR** is the directory where the datasets are located.<br>
The new dataset directory will be called **Commons IDs** (it will be created automatically under **BASE_DIR**)

In [7]:
BASE_DIR = '../ID_grouping_dataset'
DST_DATASET = join(BASE_DIR, 'Common IDs')

The paths to the datasets we want to group:

In [8]:
dataset_paths = [join(BASE_DIR, dir) for dir in sorted(listdir(BASE_DIR)) if isdir(join(BASE_DIR, dir))]

Let's create a list of IDs in each of the datasets and find out the IDs that appear in all of them. <br>
This might take up to a few minutes depending on the number of datasets and their size.

In [9]:
ids_in_datasets = []

for ds_dir in dataset_paths:
    print(f'Looking up IDs in dataset: {ds_dir}')
    ids_in_datasets += [[]]
    ds = Dataset(ds_dir)
    for dp in tqdm(ds):
        ids_in_datasets[-1] += [dp.identity_id]

shared_ids = set(ids_in_datasets[0])
for id_list in ids_in_datasets[1:]:
    shared_ids = shared_ids.intersection(id_list)

Looking up IDs in dataset: ../ID_grouping_dataset/Far


100%|██████████| 40/40 [00:05<00:00,  7.20it/s]


Looking up IDs in dataset: ../ID_grouping_dataset/Near


100%|██████████| 40/40 [00:05<00:00,  6.84it/s]


For each of the shared IDs, we create a new subfolder to **Common IDs** that gets a copy of the datapoints from each of the original datasets

In [10]:
number_of_datasets = len(dataset_paths)
for i, id in enumerate(shared_ids):
    id_dst_dir = join(DST_DATASET, str(i))
    pathlib.Path(id_dst_dir).mkdir(parents=True, exist_ok=True)

    for j in range(number_of_datasets):
        environment_idx = ids_in_datasets[j].index(id) + 1
        environment_dir_name = 'environment_' + str(environment_idx).zfill(5)
        environment_path = join(dataset_paths[j], environment_dir_name)
        copytree(environment_path, join(id_dst_dir, os.path.basename(dataset_paths[j])))

This is the final result you should get: <br>
<img src="Images/ID_grouping2.png"/>
<br><br>(Markdown images don't show up on GitHub. To visualize the image, please clone the repository locally)