# Data cleanup notebook
This notebook is meant as both cleanup tool for the original dataset from [Zenodo](https://zenodo.org/record/5092309#.ZGumqxlByDQ) and as a documentation on what was changed if one uses already cleaned up dataset (most likely currently hosted privately on google cloud storage instance).

If using the notebook as cleanup tool, download original dataset fromm [Zenodo](https://zenodo.org/record/5092309#.ZGumqxlByDQ). You can optionally calculate the md5 checksum to verify that the data was downloaded without an error:
```bash
$ wget --output-document gwhd_2021.zip https://zenodo.org/record/5092309/files/gwhd_2021.zip?download=1 && uznip gwhd_2021.zip
$ python3 data_integrity.py <Path to the dataset> <original dataset MD5>
```
Then export environment variable **DATASET_ROOT_DIR** with path where the dataset was placed and then run this notebook. Remember to run the notebook from the terminal where **DATASET_ROOT_DIR** was exported!

If using this notebook as documentation, there is no need to run it as the data should already be cleaned up.

In [1]:
import os

print("Checking environment variables...")
assert 'PROJ_PATH' in os.environ
assert 'YOLOV7_ROOT_DIR' in os.environ
assert 'DATASET_MD5' in os.environ
assert 'ORIGINAL_DATASET_MD5' in os.environ
assert 'DATASET_ROOT_DIR' in os.environ
assert 'DATA_BUCKET' in os.environ
print("Environment variables exist.")

DATASET_ROOT_DIR = os.environ['DATASET_ROOT_DIR']

Checking environment variables...
Environment variables exist.


In [None]:
from tqdm import tqdm
from PIL import Image

"""
All images in the dataset have .png extension, but some of them are actually .jpg files.
It probably is not a problem during the training, but default image viewer is not opening these
files correctly.

Convert them all to actuall .png files.
"""
images_dir = f'{DATASET_ROOT_DIR}/images'
for img_name in tqdm.tqdm(os.listdir(images_dir)):
    try:
        img = Image.open(img_name)
        img.save(img_name, format='PNG')
    except OSError as e:
        print(f"Couldn't convert {img_name} due to {e}, skipping...")

In [55]:
import pandas as pd

train_df = pd.read_csv(f'{DATASET_ROOT_DIR}/competition_train.csv')
test_df = pd.read_csv(f'{DATASET_ROOT_DIR}/competition_test.csv')
val_df = pd.read_csv(f'{DATASET_ROOT_DIR}/competition_val.csv')

In [4]:
"""
Image b11b3c68d79f4025ff7f542587ab91a67dfe55be69d1fb63db4bcbcb108284a9.png is corrupted so we remove it
It throws error that the file is truncated. It opens normally in default image viewer but not in imagemagick
"""
os.remove(os.path.join(DATASET_ROOT_DIR, 'images', 'b11b3c68d79f4025ff7f542587ab91a67dfe55be69d1fb63db4bcbcb108284a9.png'))

FileNotFoundError: [Errno 2] No such file or directory: '/home/js/gwhd_2021/images/b11b3c68d79f4025ff7f542587ab91a67dfe55be69d1fb63db4bcbcb108284a9.png'

In [56]:
"""
Check if there are duplicates in the subsets
"""
train_df_dups = train_df.loc[train_df['image_name'].duplicated(), 'image_name'].unique()
test_df_dups = test_df.loc[test_df['image_name'].duplicated(), 'image_name'].unique()
val_df_dups = val_df.loc[val_df['image_name'].duplicated(), 'image_name'].unique()
print(f"Train dataset duplicates:\n {train_df_dups}")
print(f"Test dataset duplicates:\n {test_df_dups}")
print(f"Val dataset duplicates:\n {val_df_dups}")

Train dataset duplicates:
 ['d88963636d49127bda0597ef73f1703e92d6f111caefc44902d5932b8cd3fa94.png'
 '1961bcf453d5b2206c428c1c14fe55d1f26f3c655db0a2b6a83094476e8edb5b.png']
Test dataset duplicates:
 ['da9846512ff19b8cd7278c8c973f75d36de8c4eb4e593b8285f6821aae1f4203.png']
Val dataset duplicates:
 []


In [85]:
"""
Drop these duplicates
"""
train_df.drop_duplicates(subset=['image_name'], inplace=True)
test_df.drop_duplicates(subset=['image_name'], inplace=True)
val_df.drop_duplicates(subset=['image_name'], inplace=True)
df = pd.concat([train_df, test_df, val_df])

In [86]:
"""
Check if there is leak between train, test and val sets. At this point we are sure that
there are no duplicates in each subset.
"""
import numpy as np
leaks = np.array([])
if not df['image_name'].nunique() == df.shape[0]:
    # This are the error files that camee up during symlink creation in yolo format conversion
    leaks = df.loc[df.duplicated(subset=['image_name'])]['image_name']
    leaks = leaks.reset_index(drop=True)
print(f"Found {leaks.shape[0]} leaks in the dataset")

Found 0 leaks in the dataset


In [87]:
import numpy as np

"""
Check if there are any duplicate bboxes in the dataset per image. If there are remove them.
"""
for row in df.iterrows():
    if row[1]['BoxesString'] == 'no_box':
        continue
    bboxes = row[1]['BoxesString'].split(';')
    bboxes = [bbox.split(' ') for bbox in bboxes]
    bboxes = [[int(i) for i in bbox] for bbox in bboxes]
    bboxes = np.array(bboxes, dtype=int)
    uniques, count = np.unique(bboxes, axis=0, return_counts=True)
    duplicate_bboxes = uniques[count > 1]
    if duplicate_bboxes.size > 0:
        print(f"Duplicate boxes found:\n{duplicate_bboxes}")
        new_bboxes_string = ';'.join(' '.join(unique_bbox.astype(str)) for unique_bbox in uniques)
        row[1]['BoxesString'] = new_bboxes_string
        print(f"New duplicate-free BoxesString:\n{new_bboxes_string}")

Duplicates boxes:  [[770 629 813 690]]
New:  8 141 116 271;8 382 62 435;8 573 94 678;21 260 66 370;23 389 73 542;65 237 90 268;80 992 161 1024;110 298 161 390;136 733 195 810;191 39 235 113;209 256 349 356;237 371 269 419;264 360 376 420;291 148 350 216;298 17 397 72;315 489 459 574;375 234 435 289;378 755 468 897;404 704 488 829;420 357 525 403;504 398 566 446;564 610 608 806;571 323 677 378;582 163 728 237;593 249 662 320;615 10 673 95;625 653 671 718;697 434 808 486;756 741 782 774;763 782 813 832;763 915 821 1024;770 629 813 690;789 1 910 49;795 104 914 282;882 422 928 540;899 261 1006 311;930 149 977 205;930 412 980 474;941 807 977 849;959 474 1016 541
Duplicates boxes:  [[ 74 370 146 484]]
New:  0 62 52 138;0 250 80 358;0 756 40 824;74 370 146 484;78 802 248 892;82 0 166 62;104 966 194 1024;110 184 226 256;142 872 186 914;144 760 300 812;148 488 238 562;216 916 316 1024;222 896 346 976;224 424 258 450;278 0 446 34;300 366 384 468;426 738 500 832;444 56 612 186;452 208 582 286;474