# Data cleanup notebook
This notebook is meant as both cleanup tool for the original dataset from [Zenodo](https://zenodo.org/record/5092309#.ZGumqxlByDQ) and as a documentation on what was changed if one uses already cleaned up dataset (most likely currently hosted privately on google cloud storage instance).

If using the notebook as cleanup tool, download original dataset fromm [Zenodo](https://zenodo.org/record/5092309#.ZGumqxlByDQ). You can optionally calculate the md5 checksum to verify that the data was downloaded without an error:
```bash
$ wget --output-document gwhd_2021.zip https://zenodo.org/record/5092309/files/gwhd_2021.zip?download=1 && uznip gwhd_2021.zip
$ python3 data_integrity.py <Path to the dataset> <original dataset MD5>
```
Then export environment variable **DATASET_ROOT_DIR** with path where the dataset was placed and then run this notebook. Remember to run the notebook from the terminal where **DATASET_ROOT_DIR** was exported!

If using this notebook as documentation, there is no need to run it as the data should already be cleaned up.

In [1]:
import os

print("Checking environment variables...")
assert 'PROJ_PATH' in os.environ
assert 'YOLOV7_ROOT_DIR' in os.environ
assert 'DATASET_MD5' in os.environ
assert 'ORIGINAL_DATASET_MD5' in os.environ
assert 'DATASET_ROOT_DIR' in os.environ
assert 'DATA_BUCKET' in os.environ
print("Environment variables exist.")

DATASET_ROOT_DIR = os.environ['DATASET_ROOT_DIR']

Checking environment variables...
Environment variables exist.


In [None]:
from tqdm import tqdm
from PIL import Image

"""
All images in the dataset have .png extension, but some of them are actually .jpg files.
It probably is not a problem during the training, but default image viewer is not opening these
files correctly.

Convert them all to actuall .png files.
"""
images_dir = f'{DATASET_ROOT_DIR}/images'
for img_name in tqdm.tqdm(os.listdir(images_dir)):
    try:
        img = Image.open(img_name)
        img.save(img_name, format='PNG')
    except OSError as e:
        print(f"Couldn't convert {img_name} due to {e}, skipping...")

In [2]:
import pandas as pd

train_df = pd.read_csv(f'{DATASET_ROOT_DIR}/competition_train.csv')
test_df = pd.read_csv(f'{DATASET_ROOT_DIR}/competition_test.csv')
val_df = pd.read_csv(f'{DATASET_ROOT_DIR}/competition_val.csv')
df = pd.concat([train_df, test_df, val_df])

In [4]:
"""
Image b11b3c68d79f4025ff7f542587ab91a67dfe55be69d1fb63db4bcbcb108284a9.png is corrupted so we remove it
It throws error that the file is truncated. It opens normally in default image viewer but not in imagemagick
"""
os.remove(os.path.join(DATASET_ROOT_DIR, 'images', 'b11b3c68d79f4025ff7f542587ab91a67dfe55be69d1fb63db4bcbcb108284a9.png'))

FileNotFoundError: [Errno 2] No such file or directory: '/home/js/gwhd_2021/images/b11b3c68d79f4025ff7f542587ab91a67dfe55be69d1fb63db4bcbcb108284a9.png'

In [5]:
"""
Check if there is leak between train, test and val sets
"""
if not df['image_name'].nunique() == df.shape[0]:
    # This are the error files that come up during symlink creation in yolo format conversion
    print(df.loc[df.duplicated(subset=['image_name'])]['image_name'])

# TODO: Fix these labels

2070    d88963636d49127bda0597ef73f1703e92d6f111caefc4...
2079    1961bcf453d5b2206c428c1c14fe55d1f26f3c655db0a2...
1038    da9846512ff19b8cd7278c8c973f75d36de8c4eb4e593b...
Name: image_name, dtype: object


In [6]:
import numpy as np

"""
Check if there are any duplicate bboxes in the dataset
"""
for row in df.iterrows():
    if row[1]['BoxesString'] == 'no_box':
        continue
    bboxes = row[1]['BoxesString'].split(';')
    bboxes = [bbox.split(' ') for bbox in bboxes]
    bboxes = [[float(i) for i in bbox] for bbox in bboxes]
    bboxes = np.array(bboxes, dtype=np.float32)
    uniques, count = np.unique(bboxes, axis=0, return_counts=True)
    dup = uniques[count > 1]
    if dup.size > 0:
        print(row[1]['image_name'], dup)

# TODO: Fix these boxes

NameError: name 'np' is not defined