In [1]:
%load_ext autoreload
%autoreload 2

# Synthetic data generation

In this notebook, we call methods from *src/data/synthetic_data_generation.py* to create new image for our model to train on.

We first define the constants of our data directories

In [2]:
from src.data.synthetic_data_generation import *
from pathlib import Path

# Images directories
SYNTH_DIR = Path('data/synthetic_train')
REFERENCE_DIR = Path('data/references')
DEFAULT_TRAIN_DIR = Path('data/train')
BACKGROUND_DIR = Path('data/backgrounds')
TRANSPARENCY_DIR = Path('data/alpha_references')

# CSV files
ORIGINAL_CSV = Path('data/train.csv')
SYNTH_CSV = Path('data/synthetic_train.csv')

# Image constants
IMG_SIZE = (300, 200)
NUM_SYNTH_IMAGES = 800 # this will be 4x due to the horizontal and vertical flips

# Check that all the directories exist
for directory in [SYNTH_DIR, REFERENCE_DIR, DEFAULT_TRAIN_DIR, BACKGROUND_DIR]:
    if not directory.exists():
        raise FileNotFoundError(f"Directory {directory} does not exist.")
# Check that the original CSV file exists
if not ORIGINAL_CSV.exists():
    raise FileNotFoundError(f"CSV file {ORIGINAL_CSV} does not exist. It is needed to generate the synthetic data.")

## Resize our images

We start by taking all the reference and default train images and transform them to a lighter format allowing easiser computations.

First, we remove everything that was in the synthetic folder

In [8]:
clear_output_dir(SYNTH_DIR)

All images in data\synthetic_train have been deleted.


In [9]:
to_be_resized = [REFERENCE_DIR, DEFAULT_TRAIN_DIR]
resize_training_images(IMG_SIZE, to_be_resized, SYNTH_DIR)

Resizing images in data\references...


100%|██████████| 13/13 [00:03<00:00,  3.78it/s]


Resizing images in data\train...


100%|██████████| 90/90 [00:25<00:00,  3.54it/s]

DONE! Resized images saved in data\synthetic_train.





### Creating synthetic images
Now, we will add hundreds of new images in the same folder that are generated using the reference images (with a transparent background). This will create our new csv with the additional annotations. Noise and rotations are randomly applied to individual chocolates and the backgrounds are equally distributed and are fetched from the 'data/backgrounds' directory

In [10]:
generate_synthetic_dataset(NUM_SYNTH_IMAGES, BACKGROUND_DIR, TRANSPARENCY_DIR, SYNTH_DIR, ORIGINAL_CSV, SYNTH_CSV, IMG_SIZE)

Generating images for background  1  of  1


100%|██████████| 800/800 [03:41<00:00,  3.61it/s]


### Data flipping
We can now take all our synthetic and default image and apply 3 kinds of flips: vertical, horizontal and both combined

In [11]:
# First, we group the default csv with the synthetic csv and replace the synthetic csv
merge_csv_files(SYNTH_CSV, ORIGINAL_CSV, SYNTH_CSV) # If you run this cell multiple times, it will keep adding the synthetic data to the original csv. You can delete the original csv if you want to start over.

flip_images(SYNTH_CSV, SYNTH_DIR, SYNTH_CSV)

Merged CSV files saved to data\synthetic_train.csv.
