# Creating the dataset: Data augmentation
In this notebook, I will try to highlight some techniques that can be used to augment the amount of existing data for deep learning purposes. In most cases, more data can be considered beneficial to model training and generalization, especially when dealing with shallower datasets. 

Augmenting the existing dataset equates to the artificial expansion of existing data using classic image-transforms, like illumination-changes or 2D rotations, translations, scalings ect. To a human observer, rotating or scaling an image can seem fairly trivial, but to a deep learning model the output is a wholly new, hitherto undiscovered datapoint. The greatest advantage? All these new datapoints are labeled, as they inherit from previously labeled sources. 

## Defining paths, iterators and directories
In the `\data`-folder, we have an existing dataset consisting of 244 images of mushrooms commonly found in the Norwegian flora. These mushrooms are labelled in `data\image_labels.csv`, and the labels are further described in `data\label_description.csv`. Although the author would like to point out the pain-stakingly slow work of manually selecting and labelling all datapoints, they are not currently sufficient to train a well-generalized deep learning model. 

To augment the dataset, we first find an *absolute path* `BASE_DIR` to the current folder (as this can change depending from OS to OS) and make a list `path_list` of path-objects matching a relative pattern. Here, this pattern is defined to be all images ending with `.jpg` (which encompasses both `.jpg` and `.JPG`): 

In [238]:
from pathlib import Path

# Define a BASE_DIR parameter for accuracy of loading files over multiple systems.
BASE_DIR = Path("00_Creating_Dataset.ipynb").parent.resolve()

# Define a list of image paths and a path to the labels, using a relative pattern:
RAW_LABEL_PATH = Path(f"{BASE_DIR}/data/raw_mushroom_imgs").rglob('*.csv')
raw_img_paths = Path(f"{BASE_DIR}/data/raw_mushroom_imgs").rglob('*.jpg')   # NOTE: Notation diff. due to contents

We also define a target folder, where the augmented data will be saved. Here, the folder is cleaned after initialization, so as to limit accumulating data:

In [239]:
import os

# Define a path to the target output directory
TARGET_DIR = Path(f"{BASE_DIR}/data/mushroom_imgs")
# Make a directory if there is no directory
TARGET_DIR.mkdir(parents = True, exist_ok = True)

# Empty the directory using another relative pattern
target_paths = Path(f"{TARGET_DIR}").rglob('*.jpg')
for path in target_paths:
    os.remove(path)

# Define a path for the output csv-file containing the labels of the augmented data
TARGET_LABEL_PATH = Path(f"{BASE_DIR}/data/mushroom_imgs/img_labels.csv")

## Augmenting the data

### Defining the data augmentation pipeline
The data augmentation is done through the `torchvision.transforms` library, of which the `Compose` class allows us to define a pipeline of transforms to apply to the image. These can alter the spectral or spatial components of an image, and can be tuned as to give a set of random responses. Naturally, the more randomized transforms one stack atop one another, the less likely it is for any two images to turn out the same.

In [240]:
import torch
from torchvision import transforms

# Define the composite transformation making up our pipeline
augmentation_pipeline = transforms.Compose([
    transforms.ToPILImage(),    # converts to pillow-image
    transforms.Resize((360, 360)),  # resizes the image to (256, 256)
    transforms.RandomCrop((224, 224)),  # randomly crop the image to fit in a (224, 224) area 
    transforms.RandomRotation((0, 360)),    # randomly rotates the image between 0->360 degrees
    transforms.RandomHorizontalFlip(),  # randomly flips the image horisontally
    transforms.ColorJitter(brightness=0.5, saturation=0.3, contrast=0.5),   # randomly shifts illumination params
])

### Generate the augmented dataset
With the augmentation pipeline, we just need a strategy to correctly name images, save their labels and write everything to the proper location. First, we load in the labels of the raw dataset as `raw_labels` and create a new pandas dataframe with the same columns for augmented labels: `aug_labels`:

In [241]:
import pandas as pd

# read the labels.csv file into a pandas DF
raw_labels = pd.read_csv(list(RAW_LABEL_PATH)[0])
# define a new pandas DF for augmented labels
aug_labels = pd.DataFrame(columns = raw_labels.columns)

We can then generate the full, augmented dataset through the following steps for each image: 
1. Load the image into memory (`plt.imread`, but PIL/OpenCV ect. all work).
2. Extract the image name from the image path `path`, using a split/slice strategy.
3. Generate `NUM_AUGMENTATIONS` augmented images by:
    1. Create an augmented instance of the image by running it through the defined `augmentation pipeline`
    2. Save the augmented image to `data/mushroom_imgs/{filename}` in the target folder. 
    3. Record the label of the augmented image in `aug_labels`

Finally, save the augmented labels `aug_labels` to disk using `pd.dataframe.to_csv(path, index = False, ...)`

In [242]:
import matplotlib.pyplot as plt

# Define hyperparameter for the amount of augmentations to make
NUM_AUGMENTATIONS = 10

# Iterate through all images, making copies of them
for path in raw_img_paths:
    path = str(path)

    # load the image at the path as an ndarray (using plt.imread)
    img = plt.imread(path)
    # fetch the name of the image by splitting and slicing
    img_name = path.split(sep="\\")[-1][:-4]

    # Augment 'img' NUM_AUGMENTATIONS times, save it to disk and record the new label
    for j in range(NUM_AUGMENTATIONS):
        # augment the image using the augmentation pipeline
        img_aug = augmentation_pipeline(img)

        # Define the new image name
        img_aug_name = img_name + f"_{j}"

        # Generate a new path for the image and save it to disk
        img_path = f"{TARGET_DIR}\{img_aug_name}.jpg"
        img_aug.save(img_path)

        # Record the image label in the new csv file
        label = raw_labels[raw_labels['image'] == img_name]['label']
        label = pd.DataFrame([[img_aug_name, int(label)]], columns = raw_labels.columns)
        aug_labels = pd.concat([label, aug_labels])
    

# Save the augmented image labels to file
aug_labels.reset_index(inplace = True)
aug_labels.to_csv(TARGET_LABEL_PATH, columns = ['image', 'label'], index = False)


  label = pd.DataFrame([[img_aug_name, int(label)]], columns = raw_labels.columns)
