<a href="https://colab.research.google.com/github/Sicily-F/cagedbirdID/blob/main/7_Data_pre_processing_image_augmentation_as_a_method_of_class_balancing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Offline data augmentation using the imgaug package

In [None]:
from imgaug import augmenters as iaa
import imgaug as ia #need to use ia.show for example
import os, random
import PIL
import shutil
from PIL import Image
import cv2
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import glob
from pathlib import Path
import io
import imageio
from os import listdir
from skimage import img_as_ubyte#to address  conversion error


Data augmentation is a useful way to increase the size of your dataset. It can be used for various purposes such as to balance the size of classes or to increase the size of your training set more generally. This offline form of augmentation was used to balance the number of photos in each class. As a result, we had equal numbers of photos for each species class, before training. Before we start the augmentation process, it is worth having a look at what file types are in your dataset, and how your files are named. If you have unsupported file types in your dataset, or have spaces in the file names for example, this can interrupt the process of image augmentation. As the process halts wherever the image is in the folder, this can mean that the process is lengthy, as you would have to delete the newest augmented files, and then re-run the function. 


Another thing to be aware of is the desktop.ini file
(https://www.addictivetips.com/windows-tips/hide-desktop-ini-files-windows-10/). This file is a text file that allows you to specify how a file system folder is viewed. If it is visible in the folder containing your images for augmentation, this may mean that the all the photos cannot be processed and as such the code below might stop at a certain point.

In [None]:
# This code allows you to rename all the photos in your folder, below is the path where the photos of a species called the Japanese grosbeak were stored
# When running for loops on a whole folder of images it is best to have filenames which make sense to you, and with no spaces or unusual characters
collection = "F:/all_species_augmented_croppe/to_augment/grosbeak"
# The code below ensures that each file in the grosbeak folder is renamed to grosbeak followed by sequential numbers starting at 0
for i, filename in enumerate(os.listdir(collection)): 
    os.rename("F:/all_species_augmented_croppe/to_augment/grosbeak/" + filename, "F:/all_species_augmented_croppe/to_augment/grosbeak/" + "grosbeak" + str(i) + ".jpg")  
    

Another thing to note is that, due to various bugs and issues in initial experimentation, this code *only* reliably works for one folder at a time. Let's say you have a folder with sub-folders per species, depending on your RAM and the speed of the processor on your computer, we tried out different folder structures and found if you had a folder with one subfolder in, this was more manageable computation-wise. This aligns with our methods, since we ran the Megadetecor (see. file no.6) on each species. Attempts will be made to update this is in the future though.



In [None]:
# Note the path might look like this intially: 'F:/git/to-aug/bw_leafbird', let's say if you copy and pasted the folder, but it should actually look like this path = 'F:\\git\\to-aug', which will work on the only subfolder in the directory, bw_leafbird
path = 'F:\\git\\bw_leafbird'


In [None]:
# The below code is sampled heavily from here: 'A simple and common augmentation sequence': https://imgaug.readthedocs.io/en/latest/source/examples_basics.html

# Using a seed makes the randomization reproducible
iaa.seed(2) 

seq = iaa.Sequential([
    iaa.Fliplr(0.5),
    iaa.Crop(percent=(0, 0.1)),
    iaa.Affine(rotate=(-25,25))
], random_order=True)


seq = iaa.Sequential([
    iaa.Fliplr(0.5), # horizontal flips
    iaa.Crop(percent=(0, 0.1)), # random crops
    # Small gaussian blur with random sigma between 0 and 0.5.
    # But we only blur about 50% of all images.
    iaa.Sometimes(
        0.5,
        iaa.GaussianBlur(sigma=(0, 0.5))
    ),
    # Strengthen or weaken the contrast in each image.
    iaa.LinearContrast((0.75, 1.5)),
    # Add gaussian noise.
    # For 50% of all images, we sample the noise once per pixel.
    # For the other 50% of all images, we sample the noise per pixel AND
    # channel. This can change the color (not only brightness) of the
    # pixels.
    iaa.AdditiveGaussianNoise(loc=0, scale=(0.0, 0.05*255), per_channel=0.5),
    # Make some images brighter and some darker.
    # In 20% of all cases, we sample the multiplier once per channel,
    # which can end up changing the color of the images.
    iaa.Multiply((0.8, 1.2), per_channel=0.2),
    # Apply affine transformations to each image.
    # Scale/zoom them, translate/move them, rotate them and shear them.
    iaa.Affine(
        scale={"x": (0.8, 1.2), "y": (0.8, 1.2)},
        translate_percent={"x": (-0.2, 0.2), "y": (-0.2, 0.2)},
        rotate=(-25, 25),
        shear=(-8, 8)
    )
], random_order=True) 

In [None]:
for folder in os.listdir(path):
    i = 0  # This makes the for loop start from the first file in the directory
    for fname in os.listdir(path + '\\' + folder):  
        img = cv2.imread(path + '\\' + folder + '\\' + fname, cv2.COLOR_BGR2RGB) https://stackoverflow.com/questions/39316447/opencv-giving-wrong-color-to-colored-images-on-loading, colours are back to front
        img = img.astype(np.uint8) 
        print('Original:')
        ia.imshow(img)
        img_aug = seq.augment_image(img)
        print('Augmented:')
        ia.imshow(img_aug)
        cv2.imwrite(os.path.join(path, path + '\\' + folder + '\\' + folder + "%06d.png" % (i,)), img_aug)  # This exports your c
        i += 1

# Any issues with this code, check your error codes here: https://stackoverflow.com/questions/32225348/os-listdir-showing-file-actually-not-in-the-folder-on-python-and-windows,
# Relating to image loading, check here: https://stackoverflow.com/questions/39316447/opencv-giving-wrong-color-to-colored-images-on-loading

        

## The value of class balancing
Because one round of augmentation was performed per photo, we had to standardise the number of photos per class after augmentation was performed.

There are other ways to do this, which may be more parsimonious, however, this was a quick hacky solution which I chose.

In this piece of code below, let's say we want to have a range of 210-260 photos per species, if we duplicate a ground truth dataset of 150 photos, then we will end up with 300 photos post-augmentation. We can randomly remove 50 files using this code, to get us back down to 250.

For the example code below, we clicked on the folder via the File Explorer, locally, to see the number of photos. For the class below, the bw_leafbird (blue-winged leafbird), there are 348 photos post-augmentation, so we will remove 98 photos randomly to have a final 250 photos

In [None]:
folder = 'F:/git/to-aug/bw_leafbird'
files = os.listdir(folder)  # Get filenames in current folder
files = random.sample(files, 98)  # Pick x random files
for file in files:  # Go over each file name to be deleted
    f = os.path.join(folder, file)  # Create valid path to file
    os.remove(f)  # Remove the file   
 

Since the photos were randomly removed, the file numbers were no longer sequential. You can rename the file names consecutively using the below code.

In [None]:
# This is a bit hacky, but you may need to run this function twice, for example to change the name from bw_leafbird to bw_leafbirds and then re-run the function back to bw_leafbird

# So from this
collection = "F:/git/to-aug/bw_leafbird"
# The code below ensures that each file in the grosbeak folder is renamed to grosbeak followed by sequential numbers starting at 0
for i, filename in enumerate(os.listdir(collection)): 
    os.rename("F:/git/to-aug/bw_leafbird/" + filename, "F:/git/to-aug/bw_leafbird/" + "bw_leafbirds" + str(i) + ".jpg")  
    
# Then back to this, to convert 
collection = "F:/git/to-aug/bw_leafbird"
for i, filename in enumerate(os.listdir(collection)): 
    os.rename("F:/git/to-aug/bw_leafbird/" + filename, "F:/git/to-aug/bw_leafbird/" + "bw_leafbird" + str(i) + ".jpg")  
    
    