# Step 1 - Image Preprocessing

## Preprocessing data

In this notebook, I have documented how the data for the CNN Picture Categorizer is preprocessed.

The functions and parameters described in this document all belong to the **ImagePreprocessor** class in *image_preprocessor.py*.

The Image Preprocessor is a class with the following important parameters:
- **root_dir *(string)*** *the root directory where the images are stored*
- **originals_dir *(string)*** *where the images of the original dataset are stored*
- **training_dir *(string)*** *where the training images are stored*
- **validation_dir *(string)*** *where the validation images are stored*
- **test_dir *(string)*** *where the test images are stored*
- **target_imagesize *(tuple)*** *a tuple describing the final size of the images*
- **test_size *(float)*** *the percentage of images the test set contains*
- **validation_size *(float)*** *the percentage of all images the validation set contains*
- **training_count *(integer)*** *how many images there are in the training set*
- **validation_count *(integer)*** *how many images there are in the validation set*
- **test_count *(integer)*** *how many images there are in the test set*
- **categories *(string array)*** *an array with the names of all the categories (classes)*

The Image Processor performs the following *public* functions:
- **initialize** *splits the original dataset into training, validation and test sets and records their respective sizes*
- **file_to_tensor(*file*)** *takes a single image location and returns its matrix representation*
- **files_to_tensor(*files*)** *takes a list of files and returns their matrix representations*

In [1]:
import os
import numpy as np
from glob import glob
from sklearn.datasets import load_files
from keras.utils import np_utils
from keras.preprocessing import image
from keras.preprocessing.image import img_to_array, load_img
import re
import random
import shutil

Using TensorFlow backend.


## 1: Initialize parameters

Set the default values for the parameters for the functions to refer to later. When initializing the ImagePreprocessor class, these can be changed before calling the *initialize()* function.

In [2]:
# set the default values for the parameters
root_dir = os.path.join('..', 'application', 'images')
originals_dir = os.path.join(root_dir, 'original')
training_dir = os.path.join(root_dir, 'train')
val_dir = os.path.join(root_dir, 'validation')
test_dir = os.path.join(root_dir, 'test')

target_imagesize = (256, 256)

# the sizes of the test and validation sets as compared to the total amount of images
test_size = 0.2
validation_size = 0.2

clear_existing_data = False # if true, data in training, test and validation directories will be deleted before splitting the data in the originals directory

random_seed = 7

# the amount of images in the training, validation and test sets
training_count = 0
validation_count = 0
test_count = 0

# list of categories
categories = []

## Step 1: Split the dataset

A model will be trained with the training set and validated with the validation set. Whenever the performance on the validation set improves, the weights of the model are saved (overwriting the previously best performing weights). Several models will be trained in this fashion.

After all models have been trained, we choose the one that performs best on data it hasn't seen before; the test set.

This is why, in order to train our model and to test how well the model is performing, the dataset must be split up randomly into training, validation and test sets.

The __split_dataset function will go through the images in the original dataset and split them into training, validation and test sets according to the *test_size* and *validation_size* parameters set above.

Both functions can only be called from inside the class, since the public *initialize()* function will call them.

In [3]:
# function to remove all files in given directory
def __empty_directory(path):
    for file in os.listdir(path):
        os.remove(os.path.join(path, file))
    return

In [4]:
# splits images in original directory into training, test and validation directories
def __split_dataset():
    random.seed(random_seed)
    
    size_count = 0
    for category in os.listdir(originals_dir):
        # make a new directory where they don't exist and empty existing directories
        for p in [re.split(r'[\\/]', training_dir)[-1], re.split(r'[\\/]', val_dir)[-1], re.split(r'[\\/]', test_dir)[-1]]:
            if not os.path.exists(os.path.join(root_dir, p, category)):
                os.makedirs(os.path.join(root_dir, p, category))
            if clear_existing_data == True:
                __empty_directory(os.path.join(root_dir, p, category))
            
        # collect all the files in the originals directory
        files = []
        for file in os.listdir(os.path.join(originals_dir, category)):
            files.append(file)
        
        # calculate the training, validation and test set sizes
        test_count = round(len(files) * test_size)
        validation_count = round(len(files) * val_size)
        train_count = len(files) - test_count - validation_count
        
        # randomly shuffle the array of files
        random.shuffle(files)
        
        for i, file in enumerate(files):
            location = None
            if i < test_count:
                location = test_dir
            elif i < test_count + validation_count:
                location = val_dir
            else:
                location = training_dir
                
            shutil.copyfile(os.path.join(originals_dir, category, file), os.path.join(location, category, file))
    return

## 2: Getting additional class parameters

The code that initialized the *ImagePreprocessor* class described in this document may need to know additional information with regards to how large each dataset is and what the names of the categories are.

The *training_count*, *validation_count*, *test_count* and *categories* variables are not assigned any values by default. After the data has been split, we can count how many images are in each of the datasets and what categories are used.

In [5]:
# returns an array with the category names
def __get_categories():        
    return [item[len(originals_dir)+1:] for item in sorted(glob(os.path.join(originals_dir, "*")))]

# returns the sizes of the training, validation and test sets
def __get_dataset_sizes():
    train_size = sum([len(files) for r, d, files in os.walk(training_dir)])
    validation_size = sum([len(files) for r, d, files in os.walk(val_dir)])
    test_size = sum([len(files) for r, d, files in os.walk(test_dir)])
    
    return train_size, validation_size, test_size

## 3: Initialize the class

Now that all private functions are in place, we can call the public master function to call them in the correct order.

The *initialize* function is the only function that needs to be called after initializing the ImagePreprocessor class. It will proceed to split the datasets and the class will store information about the datasets.

In [6]:
def initialize(self):
    __split_dataset()
    
    training_count, validation_count, test_count = __get_dataset_sizes()
    categories = __get_categories()

## 4: Convert images into tensors

The code section below converts these images into tensors; matrices of numeric values representing how bright each pixel in the image is. The numeric values are then normalized so they are all within a range of between 0 and 1, rather than between 0 and 255.

This normalization makes it easier for the model to train, as all pixels are now in the same range relative to the brightest pixel in each particular image. The brighest pixel in a dark image will still be of value 1, even though it may not have been 255 before the normalization.

In [7]:
# takes a list of image filepaths and returns a list of 4D tensors
def file_to_tensor(self, file):
    img = image.load_img(file, target_size=target_imagesize)
    x = image.img_to_array(img)
    x *= (1.0/x.max()) # set the range of the tensor values between 0 and 1
    return np.expand_dims(x, axis=0)

def files_to_tensors(self, files):
    list_of_tensors = [file_to_tensor(file) for file in files]
    return np.vstack(list_of_tensors)

## 5: Load datasets

The function below takes a directory and returns a list of image locations, along with a list of one-hot encoded targets.

In [8]:
# load file locations and labels
def load_dataset(path):
    data = load_files(path)
    files = np.array(data['filenames'])
    targets = np_utils.to_categorical(np.array(data['target']), max(data['target'])+1)
    return files, targets