# Create data generators

This notebook aims at showing the data generator creation process.

As denoted in [other notebooks](./1a_mapillary-dataset-presentation.ipynb), one can easily generate datasets. This is a way to preprocess data offline, apart from any machine learning algorithm, in order to guarantee data persistence on the file system.

However running neural network analysis leads to exploit these data on-the-fly, so as to save up memory. The current notebook will bridge the gap between the `Dataset` structure and directly usable data structure.

For a sake of clarity, we will take here a `Mapillary` dataset as an example.

## Introduction

Before to begin the generator creation, the usual imports are done.

In [16]:
import os

In [17]:
from deeposlandia import dataset, generator, utils

Some global variables are created as well.

In [23]:
DATAPATH = "../data"
DATASET = "mapillary" # "shapes", "aerial"
MODEL = "semantic_segmentation" # "feature_detection"
IMG_SIZE = 128
BATCH_SIZE = 10

Then we prepare the needed folders, so as to focus thereafter on the data creation commands.

In [24]:
input_folder = utils.prepare_input_folder(DATAPATH, DATASET)
input_data = os.path.join(input_folder, "training")
input_config = os.path.join(input_folder, "config_aggregate.json")
preprocess_folder = utils.prepare_preprocessed_folder(DATAPATH, DATASET, IMG_SIZE, "aggregated")

## Build a `Dataset`

As a first effort, we need to create a preprocessed dataset starting from the raw data. This is done with the help of the `Dataset` class (and its derivatives).

First we instanciate an empty dataset.

In [25]:
d = dataset.MapillaryDataset(IMG_SIZE, input_config)

To populate it, we send some input images that will be preprocessed during the operation. The preprocessed images are saved onto the file system.

In [26]:
d.populate(preprocess_folder["training"],
           input_data,
           nb_images=100,
           aggregate=True)

Last but not least, the dataset information must be saved as well. They are saved as a `.json` file; this file will be indispensable when building the data generator, especially because the dataset creation and the neural network creation are not done into the same modules.

In [27]:
d.save(preprocess_folder["training_config"])

2018-08-22 15:01:49,563 :: INFO :: dataset :: save : The dataset has been saved into ../data/mapillary/preprocessed/128_aggregated/training.json


## Create a generator

After the dataset creation, the only needed step is to recover the preprocessed dataset information, stored previously.

In [11]:
config = utils.read_config(preprocess_folder["training_config"])

Now the generator may be created, with some dataset and path parameters. The `create_generator` function is based on [Keras](https://keras.io/preprocessing/image/) `ImageDataGenerator` module, and more specifically to the `flow_from_directory` method.

In [12]:
train_generator = generator.create_generator(DATASET,
                                             MODEL,
                                             preprocess_folder["training"],
                                             IMG_SIZE,
                                             BATCH_SIZE,
                                             config["labels"])

Found 100 images belonging to 1 classes.
Found 100 images belonging to 1 classes.


As a result, one gets a Python generator that contains tuples of images and labels. This can be verified by extracting the first item of the generator:

In [13]:
item = next(train_generator)
item[0].shape, item[1].shape

((10, 224, 224, 3), (10, 224, 224, 13))

The first element of the tuple contains 10 3-channeled images of size `128*128`, whilst the second element contains 10 label arrays of size `128*128*13` (`13` being the number of aggregated labels in the dataset).