<a href="https://colab.research.google.com/github/matthewshawnkehoe/Data-Analysis/blob/main/test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Progressively Load Images

It is possible to write code to manually load image data and return data ready for modeling.

This would include walking the directory structure for a dataset, loading image data, and returning the input (pixel arrays) and output (class integer).

Thankfully, we don't need to write this code. Instead, we can use the [ImageDataGenerator class](https://keras.io/preprocessing/image/) provided by Keras.

The main benefit of using this class to load the data is that images are loaded for a single dataset in batches, meaning that it can be used for loading both small datasets as well as very large image datasets with thousands or millions of images.

Instead of loading all images into memory, it will load just enough images into memory for the current and perhaps the next few mini-batches when training and evaluating a deep learning model. I refer to this as progressive loading, as the dataset is progressively loaded fromz file, retrieving just enough data for what is needed immediately.

Two additional benefits of the using the *ImageDataGenerator* class is that it can also automatically scale pixel values of images and it can automatically generate augmented versions of images. We will leave these topics for discussion in another tutorial and instead focus on how to use the *ImageDataGenerator* class to load image data from file.

The pattern for using the ImageDataGenerator class is used as follows:

1. Construct and configure an instance of the *ImageDataGenerator* class.
2. Retrieve an iterator by calling the `flow_from_directory()` function.
3. Use the iterator in the training or evaluation of a model.



Let’s take a closer look at each step.

The constructor for the *ImageDataGenerator* contains many arguments to specify how to manipulate the image data after it is loaded, including pixel scaling and [data augmentation](https://machinelearningmastery.com/how-to-configure-image-data-augmentation-when-training-deep-learning-neural-networks/). We do not need any of these features at this stage, so configuring the *ImageDataGenerator* is easy.

In [4]:
from keras.preprocessing.image import ImageDataGenerator

# create a data generator
datagen = ImageDataGenerator()

Next, an iterator is required to progressively load images for a single dataset.

This requires calling the `flow_from_directory()` function and specifying the dataset directory, such as the train, test, or validation directory.

The function also allows you to configure more details related to the loading of images. Of note is the `target_size` argument that allows you to load all images to a specific size, which is often required when modeling. The function defaults to square images with the size (256, 256).

The function also allows you to specify the type of classification task via the `class_mode` argument, specifically whether it is `binary` or a multi-class classification `categorical`.

The default `batch_size` is 32, which means that 32 randomly selected images from across the classes in the dataset will be returned in each batch when training. Larger or smaller batches may be desired. You may also want to return batches in a deterministic order when evaluating a model, which you can do by setting `shuffle` to `False.`

There are many other options, and I encourage you to review the [API documentation](https://keras.io/preprocessing/image/).

We can use the same *ImageDataGenerator* to prepare separate iterators for separate dataset directories. This is useful if we would like the same pixel scaling applied to multiple datasets (e.g. train, test, etc.).

In [None]:
# Unzip a local copy of the test, validate, and train files


In [5]:
# load and iterate training dataset
train_it = datagen.flow_from_directory('/home/mkehoe/dev/NASA_Thermal_Earth_Obs/data/train/', class_mode='categorical', batch_size=32)
# load and iterate validation dataset
val_it = datagen.flow_from_directory('/home/mkehoe/dev/NASA_Thermal_Earth_Obs/data/validation/', class_mode='categorical', batch_size=32)
# load and iterate test dataset
test_it = datagen.flow_from_directory('/home/mkehoe/dev/NASA_Thermal_Earth_Obs/data/test/', class_mode='categorical', batch_size=32)

FileNotFoundError: ignored

Once the iterators have been prepared, we can use them when fitting and evaluating a deep learning model.

For example, fitting a model with a data generator can be achieved by calling the `fit_generator()` function on the model and passing the training iterator (`train_it`). The validation iterator (`val_it`) can be specified when calling this function via the `validation_data` argument.

The `steps_per_epoch` argument must be specified for the training iterator in order to define how many batches of images defines a single epoch.

For example, if you have 1,000 images in the training dataset (across all classes) and a batch size of 64, then the `steps_per_epoch` would be about 16, or 1000/64.

Similarly, if a validation iterator is applied, then the `validation_steps` argument must also be specified to indicate the number of batches in the validation dataset defining one epoch.

In [None]:
# define model
model = ...
# fit model
model.fit_generator(train_it, steps_per_epoch=16, validation_data=val_it, validation_steps=8)

Once the model is fit, it can be evaluated on a test dataset using the `evaluate_generator()` function and passing in the test iterator (`test_it`). The `steps` argument defines the number of batches of samples to step through when evaluating the model before stopping.

In [None]:
# evaluate model
loss = model.evaluate_generator(test_it, steps=24)

Finally, if you want to use your fit model for making predictions on a very large dataset, you can create an iterator for that dataset as well (e.g. `predict_it`) and call the `predict_generator()` function on the model.

In [None]:
# make a prediction
yhat = model.predict_generator(predict_it, steps=24)

Let's use our small dataset defined in the previous section to demonstrate how to define an *ImageDataGenerator* instance and prepare the dataset iterators.

A complete example is listed below.

In [None]:
# example of progressively loading images from file
from keras.preprocessing.image import ImageDataGenerator
# create generator
datagen = ImageDataGenerator()
# prepare an iterators for each dataset
train_it = datagen.flow_from_directory('data/train/', class_mode='binary')
val_it = datagen.flow_from_directory('data/validation/', class_mode='binary')
test_it = datagen.flow_from_directory('data/test/', class_mode='binary')
# confirm the iterator works
batchX, batchy = train_it.next()
print('Batch shape=%s, min=%.3f, max=%.3f' % (batchX.shape, batchX.min(), batchX.max()))

Running the example first creates an instance of the *ImageDataGenerator* with all default configuration.

Next, three iterators are created, one for each of the train, validation, and test binary classification datasets. As each iterator is created, we can see debug messages reporting the number of images and classes discovered and prepared.

Finally, we test out the train iterator that would be used to fit a model. The first batch of images is retrieved and we can confirm that the batch contains two images, as only two images were available. We can also confirm that the images were loaded and forced to the square dimensions of 256 rows and 256 columns of pixels and the pixel data was not scaled and remains in the range [0, 255].