# Image Data Augmentation with Keras

In this tutorial we'll take a look at one possible scenario when working with CNN: having to train an image-classification model using very little data. A "few" samples can mean anywhere from a few hundred to a few tens of thousands of images. As a practical example, we'll focus on classifying images as dogs or cats in a dataset containing 5,000 pictures of cats and dogs (2,500 cats, 2,500 dogs). We'll use 2000 pictures for training, 1000 for validation, and 2000 for testing.

## Preparation

This section will setup our environment, mount GDrive, and and connect to Kaggle.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Choose your directory where you would like to save the dataset
%cd "/content/drive/MyDrive/..."

In [None]:
# Go to kaggle.com, account, get API key and upload it
from google.colab import files
files.upload()

In [None]:
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
# Check kaggle is working
!kaggle datasets list

In [None]:
# Download dogs vs cats dataset
# Might need to accept competition terms first
# https://www.kaggle.com/competitions/dogs-vs-cats/data
!kaggle competitions download -c dogs-vs-cats

In [None]:
!unzip -qq dogs-vs-cats.zip

In [None]:
!unzip -qq train.zip

Now that we have downloaded and unzziped the data, we will take a small sample and divide it into training/validation/testing sets using the following structure:

```
cats_vs_dogs_small/
...train/
......cat/         
......dog/         
...validation/
......cat/         
......dog/         
...test/
......cat/         
......dog/
```

Notice that we will only take a small sample of the original dataset to simulate training on a smaller dataset (and to accelerate training). We can expect many real world image datasets to follow a similar organization scheme, with separate folders for each class.

In [None]:
import os, shutil, pathlib

original_dir = pathlib.Path("train")
new_base_dir = pathlib.Path("cats_vs_dogs_small")

def make_subset(subset_name, start_index, end_index):
    for category in ("cat", "dog"):
        dir = new_base_dir / subset_name / category
        os.makedirs(dir)
        fnames = [f"{category}.{i}.jpg"
                  for i in range(start_index, end_index)]
        for fname in fnames:
            shutil.copyfile(src=original_dir / fname,
                            dst=dir / fname)

make_subset("train", start_index=0, end_index=1000)
make_subset("validation", start_index=1000, end_index=1500)
make_subset("test", start_index=1500, end_index=2500)

In [None]:
import os, shutil, pathlib

original_dir = pathlib.Path("train")
new_base_dir = pathlib.Path("cats_vs_dogs_small")

## Building the model

We are now ready to build a CNN. We will opt for a "classic" architecture consisting of 3 x 3 convolution layers with ReLu activation functions, interspaced with 2 x 2 max-pooling layers.

Output will consist of a single dense layer of one unit with a sigmoid activation function, since we only have two possible classes (cats vs dogs).

Notice the first two layers of the model. Input allows us to specify sample images' dimensions (180 x 180, with 3 channels). We're also using a rescaling layer to change images' values to 0 - 255 range.

In [None]:
from tensorflow import keras
from keras import layers

model = keras.Sequential([keras.Input(shape=(180, 180, 3)),
                          layers.Rescaling(1./255),
                          layers.Conv2D(filters=32, kernel_size=3, activation="relu"),
                          layers.MaxPooling2D(pool_size=2),
                          layers.Conv2D(filters=64, kernel_size=3, activation="relu"),
                          layers.MaxPooling2D(pool_size=2),
                          layers.Conv2D(filters=128, kernel_size=3, activation="relu"),
                          layers.MaxPooling2D(pool_size=2),
                          layers.Conv2D(filters=256, kernel_size=3, activation="relu"),
                          layers.MaxPooling2D(pool_size=2),
                          layers.Conv2D(filters=256, kernel_size=3, activation="relu"),
                          layers.Flatten(),
                          layers.Dense(1, activation="sigmoid")])

model.summary()

In [None]:
model.compile(loss="binary_crossentropy", optimizer="rmsprop", metrics=["accuracy"])

We will now use the [`image_dataset_from_directory function`](https://www.tensorflow.org/api_docs/python/tf/keras/utils/image_dataset_from_directory) to easily load our sample images into a ```Dataset```. The function will automatically
* Read the picture files.
* Decode the JPEG content to RGB grids of pixels.
* Convert these into floating-point tensors.
* Assign a label based on parent folder.
* Resize them to a shared size (we'll use 180 × 180).
* Pack them into batches (we'll use batches of 32 images).

The use of a dataset generator is fundamental when working with huge amounts of data, since generally we won't be able to load all of them into memory. A data generator will instead only load the batches into memory on demand.

In [None]:
from keras.utils import image_dataset_from_directory

train_dataset = image_dataset_from_directory(
    new_base_dir / "train",
    image_size=(180, 180),
    batch_size=32)

validation_dataset = image_dataset_from_directory(
    new_base_dir / "validation",
    image_size=(180, 180),
    batch_size=32)

test_dataset = image_dataset_from_directory(
    new_base_dir / "test",
    image_size=(180, 180),
    batch_size=32)

In [None]:
for data_batch, labels_batch in train_dataset:
  print("data batch shape:", data_batch.shape)
  print("labels batch shape:", labels_batch.shape)
  break

We are now ready to start training our model. We'll add a callback to save the best model's weights based on the validation loss value.

In [None]:
callbacks = [keras.callbacks.ModelCheckpoint(
    filepath="convnet_from_scratch.keras", save_best_only=True, monitor="val_loss")]

history = model.fit(train_dataset, epochs=11,
                    validation_data=validation_dataset, callbacks=callbacks)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(16, 4))

plt.subplot(121)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.subplot(122)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.show()

After only a few epochs we can readily observe the beggining of overfitting (notice the widening gap between training and validation). This is to be expected due to the (relatively) small amount of training images.

If we test accuracy of the model against the testing set, the results (as expected) are not great.

In [None]:
test_model = keras.models.load_model("convnet_from_scratch.keras")
test_loss, test_acc = test_model.evaluate(test_dataset)
print(f"Test accuracy: {test_acc:.3f}")

We can try fighting overfitting using some of the previously discussed  techniques (like dropout layers or l2 regularization), however another powerful tool at our disposal is data augmentation.

Overfitting is caused by having too few samples to learn from, rendering you unable to train a model that can generalize to new data. Given infinite data, your model would be exposed to every possible aspect of the data distribution at hand: you would never overfit. Data augmentation takes the approach of generating more training data from existing training samples by augmenting the samples via a number of random transformations that yield believable-looking images. The goal is that, at training time, your model will never see the exact same picture twice. This helps expose the model to more aspects of the data so it can generalize better.

We'll create a data augmentation model that will at random flip the images horizontally, rotate them at most $10\% * 2\pi$ and zoom at most $20\%$. You can find a list of all available [image augmentation layers here](https://keras.io/api/layers/preprocessing_layers/image_augmentation/).

In [None]:
data_augmentation = keras.Sequential([layers.RandomFlip("horizontal"),
                                      layers.RandomRotation(0.1),
                                      layers.RandomZoom(0.2)], name='data_augmentation')

In [None]:
plt.figure(figsize=(10, 10))
for images, _ in train_dataset.take(1):
    for i in range(9):
        augmented_images = data_augmentation(images) # Keras functional API
        ax = plt.subplot(3, 3, i + 1)
        plt.imshow(augmented_images[0].numpy().astype("uint8"))
        plt.axis("off")

We can now add these augmentation layers to our previous model and try again. Overfitting should now be significantly lower, so we can train for an extended number of epochs.

In [None]:
model = keras.Sequential([keras.Input(shape=(180, 180, 3)),
                          data_augmentation,
                          layers.Rescaling(1./255),
                          layers.Conv2D(filters=32, kernel_size=3, activation="relu"),
                          layers.MaxPooling2D(pool_size=2),
                          layers.Conv2D(filters=64, kernel_size=3, activation="relu"),
                          layers.MaxPooling2D(pool_size=2),
                          layers.Conv2D(filters=128, kernel_size=3, activation="relu"),
                          layers.MaxPooling2D(pool_size=2),
                          layers.Conv2D(filters=256, kernel_size=3, activation="relu"),
                          layers.MaxPooling2D(pool_size=2),
                          layers.Conv2D(filters=256, kernel_size=3, activation="relu"),
                          layers.Flatten(),
                          layers.Dense(1, activation="sigmoid")])

model.summary()

In [None]:
model.compile(loss="binary_crossentropy", optimizer="rmsprop", metrics=["accuracy"])

**Notice** our ```ModelCheckpoint``` callback will, once again, save the best model's weights. This will allow us to rollback to a previous version if we notice overfitting or pick up where we left off in a future session.

In [None]:
callbacks = [keras.callbacks.ModelCheckpoint(
    filepath="convnet_from_scratch_with_augmentation.keras",
    save_best_only=True, monitor="val_loss")]

history = model.fit(train_dataset, epochs=30,
                    validation_data=validation_dataset, callbacks=callbacks)

In [None]:
plt.figure(figsize=(16, 4))

plt.subplot(121)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.subplot(122)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.show()

Training and validation scores should now remain closer, meaning we have effectively fought off overfitting. Similarly, test score should now be higher than before.

In [None]:
test_model = keras.models.load_model(
    "convnet_from_scratch_with_augmentation.keras")
test_loss, test_acc = test_model.evaluate(test_dataset)
print(f"Test accuracy: {test_acc:.3f}")

## A Note on Data Augmentation with ImageDataGenerator
In the past, Keras `ImageDataGenerator` was the suggested method for generating batches of tensor image data with real-time data augmentation. However, as of TF 2.9.0 `ImageDataGenerator` has been marked as deprecated, so using it in new code is not advisable.

In [None]:
datagen = keras.preprocessing.image.ImageDataGenerator(rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True,
    validation_split=0.2)

train_generator = datagen.flow_from_directory(new_base_dir / "train",
                            target_size=(180, 180), batch_size=32)

for X_batch, y_batch in train_generator:
    for i in range(0, 6):
        plt.subplot(2,3,i+1)
        plt.imshow(X_batch[i]/255)
        plt.axis('off')
    break

*Parts of this tutorial have been adapted from [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition](https://www.oreilly.com/library/view/hands-on-machine-learning/9781098125967/) By Aurélien Géron*