[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AMLA-UBC/100-Exploring-the-World-of-Modern-Machine-Learning/blob/main/Applying_Data_Argumentation_Exercise.ipynb)

## Load and Preprocess [Fashion MNIST](https://www.kaggle.com/datasets/zalando-research/fashionmnist)

Convolutional layers require the data to be a 4D array, with the dimensions representing the number of training examples, the number of channels, the image width, and the image height. That's why we reshape the data.

First, let's see how well our CNN performs without data argumentation, using the test set as our validation set.

In [1]:
!pip install -q tensorflow

In [None]:
from tensorflow import keras
import numpy as np

# Load the Fashion MNIST dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.fashion_mnist.load_data()

# Scale the data to the range [0, 1]
x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0

# Reshape the data to be used in a Conv2D layer
x_train = x_train.reshape(-1, 28, 28, 1)
x_test = x_test.reshape(-1, 28, 28, 1)

# Build a simple CNN
model = keras.Sequential([
    keras.layers.Conv2D(32, (3, 3), activation="relu", input_shape=(28, 28, 1)),
    keras.layers.MaxPooling2D((2, 2)),
    keras.layers.Flatten(),
    keras.layers.Dense(128, activation="relu"),
    keras.layers.Dense(10, activation="softmax")
])

# Compile the model
model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"]
)

# Train the model
model.fit(x_train, y_train, batch_size=32,
          epochs=3,
          validation_data=(x_test, y_test))

## ImageDataGenerator
Applies various data augmentation techniques to help expand the size of the dataset and refine the model's robustness, thus mitigating the risk of overfitting.

1. `rotation_range` controls the degree of random rotation that will be imposed on the images. It is set at 30, meaning the images will be spun by an angle that fluctuates between -30 and 30 degrees.

2. `width_shift_range` and `height_shift_range` dictate the horizontal and vertical random shifts applied to the images, respectively. With a value of 0.2, the images will be shifted randomly by a fraction that ranges from -0.2 to 0.2 of their total width or height.

3. `zoom_range` sets the extent of random zooming to be applied to the images. It is set at 0.2, meaning the images will undergo random zooming, either inward or outward, by a factor that lies between 0.8 and 1.2.

4. `horizontal_flip` controls the random horizontal flipping of images. With a value of True, the images will have a 50% chance of being flipped horizontally.

5. `fill_mode` defines the strategy for filling in any newly created pixels after image augmentation. The value is set to 'nearest', meaning any new pixels will be filled with the value of the closest existing pixel.

In [None]:
# Load the Fashion MNIST dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.fashion_mnist.load_data()

# Scale the data to the range [0, 1]
x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0

# Perform data augmentation
datagen = keras.preprocessing.image.ImageDataGenerator(
    rotation_range=30,
    width_shift_range=0.2,
    height_shift_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest')

# Add the grayscale dimension
x_train = x_train[..., np.newaxis]

# Fit the data generator on the training data
datagen.fit(x_train)

# Build a simple CNN
model = keras.Sequential([
    keras.layers.Conv2D(32, (3, 3), activation="relu", input_shape=(28, 28, 1)),
    keras.layers.MaxPooling2D((2, 2)),
    keras.layers.Flatten(),
    keras.layers.Dense(128, activation="relu"),
    keras.layers.Dense(10, activation="softmax")
])

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Train the model
model.fit(datagen.flow(x_train, y_train, batch_size=32),
          epochs=3,
          validation_data=(x_test, y_test))

## Questions

Which method achieved higher accuracy? Why might that be the case? Hint: review our model architecture and dataset.

## Practice

Let's use the STL10 dataset, which contains small images in 10 classes and 5000 images for each class, from `tensorflow_datasets`. Your goal for this exercise is to apply data argumentation to the STL10 dataset (fill in the ... places), as well as learn to research on your own and read documentations.

An example of what you may want to google is "randomly rotating images in tensorflow". An example of what you may want to ask ChatGPT is "list the TensorFlow 2 functions and explanations that allow me to apply data argumentation".

In [10]:
!pip install -q tensorflow_datasets

In [None]:
import tensorflow_datasets as tfds
import tensorflow as tf
from keras.preprocessing.image import ImageDataGenerator

# Download the STL10 dataset, which may take a few minutes
train = tfds.load("stl10", split="train[:70%]", as_supervised=True)
test = tfds.load("stl10", split="train[70%:]", as_supervised=True)

# Apply transformations to x and y, where x represents the training images and y represents the training labels
train = train.map(lambda x, y: (tf.cast(x, tf.float32) / 255.0, y))
train = train.map(lambda x, y: (..., y))
train = train.map(lambda x, y: (..., y))
train = train.batch(32).prefetch(tf.data.AUTOTUNE)

# Use the pretrained ResNet50 model as our base
resnet = tf.keras.applications.ResNet50(weights='imagenet', include_top=False, input_shape=(96, 96, 3))
model = tf.keras.Sequential([
    resnet,
    tf.keras.layers.GlobalAveragePooling2D(),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(train,
          epochs=1,
          validation_data=test)

## Further Analysis

Using data argumentation in well-documented datasets such as Fashion MNIST is unnecessary. Additionally, our basic CNN model was too simple to learn to classify randomly argumented image data quickly.

Data argumentation is only necessary when we have limited labelled data overall or for a particular class. This is because when the number of samples for each class in the training dataset is imbalanced, the model may not generalize well to new data. Data augmentation can help balance the under-represented classes by creating additional samples. [Google SafeSearch Mini V2](https://huggingface.co/FredZhang7/google-safesearch-mini-v2) shows an example of the effects of having class imbalance. However, it's worth noting that there are already open-source auto-labelling models availiable on HuggingFace to caption different types of images, videos, and audio. [BLIP](https://huggingface.co/docs/transformers/main/model_doc/blip) is an example.

Data argumentation is also a powerful tool in object detection and segmentation tasks because the model needs to learn to detect objects in different orientations, scales, and positions. 