# Experiment on Data Augmentation
In the previous notebook, you have done image classifications on the small **Dogs vs Cats** dataset. As we only used a small subset of the dataset containing 2,000 images for training, 1,000 for validation and 1,000 for testing, we observed significate overfitting problem.

Now, in this example, we will address the overfitting problem with **data augmentation**.


## Setting-Up 1: Mount Google Drive to the notebook
You can easily load data from Google Drive by mounting it to the notebook with the following code.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Setting-Up 2: One click to enable FREE GPU
Don't forget to enable GPU in your Colab notebook before training your model.

In Google Colab, it is very easy to do so.

From task bar, click: Runtime ⇨ Change runtime type

Choose 'GPU' in the Hardware accelerator.

## Dataset
Before you start this notebook, make sure the small dataset `dogs-vs-cats-small/`, which was generated in `dogs_vs_cats.ipynb`, is saved in your Google Drive. We will not repeat the "Downloading data ==> Creating small dataset" process in this notebook.

For efficient deep learning training in Google Colab, it is strongly recommended copying datasets from Google Drive to Colab’s local storage (`/content`) before training.
The benefit of using local directory is:
- Faster data access → Avoids slow I/O from Google Drive API.
- Better GPU performance → Reduces bottlenecks in loading images/batches.
- More stable training → Prevents disconnections from Google Drive.
You may use first copy datasets to the local directory with command `rsync`(recommended) or `cp`(slower):

`!rsync -avh "/content/drive/MyDrive/dataset/" "/content/dataset/"`

Please note: The `/content` directory is temporary and will be deleted when the session resets.

In [None]:
import os
import shutil

# Colab's local storage to store the dataset for faster training
local_dataset_dir = "/content/dogs-vs-cats-small"
if not os.path.exists(local_dataset_dir): # always check if the directory exists
  # Copy the dataset (preserves all subfolders and files)
  print("Copy dataset to Colab's local storage...")
  !rsync -avh "/content/drive/MyDrive/Colab Notebooks/data/dogs-vs-cats-small/" "/content/dogs-vs-cats-small/"#change the first directory to the dataset path in your drive
  print("Copy done.")

train_dir = os.path.join(local_dataset_dir,'train')
val_dir   = os.path.join(local_dataset_dir,'val')
test_dir  = os.path.join(local_dataset_dir,'test')
if not os.path.exists(train_dir):
  print(train_dir +' does not exist.')
if not os.path.exists(val_dir):
  print(val_dir +' does not exist.')
if not os.path.exists(test_dir):
  print(test_dir +' does not exist.')

n_train_per_class = 1000
n_val_per_class = 500
n_test_per_class = 500

## STEP 1: Data preprocessing
Now we need do some pre-processing before feeding the data into the network.

Roughly, the preprocessing consists of following steps.
1. Read the image files, decode them to RGB grids of pixels
2. Rescale the pixel values (integers between 0 and 255) to the [0,1] interval, to enhance training stability of neural networks.
3. Apply data augmentation.

There are two ways to apply augmentation to training images using the random transforms of `layers`.

**Option 1**: Make the preprocessing layers part of your model. It directly add the augmentation layer into the model structure as the first layer.

**Option 2**: Apply the preprocessing layers to your dataset. It is to apply the data augmentation to the entire train set using `Dataset.map`.

You may reference the follow link for both options: https://www.tensorflow.org/tutorials/images/data_augmentation

Note: In either option, no data augmentation is applied for either test or validation samples.

In [None]:
#
# Add your code here
#

## STEP 1: Build the CNN network
Use the same small convnet in `dogs_vs_cats.ipynb`. **Don't add any other regularization** techniques (such as dropout), as you will know any performance changes in this experiment is purely because of data augmentation.


In [None]:
#
# Add your code here
#
model.summary()
from tensorflow.keras.utils import plot_model
plot_model(model, show_shapes=True, dpi=100) # visualize the CNN artitecture

## STEP 2: Compile the model
The typical loss function for a binary classification problem is the binary cross-entropy loss function.

In [None]:
#
# Add your code here
#

## STEP 4: Train the model and draw learning curves
Let's train the model. You may need more epochs in this training, say `epochs=120`.

It is a good practice to always save your models after training with `model.save_model(model_folder+'/dogs_cats_small_data_augment.keras') `.

After training, also plot the loss and accuracy of the model over the training and validation set.

In [None]:
#
# Add your code here
#

**Q:** According to your learning curve, is the overfitting problem solved with data augmentation? Is there any accuracy improvement?  

**Optional:** After successfully completing data augmentation, you can fine-tune the network’s parameters or incorporate additional regularization techniques, such as dropout, early stopping, weight regularization, or batch normalization, to further improve accuracy.


