<a href="https://colab.research.google.com/github/MinghanChu/DeepLearning-ZerosToGans/blob/main/Data_Augmentation%2C_Regularization_%26_ResNets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import os
import torch
import torchvision
import tarfile
import torch.nn as nn
import numpy as np
import torch.nn.functional as F
from torchvision.datasets.utils import download_url
from torchvision.datasets import ImageFolder
from torch.utils.data import DataLoader
import torchvision.transforms as tt
from torch.utils.data import random_split
from torchvision.utils import make_grid
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

matplotlib.rcParams['figure.facecolor'] = '#ffffff'

In [2]:
project_name='05b-cifar10-resnet'

In [3]:
from torchvision.datasets.utils import download_url


# Dowload the dataset
dataset_url = "https://s3.amazonaws.com/fast-ai-imageclas/cifar10.tgz"
download_url(dataset_url, '.')

# Extract from archive
with tarfile.open('./cifar10.tgz', 'r:gz') as tar:
    tar.extractall(path='./data')

# Look into the data directory
data_dir = './data/cifar10'
print(os.listdir(data_dir))
classes = os.listdir(data_dir + "/train")
print(classes)

Downloading https://s3.amazonaws.com/fast-ai-imageclas/cifar10.tgz to ./cifar10.tgz


100%|██████████| 135107811/135107811 [00:02<00:00, 61743765.72it/s]


['test', 'train']
['ship', 'automobile', 'truck', 'dog', 'deer', 'bird', 'horse', 'frog', 'airplane', 'cat']


**Training set, validation set, and test set**

1. The **training set** is the largest portion of the dataset. It is used to train the machine learning model. During training, the model learns the underlying patterns and relationships in the data by adjusting its parameters to minimize error. The error is based on the differences between the predicted output and the actual output (ground truth).  

2. **Validation set** is used during the training phase to evaluate the model's performance and to **tune hyperparameters** (parameters that are not learned directly within the model but are set before the training process).

  + **Prevent overfitting**: Overfitting occurs when a model learns the training data too well, including its noise and outliers, which negatively impacts its performance on new, unseen data. The validation set helps monitor this by providing an independent set of data that the model has not seen during training.

  + **Model selection**: By comparing the performance of different models or configurations on the validation set, we can choose the **best model architecture or set of hyperparameters**.

3. The test set is used to evaluate the final model's performance after the training and hyperparameter tuning processes are complete. This set is only used once to assess how well the **model generalizes to new, unseen data**. It provides an unbiased estimate of the model's real-world performance. The test set **should be kept separate and not used during the training or validation processes** to ensure an accurate measure of the model's predictive power.

**However, the following paragraph indicates that the test set is used as the validation set. **

We can create training and validation datasets using the `ImageFolder` class from `torchvision`. In addition to the `ToTensor` transform, we'll also apply some other transforms to the images. There are a few important changes we'll make while creating PyTorch datasets for training and validation:

1. **Use test set for validation**: Instead of setting aside a fraction (e.g. 10%) of the data from the training set for validation, we'll simply use the test set as our validation set. This just gives a little more data to train with. In general, once you have picked the best model architecture & hypeparameters using a fixed validation set, it is a good idea to retrain the same model on the entire dataset just to give it a small final boost in performance.
2. **Channel-wise data normalization**: We will normalize the image tensors by subtracting the mean and dividing by the standard deviation across each channel. As a result, the mean of the data across each channel is 0, and standard deviation is 1. Normalizing the data prevents the values from any one channel from disproportionately affecting the losses and gradients while training, simply by having a higher or wider range of values that others.

<img src="https://i.imgur.com/LYxXBVg.png" width="360">


3. **Randomized data augmentations**: We will apply randomly chosen transformations while loading images from the training dataset. Specifically, we will pad each image by 4 pixels, and then take a random crop of size 32 x 32 pixels, and then flip the image horizontally with a 50% probability. Since the transformation will be applied randomly and dynamically each time a particular image is loaded, the model sees slightly different images in each epoch of training, which allows it generalize better.

![data-augmentation](https://imgaug.readthedocs.io/en/latest/_images/cropandpad_percent.jpg)