# Homework 2, *part 2*
### (60 points total)

In this part, you will build a convolutional neural network (CNN) to solve (yet another) image classification problem: the Tiny ImageNet dataset (200 classes, 100K training images, 10K validation images). Try to achieve as high accuracy as possible.

**Unlike part 1**, you are now free to use the full power of PyTorch and its subpackages.

## Deliverables

* This file.
* A "checkpoint file" `"checkpoint.pth"` that contains your CNN's weights (you get them from `model.state_dict()`). Obtain it with `torch.save(..., "checkpoint.pth")`. When grading, we will load it to evaluate your accuracy.

**Should you decide to put your `"checkpoint.pth"` on Google Drive, update (edit) the following cell with the link to it:**

### [Dear TAs, I've put my "checkpoint.pth" on Google Drive, download it here](https://drive.google.com/open?id=18dh9YnzftE950KKDyXAwGA7cS0XLHsNt)

## Grading

* 9 points for reproducible training code and a filled report below.
* 11 points for building a network that gets above 25% accuracy.
* 4 points for using an **interactive** (please don't reinvent the wheel with `plt.plot`) tool for viewing progress, for example Tensorboard ([with this library](https://github.com/lanpa/tensorboardX) and [an extra hack for Colab](https://stackoverflow.com/a/57791702)). In this notebook, insert screenshots of accuracy and loss plots (training and validation) over iterations/epochs/time.
* 6 points for beating each of these accuracy milestones on the private **test** set:
  * 30%
  * 34%
  * 38%
  * 42%
  * 46%
  * 50%
  
*Private test set* means that you won't be able to evaluate your model on it. Rather, after you submit code and checkpoint, we will load your model and evaluate it on that test set ourselves, reporting your accuracy in a comment to the grade.

Note that there is an important formatting requirement, see below near "`DO_TRAIN = True`".

## Restrictions

* No pretrained networks.
* Don't enlarge images (e.g. don't resize them to $224 \times 224$ or $256 \times 256$).

## Tips

* **One change at a time**: never test several new things at once (unless you are super confident). Train a model, introduce one change, train again.
* Google a lot: try to reinvent as few wheels as possible (unlike in part 1 of this assignment).
* Use GPU.
* Use regularization: L2, batch normalization, dropout, data augmentation...
* Pay much attention to accuracy and loss graphs (e.g. in Tensorboard). Track failures early, stop bad experiments early.

In [1]:
# Detect if we are in Google Colaboratory
try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

from pathlib import Path
# Determine the locations of auxiliary libraries and datasets.
# `AUX_DATA_ROOT` is where 'notmnist.py', 'animation.py' and 'tiny-imagenet-2020.zip' are.
if IN_COLAB:
    google.colab.drive.mount("/content/drive")
    
    # Change this if you created the shortcut in a different location
    AUX_DATA_ROOT = Path("/content/drive/My Drive/Deep Learning 2020 -- Home Assignment 2")
    
    assert AUX_DATA_ROOT.is_dir(), "Have you forgot to 'Add a shortcut to Drive'?"
else:
    AUX_DATA_ROOT = Path(".")

The below cell puts training and validation images in `./tiny-imagenet-200/train` and `./tiny-imagenet-200/val`:

In [2]:
# Extract the dataset into the current directory
if not Path("tiny-imagenet-200/train/class_000/00000.jpg").is_file():
    import zipfile
    with zipfile.ZipFile(AUX_DATA_ROOT / 'tiny-imagenet-2020.zip', 'r') as archive:
        archive.extractall()

**You are required** to format your notebook cells so that `Run All` on a fresh notebook:
* trains your model from scratch, if `DO_TRAIN is True`;
* loads your trained model from `"./checkpoint.pth"`, then **computes** and prints its validation accuracy, if `DO_TRAIN is False`.

In [3]:
DO_TRAIN = True

## Train the model

In [4]:
import datetime
from IPython import display
import os
import matplotlib.pyplot as plt

import torch
from torch.optim.lr_scheduler import ReduceLROnPlateau
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
import torchvision
from torchvision import transforms

from tqdm import tqdm

In [5]:
# set up device
use_cuda = torch.cuda.is_available()

print("torch", torch.__version__)
if use_cuda:
    device = torch.device("cuda")
    dtype = torch.cuda.FloatTensor
    print("Using GPU")
else:
    dtype = torch.FloatTensor
    device = torch.device("cpu")
    print("Not using GPU")
    
# load data
transform = {
    'train': transforms.Compose(
        [
            transforms.RandomCrop(32, padding=4),
            transforms.RandomHorizontalFlip(),
            transforms.RandomRotation(30),
            transforms.ColorJitter(brightness=0.1, contrast=0.1, saturation=0.05, hue=0.05),
            transforms.ToTensor(),
        ]),
    'test': transforms.Compose(
        [
            transforms.ToTensor(),
        ])
}

train_dataset = torchvision.datasets.ImageFolder('tiny-imagenet-200/train', transform=transform['train'])
test_dataset  = torchvision.datasets.ImageFolder('tiny-imagenet-200/val', transform=transform['test'])

class_names = train_dataset.classes
print(f'{len(train_dataset)} training images')
print(f'{len( test_dataset)} validation images')
image, label = train_dataset[50]

BATCH_SIZE = 512
NUM_WORKERS = 10

train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, 
                              shuffle=True, pin_memory=True)

test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, 
                            shuffle=False, pin_memory=True)

print(f"Train/test dataloaders have {len(train_dataloader)} and {len(test_dataloader)} batches")

torch 1.5.0
Using GPU
100000 training images
10000 validation images
Train/test dataloaders have 196 and 20 batches


In [6]:
class TonyNet(torch.nn.Module):
    def __init__(self, vgg):
        super(TonyNet, self).__init__()
        features = []
        for i, module in enumerate(vgg.features):
            if i <= 23:
                features.append(module)
        self.features = torch.nn.Sequential(*features)
        self.avg_pool = torch.nn.AdaptiveAvgPool2d(output_size=(7, 7))
        self.classifier = torch.nn.Sequential(torch.nn.Linear(7 * 7 * 512, 4096),
                                              torch.nn.ReLU(inplace=True),
                                              torch.nn.Dropout(),
                                              torch.nn.Linear(4096, 4096),
                                              torch.nn.ReLU(inplace=True),
                                              torch.nn.Dropout(),
                                              torch.nn.Linear(4096, 200))
        
        
    def forward(self, x):
        feats = self.features(x)
        pooled = self.avg_pool(feats)
        fltnd = torch.nn.Flatten()(pooled)
        return self.classifier(fltnd)

In [7]:
def train(model, train_dataloader, val_dataloader, opt, criterion, n_epochs=100, chckpnt_path='./checkpoint.pth'):
    date = datetime.datetime.now().strftime("%b-%d-%Y-%H:%M:%S")
    writer_train = SummaryWriter(f'runs/{date}/train')
    writer_test = SummaryWriter(f'runs/{date}/test')
    scheduler = ReduceLROnPlateau(opt, factor=0.5)
    best_acc = 0

    for i in range(n_epochs):
        model.train()
        correct, total = 0, 0
        for j, (images, labels) in enumerate(tqdm(train_dataloader)):
            probs = model(images.to(device))
            with torch.no_grad():
                labels = labels.to(device)
                predictions = probs.max(1)[1]

                total += len(labels)
                correct += (predictions == labels).sum().item()

            loss = criterion(probs, labels)
            writer_train.add_scalar('Loss', loss, i * len(train_dataloader) + j)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        train_acc = correct / total
        writer_train.add_scalar('Accuracy', train_acc, i)

        model.eval()
        correct, total = 0, 0
        for j, (images, labels) in enumerate(tqdm(val_dataloader)):
            with torch.no_grad():
                probs = model(images.to(device))
                labels = labels.to(device)
                predictions = probs.max(1)[1]
                total += len(labels)
                correct += (predictions == labels).sum().item()
                val_loss = criterion(probs, labels)
#                 scheduler.step(test_loss)
                writer_test.add_scalar('Loss', val_loss, 
                                       (i * len(val_dataloader) + j) * len(train_dataloader) / len(val_dataloader))
        val_acc = correct / total
        writer_test.add_scalar('Accuracy', val_acc, i)
        display.clear_output(True)
        print(f'Epoch number: {i}')
        print(f'Train accuracy: {train_acc}')
        print(f'Validation accuracy: {val_acc}')
        if val_acc > best_acc:
            torch.save(model.state_dict(), chckpnt_path)
            best_acc = val_acc
        
    return train_acc, val_acc

In [None]:
vgg16 = torchvision.models.vgg16()
if DO_TRAIN:
    # Your code here (train your model)
    # etc.
    tony_net = TonyNet(vgg16)
    tony_net.to(device)
    learning_rate = 3e-4
    optimizer = torch.optim.Adam(tony_net.parameters(), lr=learning_rate)
    criterion = torch.nn.CrossEntropyLoss()
    train_acc, test_acc = train(tony_net, train_dataloader, test_dataloader, optimizer, criterion, n_epochs=400)

 10%|█         | 20/196 [01:07<01:53,  1.55it/s] 

### I swear the code above works :). It just didn't save the resulting accuracy, but the model is saved and tested below, giving the neede results. The training took ~6 hours, so I didn't rerun it just to avoid this error message, which doesn't even spoil the workflow.

## Load and evaluate the model

In [None]:
# Your code here (load the model from "./checkpoint.pth")
# Please use `torch.load("checkpoint.pth", map_location='cpu')`
tony_net_loaded = TonyNet(vgg16)
tony_net_loaded.load_state_dict(torch.load('checkpoint.pth', map_location='cpu'))
tony_net_loaded.to(device)
tony_net_loaded.eval()
correct, total = 0, 0
for j, (images, labels) in enumerate(tqdm(test_dataloader)):
    with torch.no_grad():
        probs = tony_net_loaded(images.to(device))
        labels = labels.to(device)
        predictions = probs.max(1)[1]
        total += len(labels)
        correct += (predictions == labels).sum().item()
        val_loss = criterion(probs, labels)

In [None]:
val_accuracy = 100 * correct / total # Your code here
assert 0 <= val_accuracy <= 100
print("Validation accuracy: %.2f%%" % val_accuracy)

### Accuracy scores (test is above train)
![Accuracy](https://drive.google.com/uc?id=1_-hWsllmYH6hH6zAMnlV3KeoaIe5MFuI)
### Losses (test is below train)
![Loss](https://drive.google.com/uc?id=1RQzzDWJgW3LTAAiWM9IrSsiax4czoZMF)

# Report

Below, please mention:

* A brief history of tweaks and improvements.
* Which network architectures have you tried? What is the final one and why?
* What is the training method (batch size, optimization algorithm, number of iterations, ...) and why?
* Which techniques have you tried to prevent overfitting? What were their effects? Which of them worked well?
* Any other insights you learned.

For example, start with:

"I have analyzed these and those conference papers/sources/blog posts. \
I tried this and that to adapt them to my problem. \
The conclusions this task taught me are ..."

Having knowledge from lectures and having done some googling I decided that I have three main ways to try: ResNet-like architectures, DenseNet (actually, improvement of ResNet in a sense) and VGG-like architectures. From the coursemates I've heard a lot of their result with model with residual connections and many achieve good accuracy scores (~30%) within several epochs (like 10 or even less), but I was more interested to train something similar to VGG, because it was not interesting enough for me to do the same as others do.

Firstly I've tried just to download the model and try to train it as is (only changing the last FC layer, because we have different number of classes here), but I quickly stopped attemts. The main reason was that an epoch took like 100+ seconds on GPU and accuracy didn't change much. Actually, it not only didn't change, but it also stayed near $1 / 200$, which means random predictions and absence of learning. So I've built a model TonyNet, in which I took first 16 layers of `features` part of VGG16 and trained it. The motivation was to keep the structure of architecture of the model (that is known to perform good on ImageNet), but slightly reduce the number of parameters to quicken the learning procedure. From the very first moment I used several augmentation techniques like random croppin, horizontal flips, rotation and color jittering and Adam optimizer, because I knew the learning procedure will take long (mostly because of the skip connections absence) and I won't be able to perform all experiment inserting one trick at a time. After several attempts I was still having a random prediction, so I decided to get rid of weight decay (I used it from the beggining to avoid possible overfit). Finally I've managed to find a learning rate (appears that was the initalial problem) that actually made accuracy increase. Funny but `1e-3` is a LR wich doesn't make model to learn at all, but `1e-4` does the job. After several expreiments I've chosen LR in between that still gave some results. The I had an idea that 16 layers of convolutions and poolings will be not enough, so I took 24 first layers of VGG and left others as is (of course tuning first and last FC layers dimentions accordingly).

A took a batchsize of 512. Just because. What kind of reasons should I have?.. It must fit GPU and must not be too small not to have many gradient oscillations. I've optimized with Adam, because it's like a standard optimization algorithm (read "most popular"). Maybe RMSProp is like a close decision. About the number of iterations... I did the training once for 200 epochs, but the model didn't overfit and continued to learn, so I decided to have 400 and finally reached plateau.

As for the overfitting, I didn't need to handle it explicitly, because I didn't see it :). Actully, it has several reasons: I used data augmentation (that's why train score is always lower than test one) and I used dropouts from VGG, which also negates overfitting.

P.S. in the task statement it is said to refer papers or blogs. As it was said in the beginning, I was mainly conducted by the lectures materials and some random googling of tiny ImageNet best performing arcitectures, so I don't see the need to refer something here. 

P.P.S ah, yeah, and you can see that I've tried to use scheduller to reduce learning rate. I even left a model to train during the night with scheduller, but it didn't learn anything and stayed on random predictions :c. I just didn't have time to cook scheduller properly.