# Homework 2.2: The Quest For A Better Network

In this assignment you will build a monster network to solve Tiny ImageNet image classification.

This notebook is intended as a sequel to seminar 3, please give it a try if you haven't done so yet.

(please read it at least diagonally)

* The ultimate quest is to create a network that has as high __accuracy__ as you can push it.
* There is a __mini-report__ at the end that you will have to fill in. We recommend reading it first and filling it while you iterate.
 
## Grading
* starting at zero points
* +20% for describing your iteration path in a report below.
* +20% for building a network that gets above 20% accuracy
* +10% for beating each of these milestones on __TEST__ dataset:
    * 25% (50% points)
    * 30% (60% points)
    * 32.5% (70% points)
    * 35% (80% points)
    * 37.5% (90% points)
    * 40% (full points)
    
## Restrictions
* Please do NOT use pre-trained networks for this assignment until you reach 40%.
 * In other words, base milestones must be beaten without pre-trained nets (and such net must be present in the anytask atttachments). After that, you can use whatever you want.
* you __can't__ do anything with validation data apart from running the evaluation procedure. Please, split train images on train and validation parts

## Tips on what can be done:


 * __Network size__
   * MOAR neurons, 
   * MOAR layers, ([torch.nn docs](http://pytorch.org/docs/master/nn.html))

   * Nonlinearities in the hidden layers
     * tanh, relu, leaky relu, etc
   * Larger networks may take more epochs to train, so don't discard your net just because it could didn't beat the baseline in 5 epochs.

   * Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn!


### The main rule of prototyping: one change at a time
   * By now you probably have several ideas on what to change. By all means, try them out! But there's a catch: __never test several new things at once__.


### Optimization
   * Training for 100 epochs regardless of anything is probably a bad idea.
   * Some networks converge over 5 epochs, others - over 500.
   * Way to go: stop when validation score is 10 iterations past maximum
   * You should certainly use adaptive optimizers
     * rmsprop, nesterov_momentum, adam, adagrad and so on.
     * Converge faster and sometimes reach better optima
     * It might make sense to tweak learning rate/momentum, other learning parameters, batch size and number of epochs
   * __BatchNormalization__ (nn.BatchNorm2d) for the win!
     * Sometimes more batch normalization is better.
   * __Regularize__ to prevent overfitting
     * Add some L2 weight norm to the loss function, PyTorch will do the rest
       * Can be done manually or like [this](https://discuss.pytorch.org/t/simple-l2-regularization/139/2).
     * Dropout (`nn.Dropout`) - to prevent overfitting
       * Don't overdo it. Check if it actually makes your network better
   
### Convolution architectures
   * This task __can__ be solved by a sequence of convolutions and poolings with batch_norm and ReLU seasoning, but you shouldn't necessarily stop there.
   * [Inception family](https://hacktilldawn.com/2016/09/25/inception-modules-explained-and-implemented/), [ResNet family](https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035?gi=9018057983ca), [Densely-connected convolutions (exotic)](https://arxiv.org/abs/1608.06993), [Capsule networks (exotic)](https://arxiv.org/abs/1710.09829)
   * Please do try a few simple architectures before you go for resnet-152.
   * Warning! Training convolutional networks can take long without GPU. That's okay.
     * If you are CPU-only, we still recomment that you try a simple convolutional architecture
     * a perfect option is if you can set it up to run at nighttime and check it up at the morning.
     * Make reasonable layer size estimates. A 128-neuron first convolution is likely an overkill.
     * __To reduce computation__ time by a factor in exchange for some accuracy drop, try using __stride__ parameter. A stride=2 convolution should take roughly 1/4 of the default (stride=1) one.
 
   
### Data augmemntation
   * getting 5x as large dataset for free is a great 
     * Zoom-in+slice = move
     * Rotate+zoom(to remove black stripes)
     * Add Noize (gaussian or bernoulli)
   * Simple way to do that (if you have PIL/Image): 
     * ```from scipy.misc import imrotate,imresize```
     * and a few slicing
     * Other cool libraries: cv2, skimake, PIL/Pillow
   * A more advanced way is to use torchvision transforms:
    ```
    transform_train = transforms.Compose([
        transforms.RandomCrop(32, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])
    trainset = torchvision.datasets.ImageFolder(root=path_to_tiny_imagenet, train=True, download=True, transform=transform_train)
    trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)

    ```
   * Or use this tool from Keras (requires theano/tensorflow): [tutorial](https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html), [docs](https://keras.io/preprocessing/image/)
   * Stay realistic. There's usually no point in flipping dogs upside down as that is not the way you usually see them.
   


In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
import torch, torchvision
import torch.nn as nn
from torchvision import transforms
import torch.nn.functional as F
from torch.autograd import Variable
import PIL

torch.manual_seed(239)
np.random.seed(239)
torch.cuda.set_device(0)

In [3]:
transforms_simple = transform_train = transforms.Compose([
   torchvision.transforms.ToTensor(),
   transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

In [4]:
transform_train = transforms.Compose([
   torchvision.transforms.RandomHorizontalFlip(),
   torchvision.transforms.RandomRotation(20, resample=PIL.Image.BILINEAR),
   torchvision.transforms.ToTensor(),
   transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

In [5]:
from tiny_img import download_tinyImg200
data_path = '.'
download_tinyImg200(data_path)
dataset = torchvision.datasets.ImageFolder('tiny-imagenet-200/train', transform=transform_train)
print(len(dataset))
train_dataset, val_dataset = torch.utils.data.random_split(dataset, [80000, 20000])

100000


In [6]:
# feel free to copypaste code from seminar03 as a basic template for training

In [7]:
batch_size = 50
train_batch_gen = torch.utils.data.DataLoader(train_dataset, 
                                              batch_size=batch_size,
                                              shuffle=True,
                                              num_workers=4)

In [8]:
val_batch_gen = torch.utils.data.DataLoader(val_dataset, 
                                              batch_size=batch_size,
                                              shuffle=True,
                                              num_workers=4)

In [9]:
class Flatten(nn.Module):
    def forward(self, input):
        return input.view(input.size(0), -1)

In [10]:
def build_sequential_nn():
    fst_nn = nn.Sequential()

    fst_nn.add_module('conv1_1', nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, padding=1, bias=False))
    fst_nn.add_module('bn1_1', nn.BatchNorm2d(num_features=32))
    fst_nn.add_module('lrelu1_1', nn.LeakyReLU())

    fst_nn.add_module('conv1_2', nn.Conv2d(in_channels=32, out_channels=32, kernel_size=3, padding=1, bias=False))    
    fst_nn.add_module('bn1_2', nn.BatchNorm2d(num_features=32))
    fst_nn.add_module('lrelu1_2', nn.LeakyReLU())
    
    fst_nn.add_module('maxpool1', nn.MaxPool2d(2))
    fst_nn.add_module('dropout1', nn.Dropout(p=0.3))
    
    ########################################################################
    
    fst_nn.add_module('conv2_1', nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1, bias=False))
    fst_nn.add_module('bn2_1', nn.BatchNorm2d(num_features=64))
    fst_nn.add_module('lrelu2_1', nn.LeakyReLU())
    
    fst_nn.add_module('conv2_2', nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, padding=2, dilation=2, bias=False))
    fst_nn.add_module('bn2_2', nn.BatchNorm2d(num_features=64))
    fst_nn.add_module('lrelu2_2', nn.LeakyReLU())
    
    fst_nn.add_module('maxpool2', nn.MaxPool2d(2))
    fst_nn.add_module('dropout2', nn.Dropout(p=0.3))
    
    ########################################################################
    
    fst_nn.add_module('conv3_1', nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=1, bias=False))
    fst_nn.add_module('bn3_1', nn.BatchNorm2d(num_features=128))
    fst_nn.add_module('lrelu3_1', nn.LeakyReLU())
    
    fst_nn.add_module('conv3_2', nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, padding=1, bias=False))
    fst_nn.add_module('bn3_2', nn.BatchNorm2d(num_features=128))
    fst_nn.add_module('lrelu3_2', nn.LeakyReLU())
    
    fst_nn.add_module('maxpool3', nn.MaxPool2d(2))
    fst_nn.add_module('dropout3', nn.Dropout(p=0.3))
    
    ########################################################################
    
    fst_nn.add_module('conv4_1', nn.Conv2d(in_channels=128, out_channels=256, kernel_size=3, padding=1, bias=False))
    fst_nn.add_module('bn4_1', nn.BatchNorm2d(num_features=256))
    fst_nn.add_module('lrelu4_1', nn.LeakyReLU())
    
    fst_nn.add_module('conv4_2', nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, padding=1, bias=False))
    fst_nn.add_module('bn4_2', nn.BatchNorm2d(num_features=512))
    fst_nn.add_module('lrelu4_2', nn.LeakyReLU())
    
    fst_nn.add_module('dropout4', nn.Dropout(p=0.3))

    ########################################################################
    
    fst_nn.add_module('flatten', Flatten())
    fst_nn.add_module('dense1', nn.Linear(in_features=32768, out_features=2048))
    fst_nn.add_module('relu_dense', nn.ReLU())
    fst_nn.add_module('dense2', nn.Linear(in_features=2048, out_features=1024))
    fst_nn.add_module('relu_dense2', nn.ReLU())
    fst_nn.add_module('dropout_dense2', nn.Dropout(p=0.3))
    fst_nn.add_module('dense2_logits', nn.Linear(in_features=1024, out_features=200))
    
    return fst_nn

In [11]:
fst_nn = build_sequential_nn()

In [12]:
#fst_nn.load_state_dict(torch.load('fst_nn_v2.pt'))

In [13]:
def compute_loss(X_batch, y_batch, model):
    X_batch = Variable(torch.FloatTensor(X_batch)).cuda()
    y_batch = Variable(torch.LongTensor(y_batch)).cuda()
    logits = model.cuda()(X_batch)
    return F.cross_entropy(logits, y_batch).mean()

In [14]:
opt = torch.optim.Adam(fst_nn.parameters(), lr=1e-4)

train_loss = []
val_accuracy = []

In [15]:
import numpy as np
num_epochs = 50 # total amount of full passes over training data

import time
max_acc = 0.

for epoch in range(num_epochs):
    start_time = time.time()
    fst_nn.train(True) # enable dropout / batch_norm training behavior
    for (X_batch, y_batch) in train_batch_gen:
        # train on batch
        loss = compute_loss(X_batch, y_batch, fst_nn)
        loss.backward()
        opt.step()
        opt.zero_grad()
        train_loss.append(loss.cpu().data.numpy())
    
    fst_nn.train(False) # disable dropout / use averages for batch_norm
    for X_batch, y_batch in val_batch_gen:
        logits = fst_nn(Variable(torch.FloatTensor(X_batch)).cuda())
        y_pred = logits.max(1)[1].data
        val_accuracy.append(np.mean( (y_batch.cpu() == y_pred.cpu()).numpy() ))

    val_acc = np.mean(val_accuracy[-len(val_dataset) // batch_size :]) * 100
    # Then we print the results for this epoch:
    print("Epoch {} of {} took {:.3f}s".format(
        epoch + 1, num_epochs, time.time() - start_time))
    print("  training loss (in-iteration): \t{:.6f}".format(
        np.mean(train_loss[-len(train_dataset) // batch_size :])))
    print("  validation accuracy: \t\t\t{:.2f} %".format(
        val_acc))
    if val_acc > max_acc:
        torch.save(fst_nn.state_dict(), 'fst_nn_v22.pt')
        max_acc = val_acc
        print('Model saved')
        
    # lr decay
    if epoch + 1 % 10 == 0:
        for g in opt.param_groups:
            g['lr'] = max(g['lr']/5, 1e-7)

Epoch 1 of 50 took 138.362s
  training loss (in-iteration): 	4.756572
  validation accuracy: 			5.21 %
Model saved
Epoch 2 of 50 took 137.237s
  training loss (in-iteration): 	4.080317
  validation accuracy: 			8.02 %
Model saved
Epoch 3 of 50 took 137.226s
  training loss (in-iteration): 	3.743934
  validation accuracy: 			14.03 %
Model saved
Epoch 4 of 50 took 137.203s
  training loss (in-iteration): 	3.524559
  validation accuracy: 			15.46 %
Model saved
Epoch 5 of 50 took 137.250s
  training loss (in-iteration): 	3.354646
  validation accuracy: 			20.08 %
Model saved
Epoch 6 of 50 took 137.306s
  training loss (in-iteration): 	3.214243
  validation accuracy: 			21.54 %
Model saved
Epoch 7 of 50 took 137.594s
  training loss (in-iteration): 	3.093372
  validation accuracy: 			22.32 %
Model saved
Epoch 8 of 50 took 138.513s
  training loss (in-iteration): 	2.986943
  validation accuracy: 			25.78 %
Model saved
Epoch 9 of 50 took 137.611s
  training loss (in-iteration): 	2.881769
  va

When everything is done, please calculate accuracy on `tiny-imagenet-200/val`

In [16]:
test_data = torchvision.datasets.ImageFolder('tiny-imagenet-200/val_reordered/', transform=transforms_simple)

In [17]:
test_batch_gen = torch.utils.data.DataLoader(test_data, 
                                              batch_size=batch_size,
                                              shuffle=False,
                                              num_workers=4)

In [18]:
test_acc = []
test_net = build_sequential_nn()
test_net.load_state_dict(torch.load('fst_nn_v22.pt'))
test_net.eval()
test_net.cuda()
for X_batch, y_batch in test_batch_gen:
        logits = test_net(Variable(torch.FloatTensor(X_batch)).cuda())
        y_pred = logits.max(1)[1].data
        test_acc.append(np.mean( (y_batch.cpu() == y_pred.cpu()).numpy() ))

In [19]:
test_accuracy = np.mean(test_acc)
test_accuracy

0.35739999999999994

In [20]:
print("Final results:")
print("  test accuracy:\t\t{:.2f} %".format(
    test_accuracy * 100))

if test_accuracy * 100 > 40:
    print("Achievement unlocked: 110lvl Warlock!")
elif test_accuracy * 100 > 35:
    print("Achievement unlocked: 80lvl Warlock!")
elif test_accuracy * 100 > 30:
    print("Achievement unlocked: 70lvl Warlock!")
elif test_accuracy * 100 > 25:
    print("Achievement unlocked: 60lvl Warlock!")
else:
    print("We need more magic! Follow instructons below")

Final results:
  test accuracy:		35.74 %
Achievement unlocked: 80lvl Warlock!


In [22]:
train_acc = []
for X_batch, y_batch in train_batch_gen:
        logits = test_net(Variable(torch.FloatTensor(X_batch)).cuda())
        y_pred = logits.max(1)[1].data
        train_acc.append(np.mean( (y_batch.cpu() == y_pred.cpu()).numpy() ))
train_accuracy = np.mean(train_acc) * 100
train_accuracy

74.83125000000001

```

```

```

```

```

```


# Report

All creative approaches are highly welcome, but at the very least it would be great to mention
* the idea;
* brief history of tweaks and improvements;
* what is the final architecture and why?
* what is the training method and, again, why?
* Any regularizations and other techniques applied and their effects;


There is no need to write strict mathematical proofs (unless you want to).
 * "I tried this, this and this, and the second one turned out to be better. And i just didn't like the name of that one" - OK, but can be better
 * "I have analized these and these articles|sources|blog posts, tried that and that to adapt them to my problem and the conclusions are such and such" - the ideal one
 * "I took that code that demo without understanding it, but i'll never confess that and instead i'll make up some pseudoscientific explaination" - __not_ok__

### Hi, my name is `Sergey Gorbatyuk`, and here's my story

A long time ago in a galaxy far far away, when it was still more than an hour before the deadline, i got an idea: actually just build a conv simple sequentioal conv net, and if i have more time, try to build resnet.

##### I gonna build a neural network, that
First replicates architecture form the seminar. Then I gonna try some things like more packs of conv-pool-relu-bn-dropout,
normalizing inputs and early stopping.

How could i be so naive?!

##### One day, with no signs of warning,
This thing has finally converged and
i got score like 0.5%. That gave me a thought that this thing learned like nothing, and after some time i realized that the problem is 
that default learning rate in Adam is too big to train this one. Moreover, I understood that 3 conv layers with other stuff is not enough 
cause receptive field is still very small. I trained it with less lr and more convs, and got significially better score like 25%.

##### Finally, after ~30  iterations, np.float('+inf') mugs of coffee
I got the final score of 35% on test set. Not the best one, but i believed that i would reach the desired 40% with convs only. Actually this day I also realized that the deadline was on past midnight, not on the coming one, and got upset even more:(

The final architecture was 4x2 packs of conv3-bn-LReLU separated by maxpool2-dropout. After flattening there were three dense layers with ReLU and dropouts. The training procedure was optimized by Adam optimizer and here I confess that I did not try another one, because I knew that Adam performs better in average, and did not want to spend precious time making experiments with that. I intermeshed lr decay manually, because for some reason it is quite hard to do in pytorch, and I did data augumentation.

That, having wasted ____ [minutes, hours or days] of my life training, got

* accuracy on training: 74,83%
* accuracy on validation: 38,67%
* accuracy on test: 35,74%

I regret that spent not enough time doing that, and had no chance to try resnet. But still, i am sending this one, but will try to do something better ;)