## Write and train dog breed classificator from scratch

---

In this notebook, results of the experiments of training and improving the neural network from scratch is presented. In this part 25 first experiments are described and results are provided.

In these 25 experinments two custom implementations of AlexNet were used, and two PyTorch's implementations were uses. Go to [models description](#models) for details.

Use the links below to navigate between experiments description:

* [Experiment 1](#exp1): model `Base`
* [Experiment 2](#exp2): model `Base` 
* [Experiment 3](#exp3): model `Base` 
* [Experiment 4](#exp4): model `Base` 
* [Experiment 5](#exp5): model `Base` 
* [Experiment 6](#exp6): model `Base` 
* [Experiment 7](#exp7): model `Base` 
* [Experiment 8](#exp8): model `Base` 
* [Experiment 9](#exp9): model `Base` 
* [Experiment 10](#exp10): model `Base` 
* [Experiment 11](#exp11): model `Base` 
* [Experiment 12](#exp12): model `Base` 
* [Experiment 13](#exp13): model `AlexNet` (from scratch)
* [Experiment 14](#exp14): model `AlexNet` (from scratch)
* [Experiment 15](#exp15): model `AlexNet` (from scratch), best result for this model from scratch
* [Experiment 16](#exp16): model `Base`, best result for this model
* [Experiment 17](#exp17): model `AlexNet` (transfer learning), best result for this model transfer learning
* [Experiment 18](#exp18): equal to experiment 17
* [Experiment 19](#exp19): model `AlexNet` (transfer learning)
* [Experiment 20](#exp20): model `AlexNet` (transfer learning), best result for this model transfer learning
* [Experiment 21](#exp21): model `vgg16` (from scratch), best result among all 25 experiments
* [Experiment 22](#exp22): model `vgg16` (transfer learning), fail to train
* [Experiment 23](#exp23): model `vgg16` (transfer learning), fail to train
* [Experiment 24](#exp24): model `vgg16` (transfer learning), fail to train
* [Experiment 25](#exp25): model `Base_1` (from scratch), best result for this model from scratch, but worse than model `Base`

---

<a id='models'></a>
## Models


1. `Base`: custom AlexNet realization as in [original paper](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf), but without Local Response Normalization.

Model architecture:

In [23]:
from models.model_utils import init_model
model_base = init_model("Base", 133, pretrained=False)
print(model_base)

BasicCNN(
  (conv1): Conv2d(3, 48, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
  (conv2): Conv2d(48, 128, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (conv3): Conv2d(128, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv4): Conv2d(192, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv5): Conv2d(192, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (fc1): Linear(in_features=4608, out_features=2048, bias=True)
  (fc2): Linear(in_features=2048, out_features=2048, bias=True)
  (fc3): Linear(in_features=2048, out_features=133, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)


2. `Base_1`: custom AlexNet realization, but more deeper convolution layers (more filters).

Model architecture:

In [24]:
model_base_1 = init_model("Base_1", 133, pretrained=False)
print(model_base_1)

BasicCNN_v1(
  (conv1): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
  (conv2): Conv2d(64, 156, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (conv3): Conv2d(156, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv5): Conv2d(256, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (pool): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  (fc1): Linear(in_features=6912, out_features=4096, bias=True)
  (fc2): Linear(in_features=4096, out_features=2048, bias=True)
  (fc3): Linear(in_features=2048, out_features=133, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)


3. `AlexNet`: model from torchvision library

Model architecture:

In [25]:
import torchvision.models as torch_models
model_alexnet = torch_models.alexnet()
print(model_alexnet)

AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
 

4. `vgg16`: model VGG16 from torchvision library

Model architecture:

In [26]:
import torchvision.models as torch_models
model_vgg16 = torch_models.vgg16()
print(model_vgg16)

VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace=True)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace=True)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace=True)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace=True)
    (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1

# Experiments description

Fow all experiments, some parameters did'n change:
* for SGD momentum value 0.9 was used
* in Base and Base_1 for fully connected layers dropout=0.5 was used
* a scheduler with following parameters in all experiments was used:
  * scheduler patience = 3
  * scheduler factor = 0.5
  * scheduler cooldown = 2
* if Color Jitter was used, the following parameters were used for it:
  * brightness = 0.4
  * contrast = 0.4
  * saturation = 0.4
  * hue = 0.2

<a id='exp1'></a>
## Experiment 1

Details: 
* model: `Base`
* batch size: 32
* early stopping: 10
* lr: 0.01
* augmentation: no
* optimizer: SGD

Results:
* best epoch: 40
* test loss: 4.56
* test accuracy: 3.35% (28/836)
* train time (minutes): 28.1

<a id='exp2'></a>
## Experiment 2

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* batch size: 32 -> 64

Results:
* best epoch: 35
* test loss: 4.646664
* test accuracy: 1.79% (15/836)
* train time (minutes): 24.83

<a id='exp3'></a>
## Experiment 3

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* batch size: 32 -> 128

Results:
* best epoch: 38
* test loss: 4.660923
* test accuracy: 2.27% (19/836)
* train time (minutes): 15.72

<a id='exp4'></a>
## Experiment 4

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* batch size: 32 -> 16

Results:
* best epoch: 18
* test loss: 4.598885
* test accuracy: 2.15% (18/836)
* train time (minutes): 8.03

<a id='exp5'></a>
## Experiment 5

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: optimizer: SGD -> Adam

Results:
* best epoch: 7
* test loss: 4.907243
* test accuracy: 1.20% (10/836)
* train time (minutes): 4.65

<a id='exp6'></a>
## Experiment 6

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* optimizer: SGD -> Adam
* lr: 0.01 -> 0.00001

Results:
* best epoch: 61
* test loss: 4.586305
* test accuracy: 3.71% (31/836)
* train time (minutes): 41.85

<a id='exp7'></a>
## Experiment 7

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* optimizer: SGD -> Adam
* lr: 0.01 -> 0.00001
* batch size: 32 -> 256

Results:
* best epoch: 65
* test loss: 4.668120
* test accuracy: 3.35% (28/836)
* train time (minutes): 28.38

<a id='exp8'></a>
## Experiment 8

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* optimizer: SGD -> Adam
* batch size: 32 -> 256
* lr: 0.00001 -> 0.001

Results:
* best epoch: 57
* test loss: 4.890260
* test accuracy: 0.72% (6/836)
* train time (minutes): 24.61

<a id='exp9'></a>
## Experiment 9

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* batch size: 32 -> 256
* lr: 0.01 -> 0.1

Results:
* best epoch: 65
* test loss: 4.598946
* test accuracy: 3.11% (26/836)
* train time (minutes): 27.72

<a id='exp10'></a>
## Experiment 10

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* use mean and std for current train dataset

Results:
* best epoch: 41
* test loss: 4.554648
* test accuracy: 3.83% (32/836)
* train time (minutes): 27.55

<a id='exp11'></a>
## Experiment 11

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* use mean and std for current train dataset
* add weight decay 0.0005

Results:
* best epoch: 49
* test loss: 4.550658
* test accuracy: 4.31% (36/836)
* train time (minutes): 18.43

<a id='exp12'></a>
## Experiment 12

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* use mean and std for current train dataset
* add weight decay 0.0005
* add augmentaion: RandomHorizontalFlip and RandomResizedCrop

Results:
* best epoch: 98
* test loss: 4.404118
* test accuracy: 6.34% (53/836)
* train time (minutes): 35.08

<a id='exp13'></a>
## Experiment 13

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* use mean and std for current train dataset
* add weight decay 0.0005
* add augmentaion: RandomHorizontalFlip and RandomResizedCrop
* model: Base -> AlexNet

Results:
* best epoch: 85
* test loss: 1.574913
* test accuracy: 62.44% (522/836)
* train time (minutes): 39.45

<a id='exp14'></a>
## Experiment 14

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* use mean and std for current train dataset
* add weight decay 0.0005
* model: Base -> AlexNet

Results:
* best epoch: 17
* test loss: 3.64
* test accuracy: 13.88% (116/836)
* train time (minutes): 9.15

<a id='exp15'></a>
## Experiment 15

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* use mean and std for current train dataset
* add weight decay 0.0005
* add augmentaion: RandomHorizontalFlip and RandomResizedCrop
* add augmentation: Color Jitter
* model: Base -> AlexNet

Results:
* best epoch: 17
* test loss: 3.64
* test accuracy: 13.88% (116/836)
* train time (minutes): 9.15

<a id='exp16'></a>
## Experiment 16

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* use mean and std for current train dataset
* add weight decay 0.0005
* add augmentaion: RandomHorizontalFlip and RandomResizedCrop
* add augmentation: Color Jitter

Results:
* best epoch: 144
* test loss: 4.382009
* test accuracy: 7.89% (66/836)
* train time (minutes): 51.76

<a id='exp17'></a>
## Experiment 17

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* add weight decay 0.0005
* add augmentaion: RandomHorizontalFlip and RandomResizedCrop
* add augmentation: Color Jitter
* model: Base -> AlexNet 
* use pretrained weights, unfreeze last FC layer


Results:
* best epoch: 42
* test loss: 1.062519
* test accuracy: 74.88% (626/836)
* train time (minutes): 17.32

<a id='exp18'></a>
## Experiment 18

equal to [exp 17](#exp17)


<a id='exp19'></a>
## Experiment 19

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* add weight decay 0.0005
* add augmentaion: RandomHorizontalFlip and RandomResizedCrop
* add augmentation: Color Jitter
* model: Base -> AlexNet 
* use pretrained weights, unfreeze last 2 FC layers

Results:
* best epoch: 53
* test loss: 1.061143
* test accuracy: 72.13% (603/836)
* train time (minutes): 22.88

<a id='exp20'></a>
## Experiment 20

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* add weight decay 0.0005
* add augmentaion: RandomHorizontalFlip and RandomResizedCrop
* add augmentation: Color Jitter
* model: Base -> AlexNet 
* use pretrained weights, unfreeze last 3 FC layers

Results:
* best epoch: 42
* test loss: 1.062519
* test accuracy: 74.88% (626/836)
* train time (minutes): 18.58

<a id='exp21'></a>
## Experiment 21

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* use mean and std for current train dataset
* add weight decay 0.0005
* add augmentaion: RandomHorizontalFlip and RandomResizedCrop
* add augmentation: Color Jitter
* model: Base -> vgg16 (from scratch)

Results:
* best epoch: 110
* test loss: 0.992273
* test accuracy: 76.44% (639/836)
* train time (minutes): 248.18

<a id='exp22'></a>
## Experiment 22

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* add weight decay 0.0005
* add augmentaion: RandomHorizontalFlip and RandomResizedCrop
* add augmentation: Color Jitter
* model: Base -> vgg16 
* use pretrained weights, unfreeze last FC layer

Results:
* best epoch: 1
* test loss: 2.082360
* test accuracy: 52.63% (440/836)
* train time (minutes): 2.36

<a id='exp23'></a>
## Experiment 23

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* add weight decay 0.0005
* add augmentaion: RandomHorizontalFlip and RandomResizedCrop
* add augmentation: Color Jitter
* model: Base -> vgg16 
* use pretrained weights, unfreeze last 2 FC layers

Results:
* best epoch: 1
* test loss: 2.082360
* test accuracy: 52.63% (440/836)
* train time (minutes): 1.47

<a id='exp24'></a>
## Experiment 24

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* add weight decay 0.0005
* add augmentaion: RandomHorizontalFlip and RandomResizedCrop
* add augmentation: Color Jitter
* model: Base -> vgg16 
* use pretrained weights, unfreeze last 3 FC layers

Results:
* best epoch: 1
* test loss: 2.082360
* test accuracy: 52.63% (440/836)
* train time (minutes): 2.0

<a id='exp25'></a>
## Experiment 25

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* use mean and std for current train dataset
* add weight decay 0.0005
* add augmentaion: RandomHorizontalFlip and RandomResizedCrop
* add augmentation: Color Jitter
* model: Base -> Base_1 (from scratch)

Results:
* best epoch: 124
* test loss: 4.401157
* test accuracy: 5.26% (44/836)
* train time (minutes): 47.38