## Write and train dog breed classificator from scratch and using transfer learning

---

In this notebook, results of the experiments of training and improving the neural network from scratch is presented. In this part experiments 26-51 are described and results are provided.

In this series of experiments, the models were improved (in particular, bugs in custom AlexNet and VGG implementations were fixed), and Alexnet batch norm and vgg batch norm (PyTorch's implementation) were trained from scratch and using transfer learning. Go to [models description](#models) for details.

Use the links below to navigate between experiments description:

* [Experiment 26](#exp26): model `Base`
* [Experiment 27](#exp27): model `Base` 
* [Experiment 28](#exp28): model `Base` 
* [Experiment 29](#exp29): model `Base` 
* [Experiment 30](#exp30): model `Base` 
* [Experiment 31](#exp31): model `Base` 
* [Experiment 32](#exp32): model `Base` 
* [Experiment 33](#exp33): model `Base` 
* [Experiment 34](#exp34): model `Base` 
* [Experiment 35](#exp35): model `Base` 
* [Experiment 36](#exp36): model `Base` 
* [Experiment 37](#exp37): model `Base` 
* [Experiment 38](#exp38): model `AlexNet` (from scratch)
* [Experiment 39](#exp39): model `AlexNet` (from scratch)
* [Experiment 40](#exp40): model `AlexNet` (from scratch), best result for this model from scratch
* [Experiment 41](#exp41): model `Base`, best result for this model
* [Experiment 42](#exp42): model `AlexNet` (transfer learning), best result for this model transfer learning
* [Experiment 43](#exp43): equal to experiment 17
* [Experiment 44](#exp44): model `AlexNet` (transfer learning)
* [Experiment 45](#exp45): model `AlexNet` (transfer learning), best result for this model transfer learning
* [Experiment 46](#exp46): model `vgg16` (from scratch), best result among all 25 experiments
* [Experiment 47](#exp47): model `vgg16` (transfer learning), fail to train
* [Experiment 48](#exp48): model `vgg16` (transfer learning), fail to train
* [Experiment 49](#exp49): model `vgg16` (transfer learning), fail to train
* [Experiment 50](#exp50): model `Base_1` (from scratch), best result for this model from scratch, but worse than model `Base`
* [Experiment 51](#exp51): model `Base_1` (from scratch), best result for this model from scratch, but worse than model `Base`


[Conclusions](#conclusions) are here.

---

<a id='models'></a>
## Models


1. `Base`: custom AlexNet realization as in [original paper](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf), but without Local Response Normalization.

Model architecture:

In [1]:
from models.model_utils import init_model
model_base = init_model("Base", 133, pretrained=False)
print(model_base)

BasicCNN(
  (conv1): Conv2d(3, 48, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
  (conv2): Conv2d(48, 128, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (conv3): Conv2d(128, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv4): Conv2d(192, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv5): Conv2d(192, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (fc1): Linear(in_features=4608, out_features=2048, bias=True)
  (fc2): Linear(in_features=2048, out_features=2048, bias=True)
  (fc3): Linear(in_features=2048, out_features=133, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)


1.1. `Base_fix`: same as `Base`, but without sigmoid function after last fully connected layer. Using sigmoid function was a bug/error, which was fixed and tested.

Model architecture:

In [2]:
from models.model_utils import init_model
model_base_fix = init_model("Base_fix", 133, pretrained=False)
print(model_base_fix)

BasicCNN(
  (conv1): Conv2d(3, 48, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
  (conv2): Conv2d(48, 128, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (conv3): Conv2d(128, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv4): Conv2d(192, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv5): Conv2d(192, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (fc1): Linear(in_features=4608, out_features=2048, bias=True)
  (fc2): Linear(in_features=2048, out_features=2048, bias=True)
  (fc3): Linear(in_features=2048, out_features=133, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)


2. `Base_1`: custom AlexNet realization, but more deeper convolution layers (more filters).

Model architecture:

In [24]:
model_base_1 = init_model("Base_1", 133, pretrained=False)
print(model_base_1)

BasicCNN_v1(
  (conv1): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
  (conv2): Conv2d(64, 156, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (conv3): Conv2d(156, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv5): Conv2d(256, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (pool): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  (fc1): Linear(in_features=6912, out_features=4096, bias=True)
  (fc2): Linear(in_features=4096, out_features=2048, bias=True)
  (fc3): Linear(in_features=2048, out_features=133, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)


2.1. `Base_1_fix`: same as Base_1, but without sigmoid function after last fully connected layer. Using sigmoid function was a bug/error, which was fixed and tested.

Model architecture:

In [3]:
model_base_1_fix = init_model("Base_1_fix", 133, pretrained=False)
print(model_base_1_fix)

BasicCNN_v1(
  (conv1): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
  (conv2): Conv2d(64, 156, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (conv3): Conv2d(156, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv5): Conv2d(256, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (pool): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  (fc1): Linear(in_features=6912, out_features=4096, bias=True)
  (fc2): Linear(in_features=4096, out_features=2048, bias=True)
  (fc3): Linear(in_features=2048, out_features=133, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)


3. `AlexNet`: model from torchvision library

Model architecture:

In [25]:
import torchvision.models as torch_models
model_alexnet = torch_models.alexnet()
print(model_alexnet)

AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
 

4. `vgg16`: model VGG16 from torchvision library

Model architecture:

In [26]:
import torchvision.models as torch_models
model_vgg16 = torch_models.vgg16()
print(model_vgg16)

VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace=True)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace=True)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace=True)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace=True)
    (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1

# Experiments description

Fow all experiments, some parameters did'n change:
* for SGD momentum value 0.9 was used
* for `Base`, `Base_fix`, `Base_1` and `Base_1_fix` for fully connected layers dropout=0.5 was used
* a scheduler with following parameters in all experiments was used:
  * scheduler patience = 3
  * scheduler factor = 0.5
  * scheduler cooldown = 2
* if Color Jitter was used, the following parameters were used for it:
  * brightness = 0.4
  * contrast = 0.4
  * saturation = 0.4
  * hue = 0.2

<a id='exp26'></a>
## Experiment 26

Details: 
* model: `Base`
* batch size: 32
* early stopping: 10
* lr: 0.01
* weight decay: 0.0005
* mean: 0.4864 0.4560 0.3918
* std: 0.2602 0.2536 0.2562
* augmentation: random horizontal flip, random resized crop and ColorJitter
* optimizer: SGD
* weight initialization: ones

Results:
* best epoch: 1
* test loss: 4.890350
* test accuracy: 0.96% (8/836)
* train time (minutes): 0.33

<a id='exp7'></a>
## Experiment 27

Parent experiment: [exp 26](#exp26)

Differences from parent experiment: 
* weights initialization: uniform

Results:
* best epoch: 6
* test loss: 5.030281
* test accuracy: 0.72% (6/836)
* train time (minutes): 2.15

<a id='exp28'></a>
## Experiment 28

Parent experiment: [exp 26](#exp26)

Differences from parent experiment: 
* weights initialization: general rule

Results:
* best epoch: 122
* test loss: 4.396873
* test accuracy: 7.54% (63/836)
* train time (minutes): 43.48

<a id='exp29'></a>
## Experiment 29

Parent experiment: [exp 26](#exp26)

Differences from parent experiment: 
* weights initialization: pytorch default initialization

Results:
* best epoch: 77
* test loss: 4.477621
* test accuracy: 5.14% (43/836)
* train time (minutes): 119.08

<a id='exp30'></a>
## Experiment 30

Parent experiment: [exp 29](#exp29)

Differences from parent experiment: 
* model: Base -> Base_2
* augmentaion: no

Results:
* best epoch: 17
* test loss: 3.462288
* test accuracy: 16.99% (142/836)
* train time (minutes): 8.68

<a id='exp31'></a>
## Experiment 31

Parent experiment: [exp 30](#exp30)

Differences from parent experiment: 
* ColorJitter: on -> off

Results:
* best epoch: 109
* test loss: 1.588668
* test accuracy: 61.24% (512/836)
* train time (minutes): 48.93

<a id='exp32'></a>
## Experiment 32

Parent experiment: [exp 31](#exp1)

Differences from parent experiment: 
* color jitter: off -> on

Results:
* best epoch: 123
* test loss: 1.315878
* test accuracy: 66.15% (553/836)
* train time (minutes): 192.85

<a id='exp33'></a>
## Experiment 33

Parent experiment: [exp 32](#exp32)

Differences from parent experiment: 
* model: Base_2 -> AlexNet

Results:
* best epoch: 165
* test loss: 1.382329
* test accuracy: 66.51% (556/836)
* train time (minutes): 262.42

<a id='exp34'></a>
## Experiment 34

Parent experiment: [exp 29](#exp29)

Differences from parent experiment: 
* model: Base -> Base_fix

Results:
* best epoch: 145
* test loss: 1.287084
* test accuracy: 63.64% (532/836)
* train time (minutes): 223.72

<a id='exp35'></a>
## Experiment 35

Parent experiment: [exp 29](#exp29)

Differences from parent experiment: 
* model: Base -> Base_1_fix

Results:
* best epoch: 137
* test loss: 1.334149
* test accuracy: 64.95% (543/836)
* train time (minutes): 214.98

<a id='exp36'></a>
## Experiment 36

Parent experiment: [exp 29](#exp29)

Differences from parent experiment: 
* model: Base -> vgg16

Results:
* best epoch: 99
* test loss: 0.943133
* test accuracy: 76.56% (640/836)
* train time (minutes): 324.48

<a id='exp37'></a>
## Experiment 37

Parent experiment: [exp 36](#exp36)

Differences from parent experiment: 
* ColorJitter: on -> off

Results:
* best epoch: 105
* test loss: 0.903384
* test accuracy: 77.51% (648/836)
* train time (minutes): 238.15

<a id='exp38'></a>
## Experiment 38

Parent experiment: [exp 29](#exp29)

Differences from parent experiment: 
* pretrained: false -> true
* model: Base -> AlexNet
* mean: 0.485, 0.456, 0.406
* std: 0.229, 0.224, 0.225
* number of unfreezed layers: 1

Results:
* best epoch: 75
* test loss: 1.022241
* test accuracy: 73.56% (615/836)
* train time (minutes): 223.25

<a id='exp39'></a>
## Experiment 39

Parent experiment: [exp 38](#exp38)

Differences from parent experiment: 
* number of unfreezed FC layers: 1 -> 2

Results:
* best epoch: 81
* test loss: 0.990969
* test accuracy: 71.65% (599/836)
* train time (minutes): 241.1

<a id='exp40'></a>
## Experiment 40

Parent experiment: [exp 38](#exp38)

Differences from parent experiment: 
* number of unfreezed FC layers: 1 -> 3

Results:
* best epoch: 17
* test loss: 3.64
* test accuracy: 13.88% (116/836)
* train time (minutes): 9.15

<a id='exp16'></a>
## Experiment 16

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* use mean and std for current train dataset
* add weight decay 0.0005
* add augmentaion: RandomHorizontalFlip and RandomResizedCrop
* add augmentation: Color Jitter

Results:
* best epoch: 144
* test loss: 4.382009
* test accuracy: 7.89% (66/836)
* train time (minutes): 51.76

<a id='exp17'></a>
## Experiment 17

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* add weight decay 0.0005
* add augmentaion: RandomHorizontalFlip and RandomResizedCrop
* add augmentation: Color Jitter
* model: Base -> AlexNet 
* use pretrained weights, unfreeze last FC layer


Results:
* best epoch: 42
* test loss: 1.062519
* test accuracy: 74.88% (626/836)
* train time (minutes): 17.32

<a id='exp18'></a>
## Experiment 18

equal to [exp 17](#exp17)


<a id='exp19'></a>
## Experiment 19

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* add weight decay 0.0005
* add augmentaion: RandomHorizontalFlip and RandomResizedCrop
* add augmentation: Color Jitter
* model: Base -> AlexNet 
* use pretrained weights, unfreeze last 2 FC layers

Results:
* best epoch: 53
* test loss: 1.061143
* test accuracy: 72.13% (603/836)
* train time (minutes): 22.88

<a id='exp20'></a>
## Experiment 20

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* add weight decay 0.0005
* add augmentaion: RandomHorizontalFlip and RandomResizedCrop
* add augmentation: Color Jitter
* model: Base -> AlexNet 
* use pretrained weights, unfreeze last 3 FC layers

Results:
* best epoch: 42
* test loss: 1.062519
* test accuracy: 74.88% (626/836)
* train time (minutes): 18.58

<a id='exp21'></a>
## Experiment 21

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* use mean and std for current train dataset
* add weight decay 0.0005
* add augmentaion: RandomHorizontalFlip and RandomResizedCrop
* add augmentation: Color Jitter
* model: Base -> vgg16 (from scratch)

Results:
* best epoch: 110
* test loss: 0.992273
* test accuracy: 76.44% (639/836)
* train time (minutes): 248.18

<a id='exp22'></a>
## Experiment 22

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* add weight decay 0.0005
* add augmentaion: RandomHorizontalFlip and RandomResizedCrop
* add augmentation: Color Jitter
* model: Base -> vgg16 
* use pretrained weights, unfreeze last FC layer

Results:
* best epoch: 1
* test loss: 2.082360
* test accuracy: 52.63% (440/836)
* train time (minutes): 2.36

<a id='exp23'></a>
## Experiment 23

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* add weight decay 0.0005
* add augmentaion: RandomHorizontalFlip and RandomResizedCrop
* add augmentation: Color Jitter
* model: Base -> vgg16 
* use pretrained weights, unfreeze last 2 FC layers

Results:
* best epoch: 1
* test loss: 2.082360
* test accuracy: 52.63% (440/836)
* train time (minutes): 1.47

<a id='exp24'></a>
## Experiment 24

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* add weight decay 0.0005
* add augmentaion: RandomHorizontalFlip and RandomResizedCrop
* add augmentation: Color Jitter
* model: Base -> vgg16 
* use pretrained weights, unfreeze last 3 FC layers

Results:
* best epoch: 1
* test loss: 2.082360
* test accuracy: 52.63% (440/836)
* train time (minutes): 2.0

<a id='exp25'></a>
## Experiment 25

Parent experiment: [exp 1](#exp1)

Differences from parent experiment: 
* use mean and std for current train dataset
* add weight decay 0.0005
* add augmentaion: RandomHorizontalFlip and RandomResizedCrop
* add augmentation: Color Jitter
* model: Base -> Base_1 (from scratch)

Results:
* best epoch: 124
* test loss: 4.401157
* test accuracy: 5.26% (44/836)
* train time (minutes): 47.38

<a id='conclusions'></a>
## CONCLUSIONS for series of experiments 1-25

Which improvements have been made:
* For `Base` model:
  * optimizer SGD works faster than Adam and a little better. [exp 1](#exp1) vs [exp 6](#exp6)
  * if using Adam, it is better to start with small values of learning rate (0.00001). [exp 6](#exp6) vs [exp 5](#exp5) 
  * it is better to calculate actual mean and std from train split than use imagenet values (for training model from scratch) [exp 10](#exp10) vs [exp 9](#exp9)
  * using weight decay 0.00005 get better result than not using. [exp 11](#exp11) vs [exp 10](#exp10)
  * using Horizontal Flip and Random Resized Crop augmentation reduce overfitting and get better result: [exp 12](#exp12) vs [exp 11](#exp11)
  * adding augmentations from ColorJitter get some enhancements: [exp 16](#exp16) vs [exp 12](#exp12)
* Using `AlexNet` (from torchvision) model get 10x better result than my custom `Base`, if train from scratch: [exp 13](#exp13) vs [exp 12](#exp12)
* Using transfer learning for `AlexNet` get some better result than train from scratch and ~5x time faster training: [exp 17](#exp17) vs [exp 15](#exp15)
* Using vgg16 from scratch get small better result than AlexNet transfer learning: [exp 21](#exp21) vs [exp 17](#exp17)
* but transfer learning for vgg failed [exp 22](#exp21), [exp 23](#exp23), [exp 24](#exp24). 

# What's next

1. Explore why vgg transfer learning failed
2. Explore why my custom AlexNet realization 10x worse than PyTorch (may be key in adaptive average pool?)
3. Check transfer learning with unfreezing convolution layers too (in these experiments fully connected layer were unfreezed only)
4. Try state of the art architectures to train (from scratch and transfer learning)
5. Get 90% of test accuracy