## Write and train dog breed classificator from scratch and using transfer learning

---

In this notebook, results of the experiments of training and improving the neural network from scratch is presented. In this part experiments 26-51 are described and results are provided.

In this series of experiments, the models were improved (in particular, bugs in custom AlexNet and VGG implementations were fixed), and Alexnet batch norm and vgg batch norm (PyTorch's implementation) were trained from scratch and using transfer learning. Go to [models description](#models) for details.

Use the links below to navigate between experiments description:

* [Experiment 26](#exp26): model `Base`, scratch
* [Experiment 27](#exp27): model `Base`, scratch 
* [Experiment 28](#exp28): model `Base`, scratch 
* [Experiment 29](#exp29): model `Base`, scratch
* [Experiment 30](#exp30): model `Base_2`, scratch 
* [Experiment 31](#exp31): model `Base_2`, scratch
* [Experiment 32](#exp32): model `Base_2`, scratch
* [Experiment 33](#exp33): model `AlexNet`, scratch 
* [Experiment 34](#exp34): model `Base_fix`, scratch 
* [Experiment 35](#exp35): model `Base_1_fix`, scratch 
* [Experiment 36](#exp36): model `vgg16`, scratch 
* [Experiment 37](#exp37): model `vgg16`, scratch 
* [Experiment 38](#exp38): model `AlexNet` pretrained
* [Experiment 39](#exp39): model `AlexNet` pretrained
* [Experiment 40](#exp40): model `AlexNet` pretrained
* [Experiment 41](#exp41): model `AlexNet`, pretrained
* [Experiment 42](#exp42): model `AlexNet` pretrained
* [Experiment 43](#exp43): model `vgg16`, pretrained
* [Experiment 44](#exp44): model `vgg16`, pretrained 
* [Experiment 45](#exp45): model `vgg16` pretrained
* [Experiment 46](#exp46): model `vgg16_bn`, pretrained
* [Experiment 47](#exp47): model `vgg16_bn`, scratch
* [Experiment 48](#exp48): model `vgg16_bn`, scratch
* [Experiment 49](#exp49): model `vgg16_bn`, scratch
* [Experiment 50](#exp50): model `vgg16_bn`, pretrained
* [Experiment 51](#exp51): model `vgg16_bn`, scratch


[Conclusions](#conclusions) are here.

---

<a id='models'></a>
## Models


1. `Base`: custom AlexNet realization as in [original paper](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf), but without Local Response Normalization.

Model architecture:

In [1]:
from models.model_utils import init_model
model_base = init_model("Base", 133, pretrained=False)
print(model_base)

BasicCNN(
  (conv1): Conv2d(3, 48, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
  (conv2): Conv2d(48, 128, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (conv3): Conv2d(128, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv4): Conv2d(192, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv5): Conv2d(192, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (fc1): Linear(in_features=4608, out_features=2048, bias=True)
  (fc2): Linear(in_features=2048, out_features=2048, bias=True)
  (fc3): Linear(in_features=2048, out_features=133, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)


1.1. `Base_fix`: same as `Base`, but without sigmoid function after last fully connected layer. Using sigmoid function was a bug/error, which was fixed and tested.

Model architecture:

In [2]:
from models.model_utils import init_model
model_base_fix = init_model("Base_fix", 133, pretrained=False)
print(model_base_fix)

BasicCNN(
  (conv1): Conv2d(3, 48, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
  (conv2): Conv2d(48, 128, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (conv3): Conv2d(128, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv4): Conv2d(192, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv5): Conv2d(192, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (fc1): Linear(in_features=4608, out_features=2048, bias=True)
  (fc2): Linear(in_features=2048, out_features=2048, bias=True)
  (fc3): Linear(in_features=2048, out_features=133, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)


2. `Base_1`: custom AlexNet realization, but more deeper convolution layers (more filters).

Model architecture:

In [24]:
model_base_1 = init_model("Base_1", 133, pretrained=False)
print(model_base_1)

BasicCNN_v1(
  (conv1): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
  (conv2): Conv2d(64, 156, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (conv3): Conv2d(156, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv5): Conv2d(256, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (pool): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  (fc1): Linear(in_features=6912, out_features=4096, bias=True)
  (fc2): Linear(in_features=4096, out_features=2048, bias=True)
  (fc3): Linear(in_features=2048, out_features=133, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)


2.1. `Base_1_fix`: same as Base_1, but without sigmoid function after last fully connected layer. Using sigmoid function was a bug/error, which was fixed and tested.

Model architecture:

In [3]:
model_base_1_fix = init_model("Base_1_fix", 133, pretrained=False)
print(model_base_1_fix)

BasicCNN_v1(
  (conv1): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
  (conv2): Conv2d(64, 156, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (conv3): Conv2d(156, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv5): Conv2d(256, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (pool): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  (fc1): Linear(in_features=6912, out_features=4096, bias=True)
  (fc2): Linear(in_features=4096, out_features=2048, bias=True)
  (fc3): Linear(in_features=2048, out_features=133, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)


3. `AlexNet`: model from torchvision library

Model architecture:

In [25]:
import torchvision.models as torch_models
model_alexnet = torch_models.alexnet()
print(model_alexnet)

AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
 

4. `vgg16`: model VGG16 from torchvision library

Model architecture:

In [26]:
import torchvision.models as torch_models
model_vgg16 = torch_models.vgg16()
print(model_vgg16)

VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace=True)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace=True)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace=True)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace=True)
    (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1

# Experiments description

Fow all experiments, some parameters did'n change:
* for SGD momentum value 0.9 was used
* for `Base`, `Base_fix`, `Base_1` and `Base_1_fix` for fully connected layers dropout=0.5 was used
* a scheduler with following parameters in all experiments was used:
  * scheduler patience = 3
  * scheduler factor = 0.5
  * scheduler cooldown = 2
* if Color Jitter was used, the following parameters were used for it:
  * brightness = 0.4
  * contrast = 0.4
  * saturation = 0.4
  * hue = 0.2

<a id='exp26'></a>
## Experiment 26

Details: 
* model: `Base`
* batch size: 32
* early stopping: 10
* lr: 0.01
* weight decay: 0.0005
* mean: 0.4864 0.4560 0.3918
* std: 0.2602 0.2536 0.2562
* augmentation: random horizontal flip, random resized crop and ColorJitter
* optimizer: SGD
* weight initialization: ones

Results:
* best epoch: 1
* test loss: 4.890350
* test accuracy: 0.96% (8/836)
* train time (minutes): 0.33

<a id='exp7'></a>
## Experiment 27

Parent experiment: [exp 26](#exp26)

Differences from parent experiment: 
* weights initialization: uniform

Results:
* best epoch: 6
* test loss: 5.030281
* test accuracy: 0.72% (6/836)
* train time (minutes): 2.15

<a id='exp28'></a>
## Experiment 28

Parent experiment: [exp 26](#exp26)

Differences from parent experiment: 
* weights initialization: general rule

Results:
* best epoch: 122
* test loss: 4.396873
* test accuracy: 7.54% (63/836)
* train time (minutes): 43.48

<a id='exp29'></a>
## Experiment 29

Parent experiment: [exp 26](#exp26)

Differences from parent experiment: 
* weights initialization: pytorch default initialization

Results:
* best epoch: 77
* test loss: 4.477621
* test accuracy: 5.14% (43/836)
* train time (minutes): 119.08

<a id='exp30'></a>
## Experiment 30

Parent experiment: [exp 29](#exp29)

Differences from parent experiment: 
* model: Base -> Base_2
* augmentaion: no

Results:
* best epoch: 17
* test loss: 3.462288
* test accuracy: 16.99% (142/836)
* train time (minutes): 8.68

<a id='exp31'></a>
## Experiment 31

Parent experiment: [exp 30](#exp30)

Differences from parent experiment: 
* ColorJitter: on -> off

Results:
* best epoch: 109
* test loss: 1.588668
* test accuracy: 61.24% (512/836)
* train time (minutes): 48.93

<a id='exp32'></a>
## Experiment 32

Parent experiment: [exp 31](#exp31)

Differences from parent experiment: 
* color jitter: off -> on

Results:
* best epoch: 123
* test loss: 1.315878
* test accuracy: 66.15% (553/836)
* train time (minutes): 192.85

<a id='exp33'></a>
## Experiment 33

Parent experiment: [exp 32](#exp32)

Differences from parent experiment: 
* model: Base_2 -> AlexNet

Results:
* best epoch: 165
* test loss: 1.382329
* test accuracy: 66.51% (556/836)
* train time (minutes): 262.42

<a id='exp34'></a>
## Experiment 34

Parent experiment: [exp 29](#exp29)

Differences from parent experiment: 
* model: Base -> Base_fix

Results:
* best epoch: 145
* test loss: 1.287084
* test accuracy: 63.64% (532/836)
* train time (minutes): 223.72

<a id='exp35'></a>
## Experiment 35

Parent experiment: [exp 29](#exp29)

Differences from parent experiment: 
* model: Base -> Base_1_fix

Results:
* best epoch: 137
* test loss: 1.334149
* test accuracy: 64.95% (543/836)
* train time (minutes): 214.98

<a id='exp36'></a>
## Experiment 36

Parent experiment: [exp 29](#exp29)

Differences from parent experiment: 
* model: Base -> vgg16

Results:
* best epoch: 99
* test loss: 0.943133
* test accuracy: 76.56% (640/836)
* train time (minutes): 324.48

<a id='exp37'></a>
## Experiment 37

Parent experiment: [exp 36](#exp36)

Differences from parent experiment: 
* ColorJitter: on -> off

Results:
* best epoch: 105
* test loss: 0.903384
* test accuracy: 77.51% (648/836)
* train time (minutes): 238.15

<a id='exp38'></a>
## Experiment 38

Parent experiment: [exp 29](#exp29)

Differences from parent experiment: 
* pretrained: false -> true
* model: Base -> AlexNet
* mean: 0.485, 0.456, 0.406
* std: 0.229, 0.224, 0.225
* number of unfreezed FC layers: 1

Results:
* best epoch: 75
* test loss: 1.022241
* test accuracy: 73.56% (615/836)
* train time (minutes): 223.25

<a id='exp39'></a>
## Experiment 39

Parent experiment: [exp 38](#exp38)

Differences from parent experiment: 
* number of unfreezed FC layers: 1 -> 2

Results:
* best epoch: 81
* test loss: 0.990969
* test accuracy: 71.65% (599/836)
* train time (minutes): 241.1

<a id='exp40'></a>
## Experiment 40

Parent experiment: [exp 38](#exp38)

Differences from parent experiment: 
* number of unfreezed FC layers: 1 -> 3

Results:
* best epoch: 90
* test loss: 0.986147
* test accuracy: 70.45% (589/836)
* train time (minutes): 266.15

<a id='exp41'></a>
## Experiment 41

Parent experiment: [exp 29](#exp29)

Differences from parent experiment: 
* pretrained model
* Model: Base -> AlexNet
* augmentation: ColorJitter: true -> false
* number of unfreezed FC layers: 3

Results:
* best epoch: 87
* test loss: 0.935989
* test accuracy: 74.40% (622/836)
* train time (minutes): 68.8

<a id='exp42'></a>
## Experiment 42

Parent experiment: [exp 41](#exp41)

Differences from parent experiment: 
* augmentation: horizontal flip: true -> false

Results:
* best epoch: 92
* test loss: 0.971809
* test accuracy: 73.33% (613/836)
* train time (minutes): 75.72

<a id='exp43'></a>
## Experiment 43

Parent experiment: [exp 29](#exp29)

Differences from parent experiment: 
* pretrained: true
* model: vgg16
* use color jitter: false
* use random horizontal flip: false
* number of unfreezed FC layers: 1
* mean: 0.485, 0.456, 0.406
* std: 0.229, 0.224, 0.225
* lr: 0.001

Results:
* best epoch: 36
* test loss: 0.360861
* test accuracy: 88.88% (613/836)
* train time (minutes): 75.72


<a id='exp44'></a>
## Experiment 44

Parent experiment: [exp 43](#exp43)

Differences from parent experiment: 
* number of unfreezed FC layers: 1 -> 2

Results:
* best epoch: 44
* test loss: 0.377006
* test accuracy: 89.83% (751/836)
* train time (minutes): 86.2

<a id='exp45'></a>
## Experiment 45

Parent experiment: [exp 43](#exp43)

Differences from parent experiment: 
* number of unfreezed FC layers: 1 -> 3

Results:
* best epoch: 41
* test loss: 0.381830
* test accuracy: 88.40% (739/836)
* train time (minutes): 60.16

<a id='exp46'></a>
## Experiment 46

Parent experiment: [exp 43](#exp43)

Differences from parent experiment: 
* model: vgg16 -> vgg16_bn
* number of unfreezed FC layers: 1

Results:
* best epoch: 36
* test loss: 0.358328
* test accuracy: 89.00% (744/836)
* train time (minutes): 29.87

<a id='exp47'></a>
## Experiment 47

Parent experiment: [exp 29](#exp29)

Differences from parent experiment: 
* model: Base -> vgg16_bn
* augmentation: use color jitter: false
* augmentation: use random horizontal flip: false
* mean: 0.485, 0.456, 0.406 (ImageNet)
* std: 0.229, 0.224, 0.225 (ImageNet)
* * lr: 0.001

Results:
* best epoch: 105
* test loss: 1.872399
* test accuracy: 50.00% (418/836)
* train time (minutes): 330.0

<a id='exp48'></a>
## Experiment 48

Parent experiment: [exp 47](#exp47)

Differences from parent experiment: 
* augmentation color jitter: false -> true
* lr: 0.001

Results:
* best epoch: 99
* test loss: 1.479383
* test accuracy: 58.49% (489/836)
* train time (minutes): 320.03

<a id='exp49'></a>
## Experiment 49

Parent experiment: [exp 48](#exp48)

Differences from parent experiment: 
* augmentation: use random horizontal flip: false -> true

Results:
* best epoch: 123
* test loss: 1.136795
* test accuracy: 68.54% (573/836)
* train time (minutes): 218.6

<a id='exp50'></a>
## Experiment 50

Parent experiment: [exp 43](#exp43)

Differences from parent experiment: 

* model: vgg16 -> vgg16_bn
* augmentaion: color jitter: false -> true
* augmentation: random horizontal flip: false -> true
* number of unfreezed FC layers: 1

Results:
* best epoch: 50
* test loss: 0.393089
* test accuracy: 88.16% (737/836)
* train time (minutes): 88.5

<a id='exp51'></a>
## Experiment 51

Parent experiment: [exp 49](#exp49)

Differences from parent experiment: 

* mean: 0.4864 0.4560 0.3918 (custom)
* std: 0.2602 0.2536 0.2562 (custom)

Results:
* best epoch: 99
* test loss: 0.845104
* test accuracy: 75.48% (631/836)
* train time (minutes): 167.37

<a id='conclusions'></a>
## CONCLUSIONS for serie of experiments 26-51

#### Table with best results for each model  

Model | Type(scratch/pretrained) | Test accuracy, % | Test loss | Time, minutes | Experiment | Commentaries
----- | ------------------------ | ---------------- | ----------| ------------- | -----------| ------------
`Base_fix` | scratch | 63.64 | 1.287084 | 223.72 | [exp 34](#exp34) | -
`Base_1_fix` | scratch | 64.95 | 1.334149 | 214.98 | [exp 35](#exp35) | -
`Base_2` | scratch | 66.15 | 1.315878 |  192.85 | [exp 32](#exp32) | -
`AlexNet` | scratch | 66.51 | 1.382329 | 262.42 | [exp 33](#exp33) | -
`AlexNet` | pretrained | 74.40 | 0.935989 | 68.8 | [exp 41](#exp41) | ColorJitter off<br/> horizontal flip on<br/> num unfreezed FC layers: 3
`vgg16_bn` | scratch | 75.48 | 0.845104 | 167.37 | [exp 51](#exp51) | ColorJitter on<br/> horizontal flip on<br/> 
`vgg16` | scratch | 77.51 | 0.903384 | 238.15 | [exp 37](#exp37) | ColorJitter off<br/> horizontal flip on
`vgg16_bn` | pretrained | 89.00 | 0.358328 | 29.87 | [exp 46](#exp46) | ColorJitter off<br/> horizontal flip off<br/> num unfreezed FC layers: 1
`vgg16` | pretrained | 89.83 | 0.377006 | 86.2 | [exp 44](#exp44) | ColorJitter off<br/> horizontal flip off <br/> num unfreezed FC layers: 2


Conclusions:
* As usual, more deeper model show better result
* For same architecture transfer learning show much better result than training from scratch
* `vgg16_bn` show worse result than `vgg16` (it is strange.. why adding batch normalization did't improve result?)
* Using batch normalization doing training process faster 2x-3x times both from scratch and using transfer learning
* It is better to switch off ColorJitter (and sometimes ever RandomHorizontalFlip) augmentations then using transfer learning mode
* For training from scratch using ColorJitter and RandomHorizontalFlip augmentations dramatically improve result
* Custom weight initialization technique (so-called "general rule") works worse than default weights initialization from PyTorch
* It is not clear how much last FC layers unfreeze than using transfer learning (1 or 2 or 3), so, it is better to check all modes



# What's next

1. Check [exp 41](#exp41), but with num FC unfreezed = 1 (should improve the result)
1. Add more modern architectures and compare results
1. Implement and try unfreezing convolution layers