# CNN Architectures and How to Use Them with fastai

__Update__: 

* Added [an example](cnn_archs_more.ipynb) for how to use the [pytorchcv](https://github.com/osmr/imgclsmob/tree/master/pytorch) models. `pytorchcv` provides a much more comprehensive list of models.

* [Torchvision models](https://pytorch.org/docs/stable/torchvision/index.html) are all included in fastai v1 in [this PR](https://github.com/fastai/fastai/pull/1523), except for "inception_v3" (check the "Inception" section to see why it's left out and a workaround). 

* Most of the [Cadene pretrained models](https://github.com/Cadene/pretrained-models.pytorch) are included in fastai v1 in [this PR](https://github.com/fastai/fastai/pull/1753). You can directly import the models, just like ResNets. See [this file](PR_test.ipynb) for some examples.

* EfficientNet is added.

In an image classification project, I have tried various CNN architectures and compared their performances, with the help of the great deep learning library [fastai](https://docs.fast.ai/). Fastai didn't include a lot of architectures by itself, but its flexible API allows us to use pretrained models from [torchvision](https://pytorch.org/docs/stable/torchvision/index.html) or [Cadene](https://github.com/Cadene/pretrained-models.pytorch) for transfer learning, with a little bit work of customization. Since I have already done these customizations to make the models work, I'd like to share it in this notebook. 

We'll use the pretrained models from torchvision and Cadene:  

In [1]:
from torchvision.models import *
import pretrainedmodels

from fastai.vision import *
from fastai.vision.models import *
from fastai.vision.learner import model_meta

from utils import *
import sys

The fastai version used in here is:

In [20]:
__version__

'1.0.53.dev0'

Here are all models provided by torchvision: 

In [12]:
[k for k,v in sys.modules['torchvision.models'].__dict__.items() if callable(v)]

['alexnet',
 'AlexNet',
 'ResNet',
 'resnet18',
 'resnet34',
 'resnet50',
 'resnet101',
 'resnet152',
 'VGG',
 'vgg11',
 'vgg11_bn',
 'vgg13',
 'vgg13_bn',
 'vgg16',
 'vgg16_bn',
 'vgg19_bn',
 'vgg19',
 'SqueezeNet',
 'squeezenet1_0',
 'squeezenet1_1',
 'Inception3',
 'inception_v3',
 'DenseNet',
 'densenet121',
 'densenet169',
 'densenet201',
 'densenet161']

Here are all models provided by Cadene: 

In [176]:
pretrainedmodels.model_names

['fbresnet152',
 'bninception',
 'resnext101_32x4d',
 'resnext101_64x4d',
 'inceptionv4',
 'inceptionresnetv2',
 'alexnet',
 'densenet121',
 'densenet169',
 'densenet201',
 'densenet161',
 'resnet18',
 'resnet34',
 'resnet50',
 'resnet101',
 'resnet152',
 'inceptionv3',
 'squeezenet1_0',
 'squeezenet1_1',
 'vgg11',
 'vgg11_bn',
 'vgg13',
 'vgg13_bn',
 'vgg16',
 'vgg16_bn',
 'vgg19_bn',
 'vgg19',
 'nasnetamobile',
 'nasnetalarge',
 'dpn68',
 'dpn68b',
 'dpn92',
 'dpn98',
 'dpn131',
 'dpn107',
 'xception',
 'senet154',
 'se_resnet50',
 'se_resnet101',
 'se_resnet152',
 'se_resnext50_32x4d',
 'se_resnext101_32x4d',
 'cafferesnet101',
 'pnasnet5large',
 'polynet']

## ResNet

We first look at ResNet, which is proposed in "[Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385)". In the torchvision implementation, the ResNet-34 model has the following structure:

In [2]:
arch_summary(resnet34)

(0) Conv2d      : 1   layers (total: 1)
(1) BatchNorm2d : 1   layers (total: 2)
(2) ReLU        : 1   layers (total: 3)
(3) MaxPool2d   : 1   layers (total: 4)
(4) Sequential  : 15  layers (total: 19)
(5) Sequential  : 22  layers (total: 41)
(6) Sequential  : 32  layers (total: 73)
(7) Sequential  : 17  layers (total: 90)
(8) AvgPool2d   : 1   layers (total: 91)
(9) Linear      : 1   layers (total: 92)


The `arch_summary` prints out the module name of the _direct_ children, and count how many submodules in each of them. E.g., the (0) contains only one module (itself), the (4) contains 15 submodules. The direct children gives us a big picture of the structure. 

We can see after the `Conv2d`, `BatchNorm2d`, `ReLU` and `MaxPool2d` layers, we have four `BasicBlock`. Finally, we have an `AvgPool2d` layer and `Linear` layer.

For transfer learning, we cut out layer (8) and (9) and add our own custom head depending on the problem.

One of greatest techniques fastai uses heavily is called [discriminative layer training](https://docs.fast.ai/basic_train.html#Discriminative-layer-training), which is to use apply different learning rates to different layers. The idea is, what the first few layers learn is more task-independent, so we don't want to change them too much from the pretrained weights; we'd like to train more for the last few layers added by us. 

To group the layers for discriminative learning rates, we can set (0) ~ (5) as group 1, set (6) and (7) as group 2, and our own custom head as group 3. In fastai, the cut and split is defined as:
```
{'cut':-2, 'split':_resnet_split }
```

The "cut" part means we'll cut out the last two layers from the original model, then add a custom head. Now the new model has two parts: a "body" and a "head". The "split" part:

```
def _resnet_split(m:nn.Module): return (m[0][6],m[1])
```

It means we split the whole model at the (6) of the "body" part (`m[0][6]`), and split at the "head" part (`m[1]`), so finally we got three layer groups ("body" cut to two, plus the "head"), and we can apply three different learning rates during training.

Since we're only testing the models and don't care about the data, so we can mock the data up by using `FakeData`. Now we can create a learner: 

In [3]:
learn = cnn_learner(FakeData(), resnet34, pretrained=False)

When you use it to train you'll want to set `pretrained=True` to use the pretrained weights, but in here since we're testing the model structures so we don't care about the weights. 

To check the cut and split work as we expected, we extract the groups: 

In [4]:
get_groups(nn.Sequential(*learn.model[0], *learn.model[1]), learn.layer_groups)

Group 1: ['Conv2d', 'BatchNorm2d', 'ReLU', 'MaxPool2d', 'Sequential', 'Sequential']
Group 2: ['Sequential', 'Sequential']
Group 3: ['AdaptiveConcatPool2d', 'Flatten', 'BatchNorm1d', 'Dropout', 'Linear', 'ReLU', 'BatchNorm1d', 'Dropout', 'Linear']


It's split as we expected. 

There's quite a few versions of ResNet, we now take a look ResNet-50: 

In [15]:
arch_summary(resnet50)

(0) Conv2d      : 1   layers (total: 1)
(1) BatchNorm2d : 1   layers (total: 2)
(2) ReLU        : 1   layers (total: 3)
(3) MaxPool2d   : 1   layers (total: 4)
(4) Sequential  : 23  layers (total: 27)
(5) Sequential  : 30  layers (total: 57)
(6) Sequential  : 44  layers (total: 101)
(7) Sequential  : 23  layers (total: 124)
(8) AvgPool2d   : 1   layers (total: 125)
(9) Linear      : 1   layers (total: 126)


We can see it has the same structure as ResNet-34, but the four `BasicBlock` are replaced by four `Bottleneck` blocks. Same for ResNet-101 and ResNet-152:

In [16]:
arch_summary(resnet101)

(0) Conv2d      : 1   layers (total: 1)
(1) BatchNorm2d : 1   layers (total: 2)
(2) ReLU        : 1   layers (total: 3)
(3) MaxPool2d   : 1   layers (total: 4)
(4) Sequential  : 23  layers (total: 27)
(5) Sequential  : 30  layers (total: 57)
(6) Sequential  : 163 layers (total: 220)
(7) Sequential  : 23  layers (total: 243)
(8) AvgPool2d   : 1   layers (total: 244)
(9) Linear      : 1   layers (total: 245)


In [17]:
arch_summary(resnet152)

(0) Conv2d      : 1   layers (total: 1)
(1) BatchNorm2d : 1   layers (total: 2)
(2) ReLU        : 1   layers (total: 3)
(3) MaxPool2d   : 1   layers (total: 4)
(4) Sequential  : 23  layers (total: 27)
(5) Sequential  : 58  layers (total: 85)
(6) Sequential  : 254 layers (total: 339)
(7) Sequential  : 23  layers (total: 362)
(8) AvgPool2d   : 1   layers (total: 363)
(9) Linear      : 1   layers (total: 364)


This means we can use the same code in fastai for all of these architectures.

## ResNeXt

ResNeXt is proposed in "[Aggregated Residual Transformations for Deep Neural Networks](https://arxiv.org/abs/1611.05431)". We first take a look at the Cadene's implementation of "resnext101_32x4d". Wrap it to adapt the PyTorch models' API:

In [5]:
def resnext101_32x4d(pretrained=False):
    pretrained = 'imagenet' if pretrained else None
    model = pretrainedmodels.resnext101_32x4d(pretrained=pretrained)
    all_layers = list(model.children())
    return nn.Sequential(*all_layers[0], *all_layers[1:])

In [14]:
arch_summary(resnext101_32x4d)

(0) Conv2d      : 1   layers (total: 1)
(1) BatchNorm2d : 1   layers (total: 2)
(2) ReLU        : 1   layers (total: 3)
(3) MaxPool2d   : 1   layers (total: 4)
(4) Sequential  : 34  layers (total: 38)
(5) Sequential  : 45  layers (total: 83)
(6) Sequential  : 254 layers (total: 337)
(7) Sequential  : 34  layers (total: 371)
(8) AvgPool2d   : 1   layers (total: 372)
(9) Linear      : 1   layers (total: 373)


After the `Conv2d`, `BatchNorm2d`, `ReLU` and `MaxPool2d` layers, we have four blocks. Finally, we have an `AvgPool2d` layer and `Linear` layer as usual. 

For transfer learning, we cut out layer (8) and (9) and add our own custom head depending on the problem.  

To group the layers for discriminative learning rates, we can set (0) ~ (5) as group 1, set (6) and (7) as group 2, and our own custom head as group 3. In fastai, we can specify the cut and split by:

Now we can create a learner: 

In [6]:
learn = cnn_learner(FakeData(), resnext101_32x4d, pretrained=False,
                    cut=-2, split_on=lambda m: (m[0][6], m[1]))

To check the cut and split work as we expected, we extract the groups: 

In [7]:
get_groups(nn.Sequential(*learn.model[0], *learn.model[1]), learn.layer_groups)

Group 1: ['Conv2d', 'BatchNorm2d', 'ReLU', 'MaxPool2d', 'Sequential', 'Sequential']
Group 2: ['Sequential', 'Sequential']
Group 3: ['AdaptiveConcatPool2d', 'Flatten', 'BatchNorm1d', 'Dropout', 'Linear', 'ReLU', 'BatchNorm1d', 'Dropout', 'Linear']


We can see the original model's last two layers are removed, the rest layers are split into group 1 and 2. Group 3 is the custom head added by fastai. So the customization works as expected.  

Alternatively, you can add the cut and split to the model metadata before calling `create_cnn`: 

In [154]:
_resnext_meta = {'cut': -2, 'split': lambda m: (m[0][6], m[1]) }

In [155]:
model_meta[resnext101_32x4d] = _resnext_meta

"resnext101_64x4d" has similar structures:

In [3]:
def resnext101_64x4d(pretrained=False):
    pretrained = 'imagenet' if pretrained else None
    model = pretrainedmodels.resnext101_64x4d(pretrained=pretrained)
    all_layers = list(model.children())
    return nn.Sequential(*all_layers[0], *all_layers[1:])

In [4]:
arch_summary(resnext101_64x4d)

(0) Conv2d      : 1   layers (total: 1)
(1) BatchNorm2d : 1   layers (total: 2)
(2) ReLU        : 1   layers (total: 3)
(3) MaxPool2d   : 1   layers (total: 4)
(4) Sequential  : 34  layers (total: 38)
(5) Sequential  : 45  layers (total: 83)
(6) Sequential  : 254 layers (total: 337)
(7) Sequential  : 34  layers (total: 371)
(8) AvgPool2d   : 1   layers (total: 372)
(9) Linear      : 1   layers (total: 373)


So we can use the same cut and split to create a learner. 

## SENet

SENet is proposed in "[Squeeze-and-Excitation Networks](https://arxiv.org/pdf/1709.01507.pdf)". We first take a look at the Cadene's implementation of "se_resnet50". Wrap it to adapt the PyTorch models' API:

In [15]:
def se_resnet50(pretrained=False):
    pretrained = 'imagenet' if pretrained else None
    model = pretrainedmodels.se_resnet50(pretrained=pretrained)
    return model

In [16]:
arch_summary(se_resnet50)

(0) Sequential  : 4   layers (total: 4)
(1) Sequential  : 38  layers (total: 42)
(2) Sequential  : 50  layers (total: 92)
(3) Sequential  : 74  layers (total: 166)
(4) Sequential  : 38  layers (total: 204)
(5) AvgPool2d   : 1   layers (total: 205)
(6) Linear      : 1   layers (total: 206)


It contains five `SEResNetBottleneck` blocks, followed by an `AvgPool2d` layer and `Linear` layer. 

For transfer learning, we cut out the last two layers and add our own custom head depending on the problem.  

To group the layers for discriminative learning rates, we can set (0) ~ (2) as group 1, set (3) and (4) as group 2, and our own custom head as group 3. In fastai, we can specify the cut and split by:

Now we can create a learner: 

In [126]:
learn = cnn_learner(FakeData(), se_resnet50, pretrained=False,
                    cut=-2, split_on=lambda m: (m[0][3], m[1]))

To check the cut and split work as we expected, we extract the groups: 

In [133]:
get_groups(nn.Sequential(*learn.model[0], *learn.model[1]), learn.layer_groups)

Group 1: ['Sequential', 'Sequential', 'Sequential']
Group 2: ['Sequential', 'Sequential']
Group 3: ['AdaptiveConcatPool2d', 'Lambda', 'BatchNorm1d', 'Dropout', 'Linear', 'ReLU', 'BatchNorm1d', 'Dropout', 'Linear']


The customization works as expected.

Alternatively, you can add the cut and split to the model metadata before calling `create_cnn`: 

In [124]:
_se_resnet_meta = {'cut': -2, 'split': lambda m: (m[0][3], m[1]) }

In [125]:
model_meta[se_resnet50] = _se_resnet_meta

What about other SENet? Similarly, they all have the same structure. The structure of "se_resnet101": 

In [6]:
arch_summary(lambda _: pretrainedmodels.se_resnet101(pretrained=None))

(0) Sequential  : 4   layers (total: 4)
(1) Sequential  : 38  layers (total: 42)
(2) Sequential  : 50  layers (total: 92)
(3) Sequential  : 278 layers (total: 370)
(4) Sequential  : 38  layers (total: 408)
(5) AvgPool2d   : 1   layers (total: 409)
(6) Linear      : 1   layers (total: 410)


The structure of "se_resnext50_32x4d":

In [7]:
arch_summary(lambda _: pretrainedmodels.se_resnext50_32x4d(pretrained=None))

(0) Sequential  : 4   layers (total: 4)
(1) Sequential  : 38  layers (total: 42)
(2) Sequential  : 50  layers (total: 92)
(3) Sequential  : 74  layers (total: 166)
(4) Sequential  : 38  layers (total: 204)
(5) AvgPool2d   : 1   layers (total: 205)
(6) Linear      : 1   layers (total: 206)


So the same cut and split can be used for the above models. 

"senet154" has one more dropout layer:

In [13]:
arch_summary(lambda _: pretrainedmodels.senet154(pretrained=None))

(0) Sequential  : 10  layers (total: 10)
(1) Sequential  : 38  layers (total: 48)
(2) Sequential  : 98  layers (total: 146)
(3) Sequential  : 434 layers (total: 580)
(4) Sequential  : 38  layers (total: 618)
(5) AvgPool2d   : 1   layers (total: 619)
(6) Dropout     : 1   layers (total: 620)
(7) Linear      : 1   layers (total: 621)


We can use the same split but set `cut=-3`. 

## Densenet

Densenet is proposed in "[Densely Connected Convolutional Networks](https://arxiv.org/pdf/1608.06993.pdf)". 

We first look at the torchvision implementation of "densenet121". It has the following structure:

In [21]:
arch_summary(densenet121)

(0) Sequential  : 365 layers (total: 365)
(1) Linear      : 1   layers (total: 366)


We expand and look into the first module: 

In [20]:
arch_summary(lambda _: next(densenet121(False).children()))

(0) Conv2d      : 1   layers (total: 1)
(1) BatchNorm2d : 1   layers (total: 2)
(2) ReLU        : 1   layers (total: 3)
(3) MaxPool2d   : 1   layers (total: 4)
(4) _DenseBlock : 36  layers (total: 40)
(5) _Transition : 4   layers (total: 44)
(6) _DenseBlock : 72  layers (total: 116)
(7) _Transition : 4   layers (total: 120)
(8) _DenseBlock : 144 layers (total: 264)
(9) _Transition : 4   layers (total: 268)
(10) _DenseBlock : 96  layers (total: 364)
(11) BatchNorm2d : 1   layers (total: 365)


After the usual `Conv2d`, `BatchNorm2d`, `ReLU` and `MaxPool2d` layers, we got some `_DenseBlock` and `_Transition` blocks. 

For transfer learning, we can cut out the last `Linear` layer and add our own custom head depending on the problem.  

To group the layers for discriminative learning rates, we can set (0) ~ (6) as group 1, set (7) and (11) as group 2, and our own custom head as group 3. In fastai, we can specify the cut and split by:

```
{'cut':-1, 'split':_densenet_split}
```

where:
```
def _densenet_split(m:nn.Module): return (m[0][0][7],m[1])
```

Now we can create a learner: 

In [174]:
learn = cnn_learner(FakeData(), densenet121, pretrained=False)

To check the cut and split work as we expected, we extract the groups: 

In [175]:
get_groups(nn.Sequential(*learn.model[0][0], *learn.model[1]), learn.layer_groups)

Group 1: ['Conv2d', 'BatchNorm2d', 'ReLU', 'MaxPool2d', '_DenseBlock', '_Transition', '_DenseBlock']
Group 2: ['_Transition', '_DenseBlock', '_Transition', '_DenseBlock', 'BatchNorm2d']
Group 3: ['AdaptiveConcatPool2d', 'Lambda', 'BatchNorm1d', 'Dropout', 'Linear', 'ReLU', 'BatchNorm1d', 'Dropout', 'Linear']


The customization works as expected.

Other Densenet has similar structures, e.g.:

In [9]:
arch_summary(densenet201)

(0) Sequential  : 605 layers (total: 605)
(1) Linear      : 1   layers (total: 606)


In [11]:
arch_summary(lambda _: next(densenet201(False).children()))

(0) Conv2d      : 1   layers (total: 1)
(1) BatchNorm2d : 1   layers (total: 2)
(2) ReLU        : 1   layers (total: 3)
(3) MaxPool2d   : 1   layers (total: 4)
(4) _DenseBlock : 36  layers (total: 40)
(5) _Transition : 4   layers (total: 44)
(6) _DenseBlock : 72  layers (total: 116)
(7) _Transition : 4   layers (total: 120)
(8) _DenseBlock : 288 layers (total: 408)
(9) _Transition : 4   layers (total: 412)
(10) _DenseBlock : 192 layers (total: 604)
(11) BatchNorm2d : 1   layers (total: 605)


## Inception

We first look at Inception-v4, proposed in "[Inception-ResNet and the Impact of Residual Connections on Learning](https://arxiv.org/pdf/1602.07261.pdf)".

We wrap the Cadene implementation into the torchvision model API:

In [4]:
def inceptionv4(pretrained=False):
    pretrained = 'imagenet' if pretrained else None
    model = pretrainedmodels.inceptionv4(pretrained=pretrained)
    all_layers = list(model.children())
    return nn.Sequential(*all_layers[0], *all_layers[1:])

It has the following structures: 

In [22]:
arch_summary(inceptionv4)

(0) BasicConv2d : 3   layers (total: 3)
(1) BasicConv2d : 3   layers (total: 6)
(2) BasicConv2d : 3   layers (total: 9)
(3) Mixed_3a    : 4   layers (total: 13)
(4) Mixed_4a    : 18  layers (total: 31)
(5) Mixed_5a    : 4   layers (total: 35)
(6) Inception_A : 22  layers (total: 57)
(7) Inception_A : 22  layers (total: 79)
(8) Inception_A : 22  layers (total: 101)
(9) Inception_A : 22  layers (total: 123)
(10) Reduction_A : 13  layers (total: 136)
(11) Inception_B : 31  layers (total: 167)
(12) Inception_B : 31  layers (total: 198)
(13) Inception_B : 31  layers (total: 229)
(14) Inception_B : 31  layers (total: 260)
(15) Inception_B : 31  layers (total: 291)
(16) Inception_B : 31  layers (total: 322)
(17) Inception_B : 31  layers (total: 353)
(18) Reduction_B : 19  layers (total: 372)
(19) Inception_C : 31  layers (total: 403)
(20) Inception_C : 31  layers (total: 434)
(21) Inception_C : 31  layers (total: 465)
(22) AvgPool2d   : 1   layers (total: 466)
(23) Linear      : 1   layers (t

For transfer learning, we can cut out the last two layers and add our own custom head depending on the problem.  

To group the layers for discriminative learning rates, we can set (0) ~ (10) as group 1, set (11) and (21) as group 2, and our own custom head as group 3. Now we specify the cut and split and create the learner:

In [5]:
learn = cnn_learner(FakeData(), inceptionv4, pretrained=False,
                    cut=-2, split_on=lambda m: (m[0][11], m[1]))

To check the cut and split work as we expected, we extract the groups: 

In [6]:
get_groups(nn.Sequential(*learn.model[0], *learn.model[1]), learn.layer_groups)

Group 1: ['BasicConv2d', 'BasicConv2d', 'BasicConv2d', 'Mixed_3a', 'Mixed_4a', 'Mixed_5a', 'Inception_A', 'Inception_A', 'Inception_A', 'Inception_A', 'Reduction_A']
Group 2: ['Inception_B', 'Inception_B', 'Inception_B', 'Inception_B', 'Inception_B', 'Inception_B', 'Inception_B', 'Reduction_B', 'Inception_C', 'Inception_C', 'Inception_C']
Group 3: ['AdaptiveConcatPool2d', 'Flatten', 'BatchNorm1d', 'Dropout', 'Linear', 'ReLU', 'BatchNorm1d', 'Dropout', 'Linear']


The customization works as expected.

Alternatively, you can add the cut and split to the model metadata before calling `create_cnn`: 

In [4]:
_inception_4_meta = { 'cut': -2, 'split': lambda m: (m[0][11], m[1]) }

In [5]:
model_meta[inceptionv4] = _inception_4_meta

Cadene's Inception ResNet V2 has similar implementation. Define:

In [2]:
def inceptionresnetv2(pretrained=False):
    pretrained = 'imagenet' if pretrained else None
    model = pretrainedmodels.inceptionresnetv2(pretrained=pretrained)
    return nn.Sequential(*model.children())

In [3]:
arch_summary(inceptionresnetv2)

(0) BasicConv2d : 3   layers (total: 3)
(1) BasicConv2d : 3   layers (total: 6)
(2) BasicConv2d : 3   layers (total: 9)
(3) MaxPool2d   : 1   layers (total: 10)
(4) BasicConv2d : 3   layers (total: 13)
(5) BasicConv2d : 3   layers (total: 16)
(6) MaxPool2d   : 1   layers (total: 17)
(7) Mixed_5b    : 22  layers (total: 39)
(8) Sequential  : 200 layers (total: 239)
(9) Mixed_6a    : 13  layers (total: 252)
(10) Sequential  : 280 layers (total: 532)
(11) Mixed_7a    : 22  layers (total: 554)
(12) Sequential  : 126 layers (total: 680)
(13) Block8      : 13  layers (total: 693)
(14) BasicConv2d : 3   layers (total: 696)
(15) AvgPool2d   : 1   layers (total: 697)
(16) Linear      : 1   layers (total: 698)


In [4]:
learn = cnn_learner(FakeData(), inceptionresnetv2, pretrained=False,
                    cut=-2, split_on=lambda m: (m[0][9], m[1]))

In [5]:
get_groups(nn.Sequential(*learn.model[0], *learn.model[1]), learn.layer_groups)

Group 1: ['BasicConv2d', 'BasicConv2d', 'BasicConv2d', 'MaxPool2d', 'BasicConv2d', 'BasicConv2d', 'MaxPool2d', 'Mixed_5b', 'Sequential']
Group 2: ['Mixed_6a', 'Sequential', 'Mixed_7a', 'Sequential', 'Block8', 'BasicConv2d']
Group 3: ['AdaptiveConcatPool2d', 'Flatten', 'BatchNorm1d', 'Dropout', 'Linear', 'ReLU', 'BatchNorm1d', 'Dropout', 'Linear']


Although we used it with almost same code, the underlying structures are completely different for different versions of Inception, as the models are redesigned in each generation. The torchvision's implementation of Inception V3 has the structure: 

In [12]:
arch_summary(inception_v3)

(0) BasicConv2d : 2   layers (total: 2)
(1) BasicConv2d : 2   layers (total: 4)
(2) BasicConv2d : 2   layers (total: 6)
(3) BasicConv2d : 2   layers (total: 8)
(4) BasicConv2d : 2   layers (total: 10)
(5) InceptionA  : 14  layers (total: 24)
(6) InceptionA  : 14  layers (total: 38)
(7) InceptionA  : 14  layers (total: 52)
(8) InceptionB  : 8   layers (total: 60)
(9) InceptionC  : 20  layers (total: 80)
(10) InceptionC  : 20  layers (total: 100)
(11) InceptionC  : 20  layers (total: 120)
(12) InceptionC  : 20  layers (total: 140)
(13) InceptionAux: 5   layers (total: 145)
(14) InceptionD  : 12  layers (total: 157)
(15) InceptionE  : 18  layers (total: 175)
(16) InceptionE  : 18  layers (total: 193)
(17) Linear      : 1   layers (total: 194)


__Note__: some layers are not shown in here, e.g., between (2) and (3) there's a max pooling layer. This is because torchvision implementation uses `F.max_pool2d` in `forward`, which won't be included as children. So we can't cut and split this one like what we did for other models. 

If you'd really like to use this model, here's a workaround. Now, because of these `F.max_pool2d` functions, we must pass the model as a whole to fastai; at the same time, we'll want to cut off layers after `F.avg_pool2d`. How do you use it as a whole but only want part of it? The only way (that I can come up with) to solve this contradiction is to monkey patch the `forward` function. Copy it without layers after `F.avg_pool2d`:

In [2]:
def forward_inception_v3(self, x):
    x = self.Conv2d_1a_3x3(x)
    x = self.Conv2d_2a_3x3(x)
    x = self.Conv2d_2b_3x3(x)
    x = F.max_pool2d(x, kernel_size=3, stride=2)
    x = self.Conv2d_3b_1x1(x)
    x = self.Conv2d_4a_3x3(x)
    x = F.max_pool2d(x, kernel_size=3, stride=2)
    x = self.Mixed_5b(x)
    x = self.Mixed_5c(x)
    x = self.Mixed_5d(x)
    x = self.Mixed_6a(x)
    x = self.Mixed_6b(x)
    x = self.Mixed_6c(x)        
    x = self.Mixed_6d(x)        
    x = self.Mixed_6e(x)        
    if self.training and self.aux_logits:
        aux = self.AuxLogits(x)
    x = self.Mixed_7a(x)        
    x = self.Mixed_7b(x)        
    x = self.Mixed_7c(x)
    return x

And override it:

In [3]:
Inception3.forward = forward_inception_v3

Wrap the model into a `nn.Sequential` as a whole:

In [4]:
def inception_v3_cut(pretrained=False):    
    model = inception_v3(pretrained)
    return nn.Sequential(model)

Add the `cut` and `split` to `model_meta`: 

In [5]:
model_meta[inception_v3_cut] =  { 'cut': noop, 
                                  'split': lambda m: (list(m[0][0].children())[9], m[1]) }

Note: it _won't_ work if you specify `cut` and `split` in `create_cnn`, the `cut` will be overriden by the `_default_meta`. 

Now we can create the learner:

In [6]:
learn = cnn_learner(FakeData(), inception_v3_cut, pretrained=False)

In [7]:
get_groups(nn.Sequential(*learn.model[0][0].children(), *learn.model[1]), learn.layer_groups)

Group 1: ['BasicConv2d', 'BasicConv2d', 'BasicConv2d', 'BasicConv2d', 'BasicConv2d', 'InceptionA', 'InceptionA', 'InceptionA', 'InceptionB']
Group 2: ['InceptionC', 'InceptionC', 'InceptionC', 'InceptionC', 'InceptionAux', 'InceptionD', 'InceptionE', 'InceptionE', 'Linear']
Group 3: ['AdaptiveConcatPool2d', 'Flatten', 'BatchNorm1d', 'Dropout', 'Linear', 'ReLU', 'BatchNorm1d', 'Dropout', 'Linear']


It works! 

## Wide ResNet

Wide ResNet (WRN) is proposed in "[Wide Residual Networks](https://arxiv.org/pdf/1605.07146.pdf)". 

We take a look at the fastai's implementation: 

In [29]:
arch_summary(lambda _: wrn_22())

(0) Sequential  : 54  layers (total: 54)


We look into this `Sequential` module:

In [30]:
arch_summary(lambda _: next(wrn_22().children()))

(0) Conv2d      : 1   layers (total: 1)
(1) BasicBlock  : 6   layers (total: 7)
(2) BasicBlock  : 5   layers (total: 12)
(3) BasicBlock  : 5   layers (total: 17)
(4) BasicBlock  : 6   layers (total: 23)
(5) BasicBlock  : 5   layers (total: 28)
(6) BasicBlock  : 5   layers (total: 33)
(7) BasicBlock  : 6   layers (total: 39)
(8) BasicBlock  : 5   layers (total: 44)
(9) BasicBlock  : 5   layers (total: 49)
(10) BatchNorm2d : 1   layers (total: 50)
(11) ReLU        : 1   layers (total: 51)
(12) AdaptiveAvgPool2d: 1   layers (total: 52)
(13) Flatten     : 1   layers (total: 53)
(14) Linear      : 1   layers (total: 54)


Fastai didn't implement the `wrn_22` in the torchvision model API, probably because the lack of pretrained models for WRN. It'll not be very useful if we don't have pretrained model, we'll have to train it from the beginning. But for completeness, we wrap it and test our cut and split:

In [47]:
def w_rn_22(pretrained=False):
    return nn.Sequential(*list(next(wrn_22().children())))

In [48]:
_wrn_22_meta = { 'cut': None, 'split': lambda m: (m[0][5], m[1]) }

In [49]:
model_meta[w_rn_22] = _wrn_22_meta

Now we can create a learner: 

In [50]:
learn = cnn_learner(FakeData(), w_rn_22, pretrained=False)

To check the cut and split work as we expected, we extract the groups: 

In [51]:
get_groups(nn.Sequential(*learn.model[0], *learn.model[1]), learn.layer_groups)

Group 1: ['Conv2d', 'BasicBlock', 'BasicBlock', 'BasicBlock', 'BasicBlock']
Group 2: ['BasicBlock', 'BasicBlock', 'BasicBlock', 'BasicBlock', 'BasicBlock', 'BatchNorm2d', 'ReLU', 'AdaptiveAvgPool2d', 'Flatten', 'Linear']
Group 3: ['AdaptiveConcatPool2d', 'Flatten', 'BatchNorm1d', 'Dropout', 'Linear', 'ReLU', 'BatchNorm1d', 'Dropout', 'Linear']


The customization works as expected. 

## SqueezeNet

SqueezeNet is proposed in "[SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size](https://arxiv.org/abs/1602.07360)". 

We take a look at the torchvision implementation: 

In [53]:
arch_summary(squeezenet1_0)

(0) Sequential  : 53  layers (total: 53)
(1) Sequential  : 4   layers (total: 57)


The first `Sequential` module contains:

In [54]:
arch_summary(lambda _: next(squeezenet1_0(False).children()))

(0) Conv2d      : 1   layers (total: 1)
(1) ReLU        : 1   layers (total: 2)
(2) MaxPool2d   : 1   layers (total: 3)
(3) Fire        : 6   layers (total: 9)
(4) Fire        : 6   layers (total: 15)
(5) Fire        : 6   layers (total: 21)
(6) MaxPool2d   : 1   layers (total: 22)
(7) Fire        : 6   layers (total: 28)
(8) Fire        : 6   layers (total: 34)
(9) Fire        : 6   layers (total: 40)
(10) Fire        : 6   layers (total: 46)
(11) MaxPool2d   : 1   layers (total: 47)
(12) Fire        : 6   layers (total: 53)


The second `Sequential` module contains:

In [55]:
arch_summary(lambda _: list(squeezenet1_0(False).children())[1])

(0) Dropout     : 1   layers (total: 1)
(1) Conv2d      : 1   layers (total: 2)
(2) ReLU        : 1   layers (total: 3)
(3) AvgPool2d   : 1   layers (total: 4)


So we want to cut the second `Sequential` module out. We can split the first `Sequential` module on the layer (7) (_note_: this will override the current split defined in the fastai library):

In [56]:
_squeezenet_meta = { 'cut': -1, 'split': lambda m: (m[0][0][7], m[1]) }

In [57]:
model_meta[squeezenet1_0] = _squeezenet_meta

Now we can create a learner: 

In [58]:
learn = cnn_learner(FakeData(), squeezenet1_0, pretrained=False)

To check the cut and split work as we expected, we extract the groups: 

In [59]:
get_groups(nn.Sequential(*learn.model[0][0], *learn.model[1]), learn.layer_groups)

Group 1: ['Conv2d', 'ReLU', 'MaxPool2d', 'Fire', 'Fire', 'Fire', 'MaxPool2d']
Group 2: ['Fire', 'Fire', 'Fire', 'Fire', 'MaxPool2d', 'Fire']
Group 3: ['AdaptiveConcatPool2d', 'Flatten', 'BatchNorm1d', 'Dropout', 'Linear', 'ReLU', 'BatchNorm1d', 'Dropout', 'Linear']


The customization works as expected. 

## Xception

Xception is proposed in [Xception: Deep Learning with Depthwise Separable Convolutions](https://arxiv.org/pdf/1610.02357.pdf). Wrap the Cadene implementation into torchvision model API:

In [2]:
def xception(pretrained=False):
    pretrained = 'imagenet' if pretrained else None
    model = pretrainedmodels.xception(pretrained=pretrained)
    return nn.Sequential(*list(model.children()))

In [17]:
arch_summary(xception)

(0) Conv2d      : 1   layers (total: 1)
(1) BatchNorm2d : 1   layers (total: 2)
(2) ReLU        : 1   layers (total: 3)
(3) Conv2d      : 1   layers (total: 4)
(4) BatchNorm2d : 1   layers (total: 5)
(5) Block       : 11  layers (total: 16)
(6) Block       : 12  layers (total: 28)
(7) Block       : 12  layers (total: 40)
(8) Block       : 12  layers (total: 52)
(9) Block       : 12  layers (total: 64)
(10) Block       : 12  layers (total: 76)
(11) Block       : 12  layers (total: 88)
(12) Block       : 12  layers (total: 100)
(13) Block       : 12  layers (total: 112)
(14) Block       : 12  layers (total: 124)
(15) Block       : 12  layers (total: 136)
(16) Block       : 12  layers (total: 148)
(17) SeparableConv2d: 2   layers (total: 150)
(18) BatchNorm2d : 1   layers (total: 151)
(19) SeparableConv2d: 2   layers (total: 153)
(20) BatchNorm2d : 1   layers (total: 154)
(21) Linear      : 1   layers (total: 155)


We can cut out the last layer, and split at (11):

In [10]:
learn = cnn_learner(FakeData(), xception, pretrained=False,
                    cut=-1, split_on=lambda m: (m[0][11], m[1]))

Alternatively, you can add the cut and split to the model metadata before calling `create_cnn`: 

In [5]:
_xception_meta = { 'cut': -1, 'split': lambda m: (m[0][11], m[1]) }

In [6]:
model_meta[xception] = _xception_meta

To check the cut and split work as we expected, we extract the groups: 

In [11]:
get_groups(nn.Sequential(*learn.model[0], *learn.model[1]), learn.layer_groups)

Group 1: ['Conv2d', 'BatchNorm2d', 'ReLU', 'Conv2d', 'BatchNorm2d', 'Block', 'Block', 'Block', 'Block', 'Block', 'Block']
Group 2: ['Block', 'Block', 'Block', 'Block', 'Block', 'Block', 'SeparableConv2d', 'BatchNorm2d', 'SeparableConv2d', 'BatchNorm2d']
Group 3: ['AdaptiveConcatPool2d', 'Flatten', 'BatchNorm1d', 'Dropout', 'Linear', 'ReLU', 'BatchNorm1d', 'Dropout', 'Linear']


The customization works as expected. 

## DPN

Dual Path Networks (DPN) is proposed in "[Dual Path Networks](https://arxiv.org/abs/1707.01629)". Wrap the Cadene implementation into torchvision model API:

In [7]:
def dpn92(pretrained=False):
    pretrained = 'imagenet+5k' if pretrained else None
    model = pretrainedmodels.dpn92(pretrained=pretrained)
    return nn.Sequential(*list(model.children()))

In [8]:
arch_summary(dpn92)

(0) Sequential  : 288 layers (total: 288)
(1) Conv2d      : 1   layers (total: 289)


The first `Sequential` contains:

In [9]:
arch_summary(lambda _: next(dpn92(False).children()))

(0) InputBlock  : 4   layers (total: 4)
(1) DualPathBlock: 12  layers (total: 16)
(2) DualPathBlock: 9   layers (total: 25)
(3) DualPathBlock: 9   layers (total: 34)
(4) DualPathBlock: 12  layers (total: 46)
(5) DualPathBlock: 9   layers (total: 55)
(6) DualPathBlock: 9   layers (total: 64)
(7) DualPathBlock: 9   layers (total: 73)
(8) DualPathBlock: 12  layers (total: 85)
(9) DualPathBlock: 9   layers (total: 94)
(10) DualPathBlock: 9   layers (total: 103)
(11) DualPathBlock: 9   layers (total: 112)
(12) DualPathBlock: 9   layers (total: 121)
(13) DualPathBlock: 9   layers (total: 130)
(14) DualPathBlock: 9   layers (total: 139)
(15) DualPathBlock: 9   layers (total: 148)
(16) DualPathBlock: 9   layers (total: 157)
(17) DualPathBlock: 9   layers (total: 166)
(18) DualPathBlock: 9   layers (total: 175)
(19) DualPathBlock: 9   layers (total: 184)
(20) DualPathBlock: 9   layers (total: 193)
(21) DualPathBlock: 9   layers (total: 202)
(22) DualPathBlock: 9   layers (total: 211)
(23) DualP

There're 30 `DualPathBlock`, we can split them into two groups:

In [11]:
learn = cnn_learner(FakeData(), dpn92, pretrained=False,
                    cut=-1, split_on=lambda m: (m[0][0][16], m[1]))

To check the cut and split work as we expected, we extract the groups: 

In [12]:
get_groups(nn.Sequential(*learn.model[0][0], *learn.model[1]), learn.layer_groups)

Group 1: ['InputBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock']
Group 2: ['DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'DualPathBlock', 'CatBnAct']
Group 3: ['AdaptiveConcatPool2d', 'Flatten', 'BatchNorm1d', 'Dropout', 'Linear', 'ReLU', 'BatchNorm1d', 'Dropout', 'Linear']


The customization works as expected. 

Alternatively, you can add the cut and split to the model metadata before calling `create_cnn`: 

In [124]:
_dpn_meta = {'cut': -1, 'split': lambda m: (m[0][0][16], m[1]) }

In [125]:
model_meta[dpn92] = _dpn_meta

## NASNetAMobile

NASNet is proposed in "[NASNet](https://arxiv.org/abs/1707.07012)". Wrap the Cadene implementation into torchvision model API:

In [2]:
def identity(x): return x

def nasnetamobile(pretrained=False):
    pretrained = 'imagenet' if pretrained else None
    model = pretrainedmodels.nasnetamobile(pretrained=pretrained, num_classes=1000)  
    model.logits = identity
    return nn.Sequential(model)

Note the `logits` part is replaced by identity because we'll use fastai later to replace this part, which contains the last few layers. Also, if you use lambda function for the identity, the model won't be picklable. 

In [9]:
arch_summary(lambda _: nasnetamobile(False)[0])

(0) Sequential  : 2   layers (total: 2)
(1) CellStem0   : 47  layers (total: 49)
(2) CellStem1   : 57  layers (total: 106)
(3) FirstCell   : 53  layers (total: 159)
(4) NormalCell  : 49  layers (total: 208)
(5) NormalCell  : 49  layers (total: 257)
(6) NormalCell  : 49  layers (total: 306)
(7) ReductionCell0: 58  layers (total: 364)
(8) FirstCell   : 53  layers (total: 417)
(9) NormalCell  : 49  layers (total: 466)
(10) NormalCell  : 49  layers (total: 515)
(11) NormalCell  : 49  layers (total: 564)
(12) ReductionCell1: 53  layers (total: 617)
(13) FirstCell   : 53  layers (total: 670)
(14) NormalCell  : 49  layers (total: 719)
(15) NormalCell  : 49  layers (total: 768)
(16) NormalCell  : 49  layers (total: 817)
(17) ReLU        : 1   layers (total: 818)
(18) AvgPool2d   : 1   layers (total: 819)
(19) Dropout     : 1   layers (total: 820)
(20) Linear      : 1   layers (total: 821)
(21) Identity    : 1   layers (total: 822)


We need to set the `cut` to `noop` as we have already replaced the last few layers (as I'll explain later, we can't cut directly). For learning rates, layer (8) is a good point to split. So we set:

In [3]:
model_meta[nasnetamobile] =  { 'cut': noop, 
                               'split': lambda m: (list(m[0][0].children())[8], m[1]) }

Now we can create the learner:

In [4]:
learn = cnn_learner(FakeData(), nasnetamobile, pretrained=False)

In [5]:
get_groups(nn.Sequential(*learn.model[0][0].children(), *learn.model[1]), learn.layer_groups)

Group 1: ['Sequential', 'CellStem0', 'CellStem1', 'FirstCell', 'NormalCell', 'NormalCell', 'NormalCell', 'ReductionCell0']
Group 2: ['FirstCell', 'NormalCell', 'NormalCell', 'NormalCell', 'ReductionCell1', 'FirstCell', 'NormalCell', 'NormalCell', 'NormalCell', 'ReLU', 'AvgPool2d', 'Dropout', 'Linear']
Group 3: ['AdaptiveConcatPool2d', 'Flatten', 'BatchNorm1d', 'Dropout', 'Linear', 'ReLU', 'BatchNorm1d', 'Dropout', 'Linear']


The groups for different learning rates are indeed split at layer (8). 

__NOTE__: why do I need to replace the `logits`, rather than just set `cut=-1` as usual? If you set `cut=-1`, then in `create_cnn` the model will be converted into: 

`nn.Sequential(*list(model.children())[:cut]`

And this new model doesn't work. It's because `nn.Sequential` expects the modules to have exactly one argument in `forward`. However in this case, a few layers, such as `CellStem1`, take two arguments in `forward`. So we don't want to convert the model, we must use it as a whole. 

Now if we use the model as a whole, how can we cut the last few layers? My solution is to change it into an identity function, and it seems to work fine.

Now even we have used the model as a whole, we can still set different learning rates. In the `split`, `m[0][0]` is the `NASNetAMobile` object (it's wrapped by two `nn.Sequential`, both with 1 layer only), it's not iterable so we need to make a list of its children. This time we're fine, because the `split` is only used to set different learning rates, we won't call `forward` in here. 

A final note: this model doesn't seem to work for small images (smaller than 64x64). Apparently the original model is designed for ImageNet, although we replaced with adaptive pooling layer, some parts in the model body still don't work with small images. 

## PNASNet-5

PNASNet-5 is proposed in "[Progressive Neural Architecture Search](https://arxiv.org/abs/1712.00559)". Wrap the Cadene implementation into torchvision model API:

In [15]:
def identity(x): return x

def pnasnet5large(pretrained=False):    
    pretrained = 'imagenet' if pretrained else None
    model = pretrainedmodels.pnasnet5large(pretrained=pretrained, num_classes=1000) 
    model.logits = identity
    return nn.Sequential(model)

Note the `logits` part is replaced by identity, see the "NASNetAMobile" section for more explanations. Same as above, if you use lambda function for the identity, the model won't be picklable. 

In [16]:
arch_summary(lambda _: pnasnet5large(False)[0])

(0) Sequential  : 2   layers (total: 2)
(1) CellStem0   : 59  layers (total: 61)
(2) Cell        : 64  layers (total: 125)
(3) Cell        : 61  layers (total: 186)
(4) Cell        : 57  layers (total: 243)
(5) Cell        : 57  layers (total: 300)
(6) Cell        : 57  layers (total: 357)
(7) Cell        : 68  layers (total: 425)
(8) Cell        : 61  layers (total: 486)
(9) Cell        : 57  layers (total: 543)
(10) Cell        : 57  layers (total: 600)
(11) Cell        : 60  layers (total: 660)
(12) Cell        : 61  layers (total: 721)
(13) Cell        : 57  layers (total: 778)
(14) Cell        : 57  layers (total: 835)
(15) ReLU        : 1   layers (total: 836)
(16) AvgPool2d   : 1   layers (total: 837)
(17) Dropout     : 1   layers (total: 838)
(18) Linear      : 1   layers (total: 839)


We need to set the `cut` to `noop` as we have already replaced the last few layers (we can't cut directly, see the "NASNetAMobile" section for more explanations). For learning rates, layer (8) is a good point to split. So we set:

In [17]:
model_meta[pnasnet5large] =  { 'cut': noop, 
                               'split': lambda m: (list(m[0][0].children())[8], m[1]) }

Now we can create the learner:

In [18]:
learn = cnn_learner(FakeData(), pnasnet5large, pretrained=False)

In [19]:
get_groups(nn.Sequential(*learn.model[0][0].children(), *learn.model[1]), learn.layer_groups)

Group 1: ['Sequential', 'CellStem0', 'Cell', 'Cell', 'Cell', 'Cell', 'Cell', 'Cell']
Group 2: ['Cell', 'Cell', 'Cell', 'Cell', 'Cell', 'Cell', 'Cell', 'ReLU', 'AvgPool2d', 'Dropout', 'Linear']
Group 3: ['AdaptiveConcatPool2d', 'Flatten', 'BatchNorm1d', 'Dropout', 'Linear', 'ReLU', 'BatchNorm1d', 'Dropout', 'Linear']


The groups for different learning rates are indeed split at layer (8).

Again, we did something special here, such as replacing the `logits` rather than just setting `cut=-1`. See the "NASNetAMobile" section for more explanations (for the same reason, this model doesn't work for small images (size smaller than 64x64).

## VGG

VGG is proposed in "[Very Deep Convolutional Networks for Large-Scale Image Recognition](https://arxiv.org/pdf/1409.1556.pdf)". In the torchvision implementation, the VGG-16 model has the following structure:

In [14]:
arch_summary(vgg16_bn)

(0) Sequential  : 44  layers (total: 44)
(1) Sequential  : 7   layers (total: 51)


The first `Sequential` has:

In [15]:
arch_summary(lambda _: next(vgg16_bn(False).children()))

(0) Conv2d      : 1   layers (total: 1)
(1) BatchNorm2d : 1   layers (total: 2)
(2) ReLU        : 1   layers (total: 3)
(3) Conv2d      : 1   layers (total: 4)
(4) BatchNorm2d : 1   layers (total: 5)
(5) ReLU        : 1   layers (total: 6)
(6) MaxPool2d   : 1   layers (total: 7)
(7) Conv2d      : 1   layers (total: 8)
(8) BatchNorm2d : 1   layers (total: 9)
(9) ReLU        : 1   layers (total: 10)
(10) Conv2d      : 1   layers (total: 11)
(11) BatchNorm2d : 1   layers (total: 12)
(12) ReLU        : 1   layers (total: 13)
(13) MaxPool2d   : 1   layers (total: 14)
(14) Conv2d      : 1   layers (total: 15)
(15) BatchNorm2d : 1   layers (total: 16)
(16) ReLU        : 1   layers (total: 17)
(17) Conv2d      : 1   layers (total: 18)
(18) BatchNorm2d : 1   layers (total: 19)
(19) ReLU        : 1   layers (total: 20)
(20) Conv2d      : 1   layers (total: 21)
(21) BatchNorm2d : 1   layers (total: 22)
(22) ReLU        : 1   layers (total: 23)
(23) MaxPool2d   : 1   layers (total: 24)
(24) Conv2d

The second `Sequential` has:

In [16]:
arch_summary(lambda _: list(vgg16_bn(False).children())[1])

(0) Linear      : 1   layers (total: 1)
(1) ReLU        : 1   layers (total: 2)
(2) Dropout     : 1   layers (total: 3)
(3) Linear      : 1   layers (total: 4)
(4) ReLU        : 1   layers (total: 5)
(5) Dropout     : 1   layers (total: 6)
(6) Linear      : 1   layers (total: 7)


So we'll cut the second `Sequential` out, and split the first `Sequential` at (22):

```
{'cut':-1, 'split':_vgg_split}
```

where:
```
def _vgg_split(m:nn.Module): return (m[0][0][22], m[1])
```

Now we can create a learner: 

In [22]:
learn = cnn_learner(FakeData(), vgg16_bn, pretrained=False)

To check the cut and split work as we expected, we extract the groups: 

In [23]:
get_groups(nn.Sequential(*learn.model[0][0], *learn.model[1]), learn.layer_groups)

Group 1: ['Conv2d', 'BatchNorm2d', 'ReLU', 'Conv2d', 'BatchNorm2d', 'ReLU', 'MaxPool2d', 'Conv2d', 'BatchNorm2d', 'ReLU', 'Conv2d', 'BatchNorm2d', 'ReLU', 'MaxPool2d', 'Conv2d', 'BatchNorm2d', 'ReLU', 'Conv2d', 'BatchNorm2d', 'ReLU', 'Conv2d', 'BatchNorm2d']
Group 2: ['ReLU', 'MaxPool2d', 'Conv2d', 'BatchNorm2d', 'ReLU', 'Conv2d', 'BatchNorm2d', 'ReLU', 'Conv2d', 'BatchNorm2d', 'ReLU', 'MaxPool2d', 'Conv2d', 'BatchNorm2d', 'ReLU', 'Conv2d', 'BatchNorm2d', 'ReLU', 'Conv2d', 'BatchNorm2d', 'ReLU', 'MaxPool2d']
Group 3: ['AdaptiveConcatPool2d', 'Flatten', 'BatchNorm1d', 'Dropout', 'Linear', 'ReLU', 'BatchNorm1d', 'Dropout', 'Linear']


VGG-19 has a similar structure:

In [24]:
arch_summary(vgg19_bn)

(0) Sequential  : 53  layers (total: 53)
(1) Sequential  : 7   layers (total: 60)


In [25]:
arch_summary(lambda _: next(vgg19_bn(False).children()))

(0) Conv2d      : 1   layers (total: 1)
(1) BatchNorm2d : 1   layers (total: 2)
(2) ReLU        : 1   layers (total: 3)
(3) Conv2d      : 1   layers (total: 4)
(4) BatchNorm2d : 1   layers (total: 5)
(5) ReLU        : 1   layers (total: 6)
(6) MaxPool2d   : 1   layers (total: 7)
(7) Conv2d      : 1   layers (total: 8)
(8) BatchNorm2d : 1   layers (total: 9)
(9) ReLU        : 1   layers (total: 10)
(10) Conv2d      : 1   layers (total: 11)
(11) BatchNorm2d : 1   layers (total: 12)
(12) ReLU        : 1   layers (total: 13)
(13) MaxPool2d   : 1   layers (total: 14)
(14) Conv2d      : 1   layers (total: 15)
(15) BatchNorm2d : 1   layers (total: 16)
(16) ReLU        : 1   layers (total: 17)
(17) Conv2d      : 1   layers (total: 18)
(18) BatchNorm2d : 1   layers (total: 19)
(19) ReLU        : 1   layers (total: 20)
(20) Conv2d      : 1   layers (total: 21)
(21) BatchNorm2d : 1   layers (total: 22)
(22) ReLU        : 1   layers (total: 23)
(23) Conv2d      : 1   layers (total: 24)
(24) BatchN

The first `Sequential` has more layers than VGG-16, but we can still spit it at (22), it doesn't matter that much. 

In [26]:
arch_summary(lambda _: list(vgg19_bn(False).children())[1])

(0) Linear      : 1   layers (total: 1)
(1) ReLU        : 1   layers (total: 2)
(2) Dropout     : 1   layers (total: 3)
(3) Linear      : 1   layers (total: 4)
(4) ReLU        : 1   layers (total: 5)
(5) Dropout     : 1   layers (total: 6)
(6) Linear      : 1   layers (total: 7)


## AlexNet

AlexNet is proposed in "[One weird trick for parallelizing convolutional neural networks](https://arxiv.org/abs/1404.5997)". In the torchvision implementation, the model has the following structure:

In [2]:
arch_summary(alexnet)

(0) Sequential  : 13  layers (total: 13)
(1) Sequential  : 7   layers (total: 20)


The first `Sequential` has:

In [3]:
arch_summary(lambda _: next(alexnet(False).children()))

(0) Conv2d      : 1   layers (total: 1)
(1) ReLU        : 1   layers (total: 2)
(2) MaxPool2d   : 1   layers (total: 3)
(3) Conv2d      : 1   layers (total: 4)
(4) ReLU        : 1   layers (total: 5)
(5) MaxPool2d   : 1   layers (total: 6)
(6) Conv2d      : 1   layers (total: 7)
(7) ReLU        : 1   layers (total: 8)
(8) Conv2d      : 1   layers (total: 9)
(9) ReLU        : 1   layers (total: 10)
(10) Conv2d      : 1   layers (total: 11)
(11) ReLU        : 1   layers (total: 12)
(12) MaxPool2d   : 1   layers (total: 13)


The second `Sequential` has:

In [4]:
arch_summary(lambda _: list(alexnet(False).children())[1])

(0) Dropout     : 1   layers (total: 1)
(1) Linear      : 1   layers (total: 2)
(2) ReLU        : 1   layers (total: 3)
(3) Dropout     : 1   layers (total: 4)
(4) Linear      : 1   layers (total: 5)
(5) ReLU        : 1   layers (total: 6)
(6) Linear      : 1   layers (total: 7)


So we'll cut the second `Sequential` out, and split the first `Sequential` at (6):

```
{'cut':-1, 'split':_alexnet_split}
```

where:
```
def _alexnet_split(m:nn.Module): return (m[0][0][6], m[1])
```

Now we can create a learner: 

In [5]:
learn = cnn_learner(FakeData(), alexnet, pretrained=False)

To check the cut and split work as we expected, we extract the groups: 

In [6]:
get_groups(nn.Sequential(*learn.model[0][0], *learn.model[1]), learn.layer_groups)

Group 1: ['Conv2d', 'ReLU', 'MaxPool2d', 'Conv2d', 'ReLU', 'MaxPool2d']
Group 2: ['Conv2d', 'ReLU', 'Conv2d', 'ReLU', 'Conv2d', 'ReLU', 'MaxPool2d']
Group 3: ['AdaptiveConcatPool2d', 'Flatten', 'BatchNorm1d', 'Dropout', 'Linear', 'ReLU', 'BatchNorm1d', 'Dropout', 'Linear']


## EfficientNet

EfficientNet is proposed in "[EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946)". This [PyTorch implementation](https://github.com/lukemelas/EfficientNet-PyTorch) is used in here.

Import the library:

In [2]:
from efficientnet_pytorch import EfficientNet

Take a look at the basic structures:

In [None]:
m = EfficientNet.from_pretrained('efficientnet-b0')

In [9]:
arch_summary(lambda _: m)

(0) Conv2dSamePadding: 1   layers (total: 1)
(1) BatchNorm2d : 1   layers (total: 2)
(2) ModuleList  : 126 layers (total: 128)
(3) Conv2dSamePadding: 1   layers (total: 129)
(4) BatchNorm2d : 1   layers (total: 130)
(5) Linear      : 1   layers (total: 131)


In [10]:
arch_summary(lambda _: list(m.children())[2])

(0) MBConvBlock : 6   layers (total: 6)
(1) MBConvBlock : 8   layers (total: 14)
(2) MBConvBlock : 8   layers (total: 22)
(3) MBConvBlock : 8   layers (total: 30)
(4) MBConvBlock : 8   layers (total: 38)
(5) MBConvBlock : 8   layers (total: 46)
(6) MBConvBlock : 8   layers (total: 54)
(7) MBConvBlock : 8   layers (total: 62)
(8) MBConvBlock : 8   layers (total: 70)
(9) MBConvBlock : 8   layers (total: 78)
(10) MBConvBlock : 8   layers (total: 86)
(11) MBConvBlock : 8   layers (total: 94)
(12) MBConvBlock : 8   layers (total: 102)
(13) MBConvBlock : 8   layers (total: 110)
(14) MBConvBlock : 8   layers (total: 118)
(15) MBConvBlock : 8   layers (total: 126)


Wrap it in a function:

In [3]:
# "pretrained" is hardcoded to adapt to the PyTorch model function
def efficient_net_b0(pretrained=True):
    model = EfficientNet.from_pretrained('efficientnet-b0')
    return nn.Sequential(model)

Note that it's not very easy to initialize a non-pretrained instance (and we'll want to use the pretrained anyway), so the "pretrained" parameter is hardcoded here. The `efficient_net_b0` does nothing, it just makes it easier to feed to fastai's `cnn_learner`.

We need to set the `cut` to `noop`, see the "NASNetAMobile" section for more explanations. For learning rates, I divided the 18 `MBConvBlock` blocks into two parts.

In [4]:
model_meta[efficient_net_b0] =  { 'cut': noop, 
                               'split': lambda m: (list(m[0][0].children())[2][7], m[1]) }

In the original model, the final output size is 1000:

In [5]:
output_size = list(efficient_net_b0()[0].children())[-1].out_features

Loaded pretrained weights for efficientnet-b0


In [19]:
output_size

1000

And an adaptive layer is already added, so we don't want to do it again. In here, I'll just add a custom head to adapt the output size to our problem (in here, it has two classes):

In [15]:
data = FakeData()
custom_head = nn.Linear(output_size, data.c)

In [16]:
learn = cnn_learner(data, efficient_net_b0, custom_head = custom_head)

Loaded pretrained weights for efficientnet-b0


Check the learning rate groups:

In [17]:
get_groups(nn.Sequential(*list(learn.model[0][0].children())[:2], 
                         *list(learn.model[0][0].children())[2],
                         *list(learn.model[0][0].children())[3:],
                         learn.model[1]), 
           learn.layer_groups)

Group 1: ['Conv2dSamePadding', 'BatchNorm2d', 'MBConvBlock', 'MBConvBlock', 'MBConvBlock', 'MBConvBlock', 'MBConvBlock', 'MBConvBlock', 'MBConvBlock']
Group 2: ['MBConvBlock', 'MBConvBlock', 'MBConvBlock', 'MBConvBlock', 'MBConvBlock', 'MBConvBlock', 'MBConvBlock', 'MBConvBlock', 'MBConvBlock', 'Conv2dSamePadding', 'BatchNorm2d', 'Linear']
Group 3: ['Linear']


It works as expected. Note we can add more layers to group 3, just put it into the `custom_head`.

Test with a random image:

In [18]:
with torch.no_grad():
    learn.model.eval()
    print(learn.model(torch.randn(1,3,96,96)))

tensor([[-2.5571,  0.0241]])


## _To Be Continued..._