<a href="https://colab.research.google.com/github/IANGECHUKI176/deeplearning/blob/main/pytorch/convnets/EfficientNet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

EfficientNet

CNN models improves its ability to classify images by either increasing the depth of the network or
by increasing the resolution of the images to capture finer details of the image or by increasing
width of the network by increasing the number of channels. For instance, ResNet-18 to ResNet-152 has
been built around these ideas.

Now there is limit to each of these factors mentioned above and with increasing requirement of computational
power. To overcome these challenges, researchers introducted the concept of compound scaling, which scales
all the three factors moderately leading us to build EfficientNet.

EfficientNet scales all the three factors i.e. depth, width and resolution but how to scale it? we can
scale each factor equally but this wouldn't work if our task requires fine grained estimation and which
requries more depth.

Complex CNN architectures are built using multiple conv blocks and each block needs to be consistent with
previous and next block, thus each layers in the block are scaled evenly.

EfficientNet-B0 Architecture

* Basic ConvNet Block (AlexNet)
* Inverted Residual (MobileNetV2)
* Squeeze and Excitation Block (Squeeze and Excitation Network)

EfficientNet is a convolutional neural network architecture and scaling method that uniformly scales all
dimensions of depth/width/resolution using a compound coefficient. Unlike conventional practice that arbitrary
scales these factors, the EfficientNet scaling method uniformly scales network width, depth, and resolution
with a set of fixed scaling coefficients. For example, if we want to use 2^N times more computational resources,
then we can simply increase the network depth by alpha^N, width by beta^N, and image size by gamma^N, where
alpha, beta and gamma, are constant coefficients determined by a small grid search on the original small model.
EfficientNet uses a compound coefficient phi to uniformly scales network width, depth, and resolution in a
principled way.

The compound scaling method is justified by the intuition that if the input image is bigger, then the network
needs more layers to increase the receptive field and more channels to capture more fine-grained patterns on
the bigger image.

The base EfficientNet-B0 network is based on the inverted bottleneck residual blocks of MobileNetV2, in addition
to squeeze-and-excitation blocks.

EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%),
and 3 other transfer learning datasets, with an order of magnitude fewer parameters.

Interesting Stuff:

Now, the most interesting part of EfficientNet-B0 is that the baseline architecture is designed by Neural
Architecture Search(NAS). NAS is a wide topic and is not feasible to be discussed here. We can simply
consider it as searching through the architecture space for underlying base architecture like ResNet or
any other architecture for that matter. And on top of that, we can use grid search for finding the scale
factor for Depth, Width and Resolution. Combining NAS and with compound scaling leads us to EfficientNet.
Model is evaluated by comparing accuracy over the # of FLOPS(Floating point operations per second).

Recommended Reading for NAS: https://lilianweng.github.io/lil-log/2020/08/06/neural-architecture-search.html

In [158]:
import torch
import torch.nn as nn
import math
import torch.nn.functional as F

In [159]:
base_model = [
    # expand_ratio, channels, repeats, stride, kernel_size
    [1, 16, 1, 1, 3],
    [6, 24, 2, 2, 3],
    [6, 40, 2, 2, 5],
    [6, 80, 3, 2, 3],
    [6, 112, 3, 1, 5],
    [6, 192, 4, 2, 5],
    [6, 320, 1, 1, 3],
]

phi_values = {
    # tuple of: (phi_value, resolution, drop_rate)
    "b0": (0, 224, 0.2),  # alpha, beta, gamma, depth = alpha ** phi
    "b1": (0.5, 240, 0.2),
    "b2": (1, 260, 0.3),
    "b3": (2, 300, 0.3),
    "b4": (3, 380, 0.4),
    "b5": (4, 456, 0.4),
    "b6": (5, 528, 0.5),
    "b7": (6, 600, 0.5),
}

`swish` activation function == `nn.Silu`

In [160]:
class CNNBlock(nn.Module):
    def __init__(self,in_channels,out_channels,kernel_size,stride,padding,groups = 1):
        super(CNNBlock,self).__init__()
        self.cnn = nn.Conv2d(in_channels,out_channels,kernel_size,stride,padding,groups = groups,bias = False)
        self.bn = nn.BatchNorm2d(out_channels)
        self.silu = nn.SiLU()

    def forward(self,x):
        return self.silu(self.bn(self.cnn(x)))

In [161]:
class SqueezeAndExcitation(nn.Module):
    def __init__(self,in_channels,reduced_dim):
        super(SqueezeAndExcitation,self).__init__()

        self.se = nn.Sequential(
            nn.AdaptiveAvgPool2d(1), # C x H x W -> C x 1 x 1
            nn.Conv2d(in_channels,reduced_dim,1),
            nn.SiLU(),
            nn.Conv2d(reduced_dim,in_channels,1),
            nn.Sigmoid()
        )
    def forward(self,x):
        return x * self.se(x)

In [162]:
class InvertedResidualBlock(nn.Module):
    def __init__(
        self,
        in_channels,
        out_channels,
        kernel_size,
        stride,
        padding,
        expand_ratio,
        reduction=4,  # squeeze excitation
        survival_prob=0.8,  # for stochastic depth
    ):
        super(InvertedResidualBlock, self).__init__()
        self.survival_prob = 0.8
        self.use_residual = in_channels == out_channels and stride == 1
        hidden_dim = in_channels * expand_ratio
        self.expand = in_channels != hidden_dim
        reduced_dim = int(in_channels / reduction)

        if self.expand:
            self.expand_conv = CNNBlock(
                in_channels,
                hidden_dim,
                kernel_size=1,
                stride=1,
                padding=0,
            )

        self.conv = nn.Sequential(
            CNNBlock(
                hidden_dim,
                hidden_dim,
                kernel_size,
                stride,
                padding,
                groups=hidden_dim,
            ),
            SqueezeAndExcitation(hidden_dim, reduced_dim),
            nn.Conv2d(hidden_dim, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels),
        )

    def stochastic_depth(self, x):
        if not self.training:
            return x

        binary_tensor = (
            torch.rand(x.shape[0], 1, 1, 1, device=x.device) < self.survival_prob
        )
        return torch.div(x, self.survival_prob) * binary_tensor

    def forward(self, inputs):
        x = self.expand_conv(inputs) if self.expand else inputs

        if self.use_residual:
            return self.stochastic_depth(self.conv(x)) + inputs
        else:
            return self.conv(x)

In [163]:
class EfficientNet(nn.Module):
    def __init__(self,version,n_classes):
        super(EfficientNet,self).__init__()
        width_factor,depth_factor,dropout_rate = self.calculate_factors(version)
        print('factors',(width_factor,depth_factor,dropout_rate ))
        last_channel = math.ceil(1280 * width_factor)
        self.pool  = nn.AdaptiveAvgPool2d(1)
        self.features = self.create_features(width_factor,depth_factor,last_channel)
        self.classifier = nn.Sequential(
            nn.Dropout(dropout_rate),
            nn.Linear(last_channel,n_classes)
        )
    def calculate_factors(self,version,alpha = 1.2,beta = 1.1):
        phi,res,drop_rate = phi_values[version]
        depth_factor = alpha ** phi
        width_factor = beta ** phi
        return width_factor,depth_factor ,drop_rate
    def create_features(self,width_factor,depth_factor,last_channels):
        channels = int(32*width_factor)

        features = [CNNBlock(3, channels, 3, stride=2, padding=1)]
        in_channels = channels
        for expand_ratio, channels, repeats, stride, kernel_size in base_model:
            out_channels =4* math.ceil(int(channels * width_factor)/4)
            layers_repeats = math.ceil(repeats * depth_factor)
            for layer in range(layers_repeats):
                features.append(
                    InvertedResidualBlock(in_channels,
                 out_channels,
                 kernel_size=kernel_size,
                 stride = stride if layer == 0 else 1,
                 padding = kernel_size // 2, # if k=1:pad=0, k=3:pad=1, k=5:pad=2
                 expand_ratio = expand_ratio)
                )
                in_channels = out_channels
        features.append(CNNBlock(in_channels,last_channels,kernel_size =1,stride = 1,padding = 0))
        return nn.Sequential(*features)
    def forward(self,x):
        out = self.pool(self.features(x))
        out = out.view(out.size(0),-1)
        out = self.classifier(out)
        return out

In [164]:
device = "cuda" if torch.cuda.is_available() else "cpu"
version = "b0"
phi, res, drop_rate = phi_values[version]
num_examples, num_classes = 4, 10
x = torch.randn((num_examples, 3, res, res)).to(device)
model = EfficientNet(
        version=version,
        n_classes=num_classes,
    ).to(device)

# print(model(x).shape)

factors (1.0, 1.0, 0.2)


In [165]:
from torchsummary import summary

summary(model,(3,224,224))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 32, 112, 112]             864
       BatchNorm2d-2         [-1, 32, 112, 112]              64
              SiLU-3         [-1, 32, 112, 112]               0
          CNNBlock-4         [-1, 32, 112, 112]               0
            Conv2d-5         [-1, 32, 112, 112]             288
       BatchNorm2d-6         [-1, 32, 112, 112]              64
              SiLU-7         [-1, 32, 112, 112]               0
          CNNBlock-8         [-1, 32, 112, 112]               0
 AdaptiveAvgPool2d-9             [-1, 32, 1, 1]               0
           Conv2d-10              [-1, 8, 1, 1]             264
             SiLU-11              [-1, 8, 1, 1]               0
           Conv2d-12             [-1, 32, 1, 1]             288
          Sigmoid-13             [-1, 32, 1, 1]               0
SqueezeAndExcitation-14         [-1, 32