## Chapter 8 : Modern CNN
1. **AlexNet**: first deep conv successful, using dropout, Relu, polling
2. **VGG**: multiple 3 * 3 conv layers (two 3 * 3 conv touch 5 * 5 input as a 5 * 5 conv, but 2 * 3 * 3  = 18 < 25 = 5 * 5)
3. **NiN**: to handle 2 problem (1. much ram for the MLP at the end; 2. can not add MLP between the conv to increase the degree of nonlinearity as it will destroy the spatial information)
   - use 1 * 1 conv layer to add local nonlinearities across the channel activations
   - use global average pooling to integrate across all locations in the last representation layer. (must combine with added nonlinearities)
4. **GoogleNet**: Inception layer, parallel conv multi scales, and then concate them
5. **Batch Normalization**:
   - $BN(\mathbf x) = \mathbf{\gamma} \bigodot \frac{\mathbf x - \mathbf{\mu_B}}{\sigma^2_B} + \mathbf \beta$, $\mathbf{\mu_B} = \frac{1}{|B|}\sum_{x \in B} \mathbf x$,
     $\sigma^2_B = \frac{1}{|B|} \sum_{x \in B} (x - \mathbf{\mu_B})^2 + \epsilon$
   - On linear layer [N, D] it will get across D (different features in D will not do calculations), on conv layer [N, C, H, W] it will across C (save the difference between channels)
     - For example, [N, C, H, W] shape input x, for x[N, 0, H, W], get it's mean mu and std and do (x[N, 0, H, W] - mu) / std, here mu and std are scalar.
   - At the testing stage, we will use the global (whole) data mean and varience, instead of minibatch mean and varience. Just like dropout.
   - So BN also serves as a noise introducer! (minibatch information != true mean and var) Teye et al. (2018) and Luo et al. (2018).
   - So it best works for batch size of 50 ~ 100, higher the noise is small, lower it is too high.
   - Moving global mean and var: when testing, no minibatch, so we use a global one that is stored during training.
     - It is a kind of exp weighted mean, closest batch has higer weight
     - $\mu_m = \mu_m * (1 - \tau) + \mu * \tau, \Sigma_m = \Sigma_m * (1 - \tau) + \Sigma * \tau$, $\tau$ is called momentum term.
6. **Layer Normalization**: often used in NLP
   - For features like [N, A, B] it will save difference between N, A and B are typically seq_len, hidden_size.
7. **ResNet**: residual block, pass x as one of the branch before a activation function (for the original paper, and later it is changed to BN -> AC -> Conv)
   - To get the passed x has the correct shape to add up, we can use 1 * 1 conv if it is needed
   - **Idea**: nested-function class, shallower net (like ResNet-20) is subclass of depper net (like ResNet-50). Because in ResNet-50 if the layers after 20th layer are f(x) = x, then it is the same as RestNet-20! So we can make sure f' (the best we can get in ResNet-50 for certain data) will be better than f (ResNet-20 on the same data) or at least the same.
   - <p align="center">
       <img alt="Residul Block" src="https://d2l.ai/_images/resnet-block.svg" style="background-color: white; display: inline-block;">
       Rusidul Block
   </p>
   - **ResNeXt**: use g groups of 3 * 3 conv layers between two 1 * 1 conv of channel $b$ and $c_o$, so $\mathcal O(c_i c_o) \rightarrow \mathcal O(g ~ c_i / g ~ c_o / g) = \mathcal O(c_ic_o/g)$
     - This is a **Bottleneck** arch if $b < c_i$
       </br>
   - <img alt="ResNeXt Block" src="https://d2l.ai/_images/resnext-block.svg" style="background-color: white; display: inline-block;">
       ResNeXt Block
8. **DenseNet**: instead of plus x, we concatenate x repeatedly.
   - For example (\<channel\> indicates the channel): x\<c_1\> -> f_1(x)\<c_2\> end up with [x, f_1(x)]\<c_1 + c_2\> -> f_2([x, f_1(x)])\<c_3\> end up with [x, f_1(x), f_2([x, f_1(x)])]\<c_1 + c_2 + c_3\>
   - Too many of this layer will cause the dimeansion too big, so we need some layer to reduce it. **Translation** layer use 1 * 1 conv to reduce channel and avgpool to half the H and W.
9. **RegNet**:
   - AnyNet: network with **stem** -> **body** -> **head**.
   - Distrubution of net: $F(e,Z)=∑_{i=1}^{n}1(e_i<e)$, use this empirical CDF to approximate $F(e, p)$, $p$ is the net arch distrubution. $Z$ is a sample of net sample from $p$, if $F(e, Z_1) < F(e, Z_2)$ then we say $Z_1$ is better, it's parameters are better.
   - So for RegNet, they find that we should use same k (k = 1, no bottlenet, is best, says in paper) and g for the ResNeXt blocks with no harm, and increase the network depth d and weight c along the stage. And keep the c change linearly with $c_j = c_o + c_aj$ with slope $c_a$
   - neural architecture search (NAS) : with certain search space, use RL (NASNet), evolution alg (AmoebaNet), gradient based (DARTS) or shared weight (ENAS) to get the model. But it takes to much computation resource.
   - <img src="https://d2l.ai/_images/anynet.svg" style="background-color: white; display: inline-block;"> AnyNet Structure
   - End

In [1]:
import torch
from torch import nn

In [2]:
def layer_summary(net, X_shape):
        """Defined in :numref:`sec_lenet`"""
        X = torch.randn(*X_shape)
        for layer in net:
            X = layer(X)
            print(layer.__class__.__name__, 'output shape:\t', X.shape)

In [3]:
def vgg_block(num_convs, out_channels):
    layers = []
    for _ in range(num_convs):
        layers.append(nn.LazyConv2d(out_channels, kernel_size=3, padding=1))
        layers.append(nn.ReLU())
    layers.append(nn.MaxPool2d(kernel_size=2,stride=2))
    return nn.Sequential(*layers)

In [4]:
def init_cnn(module):
    """Initialize weights for CNNs.

    Defined in :numref:`sec_lenet`"""
    if type(module) == nn.Linear or type(module) == nn.Conv2d:
        nn.init.xavier_uniform_(module.weight)

In [6]:
class VGG(nn.Module):
    def __init__(self, arch, lr=0.1, num_classes=10):
        super().__init__()
        conv_blks = []
        for (num_convs, out_channels) in arch:
            conv_blks.append(vgg_block(num_convs, out_channels))
        self.net = nn.Sequential(
            *conv_blks, nn.Flatten(),
            nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(0.5),
            nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(0.5),
            nn.LazyLinear(num_classes))
        self.net.apply(init_cnn)

In [8]:
VGG_11 = VGG(arch=((1, 64), (1, 128), (2, 256), (2, 512), (2, 512)))

In [17]:
layer_summary(VGG_11.net, (32, 3, 224, 224))

Sequential output shape:	 torch.Size([32, 64, 112, 112])
Sequential output shape:	 torch.Size([32, 128, 56, 56])
Sequential output shape:	 torch.Size([32, 256, 28, 28])
Sequential output shape:	 torch.Size([32, 512, 14, 14])
Sequential output shape:	 torch.Size([32, 512, 7, 7])
Flatten output shape:	 torch.Size([32, 25088])
Linear output shape:	 torch.Size([32, 4096])
ReLU output shape:	 torch.Size([32, 4096])
Dropout output shape:	 torch.Size([32, 4096])
Linear output shape:	 torch.Size([32, 4096])
ReLU output shape:	 torch.Size([32, 4096])
Dropout output shape:	 torch.Size([32, 4096])
Linear output shape:	 torch.Size([32, 10])


In [10]:
def nin_block(out_channels, kernel_size, strides, padding):
    return nn.Sequential(
        nn.LazyConv2d(out_channels, kernel_size, strides, padding), nn.ReLU(),
        nn.LazyConv2d(out_channels, kernel_size=1), nn.ReLU(),
        nn.LazyConv2d(out_channels, kernel_size=1), nn.ReLU())

In [18]:
nin_block_ = nin_block(5, (3,3), 1, 1)
layer_summary(nin_block_, (32, 3, 224, 224))

Conv2d output shape:	 torch.Size([32, 5, 224, 224])
ReLU output shape:	 torch.Size([32, 5, 224, 224])
Conv2d output shape:	 torch.Size([32, 5, 224, 224])
ReLU output shape:	 torch.Size([32, 5, 224, 224])
Conv2d output shape:	 torch.Size([32, 5, 224, 224])
ReLU output shape:	 torch.Size([32, 5, 224, 224])


In [12]:
class NiN(nn.Module):
    def __init__(self, lr=0.1, num_classes=10):
        super().__init__()
        self.net = nn.Sequential(
            nin_block(96, kernel_size=11, strides=4, padding=0),
            nn.MaxPool2d(3, stride=2),
            nin_block(256, kernel_size=5, strides=1, padding=2),
            nn.MaxPool2d(3, stride=2),
            nin_block(384, kernel_size=3, strides=1, padding=1),
            nn.MaxPool2d(3, stride=2),
            nn.Dropout(0.5),
            nin_block(num_classes, kernel_size=3, strides=1, padding=1),
            nn.AdaptiveAvgPool2d((1, 1)),
            nn.Flatten())

In [19]:
nin = NiN()
layer_summary(nin.net, (32, 3, 224, 224))

Sequential output shape:	 torch.Size([32, 96, 54, 54])
MaxPool2d output shape:	 torch.Size([32, 96, 26, 26])
Sequential output shape:	 torch.Size([32, 256, 26, 26])
MaxPool2d output shape:	 torch.Size([32, 256, 12, 12])
Sequential output shape:	 torch.Size([32, 384, 12, 12])
MaxPool2d output shape:	 torch.Size([32, 384, 5, 5])
Dropout output shape:	 torch.Size([32, 384, 5, 5])
Sequential output shape:	 torch.Size([32, 10, 5, 5])
AdaptiveAvgPool2d output shape:	 torch.Size([32, 10, 1, 1])
Flatten output shape:	 torch.Size([32, 10])


In [3]:
import torch.nn.functional as F
class Inception(nn.Module):
    # c1--c4 are the number of output channels for each branch
    def __init__(self, c1, c2, c3, c4, **kwargs):
        super(Inception, self).__init__(**kwargs)
        # Branch 1
        self.b1_1 = nn.LazyConv2d(c1, kernel_size=1)
        # Branch 2
        self.b2_1 = nn.LazyConv2d(c2[0], kernel_size=1)
        self.b2_2 = nn.LazyConv2d(c2[1], kernel_size=3, padding=1)
        # Branch 3
        self.b3_1 = nn.LazyConv2d(c3[0], kernel_size=1)
        self.b3_2 = nn.LazyConv2d(c3[1], kernel_size=5, padding=2)
        # Branch 4
        self.b4_1 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
        self.b4_2 = nn.LazyConv2d(c4, kernel_size=1)

    def forward(self, x):
        b1 = F.relu(self.b1_1(x))
        b2 = F.relu(self.b2_2(F.relu(self.b2_1(x))))
        b3 = F.relu(self.b3_2(F.relu(self.b3_1(x))))
        b4 = F.relu(self.b4_2(self.b4_1(x)))
        return torch.cat((b1, b2, b3, b4), dim=1)

In [6]:
incep = Inception(8, (16, 32), (32, 64), 128)
incep.eval()._modules, incep(torch.randn(32, 3, 224, 224)).shape

({'b1_1': Conv2d(3, 8, kernel_size=(1, 1), stride=(1, 1)),
  'b2_1': Conv2d(3, 16, kernel_size=(1, 1), stride=(1, 1)),
  'b2_2': Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),
  'b3_1': Conv2d(3, 32, kernel_size=(1, 1), stride=(1, 1)),
  'b3_2': Conv2d(32, 64, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2)),
  'b4_1': MaxPool2d(kernel_size=3, stride=1, padding=1, dilation=1, ceil_mode=False),
  'b4_2': Conv2d(3, 128, kernel_size=(1, 1), stride=(1, 1))},
 torch.Size([32, 232, 224, 224]))

In [7]:
class GoogleNet(nn.Module):
    def __init__(self, lr=0.1, num_classes=10):
        super(GoogleNet, self).__init__()
        self.net = nn.Sequential(self.b1(), self.b2(), self.b3(), self.b4(),
                                 self.b5(), nn.LazyLinear(num_classes))
    def b1(self):
        return nn.Sequential(
            nn.LazyConv2d(64, kernel_size=7, stride=2, padding=3),
            nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
    def b2(self):
        return nn.Sequential(
            nn.LazyConv2d(64, kernel_size=1), nn.ReLU(),
            nn.LazyConv2d(192, kernel_size=3, padding=1), nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
    def b3(self):
        return nn.Sequential(Inception(64, (96, 128), (16, 32), 32),
                             Inception(128, (128, 192), (32, 96), 64),
                             nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
    def b4(self):
        return nn.Sequential(Inception(192, (96, 208), (16, 48), 64),
                             Inception(160, (112, 224), (24, 64), 64),
                             Inception(128, (128, 256), (24, 64), 64),
                             Inception(112, (144, 288), (32, 64), 64),
                             Inception(256, (160, 320), (32, 128), 128),
                             nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
    def b5(self):
        return nn.Sequential(Inception(256, (160, 320), (32, 128), 128),
                             Inception(384, (192, 384), (48, 128), 128),
                             nn.AdaptiveAvgPool2d((1,1)), nn.Flatten())

In [8]:
g = GoogleNet()
layer_summary(g.net, (32, 3, 224, 224))

Sequential output shape:	 torch.Size([32, 64, 56, 56])
Sequential output shape:	 torch.Size([32, 192, 28, 28])
Sequential output shape:	 torch.Size([32, 480, 14, 14])
Sequential output shape:	 torch.Size([32, 832, 7, 7])
Sequential output shape:	 torch.Size([32, 1024])
Linear output shape:	 torch.Size([32, 10])


In [23]:
def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
    # 通过is_grad_enabled来判断当前模式是训练模式还是预测模式
    if not torch.is_grad_enabled():
        print("In testing:")
        # 如果是在预测模式下，直接使用传入的移动平均所得的均值和方差
        X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
    else:
        print("In training:")
        assert len(X.shape) in (2, 4)
        if len(X.shape) == 2:
            # 使用全连接层的情况，计算特征维上的均值和方差
            print("For full connect layer:")
            mean = X.mean(dim=0)
            var = ((X - mean) ** 2).mean(dim=0)
            print("mean, var are: ", mean, var)
        else:
            print("For conv layer:")
            # 使用二维卷积层的情况，计算通道维上（axis=1）的均值和方差。
            # 这里我们需要保持X的形状以便后面可以做广播运算
            mean = X.mean(dim=(0, 2, 3), keepdim=True)
            var = ((X - mean) ** 2).mean(dim=(0, 2, 3), keepdim=True)
            print("mean, var are: ", mean, var)
            print("X - mean", X - mean)
        # 训练模式下，用当前的均值和方差做标准化
        X_hat = (X - mean) / torch.sqrt(var + eps)
        # 更新移动平均的均值和方差
        moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
        moving_var = momentum * moving_var + (1.0 - momentum) * var
    Y = gamma * X_hat + beta  # 缩放和移位
    return Y, moving_mean.data, moving_var.data

In [67]:
x_3d = torch.zeros(2, 2, 2)
x_3d[1, :, :] = 1
x_4d = torch.zeros(2, 2, 2, 2)
x_4d[0, 1, :, :] = 1
x_4d[1, 0, :, :] = 2
x_4d[1, 1, :, :] = 3
print("x_3d is :\n", x_3d)
print("x_4d is :\n", x_4d)

x_3d is :
 tensor([[[0., 0.],
         [0., 0.]],

        [[1., 1.],
         [1., 1.]]])
x_4d is :
 tensor([[[[0., 0.],
          [0., 0.]],

         [[1., 1.],
          [1., 1.]]],


        [[[2., 2.],
          [2., 2.]],

         [[3., 3.],
          [3., 3.]]]])


In [83]:
print("This is BN: across the channel: no between channel calculations")
print(batch_norm(x_4d, 1, 0, 0, 1, 1e-6, 0.1)[0][:, 0, :, :])
(x_4d[:, 0, :, :] - (x_4d[:, 0, :, :]).mean()) / torch.std(x_4d[:, 0, :, :], unbiased=False)

This is BN: across the channel
In training:
For conv layer:
mean, var are:  tensor([[[[1.]],

         [[2.]]]]) tensor([[[[1.]],

         [[1.]]]])
X - mean tensor([[[[-1., -1.],
          [-1., -1.]],

         [[-1., -1.],
          [-1., -1.]]],


        [[[ 1.,  1.],
          [ 1.,  1.]],

         [[ 1.,  1.],
          [ 1.,  1.]]]])
tensor([[[-1.0000, -1.0000],
         [-1.0000, -1.0000]],

        [[ 1.0000,  1.0000],
         [ 1.0000,  1.0000]]])


tensor([[[-1., -1.],
         [-1., -1.]],

        [[ 1.,  1.],
         [ 1.,  1.]]])

In [108]:
print((x_4d - x_4d.mean(dim=(1, 2, 3), keepdim=True)) / x_4d.std(dim=(1, 2, 3), keepdim=True, unbiased=False)) # Impl of layer normal 
print(nn.LayerNorm((2, 2, 2))(x_4d))

tensor([[[[-1., -1.],
          [-1., -1.]],

         [[ 1.,  1.],
          [ 1.,  1.]]],


        [[[-1., -1.],
          [-1., -1.]],

         [[ 1.,  1.],
          [ 1.,  1.]]]])
tensor([[[[-1.0000, -1.0000],
          [-1.0000, -1.0000]],

         [[ 1.0000,  1.0000],
          [ 1.0000,  1.0000]]],


        [[[-1.0000, -1.0000],
          [-1.0000, -1.0000]],

         [[ 1.0000,  1.0000],
          [ 1.0000,  1.0000]]]], grad_fn=<NativeLayerNormBackward0>)


In [12]:
class Residual(nn.Module):  #@save
    """The Residual block of ResNet models."""
    def __init__(self, num_channels, use_1x1conv=False, strides=1):
        super().__init__()
        self.conv1 = nn.LazyConv2d(num_channels, kernel_size=3, padding=1,
                                   stride=strides)
        self.conv2 = nn.LazyConv2d(num_channels, kernel_size=3, padding=1)
        if use_1x1conv:
            self.conv3 = nn.LazyConv2d(num_channels, kernel_size=1,
                                       stride=strides)
        else:
            self.conv3 = None
        self.bn1 = nn.LazyBatchNorm2d()
        self.bn2 = nn.LazyBatchNorm2d()

    def forward(self, X):
        Y = F.relu(self.bn1(self.conv1(X)))
        Y = self.bn2(self.conv2(Y))
        if self.conv3:
            X = self.conv3(X)
        Y += X
        return F.relu(Y)

In [14]:
class ResNet(nn.Module):
    def __init__(self, arch, lr=0.1, num_classes=10):
        super().__init__()
        self.net = nn.Sequential(self.b1())
        for i, b in enumerate(arch):
            self.net.add_module(f'b{i+2}', self.block(*b, first_block=(i==0)))
        self.net.add_module('last', nn.Sequential(
            nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(),
            nn.LazyLinear(num_classes)))
        
    def block(self, num_residuals, num_channels, first_block=False):
        blk = []
        for i in range(num_residuals):
            if i == 0 and not first_block:
                blk.append(Residual(num_channels, use_1x1conv=True, strides=2))
            else:
                blk.append(Residual(num_channels))
        return nn.Sequential(*blk)
    
    def b1(self):
        return nn.Sequential(
            nn.LazyConv2d(64, kernel_size=7, stride=2, padding=3),
            nn.LazyBatchNorm2d(), nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

In [16]:
resnet = ResNet(((2, 64), (2, 128), (2, 256), (2, 512)))
layer_summary(resnet.net, (32, 3, 224, 224))

Sequential output shape:	 torch.Size([32, 64, 56, 56])
Sequential output shape:	 torch.Size([32, 64, 56, 56])
Sequential output shape:	 torch.Size([32, 128, 28, 28])
Sequential output shape:	 torch.Size([32, 256, 14, 14])
Sequential output shape:	 torch.Size([32, 512, 7, 7])
Sequential output shape:	 torch.Size([32, 10])


In [None]:
class ResNeXtBlock(nn.Module):  #@save
    """The ResNeXt block."""
    def __init__(self, num_channels, groups, bot_mul, use_1x1conv=False,
                 strides=1):
        super().__init__()
        bot_channels = int(round(num_channels * bot_mul))
        self.conv1 = nn.LazyConv2d(bot_channels, kernel_size=1, stride=1)
        self.conv2 = nn.LazyConv2d(bot_channels, kernel_size=3,
                                   stride=strides, padding=1,
                                   groups=bot_channels//groups)
        self.conv3 = nn.LazyConv2d(num_channels, kernel_size=1, stride=1)
        self.bn1 = nn.LazyBatchNorm2d()
        self.bn2 = nn.LazyBatchNorm2d()
        self.bn3 = nn.LazyBatchNorm2d()
        if use_1x1conv:
            self.conv4 = nn.LazyConv2d(num_channels, kernel_size=1,
                                       stride=strides)
            self.bn4 = nn.LazyBatchNorm2d()
        else:
            self.conv4 = None

    def forward(self, X):
        Y = F.relu(self.bn1(self.conv1(X)))
        Y = F.relu(self.bn2(self.conv2(Y)))
        Y = self.bn3(self.conv3(Y))
        if self.conv4:
            X = self.bn4(self.conv4(X))
        return F.relu(Y + X)

In [17]:
"""
Dense Net
"""
def conv_block(num_channels):
    return nn.Sequential(
        nn.LazyBatchNorm2d(), nn.ReLU(),
        nn.LazyConv2d(num_channels, kernel_size=3, padding=1))

class DenseBlock(nn.Module):
    def __init__(self, num_convs, num_channels):
        super(DenseBlock, self).__init__()
        layer = []
        for i in range(num_convs):
            layer.append(conv_block(num_channels))
        self.net = nn.Sequential(*layer)

    def forward(self, X):
        for blk in self.net:
            Y = blk(X)
            # Concatenate input and output of each block along the channels
            X = torch.cat((X, Y), dim=1)
        return X

def transition_block(num_channels):
    return nn.Sequential(
        nn.LazyBatchNorm2d(), nn.ReLU(),
        nn.LazyConv2d(num_channels, kernel_size=1),
        nn.AvgPool2d(kernel_size=2, stride=2))

class DenseNet(nn.Module):
    def __init__(self, num_channels=64, growth_rate=32, arch=(4, 4, 4, 4),
             lr=0.1, num_classes=10):
        super().__init__()
        # self.save_hyperparameters()
        self.net = nn.Sequential(self.b1())
        for i, num_convs in enumerate(arch):
            self.net.add_module(f'dense_blk{i+1}', DenseBlock(num_convs,
                                                              growth_rate))
            # The number of output channels in the previous dense block
            num_channels += num_convs * growth_rate
            # A transition layer that halves the number of channels is added
            # between the dense blocks
            if i != len(arch) - 1:
                num_channels //= 2
                self.net.add_module(f'tran_blk{i+1}', transition_block(
                    num_channels))
        self.net.add_module('last', nn.Sequential(
            nn.LazyBatchNorm2d(), nn.ReLU(),
            nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(),
            nn.LazyLinear(num_classes)))
        # self.net.apply(d2l.init_cnn)
        
    def b1(self):
        return nn.Sequential(
            nn.LazyConv2d(64, kernel_size=7, stride=2, padding=3),
            nn.LazyBatchNorm2d(), nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

In [18]:
dense = DenseNet()
layer_summary(dense.net, (32, 3, 224, 224))
"""
[32, 64, 56, 56] -> 64 will become 192 = 64 + 32 (growth_rate) * 4 (num_conv)
Then will a transition layer 192 -> 192 / 2 = 96, 56 -> 56 / 2 = 18
"""

Sequential output shape:	 torch.Size([32, 64, 56, 56])
DenseBlock output shape:	 torch.Size([32, 192, 56, 56])
Sequential output shape:	 torch.Size([32, 96, 28, 28])
DenseBlock output shape:	 torch.Size([32, 224, 28, 28])
Sequential output shape:	 torch.Size([32, 112, 14, 14])
DenseBlock output shape:	 torch.Size([32, 240, 14, 14])
Sequential output shape:	 torch.Size([32, 120, 7, 7])
DenseBlock output shape:	 torch.Size([32, 248, 7, 7])
Sequential output shape:	 torch.Size([32, 10])
