[**Network in Network (NiN) (Lin et al., 2013)**](https://arxiv.org/abs/1312.4400) is a pioneering approach that convolves whole neural networks patch-wise over inputs, introducing mlpconv layers to enhance feature abstraction.

![](imgs/mlpconv_layer.png)

[MLPConv Layer](https://arxiv.org/abs/1312.4400)

![](imgs/nin.png)

[NiN Architecture](https://arxiv.org/abs/1312.4400)

The Network in Network (NiN) architecture enhances traditional CNNs by incorporating micro-networks within each convolutional layer. Instead of using standard convolutional filters that apply a single linear transformation, NiN replaces them with MLPConv layers, which consist of multiple 1×1 convolutions followed by non-linear activations. This design allows for greater abstraction and feature extraction at each spatial location, making NiN more powerful than traditional CNNs like AlexNet. Additionally, the architecture consists of multiple NiN blocks, where each block includes a standard convolution followed by two 1×1 convolution layers, enabling parameter efficiency while increasing representational capacity.

NiN consists of stacked mlpconv blocks, each containing a main convolution followed by two 1x1 convolutions, with max pooling layers interspersed. Here’s the structure:

1. Block 1:
    - Conv (11x11, 96 filters, stride 4), ReLU
    - 1x1 Conv (96 filters), ReLU
    - 1x1 Conv (96 filters), ReLU
    - MaxPool (3x3, stride 2)
2. Block 2:
    - Conv (5x5, 256 filters, padding 2), ReLU
    - 1x1 Conv (256 filters), ReLU
    - 1x1 Conv (256 filters), ReLU
    - MaxPool (3x3, stride 2)
3. Block 3:
    - Conv (3x3, 384 filters, padding 1), ReLU
    - 1x1 Conv (384 filters), ReLU
    - 1x1 Conv (384 filters), ReLU
    - MaxPool (3x3, stride 2)
4. Block 4 (Output):
    - Conv (3x3, num_classes filters, padding 1), ReLU
    - 1x1 Conv (num_classes filters), ReLU
    - 1x1 Conv (num_classes filters), ReLU
    - Global Average Pooling (reduces spatial dims to 1x1)

Input size is 224x224x3 (RGB images). The innovation is the mlpconv block, which acts like a tiny neural network per spatial location, and the use of global average pooling instead of fully connected layers.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import utils

In [10]:
def mlpconv_block(in_channels, out_channels, kernel_size, stride=1, padding=0):
    return nn.Sequential(
        nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding), nn.ReLU(),
        nn.Conv2d(out_channels, out_channels, kernel_size=1), nn.ReLU(),
        nn.Conv2d(out_channels, out_channels, kernel_size=1), nn.ReLU()
    )

In [11]:
class NiN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()

        self.net = nn.Sequential(
            mlpconv_block(3, 96, kernel_size=11, stride=4, padding=0),
            nn.MaxPool2d(3, stride=2),

            mlpconv_block(96, 256, kernel_size=5, stride=1, padding=2),
            nn.MaxPool2d(3, stride=2),

            mlpconv_block(256, 384, kernel_size=3, stride=1, padding=1),
            nn.MaxPool2d(3, stride=2),

            mlpconv_block(384, num_classes, kernel_size=3, stride=1, padding=1),
            
            nn.AdaptiveAvgPool2d((1, 1)),
            nn.Flatten()
        )

    def forward(self, x):
        return self.net(x)

In [12]:
utils.layer_summary(NiN(num_classes=10), (1, 3, 224, 224))

Input shape: (1, 3, 224, 224)
----------------------------------------
Conv2d          output shape: (1, 96, 54, 54)
ReLU            output shape: (1, 96, 54, 54)
Conv2d          output shape: (1, 96, 54, 54)
ReLU            output shape: (1, 96, 54, 54)
Conv2d          output shape: (1, 96, 54, 54)
ReLU            output shape: (1, 96, 54, 54)
MaxPool2d       output shape: (1, 96, 26, 26)
Conv2d          output shape: (1, 256, 26, 26)
ReLU            output shape: (1, 256, 26, 26)
Conv2d          output shape: (1, 256, 26, 26)
ReLU            output shape: (1, 256, 26, 26)
Conv2d          output shape: (1, 256, 26, 26)
ReLU            output shape: (1, 256, 26, 26)
MaxPool2d       output shape: (1, 256, 12, 12)
Conv2d          output shape: (1, 384, 12, 12)
ReLU            output shape: (1, 384, 12, 12)
Conv2d          output shape: (1, 384, 12, 12)
ReLU            output shape: (1, 384, 12, 12)
Conv2d          output shape: (1, 384, 12, 12)
ReLU            output shape: (1, 384, 12, 

In [5]:
data = utils.CIFAR10DataLoader(batch_size=64, resize=(224, 224))
train_loader = data.get_train_loader()
test_loader = data.get_test_loader()

In [6]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = NiN(num_classes=10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

epochs = 10
for epoch in range(epochs):
    train_loss, train_acc = utils.train_step(train_loader, model, criterion, optimizer, device)
    test_loss, test_acc = utils.eval_step(test_loader, model, criterion, device)
    print(f"Epoch {epoch + 1:>{len(str(epochs))}}/{epochs} | "
          f"Train Loss: {train_loss:.4f} | "
          f"Test Loss: {test_loss:.4f} | "
          f"Test Acc: {test_acc:.4f}")

Epoch  1/10 | Train Loss: 2.1806 | Test Loss: 2.0644 | Test Acc: 0.2335
Epoch  2/10 | Train Loss: 1.8938 | Test Loss: 1.7691 | Test Acc: 0.3151
Epoch  3/10 | Train Loss: 1.7286 | Test Loss: 1.6825 | Test Acc: 0.3555
Epoch  4/10 | Train Loss: 1.6316 | Test Loss: 1.5785 | Test Acc: 0.4017
Epoch  5/10 | Train Loss: 1.5501 | Test Loss: 1.4932 | Test Acc: 0.4368
Epoch  6/10 | Train Loss: 1.4553 | Test Loss: 1.4203 | Test Acc: 0.4672
Epoch  7/10 | Train Loss: 1.3829 | Test Loss: 1.3675 | Test Acc: 0.4898
Epoch  8/10 | Train Loss: 1.3296 | Test Loss: 1.3534 | Test Acc: 0.5107
Epoch  9/10 | Train Loss: 1.2736 | Test Loss: 1.2767 | Test Acc: 0.5331
Epoch 10/10 | Train Loss: 1.2243 | Test Loss: 1.2136 | Test Acc: 0.5578


In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = NiN(num_classes=10)
model.apply(utils.init_kaiming).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

epochs = 10
for epoch in range(epochs):
    train_loss, train_acc = utils.train_step(train_loader, model, criterion, optimizer, device)
    test_loss, test_acc = utils.eval_step(test_loader, model, criterion, device)
    print(f"Epoch {epoch + 1:>{len(str(epochs))}}/{epochs} | "
          f"Train Loss: {train_loss:.4f} | "
          f"Test Loss: {test_loss:.4f} | "
          f"Test Acc: {test_acc:.4f}")

Epoch  1/10 | Train Loss: 2.1375 | Test Loss: 2.0493 | Test Acc: 0.2458
Epoch  2/10 | Train Loss: 1.9544 | Test Loss: 1.8500 | Test Acc: 0.2598
Epoch  3/10 | Train Loss: 1.7363 | Test Loss: 1.5415 | Test Acc: 0.4289
Epoch  4/10 | Train Loss: 1.5001 | Test Loss: 1.3435 | Test Acc: 0.5149
Epoch  5/10 | Train Loss: 1.2337 | Test Loss: 1.2072 | Test Acc: 0.5745
Epoch  6/10 | Train Loss: 1.0496 | Test Loss: 1.1423 | Test Acc: 0.6149
Epoch  7/10 | Train Loss: 0.9228 | Test Loss: 0.9431 | Test Acc: 0.6691
Epoch  8/10 | Train Loss: 0.8217 | Test Loss: 0.8786 | Test Acc: 0.6922
Epoch  9/10 | Train Loss: 0.7569 | Test Loss: 0.9303 | Test Acc: 0.6959
Epoch 10/10 | Train Loss: 0.6889 | Test Loss: 0.7740 | Test Acc: 0.7353
