In [1]:
import torch
import torchvision
import torch.nn as nn

Imaging a depth wise max pooling layer - where instead of perfoming pooling spatial wise - we actually perform it depth wise. This can allow the cnn to learn to be invariant to various features. (pick out the most important ones). For example, it could learn multiple filters, each detecting a different rotation of the same pattern and depthwise maxpooling layer would ensure that the output is the same regardless of rotation. The CNN could learn to be invariant to anything: thickness, brightness, skew, color, etc.

In [13]:
a = torch.tensor([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])

In [20]:
groups = 2
c =torch.chunk(a, chunks = groups, dim=0)

In [31]:
c_ = tuple(map(lambda x: torch.max(x, 0, keepdim=True).values,c))

In [32]:
for i in c:
    print(i)

tensor([[1, 2, 3],
        [4, 5, 6]])
tensor([[ 7,  8,  9],
        [10, 11, 12]])


In [33]:
for i in c_:
    print(i)

tensor([[4, 5, 6]])
tensor([[10, 11, 12]])


In [34]:
torch.cat(c_,dim=0)

tensor([[ 4,  5,  6],
        [10, 11, 12]])

So logically speaking - depthwise pooling would involve splitting the tensor into groups, finding the max for each group and then concatenating the remainder:

In [56]:
class DepthPool(nn.Module):
    def __init__(self, pool_size=2):
        super().__init__()
        self.pool_size = pool_size
    def forward(self, inputs):
        # inputs are assumed to be of shape NxCxHxW
        # instead of sliding the window across the feature maps using the pool size, we now use the pool size to reduce the
        # channels using depth wise pooling
        # if we wanted to do the strides - we could just call in nn.MaxPool2d with the same pool size
        inputs = nn.functional.max_pool2d(inputs, kernel_size= self.pool_size)
        in_channels = inputs.shape[1]
        groups = in_channels // self.pool_size
        chunks = torch.chunk(inputs, chunks=groups, dim=1)
        # loop through each chunk and get the max feature map for each group
        chunks = tuple(map(lambda x: torch.max(x, dim=1, keepdim=True).values,chunks))
        # maximum along the channel dimension is taken for each chunk, now we need to concatenate the results into one tensor
        return torch.cat(chunks, dim=1)

In [57]:
x = torch.randn(1, 7, 32, 32)

In [58]:
pool = DepthPool(2)

In [59]:
pool(x).shape

torch.Size([1, 3, 16, 16])

So, thats depthwise pooling implemented

Another common pooling layer in modern conv net architectures is global average pooling. all it does id compute the mean of each entire feature map (its like an average pooling layer using a pooling kernel with the same spatial dimensions as the inputs). This means that it just outputs a single number per feature map and per instance. 

In [62]:
class GlobalAveragePooling(nn.Module):
    def __init__(self):
        super().__init__()
    def forward(self, inputs):
        n,c,h,w = inputs.shape
        return nn.functional.avg_pool2d(inputs,kernel_size=h)

In [63]:
x = torch.randn(1, 7, 32, 32)
global_pool = GlobalAveragePooling()

In [None]:
global_pool(x).shape

torch.Size([1, 7, 1, 1])

: 

**CNN Architectures:**

A common mistake is to use convolution kernels that are too large. for example, instead of using a conv layer with a 5x5 kernel, stack 2 layers with 3x3 kernels. one exception is for the first conv layer, it can typically have a large kernel

**LeNet-5**

**AlexNet:**
- first to stack conv layers directly on top of one another instead of stacking a pooling layer immediately after each conv layer
- To reduce overfitting the authors used dropout (50% dropout rate) for the fc layers, and data augmentation.
- Data augmentation artificially increases the size of the training set by generating many realistic variants of each training instance. This reduces overfitting making this a regularization technique. The generated instances should be as realistic as possible, ideally, given an image from the augmented training set, a human should not be able to tell whether it was augmented or not. 
- Alexnet also uses a competitive normalization step immediately after the relu steps of the first 2 conv layers. called the local response normalization (LRN): the most strongly activated neurons inhibit other neurons located at the same position in neighbouring feature maps. local response normalization encourages different feature maps to specialize, pushing them apart and forcing them to exomplre a wider range of features. ultimately, improving generalization. 