# Understanding Layers in CNN

This notebook demonstrates the working of basic building blocks of CNNs

- Convolution Layer
- Pooling Layer
    - Pooling layers are crucial for reducing the dimensionality of feature maps while retaining essential information. They help control overfitting, reduce computation, and provide translational invariance. Choosing the appropriate pooling strategy depends on the specific needs of the model architecture and the nature of the input data.


In [1]:
import torch
import torch.nn as nn
import numpy as np

## Convolutional Module

https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html

Convolution layer is implemented using convolution module. Key input arguments for the convolution module:

- **in_channels**: Number of input channels
  - For an RGB image, the number of `in_channels` is 3.
  - For an intermediate layer, it is equal to the number of kernels (or filters or feature maps) in the previous layer.

- **out_channels**: Indicates the number of filters or feature maps or neurons in each convolution layer.
  - This determines how many feature maps will be produced by the convolutional layer
  
- **kernel_size**: Indicates the size of convolution filters or kernels.
  - Specified as a single integer for square kernels (e.g., `3` for a 3x3 filter) or a tuple for non-square kernels (e.g., `(3, 5)` for a 3x5 filter).

- **padding**: Indicates the width of additional data augmented around the input.
  - Padding is used to control the spatial dimensions of the output.
  - The amount of padding often depends on the size of the kernel:
    - To maintain the input dimensions at the output (when stride is 1), padding is typically set such that `padding = (kernel_size - 1) / 2` for square kernels.

- **stride**: This defines the step size with which the convolutional filter moves across the input. The stride can be specified as a single integer for a uniform stride in all directions or as a tuple for different strides in height and width. A larger stride reduces the spatial dimensions of the output feature map.


- **dilation**: The dilation parameter controls the spacing between elements of the kernel. It can be used to expand the field of view of the kernel without increasing the number of parameters. Dilation is specified as a single integer or a tuple, indicating the dilation rate along height and width. A dilation rate of 1 means no dilation.


- **groups**: This parameter controls the connections between inputs and outputs. Setting `groups=1` means a standard convolution, where each input channel is convolved with every filter. If `groups` equals the number of `in_channels`, it performs a depthwise convolution, where each input channel is convolved with its own set of filters.


- **bias**: This is a boolean parameter that specifies whether to include a learnable bias in the convolution operation. By default, it is set to `True`, meaning that a bias will be added to the output of each filter.


- **padding_mode**: This determines the type of padding to use. The default is `'zeros'`, meaning zero-padding, but it can also be set to `'reflect'`, `'replicate'`, or `'circular'` for other types of padding.


### Relation between Convolution parameters

For a given spatial dimension (either height $H$ or width $W$):

$O = \left\lfloor \frac{I + 2P - K}{S} \right\rfloor + 1$

Where:
- $ O $ is the output dimension (either height or width).
- $ I $ is the input dimension (either height or width).
- $ P $ is the padding applied to the input.
- $ K$ is the kernel size.
- $ S $ is the stride of the convolution.
- $\left\lfloor \cdot \right\rfloor$ denotes the floor operation, which rounds down to the nearest integer.

### Example
- Input size $ I = 32 $ (e.g., a 32x32 image),
- Kernel size $ K = 3 $,
- Padding $ P = 1 $,
- Stride $ S = 1 $.

Plug in these values into the formula:

$ O = \left\lfloor \frac{32 + 2 \times 1 - 3}{1} \right\rfloor + 1 = \left\lfloor \frac{32 + 2 - 3}{1} \right\rfloor + 1 = \left\lfloor \frac{31}{1} \right\rfloor + 1 = 32 $



In [2]:
#Three kernels each of size 3-by-3-by-2
#conv_filters = nn.Conv2d(in_channels=2 , out_channels=3 ,  kernel_size=3,  padding=1 )
conv_filters = nn.Conv2d(in_channels=2 , out_channels=3 ,  kernel_size=3,  padding=1 )
# Size of convolution filters
'''
    Output Channels: 3 filters or output feature maps
    Input Channels: 2 channels from the input
    Kernel Height: 3 pixels
    Kernel Width: 3 pixels
'''
print(f'Size of convolution filters: {conv_filters.weight.size()}')
print("Conv Filter Weights")
print(conv_filters.weight)

Size of convolution filters: torch.Size([3, 2, 3, 3])
Conv Filter Weights
Parameter containing:
tensor([[[[-0.0868,  0.1588,  0.0101],
          [ 0.2002, -0.2348, -0.2257],
          [-0.0876,  0.1261, -0.0050]],

         [[-0.1209, -0.1802, -0.0792],
          [-0.0016,  0.0676,  0.1228],
          [-0.1594,  0.1482, -0.1149]]],


        [[[-0.1057, -0.0372, -0.2201],
          [ 0.1505, -0.2089,  0.0772],
          [-0.0564,  0.1741, -0.1507]],

         [[-0.0651,  0.0247, -0.0753],
          [ 0.1960, -0.2213, -0.1049],
          [ 0.0806, -0.1306,  0.0799]]],


        [[[ 0.2138, -0.1937, -0.1818],
          [-0.1340, -0.0774,  0.0700],
          [ 0.0918, -0.0159,  0.1855]],

         [[-0.2290,  0.2192,  0.1438],
          [ 0.1562,  0.0672, -0.0123],
          [ 0.2122,  0.2040,  0.0350]]]], requires_grad=True)


#### Make an input 2 x 3 x 4  (two channels, each has 3 x 4 pixels )

In [3]:
batch_size=3
input_channels  =2
input_height = 3
input_width = 4
data =torch.rand(batch_size,input_channels,input_height,input_width)
print(f"Dimensions of the data: {data.size()}")
print(data)


Dimensions of the data: torch.Size([3, 2, 3, 4])
tensor([[[[0.0249, 0.5162, 0.2239, 0.7541],
          [0.0098, 0.8837, 0.9687, 0.8592],
          [0.2849, 0.7420, 0.5915, 0.4588]],

         [[0.3153, 0.5290, 0.3056, 0.7309],
          [0.6363, 0.2823, 0.6907, 0.2244],
          [0.6115, 0.1865, 0.8038, 0.5614]]],


        [[[0.1366, 0.5261, 0.0554, 0.1270],
          [0.1539, 0.0288, 0.0954, 0.7523],
          [0.9725, 0.9499, 0.0919, 0.4983]],

         [[0.5989, 0.8211, 0.5218, 0.6614],
          [0.9070, 0.1353, 0.0649, 0.9614],
          [0.4433, 0.0429, 0.6795, 0.5759]]],


        [[[0.6359, 0.9382, 0.9903, 0.0515],
          [0.7929, 0.1298, 0.1630, 0.3775],
          [0.4837, 0.8303, 0.8130, 0.8790]],

         [[0.2229, 0.0761, 0.9712, 0.8842],
          [0.7572, 0.1363, 0.7073, 0.9360],
          [0.6933, 0.9016, 0.2981, 0.4730]]]])


#### Feed it to the convolutional layer

In [4]:
output_conv_filters = conv_filters(data)
print(f"Output of convolution filters: {output_conv_filters.size()}")
print(output_conv_filters)

Output of convolution filters: torch.Size([3, 3, 3, 4])
tensor([[[[-0.0919, -0.2418, -0.0524, -0.2513],
          [-0.2267, -0.6080, -0.4316, -0.2155],
          [-0.4112, -0.2882, -0.1761, -0.1388]],

         [[-0.1101,  0.0753,  0.1609,  0.0681],
          [-0.2055,  0.0656, -0.3154,  0.2089],
          [-0.1850, -0.2332, -0.3409,  0.0252]],

         [[ 0.1683,  0.2528,  0.3102,  0.0920],
          [ 0.2273,  0.1964,  0.1184,  0.0666],
          [-0.0978, -0.4698, -0.2568, -0.2039]]],


        [[[ 0.0141, -0.2470, -0.0524,  0.1292],
          [-0.0466, -0.3942, -0.5064, -0.3424],
          [-0.6714, -0.2250, -0.0258, -0.2452]],

         [[-0.1193,  0.0110,  0.2068,  0.1154],
          [-0.2686,  0.4150, -0.2015, -0.1374],
          [-0.0597,  0.0348, -0.1194,  0.0694]],

         [[ 0.0650,  0.1568,  0.1215,  0.1312],
          [ 0.2453,  0.2027,  0.3173,  0.0946],
          [ 0.0197, -0.4765, -0.2659, -0.0196]]],


        [[[-0.2553, -0.5427, -0.0139,  0.1890],
          [-0.14

### Simulate Convolutional layer with numpy

- helps us in understanding the workings of convolutional layers

In [5]:
# Define the convolutional layer
conv_filters = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=3, stride=1, padding=0)

# Create a random input tensor with shape (batch_size, channels, height, width)
#height and width of input should be larger than kernel size
#depth of input must match with number of in_channels of convolutional filter
input_tensor = torch.randn(1, 1, 5, 5)

# Apply the convolutional layer
output_tensor = conv_filters(input_tensor)

# Print the input tensor
print("Input Tensor:")
print(input_tensor)

# Print the output tensor
print("\nOutput Tensor:")
print(output_tensor)

# Extract convolution parameters
weight = conv_filters.weight.detach().numpy()
bias = conv_filters.bias.detach().numpy()

print("\nConvolution Weights:")
print(weight)

print("\nConvolution Bias:")
print(bias)

# Define a function to perform convolution using NumPy
def numpy_conv2d(input_data, weight, bias, stride=1, padding=0):
    # Assuming input_data shape is (batch_size, channels, height, width)
    batch_size, in_channels, in_height, in_width = input_data.shape
    out_channels, _, kernel_height, kernel_width = weight.shape

    # Output dimensions
    out_height = (in_height - kernel_height + 2 * padding) // stride + 1
    out_width = (in_width - kernel_width + 2 * padding) // stride + 1
    
    # Initialize the output
    output_data = np.zeros((batch_size, out_channels, out_height, out_width))

    # Perform convolution
    for b in range(batch_size):
        for oc in range(out_channels):
            for oh in range(out_height):
                for ow in range(out_width):
                    # Calculate corresponding input region and apply convolution
                    h_start = oh * stride
                    h_end = h_start + kernel_height
                    w_start = ow * stride
                    w_end = w_start + kernel_width

                    region = input_data[b, :, h_start:h_end, w_start:w_end]
                    output_data[b, oc, oh, ow] = np.sum(region * weight[oc]) + bias[oc]

    return output_data

# Convert input_tensor to NumPy array
input_data_numpy = input_tensor.detach().numpy()

# Perform the convolution using NumPy
output_numpy = numpy_conv2d(input_data_numpy, weight, bias)

print("\nOutput from NumPy convolution:")
print(output_numpy)

Input Tensor:
tensor([[[[-0.3633, -1.1081, -1.9856,  1.5418, -0.8696],
          [ 1.1699,  0.8141,  1.3254, -1.2161, -0.1563],
          [ 0.6774,  1.5959,  0.6463,  1.6820,  0.5404],
          [-1.1969,  0.8465, -1.6798,  0.0784, -0.6348],
          [ 1.2818, -0.0113,  0.2723, -1.4679,  0.7478]]]])

Output Tensor:
tensor([[[[ 0.2038,  0.9574,  0.1992],
          [ 0.5795, -0.8319, -0.0182],
          [-0.5165,  0.8156, -1.2960]]]], grad_fn=<ConvolutionBackward0>)

Convolution Weights:
[[[[-0.02017804 -0.18675594  0.3170899 ]
   [ 0.23428199 -0.04005798  0.03704413]
   [ 0.14230311  0.31781924 -0.16872013]]]]

Convolution Bias:
[-0.16598439]

Output from NumPy convolution:
[[[[ 0.20382187  0.95738828  0.19921386]
   [ 0.57949209 -0.83186579 -0.01816052]
   [-0.51645017  0.81558001 -1.29595339]]]]


# MaxPool2d Layer

https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html

The `nn.MaxPool2d` layer is used for downsampling operations in convolutional neural networks. It reduces the spatial dimensions (height and width) of the input while retaining the most salient features. 

Here are the key parameters and their explanations:

- **kernel_size**: Specifies the size of the window to take a max over.
  - If a single integer is provided, the window will be square (e.g., `2` for a 2x2 window).
  - A tuple can be used for non-square windows (e.g., `(2, 3)` for a window 2 pixels high and 3 pixels wide).

- **stride**: Specifies the step size with which the pooling window moves across the input.
  - If not specified, it defaults to the value of `kernel_size`.
  - A larger stride results in more aggressive downsampling.

- **padding**: Adds zero-padding to the input on both sides.
  - This can help control the dimensions of the output.
  - Typically used less frequently in pooling layers compared to convolutional layers.

- **dilation**: Controls the spacing between elements in the pooling window.
  - Default is `1`, meaning standard max pooling.
  - Larger values can result in a wider effective pooling window.

- **return_indices**: If `True`, returns the indices of the max values along with the outputs.
  - Useful for operations that require the location of maximum values, such as `nn.MaxUnpool2d`.

- **ceil_mode**: If `True`, uses the ceiling function to compute the output shape instead of the floor function.
  - Can affect the output size when the input size is not perfectly divisible by the kernel size.

For each region defined by the `kernel_size`, the `nn.MaxPool2d` layer selects the maximum value and discards the rest, effectively reducing the size of the input feature map. 

Think of this as feature selection.

By reducing spatial dimensions, pooling layers help decrease computation and control overfitting by reducing the number of parameters.

#### Relation between input and output sizes

Given an input of size `(H_in, W_in)`, the output dimensions `(H_out, W_out)` can be calculated using the following formulas:

   - $ H_{out} = \left\lfloor \frac{H_{in} + 2 \times \text{padding}[0] - \text{kernel\_size}[0]}{\text{stride}[0]} \right\rfloor + 1 $

   - $ W_{out} = \left\lfloor \frac{W_{in} + 2 \times \text{padding}[1] - \text{kernel\_size}[1]}{\text{stride}[1]} \right\rfloor + 1 $

These relationships mean that max pooling reduces the spatial dimensions of the input based on the specified parameters, which can help in reducing the computational load and controlling the overfitting by reducing the number of parameters in the network.



#### Example

Consider an example with the following parameters:

- `kernel_size` = 2
- `stride` = 2
- Input size: `(1, 3, 32, 32)`    -> here 1 is the batch size, 3 is depth or number of channels, the last two integers are width and height

When applied:

- The input dimensions of `(32, 32)` are reduced to `(16, 16)`, assuming no padding and a stride equal to the kernel size. maxpool does not change the number of channels.


In [6]:
max_pool_2D = nn.MaxPool2d(2,2)

In [7]:
output_max_pool = max_pool_2D(output_conv_filters)
print(f"Size of output of convolution filters: {output_conv_filters.size()}")
print(f"Size of output of maxpool: {output_max_pool.size()}")
print("Input of maxpool: \n")
print(output_conv_filters)
print("Output of max pool: \n")
print(output_max_pool)

Size of output of convolution filters: torch.Size([3, 3, 3, 4])
Size of output of maxpool: torch.Size([3, 3, 1, 2])
Input of maxpool: 

tensor([[[[-0.0919, -0.2418, -0.0524, -0.2513],
          [-0.2267, -0.6080, -0.4316, -0.2155],
          [-0.4112, -0.2882, -0.1761, -0.1388]],

         [[-0.1101,  0.0753,  0.1609,  0.0681],
          [-0.2055,  0.0656, -0.3154,  0.2089],
          [-0.1850, -0.2332, -0.3409,  0.0252]],

         [[ 0.1683,  0.2528,  0.3102,  0.0920],
          [ 0.2273,  0.1964,  0.1184,  0.0666],
          [-0.0978, -0.4698, -0.2568, -0.2039]]],


        [[[ 0.0141, -0.2470, -0.0524,  0.1292],
          [-0.0466, -0.3942, -0.5064, -0.3424],
          [-0.6714, -0.2250, -0.0258, -0.2452]],

         [[-0.1193,  0.0110,  0.2068,  0.1154],
          [-0.2686,  0.4150, -0.2015, -0.1374],
          [-0.0597,  0.0348, -0.1194,  0.0694]],

         [[ 0.0650,  0.1568,  0.1215,  0.1312],
          [ 0.2453,  0.2027,  0.3173,  0.0946],
          [ 0.0197, -0.4765, -0.2659

### Other Pooling Layers in PyTorch

#### 1. Average Pooling (`nn.AvgPool2d`)

Instead of taking the maximum value, it computes the average of all values in the defined window. Does a smoother aggregation of featires instead of aggressive selectoon of features via max pooling.

- **Parameters**:
  - **`kernel_size`**: The size of the window over which the average is computed.
  - **`stride`**: Determines the step size of the window. Defaults to `kernel_size` if not specified.
  - **`padding`**: Adds zero-padding around the input.
  - **`count_include_pad`**: When set to `True`, includes padded zeros in the averaging calculation.
  - **`ceil_mode`**: Uses the ceiling function to determine output size if set to `True`.

#### 2. Adaptive Pooling (`nn.AdaptiveMaxPool2d`, `nn.AdaptiveAvgPool2d`)

Adaptive pooling layers output a fixed size, regardless of the input size. They are particularly useful for creating networks that can handle variable input sizes by producing consistent output dimensions. 

Used in classification networks, where a fixed-size output is required before fully connected layers.

- **`nn.AdaptiveMaxPool2d`**: Similar to max pooling, but adapts output size.
- **`nn.AdaptiveAvgPool2d`**: Similar to average pooling, but adapts output size.


We need to specify the target output size (height, width). The layer calculates how to split the input dimensions to achieve this fixed size.


#### 3. Global Pooling

Global pooling layers reduce each feature map to a single value, typically by taking the average or maximum across all spatial dimensions. This is not explicitly provided as a separate class in PyTorch but can be implemented using adaptive pooling with an output size of 1.
- Use `nn.AdaptiveAvgPool2d(output_size=(1, 1))`.
- Use `nn.AdaptiveMaxPool2d(output_size=(1, 1))`.

-  Used in architectures like ResNet to reduce feature maps to a vector before feeding them into the final classification layer. It reduces the model's sensitivity to spatial translations of the input.

#### 4. Fractional Max Pooling (`nn.FractionalMaxPool2d`)

Similar to max pooling, but allows pooling regions to overlap and vary slightly in size, creating a fractional downsampling effect. Useful in situations where standard integer stride pooling might be too coarse or fine, providing a middle ground for pooling operations.

- **Parameters**:
  - **`kernel_size`**: The size of the pooling region.
  - **`output_size`** or **`output_ratio`**: Specifies the desired output size or ratio, allowing more flexible control over downsampling.



