# Deep Learning: Max and Mean Pooling

## ðŸŽ¯ Objective
Convolutional layers detect features, but **Pooling layers** summarize them. In this notebook, we explore the mechanics of **Max Pooling** and **Average Pooling**. We will see how these layers downsample feature maps to reduce computational cost and introduce translation invariance. We will also look at a unique case: using 3D pooling on 2D images to pool across channels, before building a complete mini-CNN.

## ðŸ“š Key Concepts
* **Max Pooling:** Selects the maximum value within a window. Useful for preserving dominant features (like strong edges) while discarding noise.
* **Average Pooling:** Calculates the mean value within a window. Useful for smoothing features.
* **Spatial vs. Volumetric:** `MaxPool2d` operates on Height/Width. `MaxPool3d` operates on Depth/Height/Width. When applied to images, "Depth" often corresponds to the **Channels**, allowing us to compress the number of feature maps.

## 1. Import Libraries

We import PyTorch and its neural network module.

In [1]:
# import libraries
import torch
import torch.nn as nn

## 2. Creating Pooling Instances

We create instances of pooling layers. 
* **poolSize (3):** The window size (3x3 pixels).
* **stride (3):** The window moves 3 pixels at a time. Because stride equals pool size, the windows do not overlap.

In [2]:
# create a pool class instance with parameters

# parameters
poolSize = 3
stride   = 3

# create the instance
p2 = nn.MaxPool2d(poolSize,stride=3)
p3 = nn.MaxPool3d(poolSize,stride=3)

# let's have a look at them
print(p2)
print(p3)

MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
MaxPool3d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)


## 3. Applying Pooling to Data

We create 2D (1 channel) and 3D (3 channel) "images" to see how PyTorch handles dimensions.

### The "Gotcha" with MaxPool3d
* **`p2(img3)`:** `MaxPool2d` applies the 3x3 reduction spatially (Height, Width) for **each channel independently**. The channel count remains 3.
* **`p3(img3)`:** `MaxPool3d` treats the input as a volume of `(Depth, Height, Width)`. Here, the 3 color channels are treated as "Depth". The pool layer reduces this dimension as well (3 channels $\to$ 1 channel), effectively mixing information from R, G, and B.

In [3]:
# Create image and apply maxpooling

# create a 2D and a 3D image
img2 = torch.randn(1,1,30,30)
img3 = torch.randn(1,3,30,30)


# all combinations of image and maxpool dimensionality
img2Pool2 = p2(img2)
print(f'2D image, 2D maxpool: {img2Pool2.shape}\n' )

# img2Pool3 = p3(img2)
# print(f'2D image, 3D maxpool: {img2Pool3.shape}\n' )

img3Pool2 = p2(img3)
print(f'3D image, 2D maxpool: {img3Pool2.shape}\n' )

img3Pool3 = p3(img3)
print(f'3D image, 3D maxpool: {img3Pool3.shape}\n' )

2D image, 2D maxpool: torch.Size([1, 1, 10, 10])

3D image, 2D maxpool: torch.Size([1, 3, 10, 10])

3D image, 3D maxpool: torch.Size([1, 1, 10, 10])



## 4. Building a Simple CNN

Now we combine convolution, activation, and pooling into a small network `littlenet`. 

### Understanding the Tensor Shape Transformation
1.  **Input:** (3, 128, 128)
2.  **Conv2d:** 10 filters, stride 3. Output spatial dim $\approx 128/3 = 43$. Output shape: **(10, 43, 43)**.
3.  **AvgPool3d:** Kernel 3, Stride 3. Pools over (Depth 10, Height 43, Width 43).
    * Depth: $10/3 \approx 3$ channels.
    * Spatial: $43/3 \approx 14$ pixels.
    * Result: **(3, 14, 14)**.
4.  **Flatten:** $3 \times 14 \times 14 = 588$ units.
5.  **Linear:** Maps 588 $\to$ 1 output.

In [4]:
littlenet = nn.Sequential(

    ## the conv-pool block
    nn.Conv2d(3,10,5,3,2), # convolution layer
    nn.ReLU(),             # activation function
    nn.AvgPool3d(3,3),     # average-pool

    ## the FFN block
    nn.Flatten(),          # vectorize to get from image to linear
    nn.Linear(588,1),      # FC linear layer
    nn.Sigmoid()           # output activation
  )

In [5]:
# test with a bit of data
img = torch.rand(1,3,128,128)
littlenet(img)

tensor([[0.5359]], grad_fn=<SigmoidBackward0>)