### Convolutional Neural Network (CNN)

Imagine this scenario: You have 10k training images, each image will be RBG and 200pixels * 200, for a normal NN the input it would be $3(RGB)*200*200=120,000$ input nodes, that would be too much, for both computation and parameters.

Instead of connecting every pixel to every neuron, CNNs use convolutional filters(kernels).
- A filter is a small matrix (3x3 or 5x5) that slides across the image
- At each position, it computes a dot product between its weights and the local patch of the image
- The result forms one pixel in the feature map (activation map)

This way, one filter learns to detect on type of feature (eg. edge, texture). Multiple filters produce multiple feature maps, each spotting different pattern.

If the input image has size H * W, with filter size K, stride S, padding P, then the output feature map size given by:
$$
Output H = \left\lfloor \frac{H-K+2P}{s} \right\rfloor+1 \\

Outtput W = \left\lfloor \frac{W-K+2P}{s} \right\rfloor+1
$$

$k$: kernel size

$S$: stride, how many pixels to slide filter

$P$: add an extra border of 0, to prevent evetual information loss

If the input is RGB, then a filter has depth 3 (3* K* K), applying 32 filters will result an output of 32 feature maps

So the final result will be a 32 * 200 * 200

But wait, shouldn't it be reducing? Since this will only increment the input nodes

To reduce size and keep important info, we use pooling:
- Max pooling, takes the maximum value in each patch

So in a 2x2 pooling with stride 2, it halves the size into 100x100 but with 32 feature maps, so overall will still need $32*100*100 = 320,000$

And we need to find a balance between how many feature maps, and how much to scale the original image down

#### Balance between depth and resolution

Convolutions usually increase the number of feature maps (depth),  
while pooling reduces their width and height (resolution).  
A good CNN architecture carefully balances these two effects,  
so the network extracts rich features **without exploding in size**.

**Example pipeline:**
- Input: (3×200×200)  
- Conv (32 filters, stride=1, padding=1): (32×200×200)  
- Max Pooling (2×2, stride=2): (32×100×100)  
- Conv (64 filters): (64×100×100)  
- Pooling: (64×50×50)  