# W04 · Convolutional Neural Networks Basics


<img src = "cnn.png" width = 640 height = 480>

<img src = "dialiated.png" width = 640 height = 480>

<img src = "stride.png" width = 768 height = 320>

## Convolutional Layer Example

**Given**

- Input image: $64 \times 64 \times 3$ (RGB)

**Conv1**

- 32 filters
- Filter/kernel size: $3 \times 3$
- Stride: 2
- Padding: valid (no padding)

**Rules**

- Output shape:

  $$
  \text{Output shape} = \left\lfloor \frac{I_h + 2P - F_h}{S} + 1 \right\rfloor \times \left\lfloor \frac{I_w + 2P - F_w}{S} + 1 \right\rfloor \times N_{\text{filters}}
  $$

- Number of parameters:

  $$
  \text{Params} = (F_w \times F_h \times C_{\text{in}} + 1) \times N_{\text{filters}}
  $$

**Solution**

- Output shape:

  $$
  \left\lfloor \frac{64 - 3}{2} + 1 \right\rfloor \times \left\lfloor \frac{64 - 3}{2} + 1 \right\rfloor \times 32 = 31 \times 31 \times 32
  $$

- Number of parameters: $(3 \times 3 \times 3 + 1) \times 32 = 816$

---

### Big Example

**Given**

- Input shape: $256 \times 256 \times 3$

**Conv1**

- 64 filters
- Filter shape: $3 \times 3$
- Stride: 2
- Padding: 0 (valid)

**Conv2**

- 128 filters
- Filter shape: $3 \times 3$
- Stride: 1
- Padding: same, where $P = \frac{F - 1}{2}$

**Max pooling**

- Window: $2 \times 2$
- Stride: 2

**Conv3**

- 256 filters
- Filter shape: $3 \times 3$
- Stride: 1
- Padding: same

**Flatten & fully connected**

- Flatten output: propagate the previous tensor
- FC1: 500 neurons
- FC2: 10 neurons

**Rules**

- Output shape:

  $$
  \text{Output shape} = \left\lfloor \frac{I_h + 2P - F_h}{S} + 1 \right\rfloor \times \left\lfloor \frac{I_w + 2P - F_w}{S} + 1 \right\rfloor \times N_{\text{filters}}
  $$

- Number of parameters in a convolutional layer:

  $$
  \text{Params}_{\text{conv}} = (F_w \times F_h \times C_{\text{in}} + 1) \times N_{\text{filters}}
  $$

- Number of parameters in a fully connected layer:

  $$
  \text{Params}_{\text{fc}} = (N_{\text{inputs}} + 1) \times N_{\text{units}}
  $$

**Solution**

- Conv1 output shape:

  $$
  \left\lfloor \frac{256 - 3}{2} + 1 \right\rfloor \times \left\lfloor \frac{256 - 3}{2} + 1 \right\rfloor \times 64 = 127 \times 127 \times 64
  $$

- Conv1 parameters: $(3 \times 3 \times 3 + 1) \times 64 = 1{,}792$
- Conv2 output shape: $127 \times 127 \times 128$ (same padding preserves width and height)
- Conv2 parameters: $(3 \times 3 \times 64 + 1) \times 128 = 73{,}856$
- Max pooling output shape:

  $$
  \left\lfloor \frac{127 - 2}{2} + 1 \right\rfloor \times \left\lfloor \frac{127 - 2}{2} + 1 \right\rfloor \times 128 = 63 \times 63 \times 128
  $$

- Max pooling parameters: none
- Conv3 output shape: $63 \times 63 \times 256$ (same padding)
- Conv3 parameters: $(3 \times 3 \times 128 + 1) \times 256 = 295{,}168$
- Flatten output shape: $63 \times 63 \times 256 = 1{,}016{,}064$ features
- FC1 parameters: $(1{,}016{,}064 + 1) \times 500 = 508{,}032{,}500$
- FC2 parameters: $(500 + 1) \times 10 = 5{,}010$

---

### Dilation rule

For convolutions with dilation $D$, the effective receptive field expands. The generalized output-size formula for one spatial dimension becomes

$$
O = \left\lfloor \frac{I + 2P - D \cdot (F - 1) - 1}{S} + 1 \right\rfloor,
$$

where $I$ is the input length, $F$ the filter size, $P$ the padding on one side, and $S$ the stride. Setting $D = 1$ recovers the standard convolution rule.

---

### Big Example 2: mixed padding, strides, and dilation

**Given**

- Input shape: $128 \times 128 \times 3$

**ConvA**

- 32 filters
- Filter shape: $3 \times 3$
- Stride: 1
- Padding: same ($P = 1$)
- Dilation: 1

**ConvB**

- 64 filters
- Filter shape: $5 \times 5$
- Stride: 2
- Padding: valid ($P = 0$)
- Dilation: 1

**ConvC**

- 128 filters
- Filter shape: $3 \times 3$
- Stride: 1
- Padding: same ($P = 2$ to counter dilation)
- Dilation: 2

**Max pooling**

- Window: $2 \times 2$
- Stride: 2

**ConvD**

- 256 filters
- Filter shape: $3 \times 3$
- Stride: 1
- Padding: valid ($P = 0$)
- Dilation: 3

**Global average pooling & fully connected**

- Global average pooling collapses $H \times W$ to $1 \times 1$
- FC (classifier) layer: 10 neurons

**Layer-by-layer solution**

- ConvA output shape: $128 \times 128 \times 32$ (same padding keeps the spatial size)
- ConvA parameters: $(3 \times 3 \times 3 + 1) \times 32 = 896$
- ConvB output shape:

  $$
  \left\lfloor \frac{128 - 5}{2} + 1 \right\rfloor \times \left\lfloor \frac{128 - 5}{2} + 1 \right\rfloor \times 64 = 62 \times 62 \times 64
  $$

- ConvB parameters: $(5 \times 5 \times 32 + 1) \times 64 = 51{,}264$
- ConvC output shape (dilation $D = 2$ makes the effective kernel $5 \times 5$ while padding $P = 2$ preserves size): $62 \times 62 \times 128$
- ConvC parameters: $(3 \times 3 \times 64 + 1) \times 128 = 73{,}856$
- Max pooling output shape:

  $$
  \left\lfloor \frac{62 - 2}{2} + 1 \right\rfloor \times \left\lfloor \frac{62 - 2}{2} + 1 \right\rfloor \times 128 = 31 \times 31 \times 128
  $$

- ConvD output shape (dilation $D = 3$ gives an effective $7 \times 7$ kernel with valid padding):

  $$
  (31 - 7 + 1) \times (31 - 7 + 1) \times 256 = 25 \times 25 \times 256
  $$

- ConvD parameters: $(3 \times 3 \times 128 + 1) \times 256 = 295{,}168$
- Global average pooling output shape: $1 \times 1 \times 256$
- FC parameters: $(256 + 1) \times 10 = 2{,}570$

