<a href="https://colab.research.google.com/github/DukeFens/QuickDraw-by-Scratch/blob/main/QuickDrawScratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We will build the convolution layer at first, but before it, we will discuss the math theory behind it. Let's see we have input with a picture of 28x28 and filter 3x3:

$$
X =
\begin{bmatrix}
x_{11} & x_{12} & \cdots & x_{1,28} \\
x_{21} & x_{22} & \cdots & x_{2,28} \\
\vdots & \vdots & \ddots & \vdots \\
x_{28,1} & x_{28,2} & \cdots & x_{28,28}
\end{bmatrix} \; , \; \; K =
\begin{bmatrix}
w_{11} & w_{12} & w_{13} \\
w_{21} & w_{22} & w_{23} \\
w_{31} & w_{32} & w_{33}
\end{bmatrix}
$$

We apply the convolution for X and K, that means we slide the kernel onto X and for each patch (m, n) while sliding we calculate:
\begin{align*}
a_{m,n} &= \sum_{i=0}^{2} \sum_{j=0}^{2} x_{m+i, n+j} w_{i, j} \\
\end{align*}

To avoid the shape reducing, before convolution, we add zeros around the border of X, then Z becomes:
$$
Z = X * K + b = \begin{bmatrix}
a_{1,1} & a_{1,2} & \cdots \\
\vdots & \ddots & \cdots \\
a_{28, 1} & \cdots & a_{28, 28}
\end{bmatrix} \text{ (we also add bias)}
$$

Then apply ReLu function to get the final feature map, we can see in general:

$$
A = ReLu(Z) = ReLu(X * K + b) \text{, with stride 1 and padding 1.}
$$

We increase the strong features by applying pooling 2x2 onto A with stride 2, then we have pooling activated feature map:
$$
P = \begin{bmatrix}
p_{1,1} & p_{1,2} & \cdots \\
\vdots & \ddots & \cdots \\
p_{14, 1} & \cdots & p_{14, 14}
\end{bmatrix}
$$

We have done convolution layer 1 sofar, now let's see in general our network design:

$$
X \xrightarrow{\text{conv}} Z^{[1]} \xrightarrow{\text{ReLu}} A^{[1]} \xrightarrow{\text{pooling}} P^{[1]} \xrightarrow{\text{conv}} Z^{[2]} \xrightarrow{\text{ReLu}} A^{[2]} \xrightarrow{\text{pooling}} P^{[2]}
$$

We have 2 layer for convolution, notice that the shape change by:
$$
m = \frac{n+2p-f+1}{s}
$$
Which:
  - $m$: Output size
  - $n$: Input size
  - $p$: Padding size
  - $f$: Filter size or pooling size
  - $s$: Stride size

Then we have pooling activated feature map in layer 2:
$$
P^{[2]} = \begin{bmatrix}
p_{1,1} & p_{1,2} & \cdots \\
\vdots & \ddots & \cdots \\
p_{7, 1} & \cdots & p_{7, 7}
\end{bmatrix}
$$

Now, more crazy, we have channel 1 for gray color of image and **32 filters** for layer 1, **64 filters** for layer 2 to detect more features. Then, the general mathematics looks like:

$$
X =
\begin{bmatrix}
x_{11} & x_{12} & \cdots & x_{1,28} \\
x_{21} & x_{22} & \cdots & x_{2,28} \\
\vdots & \vdots & \ddots & \vdots \\
x_{28,1} & x_{28,2} & \cdots & x_{28,28}
\end{bmatrix} \; , \; \; K_1 =
\begin{bmatrix}
k_1 & k_2 & \cdots & k_{32}
\end{bmatrix}, K_2 =
\begin{bmatrix}
k_1 & k_2 & \cdots & k_{64}
\end{bmatrix}
$$

which for each $k_i$:
$$
k_{i}=\begin{pmatrix}
w_{i11} & w_{i12} & w_{i13} \\
w_{i21} & w_{i22} & w_{i23} \\
w_{i31} & w_{i32} & w_{i33}
\end{pmatrix}
$$

Then, we calculate for i in $K_1$:

$$
\begin{align*}
z_i &= X * k_i + b_1 \\
a_i &= ReLu(z_i) \\
p_i &= pooling(a_i) \\
\end{align*} \\
\therefore P^{[1]} \text{ has shape (32, 14,14) as the channel replaced by number of filters} \\
\therefore P^{[2]} \text{ is (64, 7, 7) as the same calculating.}
$$


We flatten it to get input for full connected layer as we can see **64x7x7=3136** and suppose we have n examples and 128 perceptron in dense layer 1, then:
$$
X_{FC} =
\begin{bmatrix}
x_{11} & x_{12} & \cdots & x_{1,3136} \\
x_{21} & x_{22} & \cdots & x_{2,3136} \\
\vdots & \vdots & \ddots & \vdots \\
x_{n,1} & x_{n,2} & \cdots & x_{n,3136}
\end{bmatrix} \; , \; \; W^{[1]} =
\begin{bmatrix}
w_{11} & w_{12} & \cdots & w_{1,128} \\
w_{21} & w_{22} & \cdots & w_{2,128} \\
\vdots & \vdots & \ddots & \vdots \\
w_{3136,1} & w_{3136,2} & \cdots & w_{3136,128}
\end{bmatrix}
$$

$$
\begin{align*}
\therefore Z^{[1]} &= XW + b^{[1]} \\
\therefore A^{[1]} &= ReLu(Z^{[1]})
\end{align*}
$$

For layer 2, we have:

$Softmax$: $$ \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} $$
- $K$: number of classes

Then:
$$
A^{[2]} = Softmax(Z^{[2]}) = Softmax(Z^{[1]}W^{[2]} + b^{[2]})
$$
Which:
- $W^{[2]}$: 5 percentrons equals to 5 class of output.
