# Convolutional Neural Network

Convolutional Neural Network takes the features within convolutional layers and pooling layers, so it could presume more features and information.

## Convolutional Layer
In a convolutional layer, a window would travel through the data, which is in the matrix form. Such window, which is also called __Kernel__, has weights on each unit. And there are usually several kernels at one layer, producing a group of new matrices or convolutions. This is the well known __Tensor__. We usually choose the kernels which have the same shape.  
Here, we should point out that the convolution of two matrix shall be
$$
    [A*K]_{ij} = \sum\limits_{m=0}^{p-1}\sum\limits_{n=0}^{q-1} A_{i+m,j+n}K_{p-m,q-n}
$$
where we need to flip $K$ and do the element-wise multiplication, then sum them.

## Pooling Layer
This layer reduces the spatial size but remains main features. A cat is still a cat if we move the picture right for several pixels. Classical pooling layers are max pooling layer and average pooling layer.

## Fully Connected Layer
After processed with convolutional layer and pooling layers, we would use the fully connected layer, which is what we have in the former learning. 

## Propagation and Backpropagation

CNN is also a feedforwad NN. So we need to process two main parts in one epoch, which is forward propagation and backward propagation. We choose a specific network here, whose structure is:
<img src="pic/CNN.png" />


### Propagation
We mainly analyse two layers here:
 - Convolutional Layer  
     Let $*$ denote the convolution operator. We could define the multiplication over the tensors
     $$
         \mathrm{vec}(K*A) \triangleq [K_T^{(1)}, \cdots, K_T^{(n)}]A \triangleq [K_T^{(1)}A, \cdots, K_T^{(n)}A]
     $$
     here $K_T^{(1)}$ is $K$ filled with 0 in other position which looks like $\left[\begin{matrix}K & O\\ O & O\end{matrix}\right]$. And $K_T^{(i)}A$ is the sum of elementwise multiplication.

 - Pooling Layer  
 Pooling layer could be regarded as a special convolutional layer with a stride length. We usually use the _max pooling_ and _average pooling_. Formally, we could have
$$
    a = z = A * K
$$
without a bias $b$ and activation function.

## Backpropagation

- Convolutional Layer  
In the former analysis of normal fully connected NN, we use matrix form to describe the derivatives, without facing too much trouble. This is because that $w_{ij}$ in every layer is independent to each other, and our chain rules have no intersection. However, in CNN, the kernel would be used for several times, that the derivative of $K$ would be a sum of several other derivatives
$$
    \frac{\partial L}{\partial K_{ij}} = \sum\limits_k \frac{\partial L}{\partial z_k}\frac{\partial z_k}{\partial K_{ij}} = \mathrm{dot}\left(\frac{\partial L}{\partial z}, \frac{\partial z}{\partial K_{ij}}\right) \\
    \frac{\partial L}{\partial K} = A[*]\frac{\partial L}{\partial z} \\
    \nabla_K L = \mathrm{flip}\left(\frac{\partial L}{\partial K}\right) = \mathrm{flip}\left(A[*]\frac{\partial L}{\partial z}\right) = A * \mathrm{flip}\left(\frac{\partial L}{\partial z}\right)
$$
here
$$
    z = A*K + b,\ a = \sigma(z)
$$
then
$$
    \frac{\partial L}{\partial b} = \mathrm{dot}\left(\frac{\partial L}{\partial z}, J\right)
$$
where $J$ is matrix of ones.  
Also, with our need to backpropagate the derivative, we need $\frac{\partial L}{\partial A}$  
in conclusion, we have
$$
    \frac{\partial L}{\partial A} = 
        \left[\begin{matrix}
            0 & O & 0 \\
            O & \frac{\partial L}{\partial z} & O \\
            0 & O & 0
        \end{matrix}\right] [*] K \\
        \triangleq \frac{\partial L}{\partial z} \left<*\right> K
$$

- Pooling Layer  
 Because we have the similar structure as traditional convolutional layers, we could take even the same backpropagations for common pooling layer. While we only use such few pooling layers, so I'd like to get the detail of max pooling and average pooling, in backpropagation.
    - max pooling  
    Assume we have
    $$
        [A*K]_{ij} = A_{p_iq_j}
    $$
    then, we would not have to compute $\frac{\partial L}{\partial K}$, which is meaningless, but only need $\frac{\partial L}{\partial A}$. So
    $$
        \nabla_A L = \nabla_{z} L\nabla_A {z} \\
        \nabla_{A_{p_iq_j}} L = \nabla_{z_{ij}} L 
    $$
    - average pooling  
    because of the averaging, $K=J$, which is exactly what we get in convolutional layer.

For that in our codes implementation, _dot()_ get the element-wise multiplication and sum them, and we wouldn't need to write the convolution function to compute, so here we adjust our content to fit the codes implementation.

__In forward propagation__  
$$
z = A*K + b,\ a = \sigma(z)
$$
where the convolution is replaced by what we have said above.

__In backward propagation__
$$
\nabla_K L = A*\nabla_z L\\
\nabla_b L = \mathrm{dot}\left(\frac{\partial L}{\partial z}, J\right) \\
\nabla_A L = \left(\frac{\partial L}{\partial z}, J\right) * \mathrm{flip}(K)
$$