# Introduction to Deep Learning and Image Classification
In this primer, we will go through the basics of deep learning and neural network, with a focus on Convolutional Neural Network, which is typically used for image classification tasks. We will investigate how these techniques can be viewed in terms of the core notations introduced in the Machine Learning primer, as well as what distinguishes them from classifical machine learning models and leads them to dominate the field of modern AI.

Similar to the previous primer, here we will only cover the basic concepts in deep learning that you need to know for Project 5, where you will train and deploy a simple neural network on the cloud. You will learn more about the underlying theories and the use cases of different network architectures in a dedicated machine learning / deep learning course.

## 1. Introduction to Deep Learning
Deep learning is a subfield of machine learning that uses multiple-layered networks to automatically extract high-level features from the data and perform predictions. The term "deep" refers to the fact that the network structures being used typically have several layers that attempt to mimic the hierarchical structures of the brain.

A typical deep learning workflow consists of the following steps:

1. Selection of a framework for development
2. Split the dataset into training and testing dataset
3. Designing initial network model
4. Training the network, picking hyperparameters
5. Testing the model
6. Saving the model parameters and architecture

![deep_learning_flow](http://clouddatascience.blob.core.windows.net/m20-foundation-data-science/p5-dl-cv-primer/flow.png)
**Figure 1**: Illustration of a deep learning workflow. Source: https://training.ti.com/sites/default/files/docs/introduction-to-deep-learning.pdf

Deep learning has become very popular nowadays thanks to the availability of powerful computing units (in particular the Graphical Processing Units - GPUs) and large datasets. At the same time, the rise of deep learning frameworks has contributed greatly to the field's wider adoption, as any machine learning practitioner can now easily build any kind of network structure using high-level codes. Below is a comparison among the most popular frameworks. For a more detailed comparison, you can refer to [article](https://towardsdatascience.com/top-10-best-deep-learning-frameworks-in-2019-5ccb90ea6de).

![image.png](https://brianhhu.github.io/img/dl_frameworks.jpg)

**Figure 2**: Deep learning framework comparison. Source: https://brianhhu.github.io/img/dl_frameworks.jpg
  
Let's start by investigating the core component of deep learning, which is the neural network.

## 2. Neural Network
### 2.1. Definition
While nowadays "deep learning" and "neural network" are often used interchangeably, there is a minor difference between them; in particular, not all neural network has to be deep. In general, the term "neural network" refers to a class of machine learning algorithms where:

1. The **input data** is $X \in \mathbb{R}^{n \times d_1 \times d_2 \times \ldots \times d_m}$. Here we account for the fact that each input data point is not just a 1D vector, but can itself be a multi-dimensional matrix (for example, an image file is a 2D or 3D bitmap).

1. The **output** is $Y \in \mathcal Y^{n \times d}$, where $\mathcal Y$ can be $\mathbb{R}$ in regression and some label set $C$ in classification. Note that each output $y$ can also be a vector, instead of just a scalar.

1. The **parameters** are a collection of weight vectors $w_j$ and bias terms $b_j$. 

1. The **hypothesis function** is a non-linear function that involves compositions of linear operators and elementwise non-linear functions. An example would be $h(x) = \frac{1}{1 + \exp(-\theta^T x)}$, which can be expressed as $h(x) = \sigma(l(x))$, where $l(x) = \theta^Tx$ is the linear component, and $\sigma(x) = \frac{1}{1 + \exp(-x)}$ is the nonlinear component.

1. The **loss function** is similar to those in typical regression and classification scenarios, e.g., mean squared error for regression and hinge loss for classification.


![neuralnet](https://miro.medium.com/max/12840/1*v88ySSMr7JLaIBjwr4chTw.jpeg)

**Figure 3**: A neural network is a graphical representation of the class of hypothesis functions that involve compositions of linear operators and elementwise non-linear functions. Source: [towardsdatascience.com](https://towardsdatascience.com/multi-layer-neural-networks-with-sigmoid-function-deep-learning-for-rookies-2-bf464f09eb7f)

### 2.2. Artificial Neural Network
Artificial neural networks differ from other (classical) machine learning algorithms in how they handle the input features. While in the latter cases (e.g., with logistic regression or SVM), we have to specify the features ourselves, which come from either attributes in the dataset or feature engineering methods like TF-IDF, a neural network can *learn its own features* from an initial set of input features. It does so through a three-layer architecture as follows:

![ann](http://clouddatascience.blob.core.windows.net/m20-foundation-data-science/p5-dl-cv-primer/ann.png)
**Figure 4**: Structure of an artificial neural network. Source: [towardsdatascience.com](https://towardsdatascience.com/how-to-build-your-own-neural-network-from-scratch-in-python-68998a08e4f6).

Here a 3D data point $x = (x_1, x_2, x_3)$, shown in the *input layer*, is transformed into a *hidden layer* vector $z = (z_1, z_2, z_3, z_4)$, which is then used to predict the *output layer* $y$. Therefore, in a sense $z$ contains the actual features that we use for prediction. At each layer other than the input layer, there is also an associated *activation function* and a *bias vector*, whose length is equal to the number of nodes in the layer.

Each edge in the graph denotes a scalar *weight* term; in this case, $u_{ij}$ denotes the weight of $x_i$ in the computation of $z_j$; similar, $v_{jk}$ denotes the weight of $z_j$ in the computation of $y_k$. To compute $z$ from $x$, we multiply $x$ with the associated weights and then apply the activation function at the hidden layer:
\begin{equation}
z_j = L_1 \left(b_j + \sum_{i=1}^d u_{ij} x_i \right). \label{x_to_z}
\end{equation}
Similarly, to compute $y$ from $z$, we multiply $z$ with the associated weights and then apply the activation function at the output layer:
\begin{equation}
y_k = L_2 \left(b_k' + \sum_{j=1}^m v_{jk} z_j  \right). \label{z_to_y}
\end{equation}

Note that this can also be rewritten as
\begin{equation}
y_k = L_2 \left( b_k' + \sum_{j=1}^m v_{jk} L_1\left(b_j  + \sum_{i=1}^d x_i u_{ij}\right) \right), \label{x_to_y}
\end{equation}
which is a direct mapping from $x$ to $y$. As we can see, this is a composition of linear operators and non-linear functions, just like our description of the hypothesis function class in the previous section.

The remaining question is what the activation functions $L_1$ and $L_2$ may look like. Some popular choices include:

* Sigmoid: $f(x) = \dfrac{1}{1 + \exp(-x)}$.
* Hyperbolic tangent: $f(x) = \dfrac{e^{2x} - 1}{e^{2x} + 1}$.
* Rectified linear unit: $f(x) = \max \{x, 0\}$.
* Softmax (used at the output layer in multi-class classification): $f(x)_k = \dfrac{\exp(y_k)}{\sum_l \exp(y_l)}$

The hyperparameters of a neural network therefore include (1) the number of nodes in the hidden layer, and (2) the choice of activation functions, in addition to other standard components (number of training iterations, learning rate $\alpha$, regularizer $\lambda$).

Despite its recent resurgence, ANN has actually been studied extensively in the last century. For example, researchers have proven that every bounded continuous function can be approximated with small error by a network with one hidden layer (e.g., [Hornik, 1991](https://www.sciencedirect.com/science/article/abs/pii/089360809190009T)). Furthermore, any function can be approximated to arbitrary accuracy by a network with two hidden layers ([Cybenko, 1989](https://link.springer.com/article/10.1007%2FBF02551274)). However, these theoretical results are not very relevant in practice, since they require exponential number of nodes in the hidden layers. You can read more about the history of neural networks [here](http://news.mit.edu/2017/explained-neural-networks-deep-learning-0414).

More recently, the machine learning community has explored with the idea of increasing the number of layers, instead of the number of nodes in each layer, with the goal of permitting practical applications while retaining theoretical guarantees under mild conditions. This leads us to our next section.

![deeper](https://i.kym-cdn.com/photos/images/newsfeed/000/531/557/a88.jpg)

### 2.3. Deep Neural Network
While there is no formal definition for what qualifies as "deep", typically deep neural networks have at least two hidden layers (to distinguish from the traditional ANN model with only one hidden layer). The standard network structure follows the same pattern as that of ANN, where every node in layer $l$ is connected to every node in layer $l+1$ via a weight term, and every layer other than the input layer has an activation function.

To train a (deep) neural network, we perform the following steps:

1. Randomly initialize the weight and bias terms.
1. **Feedforward**: compute the predicted output $\hat y$ based on input $x$. This involves iterating through each layer and carrying out the computations similar to those in Section 2.2.
1. **Backpropagation**: compute the loss value $l(\hat y, y)$ and update all the weight / bias terms based on the corresponding gradients:
$$w := w - \alpha \cdot \frac{1}{n} \sum_{i=1}^n \frac{\partial}{\partial w} l(h(x^{(i)}), y^{(i)}).$$

There are some important points to note about this process:

* The specific update formula will depend on the activation function at each layer. We will not go into details here.
* Unlike in classical machine learning techniques where we have a single weight vector $\theta \in \mathbb{R}^d$ to update, due to the dense network structure and (possibly) many layers here, the number of weight terms can be very large. This means a neural network can take very long to train.
* One popular method to alleviate this training time issue is that, instead of updating each weight term based on the the sum of gradients over all the training data points (*batch gradient descent*), we will only pick a small random collection of samples $\{i_1, i_2, \ldots, i_k\}$ (minibatch) and update based on their gradients (*stochastic gradient descent*):
$$w := w - \alpha \cdot \frac{1}{k} \sum_{j=1}^k \frac{\partial}{\partial w} l(h \left(x^{(i_j)} \right), y^{(i_j)}), \quad k \ll n.$$

There are a number of factors that motivate the use of deep networks. It is typically mentioned that this structure better simulates the interconnected neuron levels in our brain, although the connection between neural network and biology is superficial at best. Perhaps the strongest reason is simply that deep neural network works very well in practice. Understanding why it does is an [ongoing research effort](https://en.wikipedia.org/wiki/Explainable_artificial_intelligence). For example, an interesting recent paper ([Lin et al., 2017](https://link.springer.com/article/10.1007/s10955-017-1836-5)) suggests that the class of functions that are of practical interest to us happens to be what deep neural networks can model very well, with considerably fewer parameters in practice than theoretical results have implied. A short summary of the paper was shared by one of the authors [here](https://qr.ae/TUGPZ8).

Other than the standard fully-connected network architecture described above, there are other flavors of deep neural network that are popular in practice. The two most popular architectures are *convolutional neural network* for image data, and *recurrent neural network* for sequential data. In Project 5, you will build a convolutional neural network for image classification, so the rest of the primer will focus on this topic. 

### 2.4. Convolutional Neural Network
#### 2.4.1. Motivation
Imagine we have some input images as training data to our neural network. From the computer's view, an image is just a big matrix with a number (or tuple of numbers) at every pixel. For example, a 256 x 256 RGB image has 65536 pixels, each represented by three values of red, green and blue, for a total of around $2 \times 10^5$ dimensions. This means that, even if the first hidden layer only has 5 nodes, we would need $10^6$ parameters, and it typically needs a lot more than 5 nodes. Having too many weight parameters poses several issues:

1. It is not computationally practical to train all of them.
1. There is the possibility of overfitting.
1. Important image structures are ignored in this setting, if we just flatten an input 3D matrix of size $256 \times 256 \times 3$ to a $196608$-dimensional vector. For example, an image of a cat, when scaled to half of its size and width, is still an image of a cat (i.e., images are invariant to scale). Therefore, treating each pixel as an independent input dimension doesn't seem like a good idea.

To deal with these issues, the convolutional neural network (CNN) architecture operates as follows.

#### 2.4.2. Mechanism
We begin by noting which properties of the standard fully-connected neural networks are preserved in CNN. First, the feedforward and backpropagation procedures are the same. To get the predicted output from an input, we go through each layer, multiplying the nodes in that layer with the corresponding weight matrix and applying a layer-specific activation function. We then update the weights based on the gradients of the loss function, and propagate these weight updates back to the input layer. Second, the notions of hypothesis and loss functions are also unchanged. The hypothesis function is still a composition of linear operators and common non-linear activation functions; similarly, we don't introduce any new loss function only for CNN.

What's different about CNN is that it explicitly assumes each input is a matrix (e.g., a 3D image), and it introduces new types of layer to take advantage of this structure. In particular, here are the different types of layer used in a CNN:

1. The **input layer** accepts a 3D matrix of size $W_1 \times H_1 \times D_1$.
1. The **convolutional layer** accepts a 3D matrix of size $W_1 \times H_1 \times D_1$ and has four hyperparameters: the number of filters $K$, the spatial extent $F$, the stride $S$ and the amount of zero padding $P$. It then outputs a 3D matrix of size $W_2 \times H_2 \times D_2$ where
$$ W_2 = \frac{W_1 - F + 2P}{S} + 1, \quad H_2 = \frac{H_1 - F + 2P}{S} + 1, \quad D_2 = K.$$
1. The **pooling layer** accepts a 3D matrix of size $W_1 \times H_1 \times D_1$ and has two hyperparameters: the spatial extent $F$ and the stride $S$. It then outputs a 3D matrix of size $W_2 \times H_2 \times D_2$ where
$$W_2 = \frac{W_1 - F}{S} + 1, \quad H_2 = \frac{H_1 - F}{S} + 1, \quad D_2 = D_1.$$

1. The **fully connected layer** is identical to a fully connected network layer. It accepts a $k$-dimensional vector and outputs an $l$-dimensional vector, where $l$ is the number of nodes in this layer (if the input is a 3D matrix, this matrix is flattened to become a vector).

Next, we will describe what each of these layer types do in the context of a worked example.

#### 2.4.3. Worked example
Assume our CNN architecture consists of one layer of each type as follows:

1. Input layer with that accepts a matrix with size $W_1 = 5, H_1 = 5, D_1 = 3$.
1. Convolutional layer with $K = 2, F = 3, S = 2, P = 1$.
1. ReLU layer
1. Pooling layer with $F = 2, S = 1$.
1. Fully connected layer (output layer) with two nodes
1. Softmax layer.

Note that instead of describing the second layer as "a convolutional layer with the ReLU activation function", we now have a separate ReLU layer as our third layer. This is just a notational change, to be more consistent with the Pytorch API that you will use later.

We also consider the input 3D matrix `X`:
\begin{align*}
\texttt{X[:,:0]} & = \begin{pmatrix}
    0 & 1 & 1 & 2 & 1 \\
    0 & 2 & 2 & 0 & 2 \\
    0 & 2 & 0 & 0 & 1 \\
    2 & 1 & 2 & 0 & 0 \\
    1 & 0 & 1 & 1 & 0 \\
\end{pmatrix}, & \texttt{X[:,:1]} = \begin{pmatrix}
    0 & 0 & 0 & 0 & 2 \\
    1 & 0 & 2 & 1 & 2 \\
    1 & 1 & 0 & 1 & 2 \\
    0 & 0 & 1 & 0 & 1 \\
    0 & 1 & 0 & 1 & 0 \\
\end{pmatrix}, \\\\
\texttt{X[:,:2]} & = \begin{pmatrix}
    2 & 0 & 2 & 1 & 0 \\
    0 & 1 & 0 & 2 & 2 \\
    2 & 2 & 0 & 1 & 2 \\
    0 & 1 & 2 & 1 & 1 \\
    0 & 1 & 0 & 0 & 2
\end{pmatrix}.
\end{align*}

The feedforward procedure will then be performed as follows.

**(1) Input layer**:

The input layer simply takes this input matrix $X$ and passes it to the next layer.

**(2) Convolutional layer**:

Based on the hyperparameter settings $K = 2, F = 3, S = 2, P = 1$, this layer will have $K = 2$ filters, each of which is a weight matrix with size $F \times F \times 3$ (the depth 3 here is the same as the input depth), which is $3 \times 3 \times 3$. Typically the weights are initialized randomly. In this example, let's say they are

\begin{align*}
\texttt{w0[:,:,0]} & = \begin{pmatrix}
    -1 & -1 & 0 \\
    0 & -1 & -1 \\
    1 & -1 & -1
\end{pmatrix}, & \texttt{w0[:,:,1]} & = \begin{pmatrix}
    0 & -1 & 0 \\
    -1 & 0 & 1 \\
    -1 & -1 & 0
\end{pmatrix}, & \texttt{w0[:,:,2]} & = \begin{pmatrix}
    0 & -1 & -1 \\
    1 & 1 & 0 \\
    0 & 1 & 1
\end{pmatrix} \\\\
\texttt{w1[:,:,0]} & = \begin{pmatrix}
    1 & -1 & 0 \\
    -1 & 1 & 0 \\
    0 & 1 & -1
\end{pmatrix}, & \texttt{w1[:,:,1]} & = \begin{pmatrix}
    0 & 0 & -1 \\
    0 & 0 & 0 \\
    -1 & 1 & 0
\end{pmatrix}, & \texttt{w1[:,:,2]} & = \begin{pmatrix}
    -1 & 1 & -1 \\
    -1 & -1 & 1 \\
    0 & 0 & 1
\end{pmatrix}
\end{align*}

Note also that each filter (3D weight matrix) also has an associated bias term, let's say `b0 = 1` and `b1 = 0`. These weights will interact with the input matrix `X` as follows:

* Each 2D slice of `X` -- namely, `X[:,:,0], X[:,:,1]` and `X[:,:,2]`) -- is padded with $P = 1$ layer of 0s both row-wise and column-wise, producing

\begin{align*}
\texttt{X[:,:0]} & = \begin{pmatrix}
    \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} \\
    \color{blue}{0} & 0 & 1 & 1 & 2 & 1 & \color{blue}{0} \\
    \color{blue}{0} & 0 & 2 & 2 & 0 & 2 & \color{blue}{0} \\
    \color{blue}{0} & 0 & 2 & 0 & 0 & 1 & \color{blue}{0} \\
    \color{blue}{0} & 2 & 1 & 2 & 0 & 0 & \color{blue}{0} \\
    \color{blue}{0} & 1 & 0 & 1 & 1 & 0 & \color{blue}{0} \\
    \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0}
\end{pmatrix}, & \texttt{X[:,:1]} = \begin{pmatrix}
    \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} \\
    \color{blue}{0} & 0 & 0 & 0 & 0 & 2 & \color{blue}{0} \\
    \color{blue}{0} & 1 & 0 & 2 & 1 & 2 & \color{blue}{0} \\
    \color{blue}{0} & 1 & 1 & 0 & 1 & 2 & \color{blue}{0} \\
    \color{blue}{0} & 0 & 0 & 1 & 0 & 1 & \color{blue}{0} \\
    \color{blue}{0} & 0 & 1 & 0 & 1 & 0 & \color{blue}{0} \\
    \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} \\
\end{pmatrix}, \\\\
\texttt{X[:,:2]} & = \begin{pmatrix}
    \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} \\
    \color{blue}{0} & 2 & 0 & 2 & 1 & 0 & \color{blue}{0} \\
    \color{blue}{0} & 0 & 1 & 0 & 2 & 2 & \color{blue}{0} \\
    \color{blue}{0} & 2 & 2 & 0 & 1 & 2 & \color{blue}{0} \\
    \color{blue}{0} & 0 & 1 & 2 & 1 & 1 & \color{blue}{0} \\
    \color{blue}{0} & 0 & 1 & 0 & 0 & 2 & \color{blue}{0} \\
    \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} \\
\end{pmatrix}.
\end{align*}

We see that each slice is now a $7 \times 7$ matrix, where the first row, last row, first column and bottom column (marked with blue) are all padded 0s.

* The size of the output matrix from this convolutional layer is determined as
$$W_2 = \frac{W_1 - F + 2P}{S} + 1 = 3, \quad H_2 = \frac{H_1 - F + 2P}{S} + 1 = 3, \quad D_2 = K = 2.$$
In other words, the output is a matrix `M` with size $3 \times 3 \times 2$.

* `M` is computed as follows:
\begin{align*}
\texttt{M[:,:,0]} & = \texttt{b0} + \begin{pmatrix}
    \texttt{(w0 * X[0:3,0:3,:]).sum()} & \texttt{(w0 * X[2:5,0:3,:]).sum()} & \texttt{(w0 * X[4:7,0:3,:]).sum()} \\
    \texttt{(w0 * X[0:3,2:5,:]).sum()} & \texttt{(w0 * X[2:5,2:5,:]).sum()} & \texttt{(w0 * X[4:7,2:5,:]).sum()} \\
    \texttt{(w0 * X[0:3,4:7,:]).sum()} & \texttt{(w0 * X[2:5,4:7,:]).sum()} & \texttt{(w0 * X[4:7,4:7,:]).sum()} 
\end{pmatrix} \\\\
\texttt{M[:,:,1]} & = \texttt{b1} + \begin{pmatrix}
    \texttt{(w1 * X[0:3,0:3,:]).sum()} & \texttt{(w1 * X[2:5,0:3,:]).sum()} & \texttt{(w1 * X[4:7,0:3,:]).sum()} \\
    \texttt{(w1 * X[0:3,2:5,:]).sum()} & \texttt{(w1 * X[2:5,2:5,:]).sum()} & \texttt{(w1 * X[4:7,2:5,:]).sum()} \\
    \texttt{(w1 * X[0:3,4:7,:]).sum()} & \texttt{(w1 * X[2:5,4:7,:]).sum()} & \texttt{(w1 * X[4:7,4:7,:]).sum()} 
\end{pmatrix},
\end{align*}

where we note that `w0 * X[0:3,0:3,:]` is an element-wise multiplication between two 3D matrices, and `sum()` computes the sum of all matrix elements, returning a single scalar value. More generally, we can derive the general formula for `O` as

$$\texttt{M[i,j,k]} = \texttt{bk + (wk * X[S*i:S*i+F , S*j:S*j+F , :]).sum()}.$$

For our specific example, with $S = 2$ and $F = 3$ we would have
$$\texttt{M[i,j,k]} = \texttt{bk + (wk * X[2*i:2*i+3 , 2*j:2*j+3 , :]).sum()},$$
which gives us

$$\texttt{M[:,:,0]} = \begin{pmatrix}
    0 & 0 & -2 \\
    -2 & -4 & -4 \\
    -2 & -7 & 0\\
\end{pmatrix}, \quad \texttt{M[:,:,1]} = \begin{pmatrix}
    -2 & 5 & 1 \\
    1 & -3 & -3 \\
    -1 & -1 & -3
\end{pmatrix}.$$

**(3) ReLU layer**:

The ReLU layer takes the output matrix `M` from the convolutional layer as input and applies the ReLU function $f(x) = \max \{x, 0\}$ to every element in it, yielding the output matrix `O_relu` with the same shape:

$$\texttt{M_relu[:,:,0]} = \begin{pmatrix}
    0 & 0 & 0 \\
    0 & 0 & 0 \\
    0 & 0 & 0\\
\end{pmatrix}, \quad \texttt{M_relu[:,:,1]} = \begin{pmatrix}
    0 & 5 & 1 \\
    1 & 0 & 0 \\
    0 & 0 & 0
\end{pmatrix}.$$

**(4) Pooling layer**:

Based on the hyperparameter setting $F = 2,\ S = 1$, the pooling layer accepts the output matrix `M_relu` from the ReLU layer and outputs a 3D matrix of size

$$W_2 = \frac{W_1 - F}{S} + 1 = 2, \quad H_2 = \frac{H_1 - F}{S} + 1 = 2, \quad D_2 = D_1 = 2.$$

More specifically, this output matrix is computed as

$$\texttt{M_pooling[i,j,k]} = p(\texttt{M_ReLU[S*i:S*i+F , S*j:S*j+F , :])},$$

for some pooling function $p$, which maps a 2D matrix to a scalar value. The most common choice of $p$ is $p(X) = \max_{i,j} X_{i,j}$, i.e., taking the maximum entry in $X$ (this is called maxpooling).

In our context, the output matrix will be

$$\texttt{M_pooling[:,:,0]} = \begin{pmatrix}
    0 & 0 \\
    0 & 0
\end{pmatrix}, \quad \texttt{M_pooling[:,:,1]} = \begin{pmatrix}
    5 & 5 \\
    1 & 0
\end{pmatrix}.$$

**(5) Fully connected layer**:

The fully connected layer first flattens the output matrix from the prior maxpool layer into a 1D vector:

$$x = \begin{pmatrix} 0 & 0 & 0 & 0 & 5 & 5 & 1 & 0 \end{pmatrix}^T.$$

As this layer has two nodes, the fully-connected weight matrix will have size $2 \times 8$. Again this matrix should be random initialized; here we will assume it is

$$W = \begin{pmatrix}
    1 & 3 & 2 & -2 & 0 & 2 & 1 & -1 \\
    3 & -1 & 2 & -2 & -1 & 2 & -1 & -2
\end{pmatrix}.$$

The output from this layer will then be
$$Wx = \begin{pmatrix} 11 \\ 4 \end{pmatrix}.$$

**(6) Softmax layer**:

Finally, we apply the softmax activation function to this output vector:
$$\text{softmax}(Wx) = \begin{pmatrix} e^{11} / (e^{11} + e^{4}) \\ e^{4} / (e^{11} + e^{4}) \end{pmatrix} = \begin{pmatrix} 0.9991 \\ 0.0009 \end{pmatrix}.$$

Therefore, if this is a binary classification problem, we would pick the first class since it has higher hypothesis value.

This concludes the feedforward procedure. The backpropagation computations are quite complicated even with these small matrices, so we will not cover them here. Fortunately, in practice you will rarely need to implement backpropagation yourself, as many deep learning libraries allow you to just specify the neural network forward pass and then automatically perform backpropagation.

#### 2.4.3. Intuition behind CNN
Now that we have gone through a detailed worked example, let's take a step back and discuss the intuitions behind the above operations. Recall that our original goals are to (1) reduce the number of weight parameters, and (2) take advantage of invariance properties of images. What CNN does can be described in a high level as follows:

**(1) Convolutional layer**:

In this layer, we have a number of weight filters that "scan" through the input matrix in an attempt to extract useful features. The convolution operation, where we multiply a weight filter with a slice of the input, preserves the spatial relationship between pixels by learning image features using small squares of input data. It's interesting to note that different filter values can help extract different kinds of feature. This is why we typically have more than one filter (i.e., $K > 1$), to detect several features from each image part. As an example, see the following illustration:

![cnn_filter](https://ujwlkarn.files.wordpress.com/2016/08/screen-shot-2016-08-05-at-11-03-00-pm.png)

**Figure 5**: Effects of different filters on the convolved image. Source: https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/.

The good news is that you don't have to think about which kind of features to detect and what the corresponding filter values are. CNN will do its own inference through training. Also note that this setting also helps reduce the number of weight parameters. Instead of having a weight value for every pair of input-output pixel, as is the case with fully-connected neural network, we now have a small weight filter that slides across sections of the input, which means that several input pixels are mapped to the same weight.

**(2) ReLU layer**:

The ReLU layer simply replaces every negative zero entry in the output matrix from a convoluational layer with zero. This plays the role of the non-linear activation function in a CNN's hypothesis function. You can see an example of its operation here:

![relu](https://ujwlkarn.files.wordpress.com/2016/08/screen-shot-2016-08-07-at-6-18-19-pm.png?w=1496)

**Figure 6**: Input and output of a ReLU layer. Source: http://mlss.tuebingen.mpg.de/2015/slides/fergus/Fergus_1.pdf

**(3) Pooling layer**:

The pooling layer's goal is to reduce the size of the input matrix. It does so by sliding cross sections of the input, similar to a conv layer, but selecting only a single scalar value from each section, which is usually the max value. This has the effect of retaining the most important feature information.

**(4) Fully connected layer**:

The convolutional and pooling layers have helped us identify the most salient image features, i.e., those that can be used for the actual classification task. At this point, we can simply take the output from the last pooling layer as our new features, then build a separate logistic regression or SVM classifier to make predictions based on these features. However, it's a bit cumbersome to suddenly switch to a different model like that; instead, we note that these classifiers can also be replicated by a fully connected layer with the appropriate activation function (e.g., sigmoid activation for logistic regression). Therefore, we can simply append a fully connected layer and an output layer to our existing network architecture, so that a single feedforward pass will give us the actual hypothesis values, as we have seen from the worked example above.

In practice, you would rarely need to design your own CNN architecture; instead, simply look at which is the most commonly used architecture and its [performance benchmark](https://paperswithcode.com/sota/image-classification-on-imagenet). In many cases, there will be a pre-trained model available for you to download and use on your dataset.

#### 2.4.4. Closing remarks on CNN
Needless to say, CNN is a very involved technique that requires some deep learning (no pun intended) to master. The most important takeaway message here is that a CNN attempts to identify the most salient features of the images in each class, so that it can match a new image against these identified features to predict the corresponding label. As an example, how a trained CNN represents a cat image may be visualized as follows:

![cat_cnn](https://i0.wp.com/sefiks.com/wp-content/uploads/2017/11/cnn-filter.png)

**Figure 7**: Example of identified features in a cat image. Source: https://sefiks.com/2017/11/03/a-gentle-introduction-to-convolutional-neural-networks/

In this case, the important features appear to be the pointy ears, the :3 mouth and the tail shape, which may help distinguish a cat image from, say, a bird image, where these features should not be present. The convolutional layers are mainly responsible for this feature identification task, with the help of ReLU and maxpooling. Then, the final fully connected layer performs the actual prediction.

The CNN architecture aims at taking advantage of image properties to learn more efficiently and effectively, but it also has weaknesses. First, CNN identifies patterns but doesn't care about how the patterns are spatially laid out; for example, when training a CNN to detect human faces, one could [swap the eyes and mouth](https://sefiks.com/2017/11/03/a-gentle-introduction-to-convolutional-neural-networks/) of a face and get the same hypothesis value as the original face (a new kind of neural network architecture called [capsule network](https://towardsdatascience.com/capsule-networks-the-new-deep-learning-network-bd917e6818e8) was proposed to address this issue). Second, CNN largely relies on the actual pixel values to make its prediction, leading to cases where a modified image that is seemingly indistinguishable from the original one (to the human eye) is incorrectly predicted:

![pandas](https://miro.medium.com/max/4000/1*PmCgcjO3sr3CPPaCpy5Fgw.png)
**Figure 8**: Breaking CNN with adversarial attack. Source: https://towardsdatascience.com/breaking-neural-networks-with-adversarial-attacks-f4290a9a45aa

Building networks that are robust against this kind of manipulation is an active area of research and remains a crucial milestone before computer vision can have its own impact on our lives.