# Reading Activity 25 - Deep Neural Networks Continued

## Objectives

+ Derive a loss function for binary classification
+ Understand the mathematics of convolutional layers
+ Understand how regularization parameters help us avoid overfitting
+ Understand the Bayesian interpretation of regularization parameters
+ Use data augmentation to exploit known symmetries, e.g., translation and rotation invariance, of your dataset
+ Use early stopping to avoid overfitting
+ Tune the network hyper-parameters using grid search and Bayesian global optimization
+ Learn about the state-of-the-art in regularization techniques

## References

+ Chapters 7, 9, and 11 of https://www.deeplearningbook.org/
+ These notes.

These notes are not exhaustive. They merely provide a summary. Please consult the book chapters for the complete details.

## Loss functions for classification

Take some features $\mathbf{x}_{1:n}$ and some discrete targets $y_{1:n}$.
Because the targets are discrete, we have a classification problem.
What loss function should we use?
Let's examine two cases: binary and multiclass classification.

### Binary classification

In binary classification, we use a DNN $f(\mathbf{x};\theta)$ with parameters $\theta$ to model the probability that $y$ take the value $1$ by:
$$
p(y=1|\mathbf{x},\theta) = \operatorname{sigm}(f(\mathbf{x};\theta)) = \frac{\exp\{f(\mathbf{x};\theta)\}}{1 + \exp\{f(\mathbf{x};\theta)\}}.
$$
Remember that the sigmoid function takes the scalar $f(\mathbf{x};\theta)$ and maps it on $[0,1]$ so that we get a probability.
From the obvious rule of probability, we get that:
$$
p(y=0|\mathbf{x},\theta) = 1 - p(y=1|\mathbf{x},\theta) = 1 - \operatorname{sigm}(f(\mathbf{x};\theta)).
$$
So, for an arbitrary $y$ (either 0 or 1), we can write:
$$
p(y|\mathbf{x},\theta) = \left[\operatorname{sigm}(f(\mathbf{x};\theta))\right]^y
\left[1-\operatorname{sigm}(f(\mathbf{x};\theta))\right]^{1-y}.
$$
This is a nice trick because it activates the right term based on what $y$ is.

Now that we have specified the likelihood of a single observation, the likelihood of the entire dataset is:
$$
p(y_{1:n}|\mathbf{x}_{1:n},\theta) = \prod_{i=1}^n p(y_i|\mathbf{x},\theta).
$$
We are almost done.
The idea is to train the network by maximizing the log likelihood, which is the same as minimizing the following loss function:
\begin{split}
L(\theta) &= -\log p(y_{1:n}|\mathbf{x}_{1:n},\theta)\\
&= -\sum_{i=1}^n \left\{y_i \log \operatorname{sigm}(f(\mathbf{x}_i;\theta))
+ (1-y_i)\log \left[1-\operatorname{sigm}(f(\mathbf{x}_i;\theta))\right]
\right\}.
\end{split}
This loss function is known as the *cross entropy* loss.

### Multiclass classification

Now assume that $y$ can take $K$ different values ranging from $0$ to $K-1$.
We need to model the probability that $y=k$ given $\mathbf{x}$.
To do this, we introduce a DNN $\mathbf{f}(\mathbf{x};\theta)$ with parameters $\theta$ and $K$ outputs:
$$
\mathbf{f}(\mathbf{x};\theta) = \left(f_0(\mathbf{x};\theta),\dots,f_{K-1}(\mathbf{x};\theta)\right).
$$
However, $\mathbf{f}(\mathbf{x})$ is just a bunch of $K$ scalars.
We need to turn it into a $K$-dimensional probability vector.
To achieve this, we define:
$$
p(y=k|\mathbf{x},\theta) = \operatorname{softmax}_k(\mathbf{f}(\mathbf{x};\theta))
:= \frac{\exp\left\{f_k(\mathbf{x};\theta)\right\}}{\sum_{k'=0}^{K-1}\exp\left\{f_{k'}(\mathbf{x};\theta)\right\}}.
$$
So, the role of the softmax is dule. First, to turn the scalars to positive numbers and, second, to normalize them.

For an arbitrary $y$, we can just write:
$$
p(y|\mathbf{x},\theta)) = \operatorname{softmax}_y(\mathbf{f}(\mathbf{x};\theta)).
$$
So, the likelihood of the dataset is:
$$
p(y_{1:n}|\mathbf{x}_{1:n},\theta) = \prod_{i=1}^n p(y_i|\mathbf{x},\theta).
$$
Therefore, the loss function we should be minimizing is:
\begin{split}
L(\theta) &= -\log p(y_{1:n}|\mathbf{x}_{1:n},\theta)\\
&= -\sum_{i=1}^n \log \left[\operatorname{softmax}_{y_i}(\mathbf{f}(\mathbf{x}_i;\theta))\right].
\end{split}
This is also a called *cross entropy* loss, but for multiclass classification.
Sometimes, it is called the *softmax cross entropy* loss.
Now all these names are not really important.
What is important is to understand how these losses are derived from maximum likelihood.

## Regularization 

DNNs are extremely flexible. If 