# 1. Recap of Lecture 9.

- Activation functions are nonlinear.
    - **Question:** Why are the activation functions nonlinear?
    - **Answer:**
        - The purpose of the activation function is to introduce non-linearity into the network in turn, this allows you to model a response variable (aka target variable, class label, or score) that varies non-linearly with its explanatory variables.
        - non-linear means that the output cannot be reproduced from a linear combination of the inputs (which is not the same as output that renders to a straight line--the word for this is affine).
        - Another way to think of it: without a non-linear activation function in the network, a NN, no matter how many layers it had, would behave just like a single-layer perceptron, because summing these layers would give you just another linear function (see definition just above).

- Backpropagation assumes that we can calculate the partial derivatives by each of the network parameters by passing backwards through the net:

![alt text](https://i.ibb.co/6sH2TjR/Screen-Shot-2020-11-10-at-14-38-24.png)

The partial derivatives are needed in order to apply the gradient descent method and find some optimal values:

![alt text](https://i.ibb.co/FqqQ0g2/Screen-Shot-2020-11-10-at-14-41-35.png)

The right picture above corresponds to the `Loss vs. n_epochs graph`. It doesn't look smooth, because we (most of the time) apply Stochastic Gradient Descent and draw randomly a subset of the original data set (batches), which in turn makes the gradient noisy.

The thing is that even Gradient Descent requires hyperparameter selection (for instance, learning rate, as can be seen the leftmost picture). There are so methods, which allow to use SGD more effectively in terms of convergence.

Among them:
- Momentum.
- Adagrad.
- Adadelta.
- RMSprop.
- Adam.
- Even other NNs.

![alt text](https://2.bp.blogspot.com/-q6l20Vs4P_w/VPmIC7sEhnI/AAAAAAAACC4/g3UOUX2r_yA/s1600/s25RsOr%2B-%2BImgur.gif)

Let's discuss them.

---

# 2. Classical SGD.

![alt text](https://i.ibb.co/nwTmT2s/Screen-Shot-2020-11-10-at-14-53-12.png)

---

## 2.1. SGD: Advantages.

1) We have to remember what the output was for each layer (when doing the forward pass): $O(1)$.

2) We have to remember the input to each layer in order to calculate the derivative wrt to input: $O(1)$.

$\implies$ at each iteration it requires just $O(1)$ of memory.

## 2.2. SGD: Drawbacks.

Because of its stochastic nature:

1) On average the loss function will decrease towards the minimum, but for consecutive steps it will bounce up and down.

2) Once it's close to the minimum, it doesn't convergence to it fast.

3) At saddle points or local minimum the gradient $= 0$ and SGD won't go any further in search of the global minimum.

How to deal with this $\implies$

# 3. SGD with momentum.

1) Each time we update, the new point $x_{t+1}$ is a little further than the point $x_{t+1}$ we'd get with the classical SGD. Это позволит градиенту не застрять на локальном минимуме/седловаой точке.



![alt text](https://i.ibb.co/fpnyjpS/Screen-Shot-2020-11-10-at-15-25-28.png)

![alt text](https://ruder.io/content/images/2016/09/saddle_point_evaluation_optimizers.gif)

---

# 4. SGD: Nesterov's Method.

- Classical Gradient Descent:
    - I'm at position $x_t$.
    - The antigradient here is $-\nabla f(x_t)$ and points to some direction from $x_t$.
    - Let's take a step towards that direction $\implies x_{t+1} = x_t + \alpha (-\nabla f(x_t)) = x_t - \alpha\nabla f(x_t)$.
    
    
- Gradient Descent with momentum:
    - I'm at position $x_t$.
    - The momentum $- v_{t+1}$ (the accumulation of antigradients up to the current point) points to some direction from $x_t$.
    - Let's take a step towards that direction $\implies x_{t+1} = x_t + \alpha (-v_{t+1})= x_t - \alpha v_{t+1}$.


- Gradient Descent Nesterov's Method:
    - I'm at position $x_t$.
    - The momentum $+ v_{t}$ (the accumulation of **<font color=blue>lookforward</font>** antigradients up to the **previous** points (check example below to see why is that)) points to some direction from $x_t$.
    - Let's see which direction takes the antigradient at that point: $-\nabla f(x_t + \rho v_t)$ (глянув в будущее, теперь мы знаем в какое направление мы можем отпускаться к минимуму).
    - Вместо того, делать шаг в сторону $\tilde{x}_{t+1} = x_t + \rho v_t$, вычислить градиент там и отсюда отпускаться вниз ( $\tilde{x}_{t+2} = \tilde{x}_{t+1} - \alpha \nabla f(\tilde{x}_{t+1})$ ), мы можем отспускаться к тому же направлению прям из $x_t$ by taking a step towards $v_{t+1} \overset{\Delta}{=}-\nabla f(x_t + \rho v_t) + v_t $ from $x_t$. That is $x_{t+1} = x_t + v_{t+1}$. Прадва, в таком случае $\tilde{x}_{t+2} \neq x_{t+1}$, с учетом разные learning rate, которые используем.

Example Gradient Descent Nesterov's Method:

Let $v_0 = 0$.

- $v_1 = \rho v_0 - \alpha \nabla f(x_0 + \rho v_0) = -\alpha \nabla f(x_0)$.
- $x_1 = x_0 + v_1 = x_0 -\alpha \nabla f(x_0)$.


- $v_2 = \rho v_1 - \alpha \nabla f(x_1+ \rho v_1) = -\rho \alpha \nabla f(x_0) - \alpha \nabla f(x_1 - \rho\alpha \nabla f(x_0))$.
- $x_2 = x_1 + v_2$.

![alt text](https://i.ibb.co/G5JBqTz/Screen-Shot-2020-11-10-at-17-14-52.png)

![alt text](https://ruder.io/content/images/2016/09/contours_evaluation_optimizers.gif)

---

## 4.1. Nesterov and Momentum Drawback.

1) We have to remember what the output was for each layer (when doing the forward pass):  $O(1)$.

2) We have to remember the input to each layer in order to calculate the derivative wrt to input:  $O(1)$.

$\implies$ at each iteration it requires just $O(1)$ of memory $+\ O(|v_t|)$ to store a tensor (vector) $v_t$ for the accumulated gradients. $v_t$ has the same dimensions as the dimensions of the parameter space ($|v_t| =$ number of the network's paramaters).

---

# 5. AdaGrad: SGD + steepness cache.

- Key idea: steepness (наклон) is different for each dimension. We should account for it.
- $(\text{cache}_t)_i = (s_t)_i$ - characterizes how steep the loss function is along the $i$-th dimension.

![alt text](https://i.ibb.co/GnnbmgT/Screen-Shot-2020-11-10-at-19-18-37.png)

![alt text](https://i.ibb.co/gSzk98x/Screen-Shot-2020-11-10-at-19-21-48.png)

- $\otimes$ stands for elementwise multiplication.
- $\oslash$ for elementwise division.

Suppose $\nabla_i f(x_t) \in (0,1) \implies  (\tilde{s_t})_i \overset{\Delta}{=} \nabla_i f(x_t) \otimes \nabla_i f(x_t) = \nabla_i f(x_t)^2 \in (0,1)$ and it's even smaller $\implies \frac{1}{(\tilde{s_t})_i} > 1 \implies \frac{1}{(\tilde{s_t})_i}\nabla_i f(x_t) > \nabla_i f(x_t) \implies$ that in $(x_{t+1})_i = (x_t)_i - \alpha \cdot \frac{1}{(\tilde{s_t})_i}\nabla_i f(x_t)$ we take a big step from $(x_t)_i$ to escape the plateau along the $i$-th axis.

- In other words, it's like a classical gradient descent step, but accounting for the steepness of the surface along each dimension.

---

## 5.1. AdaGrad Drawbacks.

- If $\big|\nabla_i f(x_t) \big|$ is big $\implies \nabla_i f(x_t)^2$ is even bigger $\implies \frac{\alpha}{(\tilde{s}_t)_i}$ gives a even smaller learning rate.
- That said, we have that:
    - 1) It may kill the gradients along the dimensions having steepest descent.
    - 2) It may explode the gradients along the dimensions corresponding to the plateaus.
    
    
- Note: reducing the learning rate is good, if we are trying to optimize a convex function.

How to deal with this:

- Store the cache not for each iteration, but for the last $k$ ones.
- $\color{blue}{(*)}$ The problem with this, is that the cache will irregularly change over time, since with each iteration we are deleting some square of gradient and adding new ones.

How to deal with this $\implies$

----


# 6. RMSProp: AdaGrad with exponential smoothing.

- **Exponential smoothing:** is a rule of thumb technique for smoothing time series data using the exponential window function. Whereas in the simple moving average the past observations are weighted equally, exponential functions are used to assign exponentially decreasing weights over time. It is an easily learned and easily applied procedure for making some determination based on prior assumptions by the user, such as seasonality. Exponential smoothing is often used for analysis of time-series data.
- That is, it allows us to smoothly forget the cache of early steps (not like in $\color{blue}{(*)}$).
- $\beta \in (0, 1)$.

![alt text](https://i.ibb.co/FwdKC1w/Screen-Shot-2020-11-10-at-19-57-47.png)

---

## 6.1. Memory Usage.

$O(1 + |\text{cache}_t|)$:

- $1$ - this is what in the forward pass we remember (activation on each layer).

---

# 7. Adam (Adaptive moment): Momentum + RMSProp.

- We perform exponential smoothing on both the momentum and $\text{cache}_t$

![alt text](https://i.ibb.co/KwFtsn7/Screen-Shot-2020-11-11-at-11-11-04.png)

<h1><center>$\Downarrow$</center></h1>


![alt text](https://i.ibb.co/XxS119B/Screen-Shot-2020-11-11-at-11-09-56.png)

---

## 7.1. Memory Usage.

$O(1 + |v_t| + |\text{cache}_t)$.



### But this is not quite Adam.

[Source](https://cs231n.github.io/neural-networks-3/)

![alt text](https://i.ibb.co/6X0vNXp/Screen-Shot-2020-11-11-at-11-17-15.png)

\begin{align*}
	v_{t+1} &= \gamma v_t + (1-\gamma)\nabla f(x_t)\\
	u_{t+1} &= \frac{v_{t+1}}{1-\gamma^t}\\
	\text{cache}_{t+1} &= \beta \text{cache}_t + (1-\beta)(\nabla f(x_t))^2\\
	w_{t+1} &= \frac{\text{cache}_{t+1}}{1-\beta^t}\\
	x_{t+1} &= x_t - \alpha\frac{u_{t+1}}{\sqrt{w_{t+1}} + \varepsilon}
\end{align*}

![alt text](https://i.ibb.co/jTQC858/ezgif-com-gif-maker.gif)

---

# 8. Learning Rate Decay.

- The learning rate $\alpha$ is a hyperparameter: we have to find its optimal value for optimizing the loss function. This is hard.
- **Question:** What to do?
- **Answer:** Learning rate decay. That is, 
    - 1) Fixed $\alpha$.
    - 2) Optimize loss function. We'll reach some plateau.
    - 3) on the plateau, reduce $\alpha$. This will make the optimization method to go away from the plateau.
    
[Source](http://www.bdhammel.com/learning-rates/)
![alt text](http://www.bdhammel.com/assets/learning-rate/lr-types.png)

[Source](http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture7.pdf)
![alt text](https://i.ibb.co/Ms4kX6r/Screen-Shot-2020-11-11-at-11-54-52.png)
![alt text](https://i.ibb.co/r24BngL/Screen-Shot-2020-11-11-at-11-54-45.png)

---

# 9. Data Normalization.

![alt text](https://i.ibb.co/hdjNC6X/Screen-Shot-2020-11-11-at-12-03-07.png)

# 10. Batch Normalization.

![alt text](https://i.ibb.co/ZKyW67L/Screen-Shot-2020-11-11-at-12-08-57.png)

- If we normalize the input before layer $i$, then at output it may the case that the data (the input for the $(i+1)$-th layer is no longer centered/normalized (for example, if we used sigmoid or ReLU as activation function).
- **Question:** How to deal with this?
- **Answer:** We should introduce normalization between each pair of layer. This is called _Batch Normalization_.

![alt text](https://i.ibb.co/qF1RY0Y/Screen-Shot-2020-11-11-at-12-18-02.png)
![alt text](https://i.ibb.co/cNBKkpX/Screen-Shot-2020-11-11-at-12-19-18.png)


## 10.1. Batch Normalization Workflow.

<h2><center>$\text{Layer}_i \Longrightarrow \text{Batch_Normalization} \Longrightarrow \text{Activation_Function}_i \Longrightarrow \text{Layer}_{i+1}$</center></h2>


## 10.2. What is this? ![alt text](https://i.ibb.co/PMbv0Wf/Screen-Shot-2020-11-11-at-12-33-47.png)

---

#### Reason 1:

- Suppose after the layer and before the activation function we perform normalization: we get $\hat{x}_i$.
- After performing normalization the range of $\hat{x}_i$ are pretty close to zero.
- If the activation function happens to be sigmoid or $\tanh$ then, with that range of values for $\hat{x}_i$, the function will behave as a linear one:

![alt text](https://i.ibb.co/sb459zw/Screen-Shot-2020-11-11-at-12-51-18.png)
![alt text](https://i.ibb.co/GpqW9J9/Screen-Shot-2020-11-11-at-12-51-57.png)
![alt text](https://i.ibb.co/xjftsVg/Screen-Shot-2020-11-11-at-12-51-23.png)

- This mean that we loss the nonlinearity that the activation function must provide.
- **Question:** How to deal with this.
- **Answer:** Let's scale and shift the data, which is what that line in the code does.

---

#### Reason 2:

From the original [paper](https://arxiv.org/pdf/1502.03167.pdf):
![alt text](https://i.ibb.co/k9CSr1R/Screen-Shot-2020-11-11-at-13-02-11.png)
![alt text](https://i.ibb.co/J56wgSp/Screen-Shot-2020-11-11-at-13-02-17.png)


## 10.2. Batch Normalization on the test set.

- Problem:
    - When performing batch normalization we do so by processing one mini batch at a time.
    - But at test time we may need to process the examples one at a time.
    - How to deal with this.
    
    
- Solution:
    - We need a different way to come up with $\mu_j$ and $\sigma_j^2$.
    - We do so, by estimating them.
    - **Qeustion:** How?
    - **Answer**: We estimate them, using train's statistics and by using exponential weighted averages accross minibatches.
        - Let $X_1, X_2, X_3, \ldots$ be minibatches.
        - When training on the minibatch $X_i$, after the $j$-th layer we perform batch normalization which gives us $\mu_{ij}$ and $\sigma^2_{ij}$.
        - Thefore we estimate $\mu_j$ after the $j$-th layer by exponential smoothing (exponential weighted average) after the $t$-th trained batch:
            - $\mu_1 = \mu_{1j}$.
            - $\mu_j = \beta\mu_{tj} + (1-\beta)\mu_{j-1}$.
            
        - The same for $\sigma_j^2$:
            - $\sigma^2_1 = \sigma^2_{1j}$.
            - $\sigma^2_j = \beta\sigma^2_{tj} + (1-\beta)\sigma^2_{j-1}$.
        - Finally, at test time after the $j$-th layer we compute $$\hat{x}_i = \frac{x_i - \mu_j}{\sqrt{\sigma^2_j + \varepsilon}}$$
        $$y_i = \gamma \hat{x}_i + \theta$$
        ($\gamma$ and $\theta$ are found during the training step).
        

---

# 11. Regularization.

- It constraint the solution space, where we draw the optimal solution from.

---

## 11.1. Method 1: Directly on the loss function.

- This is done by including a regularization term in the loss function.
- Usually we perform $L_1$, $L_2$ and Elastic Net regularization.

---

## 11.2. Method 2: Dropout.

- This is a technique, such that at each iteration randombly turns off some of the neurons.
- **Question:** What does turn off neurons mean?
- **Answer:** The network structured remains unchanged. We only ignore the output of the neuron when forward passing and ignore its gradient when backward passing. This is donde by multiplying the neuron's output by $0$ or $1$ (by a boolean mask). This boolean mask is randomly generated.

Probability of droupout manage how many 0 and 1 are generated by the boolean mask and after which neuron.

- When testing all the neuron are used.

![alt text](https://i.ibb.co/rFjzts6/Screen-Shot-2020-11-11-at-13-12-22.png)


- Dropout with Batch Normalization doesn't work that well. This is know as variance shift. Read more [here](https://openaccess.thecvf.com/content_CVPR_2019/papers/Li_Understanding_the_Disharmony_Between_Dropout_and_Batch_Normalization_by_Variance_CVPR_2019_paper.pdf).
- One solution is to normalize the test set (multiply by the probability of dropout, in order to decrease the variance to the previous level).

### 11.2.1. Dropout Workflow.

<h2><center>$\text{Layer}_i \Longrightarrow \text{Activation_Function}_i \Longrightarrow  \text{Dropout} \Longrightarrow \text{Layer}_{i+1}$</center></h2>

### 11.2.1. Dropout on the test set.

Taken from [here](http://cs231n.stanford.edu/slides/2016/winter1516_lecture6.pdf) (slides 58-62)

- We'll want to integrate all the subnetworks generated during dropout, much like during ensembling (with hard-voting, for example).
- One way to do this is to do many forward passes with different dropout masks, average all predictions.
- But we can in fact do this with a single forward pass! (approximately), leaving all input neurons turned on (no dropout).
- **Question:** How to do this?
- **Answer:**

Assuming that there is some sort of dropout at training stage (that is $p \in (0, 1)$), then as said above, during training the output $x$ of a neuron is multiplied by $0$ or $1$ (giving $0$ or $x$, respectively). If the probability of outputting $x$ for this neuron is $p$, then $\mathbb{E}[\text{of this neuron's output}] = \mathbb{E}[x] = px + (1-p)0 = px$.


**<font color=blue>Approach 1: Vanilla Dropout.</font>**

- Here we want $\text{prediction}_{\text{test}}(a) = \text{output}_{\text{drop}}(a)$

On the one hand, if we don't use any dropout we have

$$\text{output}_{\text{all}}(a) = w_0x + w_1y$$

On the other hand,

\begin{align*}
\text{output}_{\text{drop}}(a) &= \mathbb{E}[a]\\
	&= w_0\mathbb{E}[x] + w_1\mathbb{E}[y]\\
	&= w_0(px) + w_1(py)\\
	&= p(w_0x + w_1y)\\
	&= p\cdot \text{output}_{\text{all}}(a)\\
\end{align*}

This means that for $\text{prediction}_{\text{test}}(a) = \text{output}_{\text{drop}}(a)$ we can either make predictions using all the inputs in the forward pass and then scaling the activations by $p$:

$$
\text{prediction}_{\text{test}}(a) =p\cdot \text{output}_{\text{all}}(a)
$$

**<font color=blue>Approach 2: Inverted Dropout.</font>**

- We want $\text{prediction}_{\text{test}}(a) = \text{output}_{\text{all}}(a)$

Here we divide the activations during the dropout (training) stage by $p$, and leave the activations as they are at the test stage. This will give:

$$
\text{prediction}_{\text{test}}(a) = \text{output}_{\text{all}}(a) = \frac{1}{p} \cdot \text{output}_{\text{drop}}(a)
$$

For implementations of vanilla and inverted dropout see [here](https://cs231n.github.io/neural-networks-2/#reg).

---

#### 11.2.1.1. Dropout on the test set: Variance Increase.


Without dropout (as we do during testing) the variance increases $\implies$ the assumption that train and test are from the same distribution is no longer true $\implies$ the model is no longer reliable.

![alt text](https://i.ibb.co/9y3RnkR/Screen-Shot-2020-11-14-at-15-00-37-1.png)
![alt text](https://i.ibb.co/c875mdn/Screen-Shot-2020-11-14-at-15-01-48.png)


---

## 11.3. Method 3: Data Augmentation.

- This is a technique for generating more instances from the observation we already have in the data set.

![alt text](https://i.ibb.co/wYSBP5k/Screen-Shot-2020-11-11-at-13-45-43.png)
![alt text](https://i.ibb.co/zNxWY0n/Screen-Shot-2020-11-11-at-13-45-50.png)

---

# 12. Weight Normalization.

- **Question:** Why is not advisable to initialize all neurons  to zero.
- **Answer:**
    - By doing this, it means that for each layer, all the features have value 0.
    - Second, if the neurons start with the same weights, then all the neurons will follow the same gradient, and will always end up doing the same thing as one another. (See more [here](https://www.deeplearning.ai/ai-notes/initialization/) and [here](https://stats.stackexchange.com/questions/27112/danger-of-setting-all-initial-weights-to-zero-in-backpropagation)).
    - **<font color=red>Figure out mathematical explanation for this.</font>**
    - With this the layer of $10$ neurons ends up working like a layer of one neuron repeated $10$ times.

----

# 13. Advice for Optimization.

- Adam is great basic choice.
- Even for Adam/RMSProp learning rate matters.
- Use learning rate decay.
- Monitor your model quality.