In [1]:
import numpy as np
import matplotlib.pyplot as plt
import torch; import torch.nn as nn; import torch.nn.functional as F

## 7.1 Error Surfaces

Consider a change in the weights from $\bf w$ to $\mathbf{w} + \delta \mathbf{w}$. Such a change, will correspond to a change in the error function $E(\bf w)$ that is proportional to its gradient with respect to $\bf w$. That is:
$$\delta E \simeq \delta \mathbf{w}^\intercal \nabla E(\bf w)$$

For smooth and convex $E(\bf w)$, minima will occur where: $$ \nabla E(\mathbf{w}) = \mathbf{0}$$
In principal then, we aim to find minima by iteratively scaling the parameters (e.g. weights) in the direction of $-\nabla E(\bf w)$\
 Well, really we may reach a minima, maxima, or saddle point at points where the gradient vanishes. And indeed, we are typically concerned with high dimensional spaces and error functions with highly nonlinear dependencies on network parameters, so it will often be the case that many minima, maxima, and saddle points exist. Moreover, for any given minima we may generally find many equivalent minima within the parameter space.

While we may rarely be able to hope to find the global minimum, we can get very good results by simply finding sufficient minima.

### Local Quadratic Approximation

This section motivates gradient descent by discussing an approximation to the Newton-Raphson optimization algorithm which I've written about [here](https://github.com/BenAF002/data_science/blob/main/Notes/maths_notes/Newton_optimizer.ipynb)

Consider a point in the weight space $\hat{\bf w}$. The Second-Order Taylor Series expansion (recall that Newton-Raphson only uses expansions of second-order) of $E(\bf w)$ around this point is:
$$E(\mathbf{w}) \simeq E(\hat{\mathbf{w}}) + \bf (w - \hat{w})^\intercal b + \frac{1}{2}(w - \hat{w})^\intercal H(w - \hat{\mathbf{w}})$$
Where $\bf b$ is defined as the gradient of $E$ w.r.t. $\bf w$ evaluated at $\hat{\bf w}$ and $\bf H$ is the Hessian matrix:
$$\mathbf{b} \equiv \nabla E|_{\mathbf{w} = \hat{\mathbf{w}}} \\  \\ \mathbf{H}(\hat{\mathbf{w}}) = \nabla \nabla E(\bf w)|_{w=\hat{w}}$$

The local approximation of the gradient from the Taylor Series expansion is:
$$\nabla E(\bf w) = b + H(w - \hat{w})$$

The remainder of the discussion is a little convoluted and, I feel, unnecessary. What's important to glean from it is
> A necessary and sufficient condition for $\bf w^*$ to be a local minimum is that $\nabla E(\bf w) = 0$ *and* the Hessian $\bf H$ is positive definite (i.e. $\bf x^\intercal Hx = 0, \ \forall x$ or equivalently, all of the eigenvalues of $\bf H$ are positive)

**Aside**:\
The fact that we may determine positive definiteness from the eigenvalues of the Hessian is deducible as follows:
- The Hessian $\bf H$ is a square, symmetric matrix; thus it always has real eigenvalues and a complete set of eigenvectors $\{\mathbf{u}_i\}$
- Because the eigenvectors of the Hessian form a complete set, they may represent any arbitrary vector $\bf v$ in the vector space spanned by the Hessian as:
$$\mathbf{v} = \sum_i c_i \mathbf{u}_i$$
- $\bf H$ is positive definite if and only if $\bf v^\intercal H v > 0, \ \ \forall v$
    - Equivalently, if and only if $\mathbf{v}^\intercal \mathbf{H} \mathbf{v} = \sum_i c_i^2 \lambda_i > 0, \ \ \forall \lambda_i$
So, if all of the eigenvalues of $\bf H$, $\lambda_i$ are positive, then the Hessian is positive definite.

## 7.2 Gradient Descent Optimization

The basic approach:
$$\mathbf{w}^{(\tau)} = \mathbf{w}^{(\tau - 1)} + \Delta\mathbf{w}^{(\tau - 1)}$$

There is a great deal of nuance involved in both the selection of the weight vector update $\Delta\mathbf{w}^{(\tau)}$ and the weight initializations $\mathbf{w}^{(0)}$, as both of these things can have a very large impact on the solution found.

### Batch Gradient Descent

The simplest approach to updating with gradient information is to choose the weight update such that there is a small step in the direction of the negative gradient (of the error function w.r.t. the parameters).
$$\mathbf{w}^{(\tau)} = \mathbf{w}^{(\tau - 1)} - \eta \nabla E(\mathbf{w}^{(\tau - 1)})$$
Where $\eta > 0$ is tunable hyperparameter called the *learning rate*.

Typically, we define the error function over the entire training set (e.g., often we sum over $N$). So, this approach to iterative parameter updating requires evaluating with the entire training dataset. Techniques that use the whole data set at once are called ***Batch Methods***.\
(This is a little annoying bc I usually think of "batches" as minibatches and not the full training dataset)...

### Stochastic Gradient Descent

Using the entire training dataset at once can be very inefficient when the training dataset is large. So, we can improve efficiency by splitting up the dataset into *minibatches* and training over those instead. At the most granular level, we could have $N$ minibatches, each of size 1, such that each individual observation is treated as a minibatch. Then,
$$E(\mathbf{w}) = \sum_{n=1}^NE_n(\mathbf{w}) \\ \\ \mathbf{w}^{(\tau)} = \mathbf{w}^{(\tau -1)} - \eta \nabla E_n(\mathbf{w}^{(\tau - 1)})$$

This is the crux of SGD. An ***Epoch*** is then a complete pass through the training data.

There are other benefits to SGD besides reducing computational complexity at each iteration. One is that it reduces the risk of getting stuck on a poor local minimum or saddle point because stationary points w.r.t. the entire training dataset will generally not be stationary points w.r.t. smaller subsets of the training data. Another is that it can help improve the speed of parameter optimization. Consider the gradient of the error function w.r.t. the entire training dataset. The partial derivatives of each parameter (e.g. each weight in $\bf w$) represents a sort of average relationship between that parameter and the error function at every input value. So, different relationships that may be observed at different points in the input space become muted or offset one-another in the gradient. Thus, the update at each step does not change the parameter value by as much as it otherwise could when using smaller minibatches (I think)...

### Mini-Batches

This second benefit that I noted can prove to be a downside when we train with very few observations at each iteration. The gradient of the error function computed from a single data point is a very noisy estimate of the gradient computed on the full data set, and too much noise can cause parameters to vary too much. An intermediate approach is to use *mini-batches* of size greater than 1. 

### Parameter Initialization

The solution obtained through gradient descent is heavily dependent upon parameter initialization.

One key consideration is the concept of *symmetry breaking*. We don't want parameters that are constant (e.g. all initialized to 0), because then they will all comput the same output values and be completely redundant. Similarly, we don't want systemic trends in the parameter initializations for the same reason - i.e. to avoid redundant parameters that *arbitrarily* produce similar outputs and therefore *arbitrarily* move together when updated through gradient descent.

So, we want to initialize parameters randomly.

Additionally, if we are using ReLU activations, then we should be careful to ensure that most initial pre-activations (at least in the early layers of the network) are positive so that we don't prematurely kill neurons. One way to do this without systematically biasing the weight initializations is to initialize the bias parameters as small positive values.

## 7.3 Convergence