In [2]:
import numpy as np
import matplotlib.pyplot as plt
import torch; import torch.nn as nn; import torch.nn.functional as F

## 7.1 Error Surfaces

Consider a change in the weights from $\bf w$ to $\mathbf{w} + \delta \mathbf{w}$. Such a change, will correspond to a change in the error function $E(\bf w)$ that is proportional to its gradient with respect to $\bf w$. That is:
$$\delta E \simeq \delta \mathbf{w}^\intercal \nabla E(\bf w)$$

For smooth and convex $E(\bf w)$, minima will occur where: $$ \nabla E(\mathbf{w}) = \mathbf{0}$$
In principal then, we aim to find minima by iteratively scaling the parameters (e.g. weights) in the direction of $-\nabla E(\bf w)$\
 Well, really we may reach a minima, maxima, or saddle point at points where the gradient vanishes. And indeed, we are typically concerned with high dimensional spaces and error functions with highly nonlinear dependencies on network parameters, so it will often be the case that many minima, maxima, and saddle points exist. Moreover, for any given minima we may generally find many equivalent minima within the parameter space.

While we may rarely be able to hope to find the global minimum, we can get very good results by simply finding sufficient minima.

### Local Quadratic Approximation

This section motivates gradient descent by discussing an approximation to the Newton-Raphson optimization algorithm which I've written about [here](https://github.com/BenAF002/data_science/blob/main/Notes/maths_notes/Newton_optimizer.ipynb)

Consider a point in the weight space $\hat{\bf w}$. The Second-Order Taylor Series expansion (recall that Newton-Raphson only uses expansions of second-order) of $E(\bf w)$ around this point is:
$$E(\mathbf{w}) \simeq E(\hat{\mathbf{w}}) + \bf (w - \hat{w})^\intercal b + \frac{1}{2}(w - \hat{w})^\intercal H(w - \hat{\mathbf{w}})$$
Where $\bf b$ is defined as the gradient of $E$ w.r.t. $\bf w$ evaluated at $\hat{\bf w}$ and $\bf H$ is the Hessian matrix:
$$\mathbf{b} \equiv \nabla E|_{\mathbf{w} = \hat{\mathbf{w}}} \\  \\ \mathbf{H}(\hat{\mathbf{w}}) = \nabla \nabla E(\bf w)|_{w=\hat{w}}$$

The local approximation of the gradient from the Taylor Series expansion is:
$$\nabla E(\bf x) = b + H(w - \hat{w})$$

The remainder of the discussion is a little convoluted and, I feel, unnecessary. What's important to glean from it is
> A necessary and sufficient condition for $\bf w^*$ to be a local minimum is that $\nabla E(\bf w) = 0$ *and* the Hessian $\bf H$ is positive definite (i.e. $\bf x^\intercal Hx = 0, \ \forall x$ or equivalently, all of the eigenvalues of $\bf H$ are positive)

**Aside**:\
The fact that we may determine positive definiteness from the eigenvalues of the Hessian is deducible as follows:
- The Hessian $\bf H$ is a square, symmetric matrix; thus it always has real eigenvalues and a complete set of eigenvectors $\{\mathbf{u}_i\}$
- Because the eigenvectors of the Hessian form a complete set, they may represent any arbitrary vector $\bf v$ in the vector space spanned by the Hessian as:
$$\mathbf{v} = \sum_i c_i \mathbf{u}_i$$
- $\bf H$ is positive definite if and only if $\bf v^\intercal H v > 0, \ \ \forall v$
    - Equivalently, if and only if $\mathbf{v}^\intercal \mathbf{H} \mathbf{v} = \sum_i c_i^2 \lambda_i > 0, \ \ \forall \lambda_i$
So, if all of the eigenvalues of $\bf H$, $\lambda_i$ are positive, then the Hessian is positive definite.

## 7.2 Gradient Descent Optimization