### Descent methods

The notebook about approximate line search goes into some of the basics, might be useful before expaning on this more.

A descent method uses iterations, and steps of progressively shorter values, to minimise (or maximise) decision variables. Eg in $ \mathbb{R}^2 $ you will have one independent and one dependent variable, so this looks like a standard function. In $ \mathbb{R}^3 $ you will have two independent variables.

Then the descent method given 1 independent variable will look like this:

$ \{ x_n = x_1, x_2, x_3, ... x_n \} $

Then the next point for descent should be:

$ f(x_{n+1}) < f(x_n) $. The strict equality means that the descent method will never choose a new point that is the same level as the point it currently has.

#### Level sets

This is similar to the idea of a level curve on a map. It is the set of pairs, three tuples etc, that return the same function value.

It's equivalent to cutting through a function with a plane that has reduced dimension by one dimension.

If you were to look down onto the level set, it would look like this:

![level set](Screenshot_2023-08-07_17-21-11.png)

#### Choosing the gradient vector

The gradient vector is the direction that maximises the directional derivative.

Suppose $ ||\hat v|| = 1 $, is the direction vector.

The directional derivative (this is covered in /vectorCalculus) is:

$ \nabla f(\bar x, \bar y) \cdot \hat v = || \nabla f(\bar x, \bar y) ||\cdot ||\hat v|| \cdot \cos \theta $

And since $ \hat v = 1 $ then we can maximise the function by choosing $ \cos \theta = 1 $ and this washes out as:

$  \nabla f(\bar x, \bar y) \cdot \hat v = || \nabla f(\bar x, \bar y) ||\cdot ||\hat v|| \cdot \cos \theta = || f(\bar x, \bar y) || \cdot 1 \cdot 1 $
 
So it's just this, where $ \theta = 0 $

$  \nabla f(\bar x, \bar y) \cdot \hat v = || \nabla f(\bar x, \bar y) || $

$ \hat v = \large \frac{\nabla f(\bar x, \bar y)}{ || \nabla f(\bar x, \bar y) || } $

#### Finding the gradient of a level curve

We can do this by parameterising the function $ f(x, y) \to f(x(t), y(t))  $

Then you can use the chain rule with partial derivatives:

![partials in chain](Screenshot_2023-08-07_22-23-23.png)

![dot prod form](Screenshot_2023-08-07_22-25-43.png)

Since the dot product is equal to 0 then we know the two parts of the RHS are orthogonal.

This part: $ \frac{dx(\bar t)}{dt}i + \frac{dy(\bar t)}{dt}j $ is tangent to the level curve, so that means the gradient orthogonal to the curve.

If we're doing a minimisation problem, then at the point $ x_n $ the direction of descent is:

$ d_n = - \nabla f(x_n) $

Where $ n $ is the index of steps taken.

Recall that the step length, $ t $, is getting smaller. It is characterised by the Amijo Goldstein Condition. All together it looks like this:

![descent](Screenshot_2023-08-07_22-37-56.png)

The variable $ t $ is the exact step size to get to the next level curve.

#### Descent direction matrix

I have skipped over a lot of theory here to get assessments done, but, we can find the direction

$ d = \left(\nabla^2 f(x)\right)^{-1} \cdot \nabla f(x) $

Where $ \nabla^2 f(x) $ is the hessian of a function and it is positive definite.

Then:

$ \nabla f(x) \cdot d = -\nabla f(x)^T \cdot \left(\nabla^2 f(x)\right)^{-1} \cdot \nabla f(x) $

Suppose instead of the hessian matrix, we choose another matrix called $ D_k $

Then if:

$ D_k = I $, the identity matrix, then you get the Steepest Descent method.

If:

$ D_k = \nabla^2 f(x_k)^{-1} $ then this is Newton's Method.

![example with newton's method](Screenshot_2023-08-13_22-52-12.png)

![Screenshot_2023-08-13_22-59-54.png](Screenshot_2023-08-13_22-59-54.png)

![local min reached](Screenshot_2023-08-13_23-02-08.png)