# Week 2

## Variebles, constants & 

It's about how to think about partial differentiation as just a simple extension of the single variable method that we derived

For example, we have the expression of the area of a cylinder:

$$m = 2\pi r^2t\rho + 2\pi rht\rho$$

So:

$$\frac{\partial{m}}{\partial{h}} = 2\pi rt\rho$$

$$\frac{\partial{m}}{\partial{r}} = 4\pi rt\rho + 2\pi ht\rho$$

$$\frac{\partial{m}}{\partial{t}} = 2\pi r^2\rho + 2\pi rh\rho$$

$$\frac{\partial{m}}{\partial{\rho}} = 2\pi r^2t + 2\pi rht$$

Partial differentiation is essentially just taking a multi dimensional problem and pretending that it's just a standard 1D problem when we consider each variable separately. 

## Differentiate with respect to anything

Consider a function: $$f(x,y,z)=sin(x)e^{yz^2}$$

We're now just going to work through and find the derivatives with respect to each of these three variables.

$$\frac{\partial{f}}{\partial{x}} = cos(x)e^{yz^2}$$

$$\frac{\partial{f}}{\partial{y}} = sin(x)e^{yz^2}z^2$$

$$\frac{\partial{f}}{\partial{z}} = sin(x)e^{yz^2}2yz$$

**Total Derivative**

Imagine that the variables x, y, and z were actually all themselves a function of a single other parameter t, where $x = t-1$, $y = t^2$, $z = \frac{1}{t}$

And what we're looking for is the derivative of x with respect to t. 

So: $$f(t) = sin(t-1)e^{t^2{(\frac{1}{t})}^2}=sin(t-1)e$$

$$\frac{df(t)}{dt} = cos(t-1)e$$

However, in a more complicated scenario with many variables, the expression we needed to differentiate might have become unmanageably complex, and perhaps we won't have a nice analytical expression at all. The alternative approach is to once again use the logic of chain rule to solve this problem.

So, the chain rule in this function would look like:

$$\frac{df(x,y,z)}{dt} = \frac{\partial{f}}{\partial{x}}\frac{dx}{dt} + \frac{\partial{f}}{\partial{y}}\frac{dy}{dt} + \frac{\partial{f}}{\partial{z}}\frac{dz}{dt}$$

## Jocobian: vectors of derivatives

Imagin we have a function like: 

$$f(x,y,z)=x^2y+3z$$

To build the Jacobian, we just find each of the partial derivatives of the function one by one. 

So:
$$\frac{\partial{f}}{\partial{x}} = 2xy$$
$$\frac{\partial{f}}{\partial{y}} = x^2$$
$$\frac{\partial{f}}{\partial{z}} = 3$$

Now bringing all of those together, we just end up with a Jacobian:
$$J=[2xy,x^2,3]$$

We now have an algebraic expression for a vector which when we give it a specific x, y, z coordinate, will return a vector pointing in the direction of steepest slope of this function. 

For example, as point $(0,0,0)$ we will have $J(0,0,0)=[0,0,0]$

**Jocobian Applied**

For example, we have $f(x,y)=e^{-(x^2+y^2)}$

$$J=[-2xe{-(x^2+y^2)},-2ye^{-(x^2+y^2)}]$$

So:

$$J(-1,1)=[0.27,-0.27]$$
$$J(2,2)=[-0.001,-0.001]$$
$$J(0,0)=[0,0]$$

If drawing this Jacobian vector field, we will find that it becomes clear that the origin (0,0) must be the maximum of this system. 

Another example:

Imagin we have:

$$u(x,y)=x-2y$$
$$v(x,y)=3y-2x$$

So:

$$J_u=\left[\frac{\partial{u}}{\partial{x}} \frac{\partial{u}}{\partial{y}}\right]$$
$$J_v=\left[\frac{\partial{v}}{\partial{x}} \frac{\partial{v}}{\partial{y}}\right]$$
$$J=\begin{bmatrix}\frac{\partial{u}}{\partial{x}}&\frac{\partial{u}}{\partial{y}}\\\frac{\partial{v}}{\partial{x}}&\frac{\partial{v}}{\partial{y}}\end{bmatrix}=\begin{bmatrix}1&-2\\-2&3\end{bmatrix}$$

So the gradient must be constant everywhere. Also, this matrix is just the linear transformation from xy space to uv space.

We've now seen that the Jacobian describes the gradient of a multivariable system. And if you calculate it for a scalar valued multivariable function, you get a row vector pointing up the direction of greater slope, with a length proportional to the local steepness. 

## The Sand Pit Game

**Optimization**

Examples of mathematical optimisation in action in the real world include the planning of routes through busy cities, the scheduling of production in a factory, or a strategy for selecting stocks when trading. If we go back to the simplest function we saw in the last section, and we said that we wanted to **find the location of the maximum**, we can simply solve this system analytically by first building the **Jacobian**. And then **finding the values of x and y which make it equal to 0**. However, when the function gets a bit more complicated, finding the maximum or minimum can get a bit tricky.

## The Hessian

We can introduce an additional concept, which relates to multivariate systems called the Hessian. 

In many ways, the Hessian can be thought of as a simple extension of the Jacobian vector. For the Jacobian, we collected together all of the first order derivatives of a function into a vector. Now, we're going to collect all of the second order derivatives together into a matrix, which for a function of n variables, would look like this.

$$H=\begin{bmatrix}\frac{\partial{^2f}}{\partial{x_1^2}}&\frac{\partial{^2f}}{\partial{x_1x_2}}&...&\frac{\partial{^2f}}{\partial{x_1x_n}}\\\frac{\partial{^2f}}{\partial{x_2x_1}}&\frac{\partial{^2f}}{\partial{x_2^2}}&...&\frac{\partial{^2f}}{\partial{x_2x_n}}\\...&...&...&...\\\frac{\partial{^2f}}{\partial{x_nx_1}}&\frac{\partial{^2f}}{\partial{x_nx_2}}&...&\frac{\partial{^2f}}{\partial{x_n^2}}\end{bmatrix}$$

**Example**

Imagine we have a function like:
$$f(x,y,z)=x^2yz$$

$$J=[2xyz,x^2z,x^y]$$

So:

$$H=\begin{bmatrix}2xy&2xz&2xy\\2xz&0&x^2\\2xy&x^2&0\end{bmatrix}$$

So one thing to notice here is that our **Hessian matrix is symmetrical across the leading diagonal**. This will always be true if the function is continuous, meaning that it has no sudden step changes. 

**Using Hessian to determine whether the point with gradient equals to 0 is minimum or maximum**

For example, we have a function like $f(x,y)=x^2+y^2$

So $J=[2x,2y]$, we now know that the gradient at the point(0,0) equals to 0.

And $H=\begin{bmatrix}2&0\\0&2\end{bmatrix}$, $\begin{vmatrix}H\end{vmatrix}=4$

The power of the Hessian is, firstly, that if its determinant is positive, we know we are dealing with either a maximum or a minimum. (In this case we know $\begin{vmatrix}H\end{vmatrix}=4>0$)

Secondly, we then just look at the first term, which is sitting at the top left-hand corner of the Hessian. If this guy is also positive, we know we've got a minimum, as in this particular case. Whereas, if it's negative, we've got a maximum. (In this case we know 2 in $H=\begin{bmatrix}2&0\\0&2\end{bmatrix}$ is positive, so we've got a minimum.) 

## Reality is hard

Firstly for many applications of optimisation such as in the training in **neural networks**, you are going to be dealing with a lot more than two dimensions potentially hundreds or thousands of dimensions. This means that we can no longer draw a nice surface and climb its mountains. All the same maths still applies but we now have to use our 2D intuition to guide and enable us to trust the maths. 

Secondly, as we've mentioned briefly before, even if you do just have a 2D problem, very often you might not have a nice analytical function to describe it and calculating each point could be very expensive.

**If, as I said a minute ago, we don't even have the function that we're trying to optimise, how on earth are we supposed to build a Jacobian out of the partial derivatives?**

**The answer is: "Rise and Run"**
taking a small step in x allows us to calculate an approximate partial derivative in x. And a small step in y gives an approximate partial in y.

And our Jocobian expression would be:
$$J=\begin{bmatrix}\frac{f(x+\Delta{x},y)-f(x,y)}{\Delta{x}},\frac{f(x,y+\Delta{y}-f(x,y))}{\Delta{y}}\end{bmatrix}$$

Then we will meet 2 problems here:

Firstly, **how big should our little step be**? Well, this has to be a balance. because if it's to big you'll make a bad approximation for reasons that I hope will be obvious by this point. But if it's too small, then we might run into some numerical issues. Just remember, when your computer calculates the value of the function at a point, it only stores it to a certain number of significant figures. So if your point is too close, your computer might not register any change at all.

Second, as we mentioned earlier, **what happens if your data is a bit noisy**? To deal with this case, many different approaches have been developed. But perhaps the simplest is just to calculate the gradient using a few different step sizes and take some kind of average.