# Non linear Optimization

State estimation in SLAM depends on the following two equations:

- $x_k = f(x_{k-1}, u_k) + w_k$
- $z_{k,j} = h(y_j, x_k) + v_{k,j}$

The $x_k$ is the camera pose in 6d, $z_{k,j}$ is the image position of the observation $y_j$.
The $u_k$ is the input data at time $k$. 
The input data can be a set of 2d points that are known to belong to the same object in a sequence of images.
It can be a mixed set of 3D and 2D points, or a set of 3D points (see Camera-Motion notebook for the related discussion).  

The $x_k$ can be expressed as $R_k y_j + t_k$. 
Under this expression the $x_k$ and $z_{k,j}$ is bound with the following equation:
$$s z_{k,j} = K(R_k y_j + t_k) = KT y_j$$
where $T \in SE3D$, $s$ represents the distance of pixels, and $K$ represents the intrinsic camera matrix.

The $w_k$ and $v_{k,j}$ are noise terms. They are usually assume to be gaussian with 0 mean.

Given the presence of noise, our problem can be formulated as a conditional probability distribution like $P(x, y | z, u)$ meaning that given the pixel position $z$ and input $u$ what is the probability of pose being $x$ and observation being $y$. 

The values that maximize this probability distribution minimize the error and noise in the system.
From the bayes rule, $argmax P(x, y | z, u) = argmax P(z, u | x, y) * P(x, y)$, since we may not know the prior $P(x,y)$ we can also ignore that and transform the Maximizing posterior distribution problem to Maximium likelihood estimation problem, and end up with $$argmax P(x, y| z, u) = argmax P(z, u | x, y)$$.

This means that we need to minimize the following terms:
- $e_{u, k} = x_k - f(x_{k-1}, u_k)$
- $e_{z, j, k} = z_{k, j} - h(x_k, y_j)$

How do we minimize these terms ? Well, assuming that $e_{u,k}$ defines a loss function $f_{loss}$, we are looking for a way to find the smallest output from the function $e_{u, k} = f_{loss}(x_k, x_{k-1}, u_k)$.

Formally, we want to order our loss function using the following partial ordering: $f_{loss}(a + \nabla a) < f_{loss}(a)$. 

This is actually a fairly well known problem, if one wants to obtain this ordering, one needs to advance at the opposite direction of the gradient $\nabla a = -J(a)$ where $J$ is the jacobian matrix containing the first order derivative of the $f_{loss}(a)$ function. 
The problem here is that we need to do iterative updates until we find the $\nabla a$.
The iterative updates requires a step size, in machine learning it is called the learning rate, this sometimes get stuck in a local minima and fails to give the global minimum.
Fortunately there is a solution to this.

We are looking for $\nabla a$ in the $f_{loss}(a + \nabla a)$, one of the ways to find it is to expand the $f_{loss}(a)$ using Taylor series:
$$f_{loss}(a) = f_{loss}(a) + \nabla a f'_{loss}(a) + \frac{(\nabla a)^2 f_{loss}(a)}{2!} + \dots$$

More succinctly $$f_{loss}(a) \simeq f_{loss}(a) + \nabla a J(a) + \frac{(\nabla a)^2 H(a)}{2}$$ 
where $J$ represents the jacobian (first order derivative) matrix, and $H$ represents the hessian (second order derivative) matrix.
Now we want to minimize the right side of this equation. 
Let's take the derivative of the $\nabla a$:
$$k(\nabla a) = f_{loss}(a) + \nabla a J(a) + \frac{(\nabla a)^2 H(a)}{2}$$
$$k'(\nabla a) = 0 + J(a) + \frac{2 * \nabla a H(a)}{2}$$
$$k'(\nabla a) = J(a) + \nabla a H(a)$$

The minimum/maximum value of this function can be found with setting the derivative to 0:
$$argmin(k(\nabla a)) = ( k'(\nabla a) = 0) = J(a) + \nabla a H(a)$$
$$J(a) + \nabla a H(a) = 0$$
hence
$$\nabla a H(a) = -J(a)$$

Notice that this has the classic $Ax=B$ form and can be solved for $x$ using LU decomposition.

The problem is computing the $H$ takes a lot of time, so in reality one would approximate it using the jacobian like: $$J(a)J^T(a) \nabla a = -J(a)f_{loss}(a)$$
This is called the normal equation or *Gauss-Newton* equation.