# Camera Motion

Camera motion refers to change in camera SE3 (see Lie-Group-Algebra) from available information.
There are 2 main approaches:

- Feature matching based methods
- Optical Flow based methods

We'll start with feature matching based methods.

## Feature Based Camera Motion Estimation


There are 3 types of information that can be used for estimating camera motion in feature based methods:

- Monocular camera case: We try to estimate the camera motion using sets of 2d points.
- Binocular camera or RGB-D camera case: We try to estimate the camera motion using sets of 3D points
- Mixed case: We have a 3D set of points and their 2D projections. We try to estimate the camera pose using these two elements.

### Monocular camera

Estimating camera motion from monocular camera requires some knowledge on epipolar geometry.

Let's define the problem. Given a spatial position $P = [x, y, z]$, and a set of images $I_1, I_2,... , I_i,  ..., I_n$,
the pixel position of $P$ in $I_i$ is $p_i$.
We know that $s_i * p_i = K * (R(i) * P + t(i))$, where $R$ is rotation matrix and $t$ is the translation vector at $i$th point in time, and $K$ represents the intrinsic camera matrix. 

Now, we know that in homogenous coordinates a vector is equal to itself when multiplied by a non zero constant, so in a homogenous coordinate system $s_i v_i = v_i$. This type of equality is called: *equal up to scale*. 
We say for example that $s_i v_i$ is equal up to a scale with $v_i$. Why both are considered equal ?
Well they still express the same transformation: $A = [x=2 * 2, y=2 * 3, z=2 * 4, w=2 * 1]$ when we do a transformation from 4d - 3d with $A / w$, we get $A = [x=2, y=3, z=4]$. 
If we didn't do the multiplication with $2$, we would still get the same result.

Let's express the equality as $s_i v_i \simeq v_i$.
The relationship is $$p_i \simeq K * (R(i) * P + t(i))$$

Now, let's remove intrinsic properties of camera from the equation: $$x_i = p_i K^{-1}$$
Here, the sequential nature of images matter, because it binds the $x_i$ to $x_{i+1}$ through rotation and translation. 
Remember that the spatial position of $P$ is the same, the only thing that changes is the camera pose, and camera pose is described by rotation + translation in world coordinate.
Hence $$x_i \simeq R_{i}x_{i-1} + t_{i}$$

If we multiply both sides with $x_i^T t_i\hat{}$, we get an interesting result:
$$x_i^T t_i\hat{} x_i \simeq x_i^T t_i\hat{} R_i x_{i-1} + x_i^T t_i\hat{} t_i$$

The transposition in $x_i^T$ is actually necessary to make the matrix multiplication work on both sides.

Now the $t_i\hat{} x_i$ is simply the cross product of $t_i$ and $x_i$ (since $t_i\hat{}$ is the skew-symmetric matrix). 

Given that cross product produces a vector that is orthogonal to each of its multipliers, the inner product of $x_i^T$ with the resulting vector would necessarily be 0: $$0 \simeq x_i t_i\hat{} R_i x_{i-1} + x_i t_i\hat{} t_i$$

Here the $t_i\hat{} t_i$ is also 0, as can be seen from the snippet below:

In [2]:
import numpy as np

xi = np.array([1,2,3], dtype=float)
x_p = np.array([[0, -3, 2],
                [3, 0, -1], 
                [-2, 1, 0]], dtype=float)

print(xi.T.dot(x_p))

[0. 0. 0.]


Then we are left with the following equation $$0 \simeq x_i^T t_i\hat{}R_i x_{i-1} + 0$$

Now, since no scalar multiplication would change what is going to happen at the left side of the equation, it makes very little sense to conserve the constraint $\simeq$ on the equation. 
Hence we end up with: $$x_i^T t_i\hat{}R_i x_{i-1} = 0$$

This constraint is known as the epipolar constraint. 
If we add the pixel coordinates back to the equation: $$p_i^T K^{-T}t_i\hat{}R_i p_{i-1}K^{-1} = 0$$

The matrix $t_i\hat{}R_i$ is called the *essential* matrix: $E = t_i\hat{}R_i$.

Under the epipolar constraint the problem looks like the following: 
$x_i = [u_i, v_i, 1]$, $x_{i-1} = [u_{i-1}, v_{i-1}, 1]$
$$[u_i, v_i, 1] \begin{pmatrix} e_1 & e_2 & e_3 \\ e_4 & e_5 & e_6 \\ e_7 & e_8 & e_9 \end{pmatrix} [u_{i-1}, v_{i-1}, 1]^T = 0$$  

If we actually do the multiplication, we end up with:
$$[u_i, v_i, 1] [u_{i-1} * e_1 + v_{i-1} * e_2 + 1 * e_3, 
   u_{i-1} * e_4 + v_{i-1} * e_5 + 1 * e_6,
   u_{i-1} * e_7 + v_{i-1} * e_8 + 1 * e_9] = 0$$

$$[u_i(u_{i-1} * e_1 + v_{i-1} * e_2 + 1 * e_3) 
  + v_i(u_{i-1} * e_4 + v_{i-1} * e_5 + 1 * e_6)
  + u_{i-1} * e_7 + v_{i-1} * e_8 + 1 * e_9] = 0$$
  
$$[u_i * u_{i-1} * e_1 + u_i * v_{i-1} * e_2 + u_i * e_3
   + v_i * u_{i-1} * e_4 +v_i * v_{i-1} * e_5 + v_i * e_6
   + u_{i-1} * e_7 + v_{i-1} * e_8 + 1 * e_9 = 0$$
   
$$[u_i * u_{i-1}, u_i * v_{i-1}, u_i, v_i * u_{i-1}, v_i * v_{i-1}, 
   v_i, u_{i-1}, v_{i-1}, 1] \cdot 
   [e_1, e_2, e_3, e_4, e_5, e_6, e_7, e_8, e_9] = 0$$


### Mixed case: Perspective n Point method

We obtained the 3D position of a point in camera space using RGB-D camera or using binocular camera, we also know consequently their 2d projection positions.
Perspective n Point (PNP) method concerns how to estimate camera's pose (rotation translation in world coordinate) from this information.

There are essentially two methods, direct linear transformation (DLT) and bundle adjustment (BA). Let's see DLT first

#### Direct Linear Transformation (DLT)

We have point in 3D space: $P = [x,y,z]$ in homogenous coordinates: $P = [x, y, z, 1]$. 
The projected P is $p = [u, v, 1]$.
From 3d math we know that:
$$s [u, v, 1] = \begin{pmatrix} t_1 & t_2 & t_3 & t_4 \\ t_5 & t_6 & t_7 & t_8 \\ t_9 & t_{10} & t_{11} & t_{12} \end{pmatrix} [x, y, z, 1]$$

This is a matrix with 12 unknowns, so given 6 pairs of matching points, this can be solved using QR decomposition

#### Bundle Adjustment (BA)

We have point in 3D space: $P = [x,y,z]$ in homogenous coordinates: $P = [x, y, z, 1]$. 
The projected $P$ is $p = [u, v, 1]$.
From 3d math we know that:
$$s [u, v, 1] = K T [x, y, z, 1]$$
where $K$ is the intrinsic camera matrix, and $T$ is a member of $SE(3)$, the $s$ is there to counter the effect of perspective projection.
As you may remember projection of $KTP$ to $p$ happens after dividing the all members by the depth, so if we want to make this relationship explicit, we can also write the equation above like:
$$KTP_z [u, v, 1] = K T P$$

The difference between measured $p$ from feature matching and the projected $p$ obtained from the formula above is called the reprojection error.

If we can minimize the reprojection error $w$, we can get:

- Camera pose (camera location in world coordinate)
- Spatial position of feature points

The general strategy for finding minimum and maximum output of any function $f$ usually involves either going opposite to the derivative of the function ($f_{loss}(\alpha + \nabla \alpha) < f_{loss}(\alpha) | \nabla \alpha = f'_{loss}(\alpha)$) or, solving the case where the function derivative is equal to 0 ($f'_{loss} = 0$).

Remember that derivative measures the rate of change of a variable of a function.
Now in our equation: $w = ||m_p - (K T P)/ (KTP_z)||^2$ where $m_p$ is the measured $p$, two variables of interest are $T$ which is related to camera pose, and $P$ which is related to spatial position of the point detected by feature matching.

These two are the contributing factors to the error $w = [w1, w2, 1]^T$, so it makes sense to derive $w$ against these variables, to see how $w$ changes when we change one of these variables. 

Formally, we are describing $w$ as the following function: $w = m_p - (K \cdot T \cdot P)/ s$ where $K$ and $m_p$ are constants.
We are then looking for two partial derivatives:
- $\frac{\partial w}{T}$
- $\frac{\partial w}{P}$

Remember that $T \in SE(3)$ so we know that one way of taking its derivative is to use a left perturbation. 
Hence our first expression $\frac{\partial w}{T}$ transforms into
$\frac{\partial w}{d}$ where $d$ represents the left perturbation of $T$.
Then we are know looking for how $w$ changes with respect to $d$.

Let $w = f_1(T) = k(g(T))$ where $g(T) = T \cdot P = P'$ and 
$k(P') = m_p - (K \cdot P') / s$, then
$$f'_1(T) = \frac{\partial w}{d} = \lim_{d \to 0} \frac{f_1(d \star T) - f_1(T)}{d}$$
From the chain rule: $f'_1(T) = k'(g(T)) g'(T)$
Then the above equation is simply:
$$\frac{\partial w}{d} = \frac{\partial w}{P'} \frac{\partial P'}{d}$$

The first part of this equation $\frac{\partial w}{P'}$ is relatively simple.
Let $P' = [x', y', z']^T$ then 
$$s * [u, v, 1]^T = K \cdot P' = \begin{pmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{pmatrix} \cdot [x', y', z']^T$$
Simply from this equation, one should note that $s = z'$ (when matrix multiplication is done the last row only consists of $z'$ from here one can see that coefficient $s$ must be equal to projection's depth $z'$)

From here it should be easy to see:
- $s * u = f_x x' + c_x z' = f_x \frac{x'}{z'} + c_x$
- $s * v = f_y y' + c_y z' = f_y \frac{y'}{z'} + c_y$

With $u$ and $v$ defined, we can express the $\frac{\partial w}{P'}$ analytically:
$$\frac{\partial w}{P'} = - \begin{pmatrix}
\frac{\partial u}{x'} & \frac{\partial u}{y'} & \frac{\partial u}{z'} \\
\frac{\partial v}{x'} & \frac{\partial v}{y'} & \frac{\partial v}{z'} 
\end{pmatrix}$$

This is simply a question of filling out coefficients:
- $\frac{\partial u}{x'} = f_x / z'$
- $\frac{\partial u}{y'} = 0$
- $\frac{\partial u}{z'} = \frac{-f_x x'}{z'^2}$: This comes from power law: $f(x) = x^r$, $f'(x) = r * x^{r-1}$. Applied to our case $f_x x' z'^{-1}$, it produces $-1 * f_x x' z'^{-2}$
- $\frac{\partial v}{x'} = 0$
- $\frac{\partial v}{y'} = f_y / z'$
- $\frac{\partial v}{z'} = \frac{-f_y y'}{z'^2}$: This comes from power law as above

The minus sign in front of the matrix comes from the minus sign in the $w$: $m_p - (K \cdot P') / s$.

If we regroup:$$\frac{\partial w}{P'} = - \begin{pmatrix}
\frac{f_x}{z'} & 0 & \frac{-f_x x'}{z'^2} \\
0 & \frac{f_y}{z'} & \frac{-f_y y'}{z'^2}\end{pmatrix}$$

Now that we have the first part of $\frac{\partial w}{d}$, we can start dealing with the second part $\frac{\partial P'}{d}$.

Remember that $P' = T \cdot P$ and that $T \in SE(3)$. 
As one might recall from the Lie-Group-Algebra notebook, derivative of $SE(3)$ can be expressed through left perturbation like $d$ with:

$$\frac{\partial(T \cdot P)}{d} = \lim_{\delta \to 0} \frac{\exp(\gamma(\delta)) \exp(\gamma(\xi))P - \exp(\gamma(\xi))P}{\delta}$$
where $\xi$ is lie algebra of $T$ and $\delta$ is lie algebra of $d$.

We also know from the notebook that this expression reduces into:
$$\frac{\partial(T \cdot P)}{d} = \begin{pmatrix} I & \lambda(-(R\cdot P+t)) \\ 0 & 0 \end{pmatrix}$$
This is a 4x6 matrix where $I \in \mathbb{R}^{3x3}$ and $\lambda(-(R \cdot P+t)) \in \mathbb{R}^{3x3}$

The problem is we can't multiply the 2 by 3 matrix of $\frac{\partial w}{P'}$ with 4 by 6 matrix of $\frac{\partial P'}{d}$, due to mismatch of dimensions.
However, upon a closer look, we see that last row of $\frac{\partial P'}{d}$ is composed of zeros.

At this point we have 2 options. We can either pad the bottom of the first matrix with zeros or remove the last row from the second matrix. In both cases the end result would be the same. The multiplication of these matrices would produce a 2 by 6 matrix. Let's remove the last row from $\frac{\partial P'}{d}$ and write the multiplication. 
Remember that $R \cdot P + t = P'$

$$\frac{\partial w}{d} = \frac{\partial w}{P'} \frac{\partial P'}{d} = -
\begin{pmatrix}
\frac{f_x}{z'} & 0 & \frac{-f_x x'}{z'^2} \\
0 & \frac{f_y}{z'} & \frac{-f_y y'}{z'^2}
\end{pmatrix} \cdot \begin{pmatrix}
1 & 0 & 0 & 0 & z' & -y' \\
0 & 1 & 0 & -z' & 0 & x' \\
0 & 0 & 1 & y' & -x' & 0 \end{pmatrix}$$

We have taken care of minimizing reprojection error with respect to camera pose part, but haven't touched on minimizing reprojection error with respect to spatial position.

This requires us to take the derivative of $\frac{\partial w}{P}$ where $w$ is the error term and $P$ is the spatial position obtained through feature matching.

Let $w = f_2(P) = k(q(P))$ where $q(P) = T \cdot P = P'$ and 
$k(P') = m_p - (K \cdot P') / s$, then
$$f'_2(P) = \frac{\partial w}{P} = \lim_{h \to 0} \frac{f_2(P + h) - f_2(P)}{h}$$
From the chain rule: $f'_2(P) = k'(q(P)) q'(P)$
Then the above equation is simply:
$$\frac{\partial w}{P} = \frac{\partial w}{P'} \frac{\partial P'}{P}$$

We already know the first part $\frac{\partial w}{P'} = -\begin{pmatrix}
\frac{f_x}{z'} & 0 & \frac{-f_x x'}{z'^2} \\
0 & \frac{f_y}{z'} & \frac{-f_y y'}{z'^2}\end{pmatrix}$.
We just need to find $\frac{\partial P'}{P}$
The equation $P' = R \cdot P + t$ shows us that only $R$ is related to derivative:
- From the power rule $R * 1 \cdot P^{1 - 1}$ 

Hence the expression $\frac{\partial w}{P}$ reduces to:
$$\frac{\partial w}{P} = \begin{pmatrix}
\frac{f_x}{z'} & 0 & \frac{-f_x x'}{z'^2} \\
0 & \frac{f_y}{z'} & \frac{-f_y y'}{z'^2}\end{pmatrix} \cdot R$$
This results in a 2 by 3 matrix.

### RGB-D camera case: 3D - 3D Iterative Closest Point (ICP)

We have two sets of 3d points:
- $P = [p_1, p_2, \dots, p_i, \dots, p_n]$
- $P' = [p'_1, p'_2, \dots, p'_i, \dots, p'_n]$

We assume that these points are matched through features of images (as in the case of RGB-D images).
We are looking for the transformation $T$ such that: 
$$P = [T \cdot p'_1, T \cdot p'_2, \dots, T \cdot p'_i, \dots, T \cdot p'_n]$$

Notice that we don't use the intrinsic camera matrix anywhere here.
This is essentially the same problem as PnP.

The objective function for the minimization is:
$err = (p_i - T \cdot p'_i)^2$

Remember from the discussion in PnP that $T \in SE(3)$. 
As one might recall from the Lie-Group-Algebra notebook, derivative of $SE(3)$ can be expressed through left perturbation like $d$ with:

$$\frac{\partial(T \cdot P)}{d} = \lim_{\delta \to 0} \frac{\exp(\gamma(\delta)) \exp(\gamma(\xi))P - \exp(\gamma(\xi))P}{\delta}$$
where $\xi$ is lie algebra of $T$ and $\delta$ is lie algebra of $d$.

We also know from the notebook that this expression reduces into:
$$\frac{\partial(T \cdot P)}{d} = \begin{pmatrix} I & \lambda(-(R\cdot P+t)) \\ 0 & 0 \end{pmatrix}$$
This is a 4x6 matrix where $I \in \mathbb{R}^{3x3}$ and $\lambda(-(R \cdot P+t)) \in \mathbb{R}^{3x3}$

Given the coefficient is negative for $T \cdot p_i$ in $err = p_i - T \cdot p'_i$, the $\frac{err}{d}$ would be:
$$\frac{err}{d} = -\begin{pmatrix} I & \lambda(-(R\cdot P+t)) \\ 0 & 0 \end{pmatrix}$$

## Pixel Brightness based Camera Motion Estimation

Feature based methods are nice, but it takes a lot of time to calculate descriptors and match.
Plus, features are sometimes hard to find in some cases. 
For example, when the camera is facing a white wall, it is likely that there are not enough features to provide an accurate enough estimation.

Another possible solution is to use pixel intensity levels to estimate the camera pose matrix. 
Assuming the brightness level of a pixel doesn't change due to motion (this entirely false for certain surfaces like metals for example), we can use the intensity of a projected pixel as a target variable for the pose matrix that is being estimated.

Assuming that we are in completely still environment where lightining conditions never change, we can suppose that a pixel's brightness levels stay the same if they belong to the same location. 
This assumption can be formally stated as: $I_1(p_1) = I_2(p_2)$ where $p_1 = \frac{K \cdot P}{Z_1}$ and $p_2 = \frac{K \cdot (T \cdot P )}{Z_2}$ and where $K$ is the intrinsic camera matrix and $P$ is the camera coordinate of the pixel. 
$I_1$ is the first image.
$I_2$ is the second image. 
$T$ is the usual transformation matrix $T \in SE(3)$.
$Z_1$ is the z coordinate resulting from $K \cdot P$ which is needed for perspective transformation and $Z_2$ is the z coordinate resulting from $K \cdot (T \cdot P)$.

Notice that pixel brightness assumption is a very strong assumption, and it most definitely do not apply to certain surfaces like metals, mirrors etc, but it is a necessary assumption for the algorithm.

We treat this version of the pose estimation problem as a non linear optimization problem where we try to minimize the **photometric** error.
The photometric error is defined as
$\sum_{i=1}^k r_i(P_i, T)^2$ where 
$$r_i(P_i, T) = I_{1}(\frac{K \cdot P_i}{Z_i}) - I_{2}(\frac{K \cdot (T \cdot P_i )}{Z_2})$$
where $I_1$ represents the first image and $I_2$ represents the second image.
As in the case of reprojection error, we can minimize this error function by making its derivative.
Decomposing $r_i(P_i, T)$ as $r_i(P_i, T) = I_1(p_i) - I_2(g(h(P_i, T))$
where $h(P_i, T) = T \cdot P_i = u$ and $g(u) = \frac{K \cdot u}{Z_2}$, the derivative is $r_i'(P_i, T) = - I_2'(g(h(P_i, T))) g'(h(P_i, T)) h'(P_i, T)$ according to chain rule.

Given that $P_i$ is constant during the minimization algorithm, we can rewrite the $r_i'(P_i, T)$ as
$$r_i'(T) = - I_2'(g(h(T))) g'(h(T)) h'(T)$$

The $g'(h(T))$ is simply $$\begin{pmatrix}
\frac{f_x}{z'} & 0 & \frac{-f_x x'}{z'^2} \\
0 & \frac{f_y}{z'} & \frac{-f_y y'}{z'^2}\end{pmatrix}$$ as we had seen in the Mixed case section.

The $h'(T)$ is simply $$\begin{pmatrix} I & \lambda(-(R\cdot P+t)) \\ 0 & 0 \end{pmatrix}$$ as we had seen in Mixed case section.
This is a 4x6 matrix where $I \in \mathbb{R}^{3x3}$ and $\lambda(-(R \cdot P+t)) \in \mathbb{R}^{3x3}$

As in the Mixed case we remove the last row and end up with a matrix $\mathbb{R}^{2x6}$ after multiplication.

The only thing new then is the gradient of the projected point in the second image: $I_2'(g(h(T)))$
This is simply sum squared pixel differences in x and y directions.
Formally: $grad(I) = \sqrt{G_x(I)^2 + G_y(I)^2}$.

This covers all the terms required for finding the first order derivative of the residual $r_i$ with respect to left perturbation of $T$, that is the jacobian of the residual:
$$J(r_i) = - \sqrt{G_x(I_2)_i^2 + G_y(I_2)_i^2} \cdot \begin{pmatrix}
\frac{f_x}{z'} & 0 & \frac{-f_x x'}{z'^2} \\
0 & \frac{f_y}{z'} & \frac{-f_y y'}{z'^2}\end{pmatrix} \cdot \begin{pmatrix} I & \lambda(-(R\cdot P+t)) \end{pmatrix}$$

From the non linear optimization notebook, we know that:

$$-k'(\alpha^s) = k''(\alpha^s) \nabla \alpha$$
The $-k'(\alpha^s)$ is the first order derivative of $k$ evaluated at $\alpha^s$.
The $k''(\alpha^s)$ is the hessian matrix of $k$ evaluated at $\alpha^s$.
Hence the final form of the equation:
$$-g(\alpha^s) = H(\alpha^s) \nabla \alpha$$

The $k$ in our problem is $r_i(P_i, T)^2$.
The $\alpha^s$ is the parameter vector at $s$th iteration. 
The parameter vector in our case would be the lie algebra of the estimated pose matrix $T$.
The $\nabla \alpha$ is the shift vector. The shift vector belongs to the same domain as parameter vector.

# Bundle Adjustment and Sparsity of Hessian

Here is the projection process of a point $p$:

1. Transform $p = [X, Y, Z]$ in world space to camera space using transform matrix $T$: $P' = T \cdot p = [X', Y', Z']$

2. Transform $P'$ to normalized coordinates with perspective projection: $P_c = P' / Z' = [u, v, 1]$

3. Apply the distortion model. For example radical distortion: 
- $u' = u (1 + k_1 r^2 + k_2 r^4)$
- $v' = v (1 + k_1 r^2 + k_2 r^4)$

4. Compute pixel coordinate $m = [u_s, v_s]$ using intrinsic camera properties:
- $u_s = f_x u' + c_x$
- $v_s = f_y v' + c_y$

We can define this process as a single function $m = h(K, d, T, p)$. For multiple points, the process would something like:
$$m_i = h(K, d, T_j, p_i)$$ where $i \in \{0, \dots, N\}$ and $j \in \{0, \dots, M\}$.

Given that the intrinsic camera properties and lens distortion don't change, we can simplify this equation like: $m_i = k(T_j, p_i)$.

The cost function associated with the pixel would be $err(T_j, p_i) = m_i - k(T_j, p_i)$, overall cost function would be:
$$\sum_{j=1}^M \sum_{i=1}^N = err(T_j, p_i)^2$$

We had seen in the non linear optimization notebook that in order to minimize a residual function of the type $k(\alpha) = err'(\alpha)^2$, we need to use the following expression originating from using taylor expansion of $k(\alpha)$:
$$k(\alpha^s + \nabla \alpha) \approx k(\alpha^s) + k'(\alpha^s)\nabla \alpha + \frac{k''(\alpha^s)(\nabla \alpha)^2}{2!}$$

From here we set the derivative of $k(\alpha)$ to 0 and obtain the following expression:
$$\frac{\partial k(\alpha^s + \nabla \alpha)}{\nabla \alpha} = 0 =
0 + k'(\alpha^s)+ \frac{2 * k''(\alpha^s) \nabla \alpha}{2}$$

The equation simplifies as:
$$0 = k'(\alpha^s) + k''(\alpha^s) \nabla \alpha$$
and consequently:
$$-k'(\alpha^s) = k''(\alpha^s) \nabla \alpha$$

Now the bridge between $err(T_j, p_i)$ and $k$ is to realize that $\alpha$ is just a vector incorporating $T_j$ and $p_i$, $\alpha = [T_j, p_i]$ hence $\alpha \in R^9$ (remember that lie algebra of $T$ $\mathbb{se}(3) \in R^6$ and $p_i \in R^3$). 

Consequently the $\nabla \alpha \in R^9$ is also the case.
The jacobian matrix resulting from $k'(\alpha^s) = J(\alpha^s)$ has also a peculiar structure. Assuming $\mathbf{T} = \{ T_0, \dots, T_m \}$ and $\mathbf{p} = \{p_0, \dots, p_n\}$

$$J(\alpha^s = [\mathbf{T}, \mathbf{p}]) = \begin{pmatrix}
\frac{\partial k_x}{\partial T_0} & \dots & \frac{\partial k_x}{\partial T_m} & \dots & \frac{\partial k_x}{\partial p_0} & \dots & \frac{\partial k_x}{\partial p_n} \\
\frac{\partial k_y}{\partial T_0} & \dots & \frac{\partial k_y}{\partial T_m} & \dots & \frac{\partial k_y}{\partial p_0} & \dots & \frac{\partial k_y}{\partial p_n}\end{pmatrix}$$
where $k(\alpha^s)= err(T, p): R^9 \to R^2$

It should be fairly apparent at this point that the jacobian can be cleanly divided into 2 sections: 
$$J(\alpha^s) = \begin{pmatrix} E & F \end{pmatrix}$$
where $E = \begin{pmatrix}
\frac{\partial k_x}{\partial T_0} & \dots & \frac{\partial k_x}{\partial T_m} \\
\frac{\partial k_y}{\partial T_0} & \dots & \frac{\partial k_y}{\partial T_m}\end{pmatrix}$ and $F = \begin{pmatrix}
\frac{\partial k_x}{\partial p_0} & \dots & \frac{\partial k_x}{\partial p_n} \\
\frac{\partial k_y}{\partial p_0} & \dots & \frac{\partial k_y}{\partial p_n}\end{pmatrix}$

Now in practice, this matrix is incrementally built as one can recall from the applications:
```c++
  const int nb_iter = 10;
  for (int i = 0; i < nb_iter; ++i) {
    // H of H_f(\alpha^k) \nabla \alpha
    // which is approximated as (J_f(\alpha^k)
    // J^T(\alpha^k)) = H
    Eigen::Matrix<double, 6, 6> H =
        Eigen::Matrix<double, 6, 6>::Zero();

    // -J_f(\alpha^k) f(\alpha^k) = b
    Vector6d b = Vector6d::Zero();

    error = 0.0;

    for (std::size_t k = 0; k < points_3d.size(); ++k) {
      //
      Eigen::Vector3d P_ = pose * points_3d[k];
      Eigen::Vector2d error_k =
          f(points_2d[k], P_);        // f(y_k, p_k);
      error += error_k.squaredNorm(); // reprojection error
      //
      /**
          j_f is the 2x6 jacobian matrix of the error
         function f derived for the left perturbation of the
         transformation matrix T (see camera motion ipynb
         for explanation of its contents).
         */
      Eigen::Matrix<double, 2, 3> df_p =
          get_f_delta_over_proj(P_);
      Eigen::Matrix<double, 3, 6> df_d =
          get_proj_delta_over_perturbation(P_);
      Eigen::Matrix<double, 2, 6> J_f = df_p * df_d;

      // using J_f we can approximate the H and grad_f
      H += J_f.transpose() * J_f;
      b += (-J_f.transpose()) * error_k;
    }

    // all the terms are accumulated now solve for \nabla
    // \alpha
    Vector6d nabla_alpha = solve_nabla_alpha(H, b);

    // test terminal conditions
    // numerical problems
    if (std::isnan(nabla_alpha[0])) {
      std::cout << "result is nan" << std::endl;
      break;
    }
    //
    // gradient problems
    if ((i > 0) && (error >= lastError)) {
      std::cout << "error: " << error
                << ", last error: " << lastError
                << ", iteration: " << i << std::endl;
      break;
    }
    //
    // perturb the pose matrix towards minimizing the
    // reprojection error
    pose = Sophus::SE3d::exp(nabla_alpha) * pose;
    lastError = error;
    std::cout << "iteration " << i
              << " cost=" << std::cout.precision(12)
              << error << std::endl;
    if (nabla_alpha.norm() < 0.000001) {
      // converge: see why in camera motion ipynb
      break;
    }
  }
```

Notice the line `H += J_f.transpose() * J_f;`. 
We have point based jacobian that's being accumulated into hessian through gaussian approximation.

Now theoratically, when we are computing the $k'(p_i, T_j)$ (which is a matrix), the gradient of all the other landmarks/points are 0.
This makes the $k'(\alpha^s)$ a very sparse matrix.
Our jacobian had the form $J = [E, F]$, the hessian which is approximated as $H = J^T J$ would have the form:
$H = \begin{pmatrix} H_{11} & H_{12} \\ H_{21} & H_{22} \end{pmatrix}$


# Pose Graph Optimization

State estimation is a costly process depending on how far you want remember and what you remember about the previous states.
When you have real-time constraints, you can't keep considering all the previous landmarks as we had seen in Bundle Adjustment and Sparsity section.
There are mainly 2 ways to deal with this problem:
1. Keep N previous keyframes
2. Leave landmarks alone and optimize only for pose

The second approach is called optimizing a pose graph. 
Its main idea is to use the landmarks as constraints for optimizing pose, but not considering their residual function.

Vertices of the pose graph optimization problem are camera poses. Edges of the pose graph optimization problem are relative motion between two vertices.

Let $\mathbb{T} = \{T_1, T_2, \dots, T_i, \dots, T_n\}$ be the set of poses which are effectively vertices of the pose graph.
The relative motion from $T_i$ to $T_j$ would be $\delta T_{ij}$ or shortly $T_{ij}$.
Since $T$ is $SE(3)$ whose well defined operator is matrix multiplication.
This relative motion can be defined as 
$T_{ij} = T^{-1}_i T_j$

One can see that
$$T_i T_{ij} = T_i T^{-1}_i T_j$$
$$T_i T_{ij} = I T_j$$ where $I$ represents the identity matrix. 
From here it is easy to obtain $T_i T_{ij} = T_j$

Now assuming that we have a means to estimate $T_{ij}$ either through IMU or using optical flow, or some other way, we can use it as our error term, as in 
$$err_{ij} = T^{-1}_{ij} T^{-1}_i T_j$$
The $T^{-1}_i T_j$ bit should be $T_{ij}$ in theory so the whole equation should be reduced to $I$ in theory.

Similar to previous optimization problems, the way to minimize the error is to take the derivative of the residual/error function with respect to $T_i$ and to $T_j$ and set it to 0.

We know from the Lie-Group-Algebra notebook that we can take the derivative of $SE(3)$ through left perturbation.
Then the derivative of $err_{ij}$ at $T_i$ would go something like:
$\frac{\partial err_{ij}}{T_i} = T_{ij}^{-1} \Delta_{T_i} T_i^{-1} T_j$
where $\Delta_{T_i}$ is the perturbation term.

The key here is to first transform the $T$ in $SE(3)$ group to their Lie algebra $\mathbb{se}(3)$ form. 
In our equations let $\zeta$ represent the Lie algebra $\mathbb{se}(3)$ of $T$ in $SE(3)$.
The relationship between $\zeta$ and $T$ is
$T = \exp(\gamma(\zeta)) = \begin{pmatrix}
\exp(\lambda(\phi)) & J \rho \\ 0 & 1 
\end{pmatrix}$
Please see the Lie-Group-Algebra notebook for $\gamma$, $\lambda$, $\rho$, $\phi$, $J$.

The switch to algebra is necessary for derivation.
In this form the error function can be written as
$err_{ij} = \exp(\gamma(\zeta)_{ij})^{-1} \exp(\gamma(\zeta)_i)^{-1} \exp(\gamma(\zeta))_j$

Since $\gamma(\zeta)$ is a matrix, the inverses reduce to coefficients as in:
$exp(\gamma(\zeta)_{ij})^{-1} = \exp(-\gamma(\zeta)_{ij})$ (Baker, 2006: p. 46)

Then the derivative equation can be written as:
$$\frac{\partial err_{ij}}{T_i} = 
    \exp(-\gamma(\zeta)_{ij})
    \exp(\gamma(\zeta)_{\Delta_{T_i}}
    \exp(-\gamma(\zeta)_{i})
    \exp(\gamma(\zeta)_j)$$

# References

[1] A. Baker, Matrix groups: an introduction to Lie group theory, 3. print. in Springer undergraduate mathematics series. London Berlin Heidelberg: Springer, 2006.
