# Camera Motion

Camera motion refers to change in camera SE3 (see Lie-Group-Algebra) from available information.
There are 3 types of information that can be used for estimating camera motion:

- Monocular camera case: We try to estimate the camera motion using sets of 2d points.
- Binocular camera or RGB-D camera case: We try to estimate the camera motion using sets of 3D points
- Mixed case: We have a 3D set of points and their 2D projections. We try to estimate the camera pose using these two elements.

## Monocular camera

Estimating camera motion from monocular camera requires some knowledge on epipolar geometry.

Let's define the problem. Given a spatial position $P = [x, y, z]$, and a set of images $I_1, I_2,... , I_i,  ..., I_n$,
the pixel position of $P$ in $I_i$ is $p_i$.
We know that $s_i * p_i = K * (R(i) * P + t(i))$, where $R$ is rotation matrix and $t$ is the translation vector at $i$th point in time, and $K$ represents the intrinsic camera matrix. 

Now, we know that in homogenous coordinates a vector is equal to itself when multiplied by a non zero constant, so in a homogenous coordinate system $s_i v_i = v_i$. This type of equality is called: *equal up to scale*. 
We say for example that $s_i v_i$ is equal up to a scale with $v_i$. Why both are considered equal ?
Well they still express the same transformation: $A = [x=2 * 2, y=2 * 3, z=2 * 4, w=2 * 1]$ when we do a transformation from 4d - 3d with $A / w$, we get $A = [x=2, y=3, z=4]$. 
If we didn't do the multiplication with $2$, we would still get the same result.

Let's express the equality as $s_i v_i \simeq v_i$.
The relationship is $$p_i \simeq K * (R(i) * P + t(i))$$

Now, let's remove intrinsic properties of camera from the equation: $$x_i = p_i K^{-1}$$
Here, the sequential nature of images matter, because it binds the $x_i$ to $x_{i+1}$ through rotation and translation. 
Remember that the spatial position of $P$ is the same, the only thing that changes is the camera pose, and camera pose is described by rotation + translation in world coordinate.
Hence $$x_i \simeq R_{i}x_{i-1} + t_{i}$$

If we multiply both sides with $x_i^T t_i\hat{}$, we get an interesting result:
$$x_i^T t_i\hat{} x_i \simeq x_i^T t_i\hat{} R_i x_{i-1} + x_i^T t_i\hat{} t_i$$

The transposition in $x_i^T$ is actually necessary to make the matrix multiplication work on both sides.

Now the $t_i\hat{} x_i$ is simply the cross product of $t_i$ and $x_i$ (since $t_i\hat{}$ is the skew-symmetric matrix). 

Given that cross product produces a vector that is orthogonal to each of its multipliers, the inner product of $x_i^T$ with the resulting vector would necessarily be 0: $$0 \simeq x_i t_i\hat{} R_i x_{i-1} + x_i t_i\hat{} t_i$$

Here the $t_i\hat{} t_i$ is also 0, as can be seen from the snippet below:

In [2]:
import numpy as np

xi = np.array([1,2,3], dtype=float)
x_p = np.array([[0, -3, 2],
                [3, 0, -1], 
                [-2, 1, 0]], dtype=float)

print(xi.T.dot(x_p))

[0. 0. 0.]


Then we are left with the following equation $$0 \simeq x_i^T t_i\hat{}R_i x_{i-1} + 0$$

Now, since no scalar multiplication would change what is going to happen at the left side of the equation, it makes very little sense to conserve the constraint $\simeq$ on the equation. 
Hence we end up with: $$x_i^T t_i\hat{}R_i x_{i-1} = 0$$

This constraint is known as the epipolar constraint. 
If we add the pixel coordinates back to the equation: $$p_i^T K^{-T}t_i\hat{}R_i p_{i-1}K^{-1} = 0$$

The matrix $t_i\hat{}R_i$ is called the *essential* matrix: $E = t_i\hat{}R_i$.

Under the epipolar constraint the problem looks like the following: 
$x_i = [u_i, v_i, 1]$, $x_{i-1} = [u_{i-1}, v_{i-1}, 1]$
$$[u_i, v_i, 1] \begin{pmatrix} e_1 & e_2 & e_3 \\ e_4 & e_5 & e_6 \\ e_7 & e_8 & e_9 \end{pmatrix} [u_{i-1}, v_{i-1}, 1]^T = 0$$  

If we actually do the multiplication, we end up with:
$$[u_i, v_i, 1] [u_{i-1} * e_1 + v_{i-1} * e_2 + 1 * e_3, 
   u_{i-1} * e_4 + v_{i-1} * e_5 + 1 * e_6,
   u_{i-1} * e_7 + v_{i-1} * e_8 + 1 * e_9] = 0$$

$$[u_i(u_{i-1} * e_1 + v_{i-1} * e_2 + 1 * e_3) 
  + v_i(u_{i-1} * e_4 + v_{i-1} * e_5 + 1 * e_6)
  + u_{i-1} * e_7 + v_{i-1} * e_8 + 1 * e_9] = 0$$
  
$$[u_i * u_{i-1} * e_1 + u_i * v_{i-1} * e_2 + u_i * e_3
   + v_i * u_{i-1} * e_4 +v_i * v_{i-1} * e_5 + v_i * e_6
   + u_{i-1} * e_7 + v_{i-1} * e_8 + 1 * e_9 = 0$$
   
$$[u_i * u_{i-1}, u_i * v_{i-1}, u_i, v_i * u_{i-1}, v_i * v_{i-1}, 
   v_i, u_{i-1}, v_{i-1}, 1] \cdot 
   [e_1, e_2, e_3, e_4, e_5, e_6, e_7, e_8, e_9] = 0$$


## Mixed case: Perspective n Point method

We obtained the 3D position of a point in camera space using RGB-D camera or using binocular camera, we also know consequently their 2d projection positions.
Perspective n Point (PNP) method concerns how to estimate camera's pose (rotation translation in world coordinate) from this information.

There are essentially two methods, direct linear transformation (DLT) and bundle adjustment (BA). Let's see DLT first

### Direct Linear Transformation (DLT)

We have point in 3D space: $P = [x,y,z]$ in homogenous coordinates: $P = [x, y, z, 1]$. 
The projected P is $p = [u, v, 1]$.
From 3d math we know that:
$$s [u, v, 1] = \begin{pmatrix} t_1 & t_2 & t_3 & t_4 \\ t_5 & t_6 & t_7 & t_8 \\ t_9 & t_{10} & t_{11} & t_{12} \end{pmatrix} [x, y, z, 1]$$

This is a matrix with 12 unknowns, so given 6 pairs of matching points, this can be solved using QR decomposition

### Bundle Adjustment (BA)

We have point in 3D space: $P = [x,y,z]$ in homogenous coordinates: $P = [x, y, z, 1]$. 
The projected P is $p = [u, v, 1]$.
From 3d math we know that:
$$s [u, v, 1] = K T [x, y, z, 1]$$
where $K$ is the intrinsic camera matrix, and $T$ is the SE3d from lie algebra.

During the calculation of $T$, we assume that given some initial value for $T$, there will be an error between the
projected $p$ and the observed $p$. 
This error comes from the fact that we don't effectively know what $T$ is.
The main idea behind BA is to minimize this error using the least square method to obtain a good enough $T$.

Formally we try to minimize $f_{loss}(T) = \frac{1}{2} \sum_{i=1}^{2} || p_i - \frac{1}{s} KTP$ where $p_i$ is either $u$ or $v$.
Notice that $K$ doesn't change and $P$ doesn't change as we change $p$. 
Hence we try to find the $T$ using available $p$s by minimizing the least square loss function above.

The usual way to minimize/maximize a function to achive the following ordering $f_{loss}(T + \nabla T) < f_{loss}(T)$ is to $\nabla T = -J(T)$, meaning that take the opposite of the first order derivative and add it to the argument. 

The Non-linear-Optimization notebook expands upon this idea. 
The Gauss-Newton equation adapts to our problem like the following:
$$J(T)J^T(T) \nabla T = -J(T)f_{loss}(T)$$