# Cap 1 - Pinhole Camera Model

A pinhole camera is a Linear Camera with or without lens and with a single small aperture. Light rays pass through the aperture and project an inverted image on the opposite side of the camera. Think of the virtual image plane as being in front of the camera and containing the upright image of the scene.

<img src="imgs/simplemodel.png" style="background : white">

A point $P$ in the 3-D world can be obtained by an image sensor according to the following image, using the virtual image plane as image plane.

<img src="imgs/pinhole_camera_model.png" style="background : white">

Where:

- $[X_w \ Y_w \ Z_w]$ is the point coordinates in the world reference system. 
- $[X_c \ Y_c \ Z_c]$ is the point coordinates in the camera reference system.
- $[x \ y]$ is the point coordinates in sensor plane in [mm].
- $[u \ v]$ is the point coordinates in image plane in [pixels].  It's our primary source of information of the world. We will use this to lead us to, at least, $[X_c \ Y_c \ Z_c]$.


## From 3-D to 2-D

### Transformation from World to Camera Coordinate Frame

The transformation from the <u>world coordinate</u> frame to the <u>camera coordinate</u> frame is a 3D to 3D transformation. This is achieved using the extrinsic parameters of the camera, which are the rotation matrix (R) and the translation vector (T). The rotation matrix represents the orientation of the camera, while the translation vector represents its position in the world coordinate frame. It is given by:

$$
\begin{bmatrix}
X_c \\\
Y_c \\\
Z_c
\end{bmatrix} = R_{3\times 3}
\begin{bmatrix}
X_w \\\
Y_w \\\
Z_w
\end{bmatrix} +
\begin{bmatrix}
t_x \\\
t_y \\\
t_z
\end{bmatrix}
$$

Which in homegeneus matrix is given by:

$$
\begin{bmatrix}
X_c \\
Y_c \\
Z_c \\
1
\end{bmatrix} = 
\underbrace{
    \begin{bmatrix}
        R_{3 \times 3} & t_{3\times 1} \\
        0_{1\times 3} & 1 
    \end{bmatrix}
}_{Extrinsinc \ parameters \ T_{4\times 4}}
\begin{bmatrix}
X_w \\
Y_w \\
Z_w \\
1
\end{bmatrix}   \ \ \ (1)
$$

### Transformation from camera coordinate frame to sensor plane

By similarity of triangles we can make the perspective projection between the <u>Camera reference</u> system and the <u>Image Plane</u>:

$$
\frac{x}{X_C}=\frac{f}{Z_c} \rightarrow x = \frac{X_cf}{Z_c} \\
\frac{y}{Y_C}=\frac{f}{Z_c} \rightarrow y = \frac{Y_cf}{Z_c} 
$$

### Transformation from sensor plane to image plane

$$
u = f_x\frac{X_c}{Z_c} + c_x\\
v = f_y\frac{Y_c}{Z_c} + c_y\\
$$

Where $f_x = fm_x$ and $f_y = fm_y$. $m_x$ and $m_y$ pixel densities in $x$ and $y$ directions.

Which in homegeneus matrix is given by:

$$
\begin{bmatrix}
    u \\\
    v \\\
    1
\end{bmatrix} \equiv Z_c
\begin{bmatrix}
    u \\\
    v \\\
    1
\end{bmatrix} =
\begin{bmatrix}
    f_xX_c + c_xZ_c \\\
    f_yY_c + c_yZ_c \\\
    1
\end{bmatrix} =
\underbrace{
    \begin{bmatrix}
        f_x & 0 & c_x & 0 \\\
        0 & f_y & c_y & 0 \\\
        0 & 0 & 1 & 0
    \end{bmatrix}
}_{Intrinsic \ parameters \ [K_{3\times 3}|0]}
\begin{bmatrix}
    X_c \\\
    Y_c \\\
    Z_c \\\
    1
\end{bmatrix}   \ \ \ (2)
$$

The coordinate of a point in the world on the image plane can be achieved by combining equations (1) and (2):

$$Z_c
\begin{bmatrix}
u \\
v \\
1
\end{bmatrix}=
\underbrace{
        \underbrace{
            \begin{bmatrix}
                f_x & 0 & c_x & 0 \\
                0 & f_y & c_y & 0 \\
                0 & 0 & 1 & 0
            \end{bmatrix}
        }_{Intrinsic \ parameters \ [K_{3\times 3}|0]}
        \underbrace{
            \begin{bmatrix}
                R_{3 \times 3} & t_{3\times 1} \\
                0_{1 \times 3} & 1 
            \end{bmatrix}
        }_{Extrinsic \ parameters \ T_{4\times 4}}
}_{P_{3\times 4}}
\begin{bmatrix}
X_w \\
Y_w \\
Z_w \\
1
\end{bmatrix}  \ \ \ (3)
$$

The matrix $P_{3\times 4}$ is the <u>projection matrix</u> or in older computer vision contexts <u>camera matrix</u>. The intrinsic parameters $(f_x,f_y,c_x,c_y)$ sets up the intrinsic matrix $K$ and can be <u>informed by manufacturer</u> or found after the [calibration process from 3-D pattern](<Theory 1.2-CameraCalibration3DPattern.ipynb>) and are assumed to be known. The matrix $T$ is so called <u>extrinsic matrix</u> or <u>camera pose</u> in homogeneus form.

We can also perform good $K$ and $R$ estimation after the [calibration process from 2-D pattern](<Theory 1.1-CameraCalibration2DPattern.ipynb>).

### Important notes:

- Scaling the projection matrix $P$, implies simultaneously scaling the world and the camera, which does not change the image.
- $kP$ and $P$ produces the same $\begin{bmatrix}    u \\\    v \\\    1\end{bmatrix}$

Exercise 1 ) Given the camera matrix $C = \begin{bmatrix} 512 & -800 & 0 & 800 \\\ 512 & 0 & -8000 & 1600 \\\ 1 & 0 & 0 & 0\end{bmatrix}$, compute the image plane coodinate in the world at [4, 0, 0]

### Camera pose ($T$) after displacement 

We can use equation (1) to transform a point expressed in the world frame into the camera frame before any displacement:

$$
\begin{bmatrix}
X_c \\
Y_c \\
Z_c \\
1
\end{bmatrix} = 
_{c_1}{\left[T\right]}_{w}
\begin{bmatrix}
X_w \\
Y_w \\
Z_w \\
1
\end{bmatrix} 
$$

- $_{c_1}{\left[T\right]}_{w}$ is the camera pose for position 1, before any displacement, also called camera 1.
- $_{c_2}{\left[T\right]}_{w}$ is the camera pose for position 2, after some displacement, also called camera 2.
- $_{c_3}{\left[T\right]}_{w}$ is the camera pose for position 3, after more displacement, also called camera 3.

To transform a 3D point expressed in the camera 2 frame to the camera 1 frame:

$$
_{c_1}{\left[T\right]}_{c_2} = _{c_1}{\left[T\right]}_{w} \cdot _{w}{\left[T\right]}_{c_2} = 
_{c_1}{\left[T\right]}_{w} \cdot {(_{c_2}{\left[T\right]}_{w})}^{-1} = 
\begin{bmatrix}
_{c_1}{\left[R\right]}_{w}  & _{c_1}{\left[t\right]}_{w}  \\
0_{1 \times 3} & 1
\end{bmatrix}  \cdot
\begin{bmatrix}
{(_{c_2}{\left[R\right]}_{w})}^T  & - {(_{c_2}{\left[R\right]}_{w})}^T \cdot  _{c_2}{\left[t\right]}_{w} \\
0_{1 \times 3} & 1
\end{bmatrix} =
$$

$$
\begin{bmatrix}
_{c_1}{\left[R\right]}_{w} \cdot {(_{c_2}{\left[R\right]}_{w})}^T  & - _{c_1}{\left[R\right]}_{w} \cdot {(_{c_2}{\left[R\right]}_{w})}^T \cdot  _{c_2}{\left[t\right]}_{w} + _{c_1}{\left[t\right]}_{w} \\
0_{1 \times 3} & 1
\end{bmatrix} =
\begin{bmatrix}
_{c_1}{\left[R\right]}_{c_2}  & _{c_1}{\left[t\right]}_{c_2}  \\
0_{1 \times 3} & 1
\end{bmatrix} 
$$

Assim:

$$
_{c_1}{\left[R\right]}_{c_2} = _{c_1}{\left[R\right]}_{w} \cdot {(_{c_2}{\left[R\right]}_{w})}^T  
$$
$$
_{c_1}{\left[t\right]}_{c_2} = - \underbrace{_{c_1}{\left[R\right]}_{w} \cdot {(_{c_2}{\left[R\right]}_{w})}^T}_{_{c_1}{\left[R\right]}_{c_2}} \cdot  _{c_2}{\left[t\right]}_{w} + _{c_1}{\left[t\right]}_{w}
$$



Exercise 2 ) Given the camera pose at positions 1 and 2: 

$$T_1 = \begin{bmatrix} 1 & 0 & 0 & 10 \\\ 0 & 1 & 0 & 20 \\\ 0 & 0 & 1 & 30 \\\ 0 & 0 & 0 & 1\end{bmatrix}$$
$$T_2 = \begin{bmatrix} -1 & 0 & 0 & 10 \\\ 0 & -1 & 0 & 10 \\\ 0 & 0 & 1 & 10 \\\ 0 & 0 & 0 & 1\end{bmatrix}$$

Compute the displacement between thesse positions.


### Solutions

#### Exercise 1 )

In [1]:
import numpy as np

C = np.array([[512, -800, 0,    800],
              [512, 0,    -800, 1600],
              [1,   0,    0,    0]])
Pw = np.array([4,0,0]) # Point in world reference

# Solution
def warp(C, Pw):
    Pw_homogeneus = np.hstack( (Pw, [1]) )  # array([4, 0, 0, 1])
    Pi_homogeneus = C @ Pw_homogeneus # Point in image reference = array([2848, 3648,    4])
    Pi_homogeneus = Pi_homogeneus/Pi_homogeneus[2] # array([712., 912.,   1.])
    u , v = Pi_homogeneus[0], Pi_homogeneus[1] # (712.0, 912.0)
    return u, v

print(f"Image plane point {warp(C, Pw)}")
# Try scale C to find the same answer
C = 4*C
print(f"Image plane point {warp(C, Pw)}")

Image plane point (712.0, 912.0)
Image plane point (712.0, 912.0)


#### Exercise 2 )

In [21]:
import numpy as np

Tworld_to_1 = np.array([[1, 0, 0, 10],
                        [0, 1, 0, 20],
                        [0, 0, 1, 30],
                        [0, 0, 0, 1]])
Tworld_to_2 = np.array([[ -1, 0 , 0, 10],
                        [ 0 , -1, 0, 10],
                        [ 0 , 0 , 1, 10],
                        [ 0 , 0 , 0, 1]])

# Solution
def displacement(Tworld_to_1,Tworld_to_2):
    T1_to_world = np.linalg.inv(Tworld_to_1)
    T1_to_2 = Tworld_to_2 @ T1_to_world
    R1_to_2 = T1_to_2[0:3,0:3]
    t1_to_2 = T1_to_2[0:3,3]
    return R1_to_2, t1_to_2
R, t = displacement(Tworld_to_1,Tworld_to_2)
print(f"Camera displacement:\n Rotation:\n{R}\n translation:\n {t}")

Camera displacement:
 Rotation:
[[-1.  0.  0.]
 [ 0. -1.  0.]
 [ 0.  0.  1.]]
 translation:
 [ 20.  30. -20.]
