# Computer Vision: Camera Models
## 1. Introduction

The camera is a fundamental tool in computer vision, allowing us to record and utilize images for various applications. To understand how cameras work, we need to model their behavior. 

## 2. Pinhole Cameras

### Figure 1: A simple working camera model - the pinhole camera model
![Figure 1](images/figure1_camera.png)

Let's begin by designing a simple camera system. Imagine placing a barrier with a small aperture between a 3D object and a photographic film or sensor. As shown in Figure 1, each point on the 3D object emits rays of light. The barrier allows only a few of these rays to pass through the aperture, creating a one-to-one mapping between 3D points and the film. This basic model is known as the pinhole camera model.

### Figure 2: Formal construction of the pinhole camera model
![Figure 2](images/figure2_camera.png)

A more formal depiction of the pinhole camera model is shown in Figure 2. The film is the image or retinal plane, the aperture is the pinhole O, and the distance between the image plane and pinhole O is the focal length f. The projection of the object on the image plane and the image in the virtual image plane are identical up to a scale transformation.

To use pinhole cameras, let $P = [x \ y \ z]^T$ be a 3D point. P is projected onto the image plane Π', resulting in point $P' = [x' \ y']^T$. The pinhole itself can be projected onto the image plane, creating point C'.

We define a camera reference system i j k centered at the pinhole O, where k is perpendicular to the image plane. The line from C' to O is the optical axis of the camera.

Using similar triangles, we find that the relationship between 3D point P and image plane point P' is:
$$P' = [x' \ y']^T = [fx \ fy]^T / z$$

## Aperture Size

In the pinhole model, we assume the aperture is a single point. However, real-world apertures are not infinitely small. Varying the aperture size affects the image.

### Figure 3: Effects of aperture size
![Figure 3](images/figure3_camera.png)

Increasing the aperture size allows more light rays, causing blurring due to rays from multiple 3D points affecting each film point. A smaller aperture results in crisper but darker images. The pinhole model presents a trade-off between image sharpness and brightness.

This leads to the question: can we design cameras that capture both crisp and bright images?


## 3. Cameras and Lenses

Modern cameras address the trade-off between crispness and brightness using lenses, which focus or disperse light. Replacing the pinhole with a properly placed and sized lens ensures that all light rays emitted by a point P converge to a single point P' in the image plane (Figure 4). However, this property only holds for specific points, causing objects at different distances from the lens to be out of focus.

### Figure 4: Lens model setup
![Figure 4](images/figure4_camera.png)

Lenses also focus light rays parallel to the optical axis into a focal point (Figure 5). The distance between the focal point and the lens center is the focal length f. Light rays passing through the lens center are not deviated. This lens-based model, known as the paraxial refraction model, relates 3D points to their corresponding image plane points.

### Figure 5: Lens focusing
![Figure 5](images/figure5_camera.png)

Radial distortion is a common aberration in this model, causing magnification changes based on distance to the optical axis. Pincushion distortion increases magnification, while barrel distortion decreases it. This distortion occurs due to differing focal lengths across the lens.

### Figure 6: Radial distortion effects
![Figure 6](images/figure6_camera.png)

## 4. Going to Digital Image Space

In this section, we'll explore the parameters we need to consider when modeling the projection from 3D space to digital images. While the results will be derived using the pinhole model, they apply to the paraxial refraction model as well.

### Homogeneous Coordinates

To address the nonlinearity of the projection from 3D point P to 2D point P', we introduce the concept of homogeneous coordinates. By extending 3D Euclidean coordinates to 4D homogeneous coordinates by adding a "1" in the new dimension, we can represent the mapping in a linear matrix-vector form:

$$ P' = K \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix} $$

### The Camera Matrix Model and Intrinsic Parameters

The camera matrix model captures the essential parameters affecting the mapping from a 3D point to image coordinates. The matrix $ K $ consists of intrinsic parameters such as $ cx $, $ cy $, $ k $, and $ l $, representing translation and unit conversion between image plane coordinates and digital image coordinates. The matrix $ K $ also accounts for skewness and distortion, with skewness represented by the angle $ \theta $.

#### Figure 7: Camera matrix model
![Figure 7](images/figure7_camera.png)

### Extrinsic Parameters

While the intrinsic parameters handle the transformation within the camera reference system, the extrinsic parameters account for transformations between different reference systems. These parameters include a rotation matrix $ R $ and a translation vector $ T $ that map a 3D point $ P_w $ in a world reference system to its corresponding camera coordinates $ P $.

#### Figure 7.1: Extrinsic parameters
![Figure 8](images/intrinsic_extrinsic.png)  
source : [mphy0026](https://mphy0026.readthedocs.io/en/latest/calibration/camera_calibration.html)

### Full Projection Matrix $ M $

Combining the intrinsic and extrinsic parameters, we arrive at the full 3x4 projection matrix $ M $, which has 11 degrees of freedom. The intrinsic parameters, stored in matrix $ K $, define the camera's characteristics. Extrinsic parameters, represented by rotation matrix $ R $ and translation vector $ T $, account for the transformation between different coordinate systems.

## 5. Camera Calibration

To accurately understand the transformation from the 3D world to digital images, we need prior knowledge of the camera's intrinsic parameters. When we don't have these parameters for an arbitrary camera, we can deduce them from images. This problem of estimating intrinsic and extrinsic camera parameters is known as camera calibration.

### Calibration Rig and Correspondences

Camera calibration involves solving for the intrinsic matrix $ K $ and extrinsic parameters $ R $ and $ T $ in Equation 10. A calibration rig, like the one depicted in Figure 7, helps define a world reference frame and known world points $ P_1, ..., P_n $. From images taken by the camera, corresponding points $ p_1, ..., p_n $ are obtained.

### Linear System of Equations

A linear system of equations is formulated using these correspondences. Each correspondence gives rise to two equations, resulting in a linear system with more equations than unknowns. By rearranging the equations, we can express the system in matrix-vector form:

$$ U \mathbf{m} = \mathbf{0} $$

### Minimization for Overdetermined System

For systems with more equations than unknowns, we aim to minimize the solution norm while constraining the vector $ \mathbf{m} $ norm to 1. Singular Value Decomposition (SVD) is employed for this purpose. The solution is found by setting $ \mathbf{m} $ as the last column of the matrix $ V $ from the SVD of matrix $ P $.

### Finding Intrinsic and Extrinsic Parameters

After obtaining the camera matrix $ M $ through SVD, we express the true values of $ M $ in terms of a scaling parameter. This yields the extrinsic and intrinsic parameters in terms of the solved camera matrix $ M $:

$$ M = \frac{1}{\rho} \begin{bmatrix} \alpha \mathbf{r}_1^T - \alpha \cot(\theta) \mathbf{r}_2^T + c_x \mathbf{r}_3^T \\ \beta \mathbf{r}_3^T + c_y \mathbf{r}_3^T \\ \alpha t_x - \alpha \cot(\theta) t_y + c_x t_z \\ \beta t_z + c_y t_z \end{bmatrix} $$

### Degenerate Configurations

It's important to note that not all sets of correspondences lead to a solvable system. Degenerate configurations occur when points lie on the intersection curve of two quadric surfaces. These situations lead to unsolvable systems.



## 6. Handling Distortion in Camera Calibration

Up until now, we've been working with ideal lenses that don't introduce distortion. However, real lenses can lead to deviations from rectilinear projection, necessitating more advanced calibration methods. In this section, we provide a brief introduction to handling distortions.

### Isotropic Radial Distortion

Real-world lens distortions are often radially symmetric due to the lens's physical symmetry. We model radial distortion with an isotropic transformation. While this transformation isn't linear, we can compute the ratio between two coordinates $ u_i $ and $ v_i $ as:

$$ \frac{u_i}{v_i} = \frac{m_1 P_i}{m_2 P_i} = \frac{m_3 P_i}{m_2 P_i} = \lambda_i $$

### Nonlinear Optimization for Distortion

To handle radial distortion, we need to solve a system of nonlinear equations. This nonlinearity arises from the distortion model. While we can't solve this system using the linear methods discussed earlier, we can use nonlinear optimization techniques.

Assuming $ n $ correspondences, we set up a system of equations in matrix form:

$$ \begin{bmatrix}
v_1 P_1 - u_1 P_1 & 0 & -u_1 P_1 \\
\vdots & \vdots & \vdots \\
v_n P_n - u_n P_n & 0 & -u_n P_n \\
\end{bmatrix} \begin{bmatrix}
m_1 \\
m_2 \\
m_3
\end{bmatrix} = \begin{bmatrix}
0 \\
\vdots \\
0
\end{bmatrix} $$

### Simplifying the Problem

If we assume that the ratio between $ u_i $ and $ v_i $ is unaffected by distortion, we can simplify the nonlinear optimization problem. This simplification reduces the complexity of solving the problem, making it more manageable compared to the original distortion problem.


## 7. Appendix A: Rigid Transformations

This appendix provides an overview of the basic rigid transformations commonly used in computer graphics and computer vision. Rigid transformations include rotation, translation, and scaling. In this section, we'll cover these transformations in the context of 3D space, as they are essential concepts in this field.

### Rotations in 3D Space

Rotating a point in 3D space can be achieved by rotating around each of the three coordinate axes separately. Common convention dictates that rotations occur in a counter-clockwise direction. One way to represent rotations is using Euler angles, which describe how much a point rotates around each degree of freedom. However, using Euler angles can lead to singularities known as gimbal lock.

To avoid gimbal lock and provide a more general representation, we use rotation matrices. A rotation matrix is a square, orthogonal matrix with a determinant of one. Given a rotation matrix $ R $ and a vector $ v $, the resulting vector $ v' $ can be computed as:

$$ v' = Rv $$

Rotation matrices allow us to represent rotations around the $ x $, $ y $, and $ z $ axes as follows:

$$ Rx(\alpha) = \begin{bmatrix}
1 & 0 & 0 \\
0 & \cos(\alpha) & -\sin(\alpha) \\
0 & \sin(\alpha) & \cos(\alpha)
\end{bmatrix} $$

$$ Ry(\beta) = \begin{bmatrix}
\cos(\beta) & 0 & \sin(\beta) \\
0 & 1 & 0 \\
-\sin(\beta) & 0 & \cos(\beta)
\end{bmatrix} $$

$$ Rz(\gamma) = \begin{bmatrix}
\cos(\gamma) & -\sin(\gamma) & 0 \\
\sin(\gamma) & \cos(\gamma) & 0 \\
0 & 0 & 1
\end{bmatrix} $$

These rotations are often combined by matrix multiplication, where the order matters.

### Translations in 3D Space

Translations describe the movement or displacement in a particular direction. In 3D space, a translation is represented by a vector $ t $ with three components: $ t_x $, $ t_y $, and $ t_z $. If a point $ P $ is translated by $ t $ to a new point $ P' $, the operation can be written as:

$$ P' = P + t $$

In matrix form, translations can be represented using homogeneous coordinates. The translation matrix is given by:

$$ T = \begin{bmatrix}
1 & 0 & 0 & t_x \\
0 & 1 & 0 & t_y \\
0 & 0 & 1 & t_z \\
0 & 0 & 0 & 1
\end{bmatrix} $$

Using this matrix, the translation operation becomes $ P' = TP $, where $ P' $ is the translated point.

### Combining Transformations

Transformations like rotation, translation, and scaling can be combined by matrix multiplication to create more complex transformations. By using homogeneous coordinates, these transformations can be efficiently represented and applied using matrix-vector multiplication.

It's important to note that these types of transformations are examples of affine transformations. Projective transformations occur when the final row of the transformation matrix is not $[0 \, 0 \, 0 \, 1]$.

This concludes the overview of basic rigid transformations in 3D space.


## 8. Appendix B: Different Camera Models

This appendix introduces two alternative camera models: the weak perspective model and the orthographic projection model. These models provide simplified representations of the camera's projection process under certain assumptions.

### Weak Perspective Model

The weak perspective model simplifies the camera projection process by using orthogonal projection onto a reference plane and then projecting from the reference plane to the image plane using a projective transformation. This model is suitable when deviations in depth from the reference plane are small compared to the camera's distance. The steps involved in the weak perspective model are as follows:

1. Given a reference plane Π at a distance $ z_o $ from the camera center, points $ P, Q, R $ are first orthogonally projected onto the reference plane, resulting in points $ P', Q', R' $.
2. These points $ P', Q', R' $ are then projected to the image plane using a projective transformation, resulting in points $ p', q', r' $.

![Figure 8: The weak perspective model: orthogonal projection onto reference plane](images/figure8_camera.png)

In this model, because the depth deviations are small, the projection onto the reference plane reduces the transformation to a simple, constant magnification. The magnification is equal to the focal length $ f' $ divided by $ z_o $, resulting in:

$$x' = \frac{f'x}{z_o}, \quad y' = \frac{f'y}{z_o} $$

The projection matrix $ M $ is simplified as well:

$$ M = \begin{bmatrix}
A & b \\
0 & 1
\end{bmatrix} $$
![Figure 9: The weak perspective model: projection onto the image plane](images/figure9_camera.png)

The orthographic projection model takes this simplification further. In this model, the optical center is at infinity, and the projection rays are perpendicular to the retinal plane. This results in a projection that ignores depth, making the orthographic model suitable for cases where depth information is not critical. The projections in the orthographic model are defined as:

$$ x' = x, \quad y' = y $$

This model is often used in fields like architecture and industrial design.

![Figure 10: The orthographic projection model](images/figure10_camera.png)

The weak perspective model and the orthographic projection model provide simpler mathematical representations of the camera projection process, sacrificing precision for ease of computation. These models are particularly effective when objects are small and distant from the camera.


## Conclusion

In this comprehensive guide to camera projection models and calibration, we've explored the fundamental principles that underlie the transformation of the three-dimensional world into two-dimensional images. From understanding the basics of pinhole cameras to diving into more complex lens-based models, we've covered a wide range of concepts that form the foundation of computer vision, photography, and graphics.

We started by delving into the pinhole camera model, which laid the groundwork for understanding how light rays travel from the three-dimensional space to the two-dimensional image plane. This simple model highlighted the trade-offs between brightness and crispness, leading us to consider the importance of aperture size and its impact on image quality.

As we progressed, we introduced lenses as powerful tools to mitigate the conflict between brightness and crispness. Lenses, with their ability to focus or disperse light, paved the way for more sophisticated models that could account for various complexities in real-world camera systems. We explored the paraxial refraction model, which allowed us to relate 3D points to their corresponding points in the image plane, and discussed aberrations like radial distortion that arise from using lenses.

The transition from 3D space to digital image space brought us to camera calibration, a crucial step to precisely determine intrinsic and extrinsic camera parameters. We learned how to derive the camera matrix model, which encapsulates essential camera characteristics, and how to solve for intrinsic and extrinsic parameters using calibration rigs and known correspondences. We also touched on the challenges of handling distortion in camera calibration and how to address them using non-linear optimization techniques.

In the appendices, we explored alternative camera models—the weak perspective model and the orthographic projection model. These simplified models provided insights into scenarios where computational simplicity outweighs the need for high precision, such as distant objects or certain design applications.

As we conclude this guide, we've gained a deep understanding of camera projection models and calibration processes. These concepts serve as the backbone for various applications, from computer graphics and virtual reality to robotics and medical imaging. Armed with this knowledge, we can appreciate the intricacies of how cameras capture the world around us and use it to create innovative solutions in various domains.

### References
Figures 1, 2, 3, 4, 5, 6, 7, 8, 9, 10: Standford CS231A notes on camera models