Seminar by Nikita Ligostaev

# Augmented reality on your knees 👨‍💻

---

## Projective transformation theory

### What are homogenous coordinates?

Most of the time when working with 3D, we are thinking in terms of Euclidean geometry – that is, coordinates in three-dimensional space ($x, y$, and $z$). However, there are certain situations where it is useful to think in terms of projective geometry instead. Projective geometry has an extra dimension, called $w$, in addition to the $x, y$, and $z$ dimensions. This four-dimensional space is called `projective space`, and coordinates in projective space are called `homogeneous coordinates`.

**The main convenience is that everything can be expressed by a linear operation.**

In homogeneous coordinates, a point in three-dimensional space is represented by four numbers $(x,y,z,w)^*$, where $w$ is a weighting factor. In Cartesian coordinates it corresponds to a point:
$$\begin{gather}  
\left(\frac {x}{w}, \frac {y}{w}, \frac {z}{w}\right) \notag{}
\end{gather}$$

\* - Quaternions look a lot like homogeneous coordinates. Both are 4D vectors, commonly depicted as $(x,y,z,w)$. However, quaternions and homogeneous coordinates are different concepts, with different uses.

**2D analogy**

Imagine a projector that is projecting a 2D image onto a screen. It’s easy to identify the $x_s$ and $y_s$ dimensions of the projected image. Moreover, if you step back from the 2D image and look at the projector and the screen, you can see the $w$ dimension.

**The $w$ dimension is the distance from the projector to the screen.**

<div style="text-align: center;">
    <img src="./utils/2d_analogy.png" width="60%" height="60%">
</div>

So what does the $w$ dimension do, exactly? If you move the projector closer to the screen, the whole 2D image becomes smaller. If you move the projector away from the screen, the 2D image becomes larger. 

**As you can see, the value of $w$ affects the size (scale) of the image.**

<div style="text-align: center;">
    <img src="./utils/scale.png" width="100%" height="100%">
</div>

**Applying to 3D**

There is no such thing as a 3D projector (yet), so it’s harder to imagine projective geometry in 3D, but the $w$ value works exactly the same as it does in 2D. When $w$ increases, the coordinate expands (scales up). When 𝑊 decreases, the coordinate shrinks (scales down). **The $w$ is basically a scaling transformation for the 3D coordinate.**

The usual advice for 3D programming beginners is to always set $w=1$. The reason for this is that when you scale a coordinate by 1 it doesn’t shrink or grow, it just stays the same size. So, when $w=1$ it has no effect on the $X,Y$ or $Z$ values.

## Step-by-step projective transformation

$$\begin{gather}    
    \text{Image in pixels} \quad \xleftarrow{\text{Intrinsics } \mathbf{K}} \quad \text{Camera} \quad \xleftarrow{\text{Extrinsics } [\mathbf{R}|t]} \quad \text{World point} \notag{}
\end{gather}$$

<div style="text-align: center;">
    <img src="./utils/pinhole.png" width="100%" height="100%">
</div>


$$\begin{gather}  
    \mathbb{R}^3 \rightarrow \mathbb{R}^2 \notag{}\\
    p \sim P\\
    \frac{x_s}{x_c} = \frac{y_s}{y_c} = \frac{f}{z_c} = w
\end{gather}$$

**Transformation from image points to pixel coordinates in pinhole camera model**

From triangles projections $Ocp$ and $OP^{\prime}P$ on $X_c Z_c$ and $X_c Y_c$, respectively:
$$\begin{gather}    
    \frac{z_c}{f} = \frac{x_c}{x_s} \quad \Rightarrow \quad x_s = f\frac{x_c}{z_c} \tag{3}\\
    \frac{z_c}{f} = \frac{y_c}{y_s} \quad \Rightarrow \quad y_s = f\frac{y_c}{z_c} \notag{}\\
\end{gather}$$

The principle of triangle similarity relates homogeneous and Cartesian coordinates:
$$\begin{gather}  
    \begin{bmatrix}
        x_s\\
        y_s\\
        1
    \end{bmatrix} = \frac{1}{z_c}
    \begin{bmatrix}
        f & 0 & 0\\
        0 & f & 0\\
        0 & 0 & 1
    \end{bmatrix} 
    \begin{bmatrix}
        x_c\\
        y_c\\
        z_c
    \end{bmatrix} \tag{4}
\end{gather}$$

$$\begin{gather} 
    \begin{bmatrix}
        x_s\\
        y_s\\
        1
    \end{bmatrix}
    \sim \begin{bmatrix}
        f & 0 & 0\\
        0 & f & 0\\
        0 & 0 & 1
    \end{bmatrix} 
    \begin{bmatrix}
        x_c\\
        y_c\\
        z_c
    \end{bmatrix} \tag{5}
\end{gather}$$

The optical centre and $Z$ axis of the camera should intersect at the origin of the image coordinates at ($c_x, c_y$) (principle point). However, there are exceptions, for example, if the sensor has moved or the lens has been distorted. A slight misalignment of $\gamma$ between the $x$ and $y$ axes of the camera sensor is also possible:
$$\begin{gather} 
    \mathbf{K} = \begin{bmatrix}
        f & 0 & 0\\
        0 & f & 0\\
        0 & 0 & 1
    \end{bmatrix} \quad \Rightarrow \quad \mathbf{K} = \begin{bmatrix}
        f_x & \gamma & c_x\\
        0 & f_y & c_y\\
        0 & 0 & 1
    \end{bmatrix} \tag{7}\\

    \begin{bmatrix}
        x_s\\
        y_s\\
        1
    \end{bmatrix} = 
    \begin{bmatrix}
        f_x & \gamma & c_x\\
        0 & f_y & c_y\\
        0 & 0 & 1
    \end{bmatrix} 
    \begin{bmatrix}
        x_c\\
        y_c\\
        z_c
    \end{bmatrix} \tag{8}\\

    x_s = f_x x_c + \gamma y_c + c_x \tag{9}\\
    y_s = f_y y_c + c_y \notag{}

\end{gather}$$

**Camera intrinsics** [(demo)](https://ksimek.github.io/2012/08/22/extrinsic/)
$$\begin{align}  
    \mathbf{K} = \begin{bmatrix}
        f_x & \gamma & c_x\\
        0 & f_y & c_y \\
        0 & 0 & 1
    \end{bmatrix} = 
    \underbrace{\begin{bmatrix}
        1 & 0 & c_x\\
        0 & 1 & c_y \\
        0 & 0 & 1
    \end{bmatrix}}_{\text{2D translation}}
    \underbrace{\begin{bmatrix}
        f_x & 0 & c_x\\
        0 & f_y & c_y \\
        0 & 0 & 1
    \end{bmatrix}}_{\text{2D scaling}} 
    \underbrace{\begin{bmatrix}
        1 & \frac{\gamma}{f_x} & 0\\
        0 & 1 & 0 \\
        0 & 0 & 1
    \end{bmatrix}}_{\text{2D shear}} \tag{10}
\end{align}$$

**Camera extrinsics** [(demo)](https://ksimek.github.io/2012/08/22/extrinsic/)
$$\begin{gather}      
    \begin{bmatrix}
        \hat{x_c}\\
        \hat{y_c}\\
        \hat{z_c}
    \end{bmatrix} = \begin{bmatrix}
        \mathbf{R} | t 
    \end{bmatrix} \begin{bmatrix}
        x_w\\
        y_w\\
        z_w\\
        1
    \end{bmatrix} = \begin{bmatrix}
        r_{11} & r_{12} & r_{13} & t_1\\
        r_{21} & r_{22} & r_{23} & t_2\\
        r_{31} & r_{32} & r_{33} & t_3
    \end{bmatrix} \begin{bmatrix}
        x_w\\
        y_w\\
        z_w\\
        1
    \end{bmatrix} = \begin{bmatrix}
        r_{11} x_w + r_{12} y_w + r_{13} z_w + t_1\\
        r_{21} x_w + r_{22} y_w + r_{23} z_w + t_2\\
        r_{31} x_w + r_{32} y_w + r_{33} z_w + t_3
    \end{bmatrix} = \mathbf{R} \begin{bmatrix}
        x_w\\
        y_w\\
        z_w
    \end{bmatrix} + t \tag{11}\\

    x_c = \frac{\hat{x_c}}{\hat{z_c}} \tag{12}\\
    y_c = \frac{\hat{y_c}}{\hat{z_c}} \notag{}
\end{gather}$$

**3D world to 2D screen mapping**
$$\begin{gather}      
    \begin{bmatrix}
        \hat{x_s}\\
        \hat{y_s}\\
        \hat{z_s}
    \end{bmatrix} = \mathbf{K} \begin{bmatrix}
        \mathbf{R} | t 
    \end{bmatrix} \begin{bmatrix}
        x_w\\
        y_w\\
        z_w\\
        1
    \end{bmatrix} = \begin{bmatrix}
        f_x & \gamma & c_x\\
        0 & f_y & c_y \\
        0 & 0 & 1
    \end{bmatrix} \begin{bmatrix}
        \hat{x_c}\\
        \hat{y_c}\\
        \hat{z_c}
    \end{bmatrix} = \begin{bmatrix}
        f_x \hat{x_c} + \gamma \hat{y_c} + c_x \hat{z_c}\\
        f_y \hat{y_c} + c_y \hat{z_c} \\
        \hat{z_c}
    \end{bmatrix} = 
    \hat{z_c} \begin{bmatrix}
        \frac{f_x x_c}{\hat{z_c}} + \frac{\gamma y_c}{\hat{z_c}} + c_x \\
        \frac{f_y y_c}{\hat{z_c}} + c_y \\
        1
    \end{bmatrix} \tag{13}\\

    x_s = \frac{\hat{x_s}}{\hat{z_s}} = \frac{f_x \hat{x_c} + \gamma \hat{y_c} + c_x \hat{z_c}}{\hat{z_c}} \tag{14}\\
    y_s = \frac{\hat{y_s}}{\hat{z_s}} = \frac{f_y \hat{y_c} + c_y \hat{z_c}}{\hat{z_c}} \notag{}
\end{gather}$$

Finally:
$$\begin{gather}  
    x_s = \frac{f_x \hat{x_c} + \gamma \hat{y_c} + c_x \hat{z_c}}{\hat{z_c}} = f_x x_c + \gamma y_c + c_x \tag{15}\\
    y_s = \frac{f_y \hat{y_c} + c_y \hat{z_c}}{\hat{z_c}} = f_y y_c + c_y \notag{}
\end{gather}$$

**Lens/camera distortion**

Radial distortion:
$$\begin{gather}  
    r^2=x_c^2+y_c^2  \tag{16}\\
    L_r(x_c,y_c)=1+k_1r^2+k_2r^4+k_3r^6
    \begin{bmatrix} 
    x_c\\
    y_c
    \end{bmatrix} \tag{17}
\end{gather}$$

Tangential distortion:
$$\begin{gather}
    L_t(x_c,y_c)=
    \begin{bmatrix} 
    2p_1x_cy_c+p_2(r^2+2x_c^2)\\
    p_1(r^2+2y_c^2)+2p_2x_cy_c
    \end{bmatrix} \tag{18}\\
\end{gather}$$

Lens distortion correction:
$$\begin{gather} 
    L(x_c,y_c)=L_r(x_c,y_c)+L_t(x_c,y_c) \tag{19}
\end{gather}$$

$$\begin{gather}
    L(x_c,y_c) = x_d, y_d \tag{20}\\
\end{gather}$$

From (15) and (20) lens distortion correction is:
$$\begin{gather} 
    x_s = f_x x_d + \gamma y_d + c_x \tag{21}\\
    y_s = f_y y_d + c_y \notag{}
\end{gather}$$ 

---

# Problem 1. Camera calibration (15 pts.)

<div style="text-align: center;">
    <img src="./utils/zhang.png" alt="Компьютер" width="100" height="100">
</div>

### Task 1. Prepare data

1. Open pattern [generator](https://calib.io/pages/camera-calibration-pattern-generator) on your phone.
2. Generate checkerboard pattern to fit on your phone's screen (e.g. 5 rows and 8 columns) **(8 pts.)**
3. Record calibration pattern using web-camera and your phone **(2 pts.)**

**Rules for recording:** 10-20 images are enough, there should be no blurry images, the pattern have to be present in the frame, try to cover entire camera plane.

### Task 2. Do calibration

1. Fill empty spaces in `2_camera_calibration.py` and run calibration **(5 pts.)**.
2. You are breathtaking!

<div style="text-align: center;">
    <img src="./utils/keanu.png" width="500" height="300">
</div>

# Problem 2. Aruco marker detection (15 pts.)

<div style="text-align: center;">
    <img src="./utils/4x4_1000-20.svg" alt="Компьютер" width="100" height="100">
</div>

### Task 1. Prepare marker

1. Open aruco [generator](https://chev.me/arucogen/?ysclid=m7ksx70pa2128997178) on your phone.
2. Choose any marker from standard dictionaries you like (e.g. 4x4, with ID 50).
3. Do not put your phone with opened marker far away. 

### Task 2. Detect marker

1. Detect aruco marker using `aruco.detectMarkers` in `3_detect_aruco.py` **(10 pts.)**.
2. Visualize marker bounding box using `cv2.line` **(5 pts.)**.
2. You are breathtaking!

<div style="text-align: center;">
    <img src="./utils/keanu.png" width="500" height="300">
</div>

# Problem 3. Simple AR (projective transformation) (20 pts.)

<div style="text-align: center;">
    <img src="./utils/cube.gif" width="200" height="150">
</div>

Make combination of previous results to project cube mode on aruco marker in `4_cube_aruco.py`.
1. Add aruco marker detection using `aruco.detectMarkers` **(2 pts.)**.
2. Estimate pose of the marker using `cv2.aruco.estimatePoseSingleMarkers` **(5 pts.)**.
3. Find projective transformation for each point of the cube model using `cv2.projectPoints` **(10 pts.)**.
4. Visualize cube on marker using `cv2.line` **(3 pts.)**.

# Problem 4. Camera feed on plane (25 pts.)

<div style="text-align: center;">
    <img src="./utils/mx-brio.png" width="270" height="200">
</div>

Use the same template as in `4_cube_aruco.py`, but now project real-time feed from web-camera on some plane (e.g. perpendicular to marker plane). 
1. Define some plane perpendicular to marker as 3d array.
2. Project camera feed on plane with `cv2.projectPoints` **(7 pts.)**.
3. Get transformation matrix with `cv2.getPerspectiveTransform` **(10 pts.)**.
4. Get transformed image with `cv2.warpPerspective` and visualize with `cv2.addWeighted` **(8 pts.)**.
5. Name your script `5_AR_video_plane.py`.

---

**Instead of final word**: now you are ready for the simple assignment that is available [here](./HW.ipynb).