In [None]:
# Importing Libraries
import imageio.v3 as iio # pip install imageio[ffmpeg]
import matplotlib.pyplot as plt
import cv2
import numpy as np
import open3d as o3d # pip install open3d

## 2. Frame Extraction

In [3]:
def load_video_frames(video_path, frame_interval=10, display_frames=True):
    # Load frames from a video file at specified intervals.

    frame_count = 0
    frames = []  # List to store frames for visualization

    try:
        print(f"Opening video file: {video_path}")
        
        # Iterate through frames in the video file
        for frame in iio.imiter(video_path):
            if frame_count % frame_interval == 0:
                frames.append(frame)  # Store frame in list
                
                # Display frame if display_frames is True
                if display_frames:
                    plt.imshow(frame)
                    plt.title(f'Frame {frame_count}')
                    plt.axis('off')
                    plt.show()
                
            frame_count += 1

    except FileNotFoundError:
        print(f"Error: Video file not found at {video_path}")
        return [], 0
    except Exception as e:
        print(f"An error occurred while processing the video: {e}")
        return [], 0

    print(f"\nFinished processing video.")
    print(f"Total frames iterated: {frame_count}.")
    
    return frames

frames = load_video_frames('../videos/bedroomvideo.mp4', frame_interval=10, display_frames=False)

Opening video file: ../videos/bedroomvideo.mp4

Finished processing video.
Total frames iterated: 237.


## 5. Essential/ Fundamental Matrix Computation

Both matrices describe the epipolar geometry between two images of the same scene taken from different viewpoints. Epipolar geometry defines the constraint that corresponding points must satisfy. If you have a point x in the first image, its corresponding point x' in the second image must lie on a specific line called the epiline. Both matrices capture this relationship, but they differ based on camera calibration.

The Fundamental Matrix acts as a bridge between the 2D pixel coordinates of corresponding points in two different images, capturing the geometric constraints imposed by the 3D scene and the camera positions, even when you don't know the camera's exact internal details. It's a fundamental tool for tasks like finding correct feature matches, estimating camera motion, and ultimately reconstructing the 3D scene.

- Fundamental Matrix (F):
  - **What it is**: A 3x3 matrix that relates corresponding points between two images in pixel coordinates.
  - **Information encoded**: It contains information about the camera's relative rotation and translation (extrinsic parameters) and the intrinsic parameters (like focal length, principal point) of both cameras.
  - **Equation**: It satisfies the epipolar constraint: x'^T * F * x = 0, where x and x' are the homogeneous coordinates of the matching points in pixels.
  - **When to use**: Use the Fundamental Matrix when the cameras are uncalibrated, meaning you don't know their intrinsic parameters.
  - **Computation**: Typically requires at least 8 pairs of corresponding points (using the 8-point algorithm) or 7 pairs (7-point algorithm). OpenCV's function often uses robust methods like RANSAC or LMedS which handle outliers well using many more points.

If you are recording video with a phone and have not performed a specific camera calibration procedure to find its intrinsic matrix (K), then using the Fundamental Matrix (F) is the recommended approach. You treat the camera as uncalibrated.

- Essential Matrix (E):
  - **What it is**: A 3x3 matrix that relates corresponding points between two images in normalized image coordinates (independent of camera intrinsics).
  - **Information encoded**: It contains only information about the camera's relative rotation (R) and translation (t), up to a scale factor. It does not include camera intrinsic information.
  - **Equation**: It satisfies the epipolar constraint in normalized coordinates: x_norm'^T * E * x_norm = 0.
  - **When to use**: Use the Essential Matrix when the cameras are calibrated, meaning you know their intrinsic parameters (focal length, principal point - often represented in a camera matrix K).
  - **Computation**: Requires at least 5 pairs of corresponding points (using the 5-point algorithm), though robust methods in OpenCV use more points.
  - **Relation to F**: E = K'^T * F * K, where K and K' are the camera intrinsic matrices for the two views. If the camera is the same for both views, K' = K.

In [None]:
# Compute the fundamental matrix from matches keypoints
def compute_fundamental_matrix(matches, keypoints1, keypoints2, method = cv2.FM_RANSAC, ransac_threshold = 3.0, confidence = 0.99):
  # Extract coordinates of matches keypoints
  points1 = np.float([keypoints1[m.queryIdx].pt for m in matches])
  points2 = np.float([keypoints2[m.queryIdx].pt for m in matches])
  
  # Compute the fundamental matrix
  fundamental_matrix, inlier_mask = cv2.findFundamentalMat(points1, points2, method = method, ransacReprojThreshold = ransac_threshold, confidence = confidence)
  
  # Convert mask to binary array for easier filtering
  if fundamental_matrix is None or fundamental_matrix.shape != (3,3):
    raise ValueError("Failed to compute a valid fundamental matrix")
  
  return fundamental_matrix, inlier_mask

# Visualise epipolar lines to verify the fundamental matrix
def visualise_epipolar_lines(img1, img2, points1, points2, fundamental_matrix, sample_size = 20):
  # Sample points if there are too many
  if len(points1) > sample_size:
    indices = np.random.choice(len(points1), sample_size, replace = False)
    pts1 = points1[indices]
    pts2 = points2[indices]
  else:
    pts1 = points1
    pts2 = points2
    
  # Create a figure to display epipolar lines
  _, (ax1, ax2) = plt.subplots(1, 2, figsize = (15, 8))
    
  # Display the first image
  ax1.imshow(cv2.cvtColor(img1, cv2.COLOR_BGR2RGB))
  ax1.set_title('Epipolar Lines on Image 1')
  ax1.axis('off')
    
  # Display the second image
  ax2.imshow(cv2.cvtColor(img2, cv2.COLOR_BGR2RGB))
  ax2.set_title('Epipolar Lines on Image 2')
  ax2.axis('off')
    
  # Draw epipolar lines on both images
  for i in range(len(pts1)):
    # Draw points
    ax1.plot(pts1[i, 0], pts1[i, 1], 'ro', markersize = 6)
    ax2.plot(pts2[i, 0], pts2[i, 1], 'ro', markersize = 6)
        
    # Compute epipolar line in second image for point in first image
    line2 = cv2.computeCorrespondEpilines(pts1[i].reshape(-1, 1, 2), 1, fundamental_matrix)
    line2 = line2.reshape(-1)
        
    # Draw epipolar line
    x0, y0 = 0, int(-line2[2] / line2[1])
    x1, y1 = img2.shape[1], int(-(line2[2] + line2[0] * img2.shape[1]) / line2[1])
    ax2.plot([x0, x1], [y0, y1], 'g-')
        
    # Compute epipolar line in first image for point in second image
    line1 = cv2.computeCorrespondEpilines(pts2[i].reshape(-1, 1, 2), 2, fundamental_matrix)
    line1 = line1.reshape(-1)
        
    # Draw epipolar line
    x0, y0 = 0, int(-line1[2] / line1[1])
    x1, y1 = img1.shape[1], int(-(line1[2] + line1[0] * img1.shape[1]) / line1[1])
    ax1.plot([x0, x1], [y0, y1], 'g-')
    
  plt.tight_layout()
  plt.show()
  
  # Return the current Figure object that is active
  return plt.gcf()

# Calculate the epipolar geometry error to evaluate the quality of the fundamental matrix
def epipolar_error(points1, points2, fundamental_matrix):
  # Convert each point to homogeneous coordinates
  homogeneous_points1 = np.hstack((points1, np.ones((points1.shape[0],1))))
  homogeneous_points2 = np.hstack((points2, np.ones((points2.shape[0],1))))
  
  # Calculate epipolar lines for points in image 1
  lines2 = np.dot(homogeneous_points1, fundamental_matrix.T)
  # Normalise lines
  norms2 = np.sqrt(lines2[:, 0]**2 + lines2[:, 1]**2)
  lines2 = lines2 / norms2.reshape(-1,1)
  # Calculate the distance from points in image 2 to their corresponding epipolar lines
  dist2 = np.abs(np.sum(lines2 * homogeneous_points2, axis = 1))
  
  # Calculate epipolar lines for points in image 2
  lines1 = np.dot(homogeneous_points2, fundamental_matrix)
  # Normalise lines
  norms1 = np.sqrt(lines1[:, 0]**2 + lines1[:, 1]**2)
  lines1 = lines1 / norms1.reshape(-1,1)
  # Calculate the distance from points in image 1 to their corresponding epipolar lines
  dist1 = np.abs(np.sum(lines1 * homogeneous_points1, axis = 1))
  
  metrics = {
    "mean_error": (np.mean(dist1) + np.mean(dist2)) / 2,
    "max_error": max(np.max(dist1), np.max(dist2)),
    "std_error": (np.std(dist1) + np.std(dist2)) / 2
  }
  
  return metrics

# Main function to process the fundamental computation step
def process_fundamental_matrix(imgs, matches, keypoints1, keypoints2, visualise = True):
  # Compute fundamental matrix
  print("Computing fundamental matrix...")
  F, inlier_mask = compute_fundamental_matrix(matches, keypoints1, keypoints2)
    
  # Filter matches based on inlier mask
  inlier_matches = [m for i, m in enumerate(matches) if inlier_mask[i]]
  print(f"Inlier matches: {len(inlier_matches)} out of {len(matches)} ({len(inlier_matches) / len(matches) * 100:.2f}%)")
    
  # Extract coordinates of inlier keypoints
  inlier_points1 = np.float32([keypoints1[m.queryIdx].pt for m in inlier_matches])
  inlier_points2 = np.float32([keypoints2[m.trainIdx].pt for m in inlier_matches])
    
  # Calculate error metrics
  error_metrics = epipolar_error(inlier_points1, inlier_points2, F)
  print(f"Mean epipolar error: {error_metrics['mean_error']:.4f} pixels")
        
  # Visualize epipolar lines if requested
  if visualise and len(imgs) >= 2:
      visualise_epipolar_lines(imgs[0], imgs[1], inlier_points1, inlier_points2, F)
    
  # Prepare results
  results = {
      "fundamental_matrix": F,
      "inlier_mask": inlier_mask,
      "inlier_matches": inlier_matches,
      "inlier_points1": inlier_points1,
      "inlier_points2": inlier_points2,
      "error_metrics": error_metrics,
  }
    
  return results

## 6. Camera Pose Estimation

The goal of this stage is to determine the relative motion between the two camera views where you matched features. This motion is described by:

- **Rotation (R)**: A 3x3 matrix describing how the camera orientation changed between the two shots.
- **Translation (t)**: A 3x1 vector describing how the camera position changed between the two shots. Note that the translation vector t can only be determined up to a scale factor. This means you know the direction of motion but not the absolute distance moved.

- Input Matrix: cv2.recoverPose technically requires the Essential Matrix (E) as its main input, not the Fundamental Matrix (F). It also needs the corresponding inlier points from the previous stage (pts1_inliers, pts2_inliers) and the camera intrinsic matrix (K).
- Handling the Uncalibrated Case (Starting with F): Since you computed F (because the phone camera was treated as uncalibrated), you need a way to get E to use recoverPose. The relationship is E = K^T * F * K. But K is unknown!
  - The Workaround: You need to assume a plausible K matrix. A common approach is:
Set the principal point (cx, cy) to the image center (e.g., width/2, height/2).
  - Estimate or guess the focal length (fx, fy). Sometimes fx = fy = image_width is used as a starting guess, or a typical value for phone cameras (e.g., 500-1000 pixels) might be assumed.
  - Compute E: Calculate E = K_assumed.T @ F @ K_assumed.
  - Use recoverPose: Call cv2.recoverPose using this computed E, the same assumed K, and your inlier points.
  - Important: You must document this assumption about K in your report. The resulting pose (especially translation t) will be relative to the scale defined by your assumed K and F.
- Chirality Problem: Mathematically, decomposing the E matrix yields four possible solutions for the rotation (R) and translation (t). However, only one of these solutions is physically correct – the one where the reconstructed 3D points lie in front of both cameras.
  - How recoverPose Solves It: The cv2.recoverPose function handles this automatically! It takes your inlier points (pts1_inliers, pts2_inliers) and the K matrix, triangulates the points for each of the four possible (R, t) combinations, and counts how many points end up in front of both camera views. It then returns the R and t corresponding to the hypothesis with the most positive depth points, effectively resolving the chirality ambiguity. Your report should explain this concept.

## 7. 3D Point Triangulation and Scene Visualisation

- Concept: Triangulation is the process of determining the 3D coordinates of a point in space. You can do this if you have observed the projection of that point in at least two images taken from different, known camera viewpoints. Given the 2D coordinates of the point in each image (pts1, pts2) and the camera poses (relative rotation R and translation t between the views, plus the camera intrinsics K), you can find the intersection of the "rays" that go from each camera center through its corresponding 2D image point. This intersection point is the estimated 3D location of the point.
- Projection Matrices (P1, P2): To use cv2.triangulatePoints, you need the 3x4 projection matrix for each camera view. This matrix maps 3D world points (in homogeneous coordinates) to 2D image points (in homogeneous coordinates).
  - Camera 1 (Reference): We typically assume the first camera is at the origin of the world coordinate system. Its projection matrix P1 is formed using the intrinsic matrix K and a standard pose [I | 0] (Identity rotation, Zero translation): P1 = K @ np.hstack((np.eye(3), np.zeros((3, 1))))
  - Camera 2 (Relative Pose): The second camera's projection matrix P2 is formed using the same K (assuming the same camera or the assumed K from Stage 6) and the relative rotation R and translation t calculated in Stage 6: P2 = K @ np.hstack((R, t))
- cv2.triangulatePoints: This OpenCV function takes the two projection matrices (P1, P2) and the corresponding 2D points from both images (pts1, pts2) as input.
  - Input Format: Note that cv2.triangulatePoints expects the 2D points as 2xN arrays (2 rows, N columns). You'll likely need to transpose your (N, 1, 2) or (N, 2) point arrays.
  - Output Format: It returns the 3D points in homogeneous coordinates as a 4xN array (4 rows, N columns). To get the standard 3D Cartesian coordinates, you need to divide the first three rows by the fourth row.

Explanation: Scene Visualisation

- Goal: To display the calculated 3D points to visually inspect the reconstructed scene structure.
- Tools: Your project allows libraries like Matplotlib, Open3D, or CloudCompare.
- Interactivity: The visualization must be interactive, allowing you to rotate, pan, and zoom the 3D view.
  - Matplotlib (mplot3d): Can create basic 3D scatter plots. Interactivity (rotation/zoom) often works well in specific environments like Jupyter Notebook using the %matplotlib notebook backend.
  - Open3D: A library specifically designed for 3D data processing and visualization. It generally provides more robust and feature-rich interactive visualization windows suitable for point clouds. This is likely the better choice to meet the "interactive" requirement robustly.