# Image And Video Analysis

## Project 2 - Object Tracking

### Umar Faruk Abdullahi & Raixa Madueno

## Introduction to the Real World Problem

In the field of Human-Computer Interaction (HCI), the quest for more intuitive and seamless methods of computer control has been ongoing for a long time. From the early days of keyboards to the advent of the graphical user interface (GUI) which introduced the mouse and the more recent introduction of touch screens, each development has aimed to make our interactions with digital devices more natural. Alongside these developments, research has also increasingly focused on harnessing the human body itself as a controller. Gaming devices like the Nintendo Wii, Switch, and systems such as Kinect have demonstrated the viability of interfaces that respond to body motion, eye tracking, and gesture recognition.

Despite these advances, one key element of human expression remains challenging - handwriting. At its core, handwriting can be considered as the art of drawing meaningful patterns through a series of deliberate strokes. Traditionally, digital handwriting has relied on dedicated hardware such as digital pens or styluses used on specialized surfaces. Although effective, these solutions come with their own set of limitations: they require additional devices, can be expensive, and may not always capture the natural fluidity of pen-on-paper writing. Touch screens offer an alternative but still necessitate direct contact, which might not be ideal in every scenario—especially where hygiene or ease-of-use is a priority.

Our project presents a new approach that only relies on direct camera input to enable handwriting through real-time finger tracking. By using the advances in computer vision and machine learning, we capture the motions of a user’s finger and convert these movements into words. 

The benefits of our approach include:

1. By relying on camera technology that is available in all devices, our system broadens access and reduces costs.
2. The approach captures the organic, dynamic nature of handwriting, offering a more intuitive user experience than traditional stylus-based systems.

### Applications

1. **Enhancing Accessibilty**: For individuals with certain physical limitations or those who find traditional input devices hard to use, interacting with digital devices can be a barrier. Handwriting is a natural form of communication and note-taking, but devices like keyboards and touchscreens don't always support this mode effectively.
2. **Gaming**: This can be extended to handwriting-based gameplay, gesture-driven interactions, or even creative applications where players draw symbols or patterns to trigger in-game actions. This opens up new possibilities for educational games, virtual reality (VR) experiences, and creative applications that merge digital and physical interaction.
3. **Rehabilitation**: Handwriting exercises are a common part of therapy for individuals recovering from strokes or motor impairments. The system can be used to assist in rehabilitation by allowing users to practice writing in a low-stress, engaging digital environment.

### Approach 

To solve this problem, we rely on object tracking. Object tracking is a computer vision task that involves locating and following one or more objects over time within a video sequence. The primary goal is to determine the trajectory of an object as it moves through scenes, maintaining its identity as it moves across the frames. To achieve this, we follow the steps below:

1. **Detection:** The first task is to correctly identify the desired object of interest to track. In our case, the `tip of index finger`. This is achieved using the [MediaPipe library](https://ai.google.dev/edge/mediapipe/solutions/vision/hand_landmarker#get_started) which is an open-source tool that allows hand tracking and gesture recognition. 
2. **Tracking:** This is the most important part. After correctly detecting the point of interest, we use a tracking algorithm to follow the point of interest as it moves across frames thereby accumulating the trajectory.
3. **Correction:** As the finger moves across frames, we correct our tracking error by providing the correct location of the finger at the subsequent frame. This can be directly incorporated within the tracking algorithm or made additional. We use the tracking algorithm.
4. **Handwriting Recognition:** After every subsequent word is written, we run the captured frame through an optical character recognition (OCR) model to extract the written text. After several trials with numerous alternatives, we use the [Google Gemini Model](https://ai.google.dev/gemini-api/docs/vision?lang=python) for the OCR.

The diagram below shows a graphical illustration of the explained approach above:

![Diagram of Selected Approach](./img/hand_copy.png)

### Choice of Tracking Algorithm

To track the fingertip, we require an object tracking algorithm that is well suited for point tracking. Among the numerous available tracking algoritm, the most suitable for point tracking is the optical flow tracking algorithm and the Kalman Filter:

1. **Optical Flow:** Optical flow estimates motion by analyzing the changes in pixel intensities across adjacent frames. For scenes with a static background like our case, this method is well suited to focus on only the moving object i.e fingertip. However, the choosen `Lucas-Kanade` method inherently relies on three assumptions:
   - `only a small change in time between consecutive frames` 
   - `intensities that do not change much (brightness constancy)`
   - `the region around the tracked point remaining the same (spatial coherence)`.
2. **Kalman Filter:**: The Kalman Filter provides estimates of the object's state (position and velocity) while accounting for certain uncertainties. This method is well suited to our problem where the focus on the location (state) of the fingertip and how fast it is being displaced from its previous location (velocity). It provides a more smooth trajectory.

**Our Approach:** Considering the strengths and weaknesses of both tracking algorithms, our approach uses a combined algorithm. To compensate for the limitations of the optical flow method, the Kalman Filter provides predictions that help reduce potential errors around brightness constancy and spatial coherence. Additionally, since the Kalman Filter relies on accurate measurements to update its state after prediction, we use the optical flow algorithm to provide these updated locations.

## Tutorial

Below, we provide a step-by-step tutorial of our implementation:

In [1]:
import os
import cv2
import numpy as np
import matplotlib.pyplot as plt
from dotenv import load_dotenv

from IPython.display import clear_output, Image, display
from PIL import Image as PILImage

from src.utils import (
    display_jupyter,
    display_cv2,
    create_video_saver,
    setup_gesture_recognizer,
    detect_keypoint,
    detect_closed_fist,
    setup_text_detector,
    detect_text_from_canvas,
    draw_recognized_text,
)

from src.kalman_filter import KalmanFilter

In the first cell, we import all the necessary libraries and methods we will be using for our project. A short description of unfamiliar imports is provided below:
1. `src.utils`: This module contains utility functions that are required for our application but are not part of the main logic and therefore not included in the notebook. The description of every function is available in the definition. Below we provide the use of each and link to the method:
   1. [display_jupyter](./src/utils.py#L14): Uses the `ipython` module to display frames right in the notebook without using a different window.
   2. [display_cv2](./src/utils.py): Displays the original and processed frames side-by-side as an openCV window.
   3. [setup_gesture_recognizer](./src/utils.py): Initializes and returns the MediaPipe gesture recognizer object to be used later.
   4. [detect_keypoint](./src/utils.py): Processes the input frame to detect the pointing gesture and get index fingertip co-ordinates.
   5. [detect_closed_fist](./src/utils.py): Processes the input frame to detect closed fist gesture. We use the `closed fist` gesture to signal when the user is done writing a word and trigger the OCR model.
   6. [setup_text_detector](./src/utils.py): Initializes and returns the Gemini GenAI client for OCR.
   7. [detect_text_from_canvas](./src/utils.py): Gets the text from the input canvas using the text detector.
   8. [draw_recognized_text](./src/utils.py): Draws the recognized text as subtitle on the output frame.
2. `src.kalman_filter`: This module contains the definition of the 2D KalmanFilter used in the project. In includes the initialization, prediction and update steps. The open-source library was adapted from [https://machinelearningspace.com/2d-object-tracking-using-kalman-filter/](https://machinelearningspace.com/2d-object-tracking-using-kalman-filter/)

## Initialization

This section defines the video sources and configures the parameters used in the tracking setup. We perform the following actions:

1. Loads the environment variables using the `dotenv` library. This loads the `.env` file into the process to enable access to the required `GEMINI_API_KEY`. The key must be present for the text detector to work. Get your key from the following link: [https://aistudio.google.com/apikey](https://aistudio.google.com/apikey) and paste in the `.env` file: `GEMINI_API_KEY=<your_key>`
2. We setup the Gemini client as the `text_detector`.
3. The source of the video is defined for the `cv2.VideoCapture` method. To use your webcam, specify the value `0` as the parameter. Otherwise, specify the path of your video.
4. The video frame dimensions are extracted and state variables are stored for use.
   1. `initial_point`: To track the first point.
   2. `tracking_started`: Truthy value to inform us of the tracking state. Used for control flow.
   3. `points`: An array to accumulate the points of the fingertip as it moves. Used for drawing the trajectory on the canvas.
   4. `recognized_text`: Variable to hold the recognized text from the `text_detector` we initialized earlier.

In [2]:
### Get the genAI recognizer

# Load environment variables
load_dotenv()

# Get the GEMINI_API_KEY
gemini_api_key = os.getenv("GEMINI_API_KEY")
if gemini_api_key is None:
    raise ValueError("GEMINI_API_KEY is not set")

text_detector = setup_text_detector(gemini_api_key)

### Get the video source

# Video path
video_path = "hand_tracking.mp4"
cap = cv2.VideoCapture(0)
if not cap.isOpened():
    raise IOError("Cannot open video file")

# Frame dimensions
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
frame_size = (frame_width, frame_height)
fps = int(cap.get(cv2.CAP_PROP_FPS))

# Canvas to draw the gestures
canvas = np.zeros((frame_height, frame_width, 3), dtype=np.uint8)

### State variables
initial_point = None
tracking_started = False
points = []
recognized_text = ""
kalman_acc_error = 0.0
mean_shift_acc_error = 0.0

## Tracking

In this section, we perform the tracking task based on the defined algorithms above.

### Tracking with Kalman Filters

The code below can be summarised in the following steps:

1. Initialization:

   - A video writer is created to save to output using the `create_video_saver()` method. This method takes in the name of the output file, size, frames per second and the output codec. By default, it saves the video in `30 FPS` using the `X264` mp4 codec.
   - We initialized the gesture recongition system using `setup_gesture_recognizer()` for detecting the start and stop gestures.
   - Then we define the parameters for the Lukas-Kanade algotithm with - `lk_params`. `winSize` is the size of the search window, `maxLevel` is the number of pyramid levels, and `criteria` defines the termination criteria for the iterative algorithm.
   - Finally the KalmanFilter is initialized with initial speed of 1 : `kf = KalmanFilter()` 

2. Main Loop: This loop continuously reads frames from the camera `cap.read()`. `frame = cv2.flip(frame, 1)` mirrors the video horizontally for a more natural experience. We then set `output_frame` as a copy of the current frame so that the original frame can be used for processing, and the output frame can be drawn on.

   Within the main loop, we perform the following:

   1. Tracking Initialization:
      - `detect_keypoint()` uses the gesture recognizer to detect the `Pointing_Up` hand gesture to initiate tracking.
      - `p0` stores the initial keypoint (hand position) as a NumPy array with the correct shape for `calcOpticalFlowPyrLK`
      - `old_frame` and `old_gray` stores the previous frame in color and grayscale, respectively. The grayscale version is used for optical flow calculations.
   
   2. Tracking (Lucas-Kanade and Kalman Filter):

      - We convert the frame to grayscale with `frame_gray`
      - Then run the Kalman Filter prediction step with `kf.predict()`
      - `cv2.calcOpticalFlowPyrLK()` then calculates the optical flow between the previous and current frames. `p1` contains the new hand position. `st` is a status array indicating whether the flow was found for each point.
      - We check `st[0][0]` to see if the tracking was successful and then run `kf.update()` to update the Kalman filter with the measurement.
      - Then, the predicted and updated positions are drawn on the frame 
      - We also check for a "closed fist" gesture to stop tracking. After detecting the `closed_fist` gesture, we run the canvas via the text_detector to get the written text on the canvas at the moment and store it. Additionally, the optical flow and Kalman Filter is reset to track the next written characters.
      - If tracking is lost `(st[0][0] == 0)`, only the Kalman filter's prediction is used, and the system will attempt to reacquire tracking.
   
   3. Drawing and Display:

      - Finally, here the `output_frames` and the `recognized_text` are drawn on the frame.
      - The frame is also saved in the video writer opened from the beginning of the process using the `save_kalman` method.

In [None]:
# Tracking using Lucas-Kanade Optical Flow and Kalman Filter
try:
    
    # Create the video saver
    # If the video source is the camera, save the output as camera_output.mp4
    # Otherwise, save the output as the input video name with _output.mp4
    if cap.getBackendName() == 'AVFOUNDATION':
        save_kalman = create_video_saver("output/camera_output.mp4", frame_size, fps)
    else:
        save_kalman = create_video_saver(f"output/{video_path.replace('.mp4', '_output.mp4')}", frame_size, fps)
    
    gesture_recognizer = setup_gesture_recognizer()
    
    tracker = None
    old_frame = None
    old_gray = None
    
    # Luka-Kanade parameters
    # The window size is the size of the search window
    lk_params = dict(winSize=(15, 15),
                    maxLevel=2,
                    criteria=(cv2.TERM_CRITERIA_EPS |
                            cv2.TERM_CRITERIA_COUNT, 10, 0.03))
    
    # Define the Kalman Filter object
    kf = KalmanFilter(0.1, 1, 1, 1, 0.1,0.1)
    
    while True:
        ret, frame = cap.read()
        if not ret:
            cap.set(cv2.CAP_PROP_POS_FRAMES, 0)
            continue
        
        # The frame is flipped to match the video source
        # Makes it easier to follow along
        frame = cv2.flip(frame, 1)
        output_frame = frame.copy()
        
        # If tracking is not started, detect the initial point
        # and start tracking
        if not tracking_started:
            
            # Get the initial point using the MediaPipe Hand Tracker
            initial_point = detect_keypoint(frame, gesture_recognizer)
            
            if initial_point is not None:
                tracking_started = True
                print("Initial point detected! Starting tracking...")
                
                current_point = initial_point
                points = [initial_point]
                
                # Initialize the initial point for the Lukas-Kanade tracker
                p0 = np.array([[current_point[0], current_point[1]]], dtype=np.float32).reshape(-1, 1, 2)
            
            # Set the old frame and old gray frame
            old_frame = frame
            old_gray = cv2.cvtColor(old_frame, cv2.COLOR_BGR2GRAY)
        
        # If tracking is started, use the Lukas-Kanade tracker
        # to track the point and update the Kalman Filter
        if tracking_started:
            
            frame_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
            
            # Predict the next state using the Kalman Filter
            (kalman_pred_x, kalman_pred_y) = kf.predict()
            
            # We then calculate the optical flow using the Lucas-Kanade method
            p1, st, err = cv2.calcOpticalFlowPyrLK(old_gray, frame_gray, p0, None, **lk_params)
            
            new_point = detect_keypoint(frame, gesture_recognizer)
            
            # if the new point is detected, update the Kalman Filter and keep tracking
            if st[0][0] == 1:
                measurement = p1.reshape(-1, 2)
                x_meas, y_meas = measurement[0]
                
                # Update the Kalman Filter with the new measurement
                if new_point is not None:
                    (kalman_updated_x, kalman_updated_y) = kf.update([[new_point[0]], [new_point[1]]])
                    
                    tracking_error = np.linalg.norm(
                        np.array([new_point[0], new_point[1]]) -
                        np.array([kalman_updated_x, kalman_updated_y])
                    )
                    kalman_acc_error += tracking_error
                else:
                    (kalman_updated_x, kalman_updated_y) = kf.update([[x_meas], [y_meas]])
                
                # Append the new point to the points list
                points.append((int(kalman_updated_x), int(kalman_updated_y)))
                
                if len(points) > 1:
                    cv2.line(canvas, points[-2], points[-1], (0, 0, 255), 3)
                
                output_frame = cv2.addWeighted(output_frame, 1.0, canvas, 1.0, 0)
                
                # Draw tracking visualization
                cv2.circle(output_frame, (int(kalman_pred_x), int(kalman_pred_y)), 5, (0, 255, 255), -1)
                cv2.circle(output_frame, (int(kalman_updated_x), int(kalman_updated_y)), 5, (0, 255, 0), -1)
                cv2.rectangle(output_frame, 
                            (int(kalman_updated_x - 15), int(kalman_updated_y - 15)), 
                            (int(kalman_updated_x + 15), int(kalman_updated_y + 15)),
                            (255, 0, 0), 2)
                cv2.putText(output_frame, "Tracking Active", (10, 30),
                           cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
                
                # Check for closed fist gesture
                # If detected, end the tracking and recognize the text
                if detect_closed_fist(frame, gesture_recognizer):
                    tracking_started = False
                    print("Closed fist detected - ending tracking")
                    
                    if len(points) > 1:
                        
                        # Recognize the text from the canvas using the GEMINI Model 
                        detected_text = detect_text_from_canvas(canvas, text_detector)
                        recognized_text += f"{detected_text} "
                        print(f"Recognized text: {recognized_text}")

                        canvas = np.zeros_like(frame)
                        p0 = None
                        points = []
                        kf = KalmanFilter(0.1, 1, 1, 1, 0.1, 0.1)
                
                # Update the previous frame and previous points
                p0 = p1.reshape(-1, 1, 2)
                old_frame = frame
                old_gray = frame_gray
                
            # If the point is not detected, use the Kalman Filter to predict the next state
            # The Kalman Filter will be used to guess the next point with no measurement updates
            # Then, we restart the tracking process
            else:
                p0 = np.array([[kalman_pred_x, kalman_pred_y]], dtype=np.float32).reshape(-1, 1, 2)
                output_frame = cv2.addWeighted(output_frame, 1.0, canvas, 1.0, 0)
                
                # Draw tracking visualization
                cv2.circle(output_frame, (int(kalman_pred_x), int(kalman_pred_y)), 5, (0, 255, 255), -1)
                cv2.putText(output_frame, "Tracking Lost - Kalman Guesses only", (10, 30),
                           cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 255), 2)
                
                # Reset the tracking process
                tracking_started = False
                points = []
                canvas = np.zeros_like(frame)
        else:
            output_frame = cv2.addWeighted(output_frame, 1.0, canvas, 1.0, 0)
            cv2.putText(output_frame, "Waiting for pointing gesture...", (10, 30),
                       cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 255), 2)
        
        # Draw recognized text as subtitle
        if recognized_text:
            draw_recognized_text(output_frame, recognized_text)
        
        # Display the output frame
        display_cv2(output_frame)
        
        # Save the output frame
        save_kalman(output_frame)
        
        key = cv2.waitKey(int(1000/fps)) & 0xFF
        if key == ord('q'):
            break
        
except KeyboardInterrupt:
    print("Interrupted by user")
finally:
    # We then release the video capture and destroy all windows
    gesture_recognizer.close()
    print("Kalman Accumulated Tracking Error:", kalman_acc_error)
    
    cv2.destroyAllWindows()
    cv2.waitKey(1)
    cap.release()
    save_kalman.release()

## Challenges Faced

1. Initially, we explored traditional computer vision methods like contour detection, image segmentation, and color profiling for hand and finger detection.  However, these approaches presented significant challenges.  Accurately isolating the hand contour proved difficult, and creating a robust color profile that could accommodate varying skin tones and lighting conditions was problematic. Consequently, we adopted the MediaPipe library for initial finger detection.

2. Tuning the Kalman Filter parameters was a challenging process. The **process noise covariance Q** that represents the uncertainty in how the state evolves over time, causes the tracking to become jittery if it is too low and tracking to become sluggish if it is too high. We set the parameter `std_acc` that the `Q` relies on to be proportional to the velocity in both the **x** and **y** direction as **1**. Also the `std_meas_x` and `std_meas_y` parameters that affect the **measurement noise covariance R** were set to be low **(0.1)** to prevent laggy tracking. 
   
3. Choosing the appropriate size for the tracking window is crucial. A window that is too small might lose track of the hand during fast movements, while a window that is too large might capture background clutter and reduce the tracking accuracy.

4. If the drawing canvas is not perfectly clean, the OCR system misinterprets other artifacts as part of the written text. Also, each step in the pipeline (hand detection, tracking, gesture recognition, OCR) introduces some latency. This impacts the user experience of the system especially during the text recognition stage.

5. Additionally, we also explored tracking using the Mean Shift approach documented in the [Appendix A](#appendix):
   - Mean Shift works by iteratively moving the tracking window towards a region the region of maximum density using the color histogram extracted in the initialization step. 
   - It does not have an internal state and therefore calculates no estimate of next object location. This causes difficulty in tracking objects moving in an irregular manner such as a finger for handwriting. If the pointing finger moves quickly, Mean Shift cannot estimate its velocity or predict its next location. 
   - Since mean shift purely relies on pixel intensity/color similarity, regions similar to the finger such as other fingers and parts of the hand causes tracking degradation due to their resemblance.

   On the other hand, Kalman Filter maintains an internal state and estimates object location using the state transition model. For a small object point like our index finger keypoint, Kalman Filter is most appropriate since it is more robust to noise by incorporating a both process and measurement noise in its calculations of the next position of the object.

## Room for Improvement

1. **Extended Kalman Filter:** The current implementation based on the Kalman Filter relies on a linear Gaussian process and measurement model. However, finger tracking is inherently non-linear as one can change the speed with which their hands move, or write certain characters with more speed than others. An extended Kalman Filter (EKF) can improve performance by using a non-linear model of both the state transition and measurement. This will require modelling the acceleration of the system and incorporating the finger joint kinematics (how the fingers move). 
 
2. **Reduction of Latency:** The current synchronous manner of recognizing text induces latency into the system which causes a slight delay and affects the user experience. By using asynchronous programming techniques, this can enable processing of requests in the background while the user continues using the system. While it increases the system complexity, it is very vital in ensuring a smoother workflow.

3. **Mean Shift Approach:** The experimented mean shift led to poor performance. This is due to mean shift being susceptible to drift, especially if there are objects with similar color distributions in the scene. The tracking window might gradually shift away from the hand and onto a distractor. Currently, due to other parts of the hand resembling the finger and its surrounding areas, the mean shift algorithm faces challenges in correctly tracking only the fingertip and changes from finger to finger. Experimenting with different window sizes, termination criterias, additional frame preprocessing steps or color ranges might help improve the performance.

## Appendix



### A. Mean Shift Approach

Ensure you have setup the environment from the [Initialization](#initialization) step before proceeding

Here's the breakdown of the approach:

1. Initialization:

    - `gesture_recognizer` is initialized to detect start and stop gestures.
    - `term_criteria` sets the termination criteria for the Mean Shift algorithm.

2. Main Loop: This loop continuously processes frames from the camera. Flipping the frame horizontally `cv2.flip` provides a more natural user experience.

    a. Tracking Initialization:

    - We wait for a Pointing_Up gesture using `detect_keypoint`.
    - Upon detection, a tracking window `track_window` is defined around the detected point. 
    - A region of interest (ROI) is then extracted from the initial frame based on the `track_window`.
    - The ROI is then converted to the HSV color space to generating the color histogram.
    - Finally, the color histogram `roi_hist` is calculated for the ROI and normalized. This histogram represents the target color distribution to be tracked.
    
    b. Tracking (Mean Shift):

    - The current frame is converted to HSV color space.
    - `cv2.calcBackProject()` then calculates the back projection of the HSV frame using the `roi_hist`. This creates a probability map where brighter pixels indicate a higher likelihood of matching the target color.
    - We then use the `cv2.meanShift()` to perform the Mean Shift algorithm to find the new location of the target based on the back projection. It updates the `track_window`.
    - The new hand position `(current_point)` is calculated from the updated `track_window`.
    - If tracking is valid, the current point is added to the points list, and a line is drawn on the canvas.
    - A "closed fist" gesture is checked to stop tracking. If detected, we take the canvas content and process it using the `detect_text_from_canvas()` to extract and store the written text. The canvas is then cleared.
    
    c. Drawing and Display:

    - The canvas is combined with the current frame using `cv2.addWeighted()`.
    - Tracking visualizations (circle and rectangle) are drawn on the frame.
    - The recognized text is displayed as a subtitle.

In [None]:
try:
    gesture_recongizer = setup_gesture_recognizer()
    
    # Mean Shift tracking parameters
    term_criteria = (cv2.TERM_CRITERIA_EPS | cv2.TERM_CRITERIA_COUNT, 10, 1)
    
    while True:
        ret, frame = cap.read()
        if not ret:
            cap.set(cv2.CAP_PROP_POS_FRAMES, 0)
            continue
        
        # Flip the frame to enable mirror view
        frame = cv2.flip(frame, 1)
        output_frame = frame.copy()
        
        if not tracking_started:
            
            # Detect the initial point using the MediaPipe library
            initial_point = detect_keypoint(frame, gesture_recongizer)
            
            if initial_point is not None:
                tracking_started = True
                print("Initial point detected! Starting tracking...")
                
                # Create initial window for mean shift
                # The window size is fixed at 30x30 pixels
                box_size = 50
                track_window = (
                    initial_point[0] - box_size//2,  
                    initial_point[1] - box_size//2, 
                    box_size,  
                    box_size 
                )
                
                # Set up the Region of Interest for tracking
                x, y, w, h = track_window
                roi = frame[y:y+h, x:x+w]
                
                hsv_roi = cv2.cvtColor(roi, cv2.COLOR_BGR2HSV)
                
                # Create mask and normalized histogram
                # We create a mask to consider the skin color of the hand
                mask = cv2.inRange(hsv_roi, np.array((0., 60., 32.)), np.array((180., 255., 255.)))
                roi_hist = cv2.calcHist([hsv_roi], [0], mask, [180], [0, 180])
                cv2.normalize(roi_hist, roi_hist, 0, 255, cv2.NORM_MINMAX)
                
                current_point = initial_point
                points = [initial_point]
        
        if tracking_started:
            
            hsv = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV)
            
            # Calculate back projection
            # This is the probability of each pixel belonging to the hand
            dst = cv2.calcBackProject([hsv], [0], roi_hist, [0, 180], 1)
            
            # Then apply meanshift to get the new location
            ret, track_window = cv2.meanShift(dst, track_window, term_criteria)
            
            # Get the new position from the returned track_window
            x, y, w, h = track_window
            current_point = (int(x + w//2), int(y + h//2))
            
            # Check if tracking is still valid
            # Here, we check if the bounding box is within the frame
            # Bounding box outside the frame means tracking is lost because the hand is out of view
            if w > 0 and h > 0 and x >= 0 and y >= 0 and x + w <= frame.shape[1] and y + h <= frame.shape[0]:
                points.append(current_point)
                
                # Draw the line on canvas
                if len(points) > 1:
                    cv2.line(canvas, points[-2], points[-1], (0, 0, 255), 3, cv2.LINE_AA)
                
                # Combine canvas with output frame for visualization
                output_frame = cv2.addWeighted(output_frame, 1.0, canvas, 1.0, 0)
                
                # Draw tracking visualization
                cv2.circle(output_frame, current_point, 5, (0, 255, 0), -1)
                cv2.rectangle(output_frame, (x, y), (x + w, y + h), (255, 0, 0), 2)
                cv2.putText(output_frame, "Tracking Active", (10, 30),
                           cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
                
                detected_point = detect_keypoint(frame, gesture_recongizer)
        
                if detected_point is not None:
                    # Computing the Euclidean distance between the mean shift estimate and the detection.
                    tracking_error = np.linalg.norm(
                        np.array([detected_point[0], detected_point[1]]) -
                        np.array([current_point[0], current_point[1]])
                    )
                    
                    mean_shift_acc_error += tracking_error
                
                # Check for closed fist gesture
                # If a closed fist is detected, we end that tracking session
                # and perform OCR on the canvas using the text detector Gemini API
                if detect_closed_fist(frame, gesture_recongizer):
                    tracking_started = False
                    print("Closed fist detected - ending tracking")
                    
                    # Perform OCR on the canvas
                    if len(points) > 1:  # Only if something was drawn
                        text = detect_text_from_canvas(canvas, text_detector)
                        recognized_text += f"{text} "
                        print(f"Recognized text: {recognized_text}")
                        
                        canvas = np.zeros_like(frame)
            else:
                cv2.putText(output_frame, "Tracking Lost", (10, 30),
                           cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 255), 2)
                
                # Reset tracking variables
                tracking_started = False
                points = []
                canvas = np.zeros_like(frame)
        else:
            # Combine existing canvas with output frame
            output_frame = cv2.addWeighted(output_frame, 1.0, canvas, 0.5, 0)
            cv2.putText(output_frame, "Waiting for pointing gesture...", (10, 30),
                       cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 255), 2)
        
        # Display the last recognized text as subtitle if it exists
        if recognized_text:
           draw_recognized_text(output_frame, recognized_text)
        
        display_cv2(output_frame)
        key = cv2.waitKey(int(1000/fps)) & 0xFF
        if key == ord('q'):
            break
        
except KeyboardInterrupt:
    print("Interrupted by user")
finally:
    gesture_recongizer.close()
    print("MeanShift Accumulated Tracking Error:", mean_shift_acc_error)
    
    cv2.destroyAllWindows()
    cv2.waitKey(1)
    cap.release()