# Task 1

In [None]:
# Importing Libraries
import torch
import torchvision
from torchvision.transforms import functional as F
import cv2
import numpy as np
from scipy.optimize import linear_sum_assignment
from IPython.display import Video, FileLink, display
import torch.nn as nn

The Kalman Filter class defined in this code is designed to track objects by estimating their positions over time while accounting for motion and measurement uncertainties. The constructor initializes several matrices and vectors that represent the dynamics of the system. The state transition matrix \( A \) is used to predict the future state of the object based on a simple constant velocity model, while \( H \) is the observation matrix that maps the predicted state to the observed measurement space. The process noise covariance matrix \( Q \) accounts for the uncertainty in the prediction model, and the measurement noise covariance matrix \( R \) reflects the noise in the observations. The error covariance matrix \( P \) captures the estimated uncertainty of the state, and \( x \) is the state vector representing the object's position and velocity.

The `predict` method uses the state transition matrix \( A \) to forecast the object's next state, updating the state \( x \) and increasing the uncertainty in \( P \) by incorporating the process noise \( Q \). This step provides a prediction of where the object is expected to be. The `update` method then corrects this prediction using the new measurement \( z \). It calculates the innovation covariance \( S \), which represents the total uncertainty in the predicted measurement, and computes the Kalman gain \( K \), a factor that determines how much the prediction should be adjusted based on the observed measurement. The state \( x \) is updated by adding a weighted difference between the observed and predicted positions, and the error covariance \( P \) is adjusted to reflect the reduced uncertainty after incorporating the measurement. This iterative process of prediction and update allows the Kalman Filter to track objects effectively by dynamically balancing between the model's prediction and real-world observations.

Kalman Filter Equation:

Prediction Step:

State Prediction:
$\mathbf{x}_{k|k-1} = \mathbf{A} \mathbf{x}_{k-1|k-1} + \mathbf{B} \mathbf{u}_k$
- $\mathbf{x}_{k|k-1}$: Predicted state estimate at time $k$
- $\mathbf{A}$: State transition matrix
- $\mathbf{x}_{k-1|k-1}$: Previous state estimate at time $k-1$
- $\mathbf{B}$: Control input matrix
- $\mathbf{u}_k$: Control vector

Covariance Prediction:
$\mathbf{P}_{k|k-1} = \mathbf{A} \mathbf{P}_{k-1|k-1} \mathbf{A}^T + \mathbf{Q}$
- $\mathbf{P}_{k|k-1}$: Predicted covariance matrix at time $k$
- $\mathbf{P}_{k-1|k-1}$: Previous covariance matrix
- $\mathbf{Q}$: Process noise covariance matrix

Update Step:

Kalman Gain Calculation:
$\mathbf{K}_k = \mathbf{P}_{k|k-1} \mathbf{H}^T \left( \mathbf{H} \mathbf{P}_{k|k-1} \mathbf{H}^T + \mathbf{R} \right)^{-1}$
- $\mathbf{K}_k$: Kalman Gain
- $\mathbf{H}$: Observation matrix
- $\mathbf{R}$: Measurement noise covariance matrix

State Update:
$\mathbf{x}_{k|k} = \mathbf{x}_{k|k-1} + \mathbf{K}_k \left( \mathbf{z}_k - \mathbf{H} \mathbf{x}_{k|k-1} \right)$
- $\mathbf{x}_{k|k}$: Updated state estimate
- $\mathbf{z}_k$: Measurement vector at time $k$

Covariance Update:
$\mathbf{P}_{k|k} = \left( \mathbf{I} - \mathbf{K}_k \mathbf{H} \right) \mathbf{P}_{k|k-1}$
- $\mathbf{P}_{k|k}$: Updated covariance matrix
- $\mathbf{I}$: Identity matrix



In [None]:
# Kalman Filter class for tracking
class KalmanFilter:
    def __init__(self, dt=1, std_acc=1, x_std_meas=0.1, y_std_meas=0.1):
        self.dt = dt
        self.A = np.eye(4) + np.array([[0, 0, dt, 0], [0, 0, 0, dt], [0, 0, 0, 0], [0, 0, 0, 0]])
        self.H = np.array([[1, 0, 0, 0], [0, 1, 0, 0]])
        self.Q = np.eye(4) * std_acc ** 2
        self.R = np.diag([x_std_meas ** 2, y_std_meas ** 2])
        self.P = np.eye(4)
        self.x = np.zeros((4, 1))

    def predict(self):
        self.x = np.dot(self.A, self.x)
        self.P = np.dot(np.dot(self.A, self.P), self.A.T) + self.Q
        return self.x[:2].flatten()

    def update(self, z):
        z = z.reshape(2, 1)  # Ensure z is a 2D column vector
        S = np.dot(np.dot(self.H, self.P), self.H.T) + self.R
        K = np.dot(np.dot(self.P, self.H.T), np.linalg.inv(S))
        self.x += np.dot(K, (z - np.dot(self.H, self.x)))
        self.P = np.dot(np.eye(4) - np.dot(K, self.H), self.P)

The `Tracker` class is designed to manage the tracking of individual objects detected using the Faster R-CNN object detection model, without utilizing appearance features. Each tracker is responsible for following one specific object in the video. The constructor of the class initializes the tracker's unique identifier \( \text{id} \), the bounding box \( \text{bbox} \) representing the object's location, and an instance of the `KalmanFilter` class used for estimating and predicting the object's position. The initial position of the Kalman Filter state \( x \) is set to the coordinates of the top-left corner of the bounding box. Additionally, the tracker keeps track of the age of the object (how many frames it has been tracked) and counts how many consecutive frames the object has been invisible, which helps manage the tracker when the object temporarily goes out of view.

The `predict` method updates the age and increases the count of consecutive frames where the object has not been detected, then uses the Kalman Filter to predict the next position of the object based on its previous motion. The `update` method adjusts the tracker's state using a new bounding box provided by the Faster R-CNN detection. It updates the bounding box and applies the Kalman Filter to incorporate the new observation, correcting the predicted position to align with the actual detected position. The `consecutive_invisible_count` is reset, as the tracker has successfully matched a new detection. This class allows for efficient and continuous tracking of objects in a video stream, dynamically updating their estimated positions while handling scenarios where detections may be temporarily lost.

In [None]:
# Tracker class for Faster R-CNN (without appearance features)
class Tracker:
    def __init__(self, id, bbox):
        self.id = id
        self.kf = KalmanFilter()
        self.kf.x[:2] = np.array([[bbox[0]], [bbox[1]]])
        self.bbox = bbox
        self.age = 0
        self.visible_count = 0
        self.consecutive_invisible_count = 0

    def predict(self):
        self.age += 1
        self.consecutive_invisible_count += 1
        return self.kf.predict()

    def update(self, bbox):
        self.bbox = bbox
        self.kf.update(np.array(bbox[:2]))
        self.consecutive_invisible_count = 0  # Reset when updated

The `iou` function computes the Intersection over Union (IoU) between two bounding boxes, which is a measure of overlap used to evaluate the similarity between predicted and ground-truth bounding boxes in object detection tasks. The function takes two bounding boxes as inputs, each defined by four coordinates: the top-left corner (\( x1, y1 \)) and the bottom-right corner (\( x2, y2 \)) for `box1`, and similarly for `box2`. The function first determines the coordinates of the intersection area between the two boxes. The intersection is calculated by taking the maximum of the top-left coordinates and the minimum of the bottom-right coordinates, ensuring that only the overlapping region is considered.

The area of the intersection is computed as the product of the width and height of the overlapping region, with a check to ensure that the values are non-negative (to handle cases where there is no overlap). The areas of both bounding boxes are also calculated. The IoU is then computed as the ratio of the intersection area to the union area, where the union area is the total area covered by both boxes, subtracting the intersection area to avoid double counting. The resulting IoU value, which ranges from 0 to 1, indicates how well the two boxes overlap, with 1 meaning perfect overlap and 0 indicating no overlap at all. This metric is commonly used in object detection algorithms to match and evaluate detected objects against ground-truth labels.

In [None]:
# Helper function to compute Intersection over Union (IoU)
def iou(box1, box2):
    x1, y1, x2, y2 = box1
    x1_p, y1_p, x2_p, y2_p = box2
    xi1, yi1 = max(x1, x1_p), max(y1, y1_p)
    xi2, yi2 = min(x2, x2_p), min(y2, y2_p)
    inter_area = max(0, xi2 - xi1 + 1) * max(0, yi2 - yi1 + 1)
    box1_area = (x2 - x1 + 1) * (y2 - y1 + 1)
    box2_area = (x2_p - x1_p + 1) * (y2_p - y1_p + 1)
    union_area = box1_area + box2_area - inter_area
    return inter_area / union_area

The `detect_and_track` function is designed for real-time object detection and tracking in a video. It uses a pre-trained object detection model, such as Faster R-CNN, to identify objects in each frame of the video and then employs a tracking algorithm to maintain the identity of these objects across multiple frames. The function begins by initializing the video capture from a specified input video path and setting up the output video writer to save the processed frames. It also transfers the model to the specified device (e.g., GPU or CPU) and sets it to evaluation mode.

The main loop runs as long as the video capture is open, processing each frame one by one. The frame is resized to the appropriate dimensions, converted to RGB, and transformed into a tensor for model input. Using the object detection model, it predicts the bounding boxes, confidence scores, and labels for objects in the frame. To ensure accuracy, only detections with a confidence score greater than 0.7 and a sufficiently large size are kept.

The function then updates existing trackers using a Kalman Filter to predict the positions of previously detected objects. It uses the Hungarian algorithm to associate new detections with existing trackers based on the Intersection over Union (IoU) metric, which measures the overlap between bounding boxes. If the IoU value between a detection and a tracker is above a threshold (0.3 in this case), the tracker is updated with the new detection. Detections that are not matched to any tracker are used to initialize new trackers, and each tracker is given a unique ID.

Finally, the function draws bounding boxes around tracked objects on the frame, along with their IDs, and writes the processed frame to the output video. The process continues until all frames have been processed, after which the video capture and writer are released, and a message indicates that the video processing is complete. This approach effectively tracks multiple objects in a video, maintaining their identities and visualizing the tracking results.



### Hungarian Algorithm
 Hungarian algorithm is used to associate detected bounding boxes from the object detection model (e.g., Faster R-CNN) with the existing trackers that maintain object identities across frames.

### Steps of the Hungarian Algorithm
1. **Cost Matrix Creation**:
   - The algorithm starts by creating a cost matrix, where each entry represents the "cost" of assigning a detection to a tracker. In your implementation, this cost is derived from the Intersection over Union (IoU) between bounding boxes:
     - A higher IoU means a better match, so the cost is typically the negative IoU or some function that reflects how "undesirable" a match is.
  
2. **Minimizing Total Cost**:
   - The Hungarian algorithm then works to minimize the total assignment cost by finding the optimal assignment of detections to trackers. It does this in a systematic way:
     - It ensures that each detection is assigned to only one tracker, and each tracker is assigned to only one detection, in a way that the overall cost is minimized.

3. **Association Step**:
   - The algorithm outputs pairs of matched indices, indicating which detection should be matched with which tracker.
   - If a detection cannot be assigned to any existing tracker (because the cost is too high, i.e., the IoU is too low), a new tracker is created.
   - Similarly, if a tracker does not get a match, it might indicate that the object is temporarily not detected or has left the scene.

### The Role of the Hungarian Algorithm :
- **Efficient Matching**: It efficiently assigns detected objects to existing trackers, ensuring that each detection is paired with the best possible tracker.
- **Maintain Object Identity**: By using the optimal assignment, the algorithm helps maintain the identity of objects across video frames, even as they move.
- **Handle Multiple Objects**: The Hungarian algorithm is particularly effective in multi-object tracking (MOT) scenarios, where there are many objects that need to be tracked simultaneously.

### Significance in Object Tracking :
The Hungarian algorithm is crucial for accurately tracking multiple objects in a video. It ensures that your object tracking system can:
- **Reduce Mismatches**: By optimally matching detections to trackers, it reduces the chances of incorrectly assigning an identity to an object.
- **Handle Occlusions and Reappearances**: If an object is temporarily occluded or disappears and then reappears, the algorithm helps in reassociating the object correctly.



In [None]:
# Object detection and tracking function
def detect_and_track(video_path, output_video_path, model, device, max_invisible_frames=10):
    cap = cv2.VideoCapture(video_path)
    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    fps = int(cap.get(cv2.CAP_PROP_FPS))
    out = cv2.VideoWriter(output_video_path, cv2.VideoWriter_fourcc(*'mp4v'), fps, (frame_width, frame_height))
    model.to(device).eval()

    trackers = []
    next_id = 0
    frame_count = 0

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        frame = cv2.resize(frame, (frame_width, frame_height))
        rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        image_tensor = F.to_tensor(rgb_frame).unsqueeze(0).to(device)

        with torch.no_grad():
            outputs = model(image_tensor)[0]

        boxes = outputs['boxes'].cpu().numpy()
        scores = outputs['scores'].cpu().numpy()
        labels = outputs['labels'].cpu().numpy()

        # Filter detections based on confidence and size
        detections = [
            box for i, box in enumerate(boxes)
            if scores[i] > 0.7 and (box[2] - box[0]) * (box[3] - box[1]) > 500
        ]

        # Update trackers
        updated_trackers = []
        for tracker in trackers:
            tracker.predict()
            if tracker.consecutive_invisible_count <= max_invisible_frames:
                updated_trackers.append(tracker)

        # Use Hungarian algorithm to associate detections with trackers
        if len(detections) > 0 and len(updated_trackers) > 0:
            iou_matrix = np.zeros((len(detections), len(updated_trackers)))
            for d, det in enumerate(detections):
                for t, trk in enumerate(updated_trackers):
                    iou_matrix[d, t] = iou(det, trk.bbox)
            row_indices, col_indices = linear_sum_assignment(-iou_matrix)

            matched_indices = []
            for r, c in zip(row_indices, col_indices):
                if iou_matrix[r, c] > 0.3:  # IoU threshold
                    updated_trackers[c].update(detections[r])
                    matched_indices.append(r)

            unmatched_detections = [i for i in range(len(detections)) if i not in matched_indices]
        else:
            unmatched_detections = list(range(len(detections)))

        # Add new trackers for unmatched detections
        for i in unmatched_detections:
            trackers.append(Tracker(next_id, detections[i]))
            next_id += 1

        # Draw bounding boxes
        for tracker in updated_trackers:
            x1, y1, x2, y2 = map(int, tracker.bbox)
            cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
            cv2.putText(frame, f"ID {tracker.id}", (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

        out.write(frame)
        frame_count += 1

    cap.release()
    out.release()
    print("Video processing complete and saved.")

This code snippet sets up and executes the object detection and tracking process using a pre-trained Faster R-CNN model. It begins by determining the device on which to run the model, either a GPU (if available) or the CPU, using PyTorch's `torch.device` method. The Faster R-CNN model, which is part of the `torchvision` library and is pre-trained on the COCO dataset, is then loaded and prepared for evaluation by calling `model.eval()`. This ensures that the model operates in inference mode, optimizing its performance for making predictions on new data without updating its parameters.

Next, the `detect_and_track` function is called to process the video. The input video, specified by `video_path` (here, "soccer.mp4" in the `/content` directory), is analyzed frame by frame. The function applies the Faster R-CNN model to detect objects and uses a tracking algorithm to maintain the identities of these objects across frames. The processed video, with bounding boxes and object IDs, is saved to `output_video_path` ("colab_output_video.mp4").

Finally, the code uses Jupyter's display functionalities to show the processed video directly in the notebook. The `display(Video(...))` command embeds the video, allowing for immediate viewing, while `FileLink(...)` provides a clickable link for downloading the output video. This approach enables users to visualize and download the results of the object detection and tracking in an interactive and accessible manner.

In [None]:
# Load the model and set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()

# Run the detection and tracking
video_path = '/content/soccer.mp4'
output_video_path = 'colab_output_video.mp4'
detect_and_track(video_path, output_video_path, model, device)

# Display the output video
display(Video("colab_output_video.mp4", embed=True))
display(FileLink("colab_output_video.mp4"))

Downloading: "https://download.pytorch.org/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth" to /root/.cache/torch/hub/checkpoints/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
100%|██████████| 160M/160M [00:01<00:00, 158MB/s]


Video processing complete and saved.


# Task 2

The `FeatureExtractor` class is a simple neural network module designed to extract feature representations from images using a pre-trained convolutional neural network (CNN). It leverages a ResNet-18 model, which is a popular deep learning architecture known for its efficiency and effectiveness in learning image features. In the constructor (`__init__` method), the `FeatureExtractor` class inherits from `nn.Module`, making it compatible with PyTorch's neural network framework. The `torchvision.models.resnet18(pretrained=True)` function loads a ResNet-18 model that has been pre-trained on the ImageNet dataset, allowing it to extract high-quality features from images.

The model's last fully connected layer, which is used for classification, is removed by converting the model's layers into a sequential container and excluding the final layer. This is done because the purpose of this class is to generate feature embeddings, not to classify images. The `forward` method defines how the input data flows through the network. It takes an input tensor `x`, passes it through the modified ResNet-18 model to extract features, and then flattens the output into a one-dimensional vector using `view(x.size(0), -1)`. This flattened feature vector can then be used for various tasks, such as object tracking or similarity calculations, where a compact and meaningful representation of the image is required.

In [None]:
# Define a simple feature extractor using a pre-trained CNN
class FeatureExtractor(nn.Module):
    def __init__(self):
        super(FeatureExtractor, self).__init__()
        self.model = torchvision.models.resnet18(pretrained=True)
        self.model = nn.Sequential(*list(self.model.children())[:-1])  # Remove the last layer

    def forward(self, x):
        x = self.model(x)
        return x.view(x.size(0), -1)  # Flatten the output

This code snippet initializes the `FeatureExtractor` class and prepares it for use in a deep learning pipeline. First, it sets up the device for computation using PyTorch's `torch.device` function. If a CUDA-enabled GPU is available, the code will use the GPU to speed up the computations; otherwise, it defaults to using the CPU. This dynamic device selection ensures that the feature extraction process is optimized for the available hardware.

Next, an instance of the `FeatureExtractor` class is created, which loads a pre-trained ResNet-18 model (with its last layer removed) to serve as the feature extractor. The `to(device)` method moves the feature extractor to the selected device (either GPU or CPU), making sure that all operations performed by the model are conducted on the designated hardware. Finally, the `feature_extractor.eval()` method puts the model into evaluation mode. This is crucial for inference, as it disables certain layers, like dropout and batch normalization, which behave differently during training. By setting the model to evaluation mode, it ensures consistent and accurate feature extraction from input images without any random variability.

In [None]:
# Initialize the feature extractor
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
feature_extractor = FeatureExtractor().to(device)
feature_extractor.eval()

Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|██████████| 44.7M/44.7M [00:00<00:00, 171MB/s]


FeatureExtractor(
  (model): Sequential(
    (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace=True)
    (3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (4): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (1): BasicBlock(
        (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stat

The `Tracker` class is part of the Deep-SORT tracking algorithm, designed to manage and maintain the state of an individual object being tracked across frames in a video. When a new `Tracker` instance is created, it is initialized with a unique identifier `id`, the object's initial bounding box `bbox`, and its `appearance_feature`, which captures the visual characteristics of the object. This appearance feature helps distinguish the object from others and is crucial for maintaining tracking consistency. The `KalmanFilter` instance `kf` is used to predict the future position of the object based on its current state and motion, providing a robust way to handle uncertainties and noise in the object's movement.

The constructor also sets the initial position of the Kalman Filter's state vector to the top-left corner of the bounding box and initializes several attributes to keep track of the tracker's status. The `age` attribute counts how many frames the object has been tracked, `visible_count` keeps track of how many times the object has been detected, and `consecutive_invisible_count` records how many consecutive frames the object has not been detected.

The `predict` method advances the tracker's state, increasing the `age` and `consecutive_invisible_count` to reflect that a new frame has been processed. It then uses the Kalman Filter to predict the object's next position. The `update` method adjusts the tracker's state with new information from the latest detection. It updates the bounding box and appearance feature with the new data and resets the `consecutive_invisible_count` since the object was successfully detected. This class provides the necessary functionality to track objects effectively over time, using both motion and appearance information to maintain the identity of the object even when detections are intermittent or uncertain.

In [None]:
# Tracker class for Deep-SORT
class Tracker:
    def __init__(self, id, bbox, appearance_feature):
        self.id = id
        self.kf = KalmanFilter()
        self.kf.x[:2] = np.array([[bbox[0]], [bbox[1]]])
        self.bbox = bbox
        self.appearance_feature = appearance_feature
        self.age = 0
        self.visible_count = 0
        self.consecutive_invisible_count = 0

    def predict(self):
        self.age += 1
        self.consecutive_invisible_count += 1
        return self.kf.predict()

    def update(self, bbox, appearance_feature):
        self.bbox = bbox
        self.appearance_feature = appearance_feature
        self.kf.update(np.array(bbox[:2]))
        self.consecutive_invisible_count = 0

The `extract_features` function is a utility designed to extract appearance features from a set of image regions, or crops, that correspond to the bounding boxes of detected objects in a video frame. The function takes two inputs: `frame`, which is the current image frame from the video, and `boxes`, a list of bounding boxes indicating the regions of interest. For each bounding box in `boxes`, the function extracts the corresponding region from the `frame` using slicing, effectively cropping the image to include only the object of interest. These cropped images are then resized to a fixed size of 128x128 pixels to ensure a consistent input size for the feature extraction model, making the process more efficient and compatible with deep learning models.

The cropped and resized images are converted into tensors using the `F.to_tensor` method, which transforms the images into a format suitable for processing with PyTorch. Each tensor is unsqueezed to add a batch dimension and transferred to the specified device (GPU or CPU) for faster computation. These processed image tensors are stored in a list called `crops`.

If there are any crops, the function concatenates them along the batch dimension, creating a single tensor containing all the cropped images. It then uses the `feature_extractor` model, running in inference mode (with `torch.no_grad()` to disable gradient computation), to extract high-level appearance features from these image regions. The resulting features are converted to a NumPy array and returned. If there are no crops (e.g., if no objects were detected in the frame), the function returns an empty list. These extracted features are crucial for distinguishing and matching objects over multiple frames, aiding in accurate object tracking.

In [None]:
# Helper function to extract appearance features
def extract_features(frame, boxes):
    crops = []
    for box in boxes:
        x1, y1, x2, y2 = map(int, box)
        crop = frame[y1:y2, x1:x2]  # Crop the image
        crop = cv2.resize(crop, (128, 128))  # Resize to a fixed size
        crop = F.to_tensor(crop).unsqueeze(0).to(device)
        crops.append(crop)

    if crops:
        crops = torch.cat(crops, dim=0)
        with torch.no_grad():
            features = feature_extractor(crops).cpu().numpy()
        return features
    else:
        return []

The `detect_and_track_deep_sort` function is a key component of the Deep-SORT tracking system, designed to perform object detection and then track these objects across frames in a video. It uses a pre-trained object detection model (such as Faster R-CNN) to identify objects and assigns unique IDs to each tracked object, ensuring that these objects are followed consistently throughout the video.

The function starts by opening the input video using OpenCV’s `VideoCapture` and setting up a `VideoWriter` to save the processed video frames to an output file. The video frame width, height, and frames per second (fps) are obtained from the input video. The detection model is transferred to the specified device (either GPU or CPU) and set to evaluation mode for efficient inference.

The main loop processes each frame of the video until there are no more frames to read. Each frame is converted from BGR (used by OpenCV) to RGB format and then transformed into a tensor suitable for input to the object detection model. The model predicts bounding boxes and confidence scores for detected objects in the frame. The detections are filtered based on a confidence threshold (greater than 0.7), and the corresponding appearance features of the detected objects are extracted using a helper function.

The function then updates the existing trackers, using a Kalman Filter to predict the positions of tracked objects. Trackers that have not been updated for more than a specified number of frames (`max_invisible_frames`) are removed. The function uses a simple Intersection over Union (IoU) metric to match new detections with existing trackers. If the IoU between a detection and a tracker's predicted position is above a threshold (0.3), the tracker is updated with the new detection and appearance feature. Unmatched detections are used to initialize new trackers, each assigned a unique ID.

The function then draws bounding boxes around each tracked object, along with the unique ID, on the current video frame. The processed frame is written to the output video file. Once all frames have been processed, the video capture and writer are released, and a message is printed to indicate that the video with tracked objects has been saved. This approach allows for efficient and continuous tracking of multiple objects in a video, handling scenarios where objects may enter or leave the frame.

In [None]:
# Deep-SORT Tracking Function
def detect_and_track_deep_sort(video_path, output_video_path, model, device, max_invisible_frames=10):
    cap = cv2.VideoCapture(video_path)
    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    fps = int(cap.get(cv2.CAP_PROP_FPS))
    out = cv2.VideoWriter(output_video_path, cv2.VideoWriter_fourcc(*'mp4v'), fps, (frame_width, frame_height))
    model.to(device).eval()

    trackers = []
    next_id = 0

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        image_tensor = F.to_tensor(rgb_frame).unsqueeze(0).to(device)

        # Perform Faster R-CNN object detection
        with torch.no_grad():
            outputs = model(image_tensor)[0]

        boxes = outputs['boxes'].cpu().numpy()
        scores = outputs['scores'].cpu().numpy()

        # Filter detections based on confidence
        detections = [box for i, box in enumerate(boxes) if scores[i] > 0.7]
        appearance_features = extract_features(frame, detections)

        # Update trackers or create new ones
        updated_trackers = []
        for tracker in trackers:
            tracker.predict()
            if tracker.consecutive_invisible_count <= max_invisible_frames:
                updated_trackers.append(tracker)

        # Associate detections to trackers
        unmatched_detections = []
        for det, feature in zip(detections, appearance_features):
            matched = False
            for tracker in updated_trackers:
                if iou(det, tracker.bbox) > 0.3:  # Simple IoU matching
                    tracker.update(det, feature)
                    matched = True
                    break
            if not matched:
                unmatched_detections.append((det, feature))

        # Add new trackers for unmatched detections
        for det, feature in unmatched_detections:
            trackers.append(Tracker(next_id, det, feature))
            next_id += 1

        # Draw bounding boxes
        for tracker in updated_trackers:
            x1, y1, x2, y2 = map(int, tracker.bbox)
            cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
            cv2.putText(frame, f"ID {tracker.id}", (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

        out.write(frame)

    cap.release()
    out.release()
    print("Deep-SORT Tracking video saved.")

This code snippet initializes and uses a pre-trained object detection model, specifically Faster R-CNN with a ResNet-50 backbone and a Feature Pyramid Network (FPN) from the `torchvision` library. The model is loaded with pre-trained weights, meaning it has already been trained on the COCO dataset and is capable of detecting various object categories. The `model.eval()` line sets the model to evaluation mode, optimizing it for inference by disabling certain layers and behaviors, such as dropout, that are used during training.

Next, the `detect_and_track_deep_sort` function is called to process a video for object detection and tracking using the Deep-SORT algorithm. The `video_path` specifies the location of the input video file ("soccer.mp4" in this case, located in the `/content` directory), and `deep_sort_output_path` defines the name of the output video file that will be generated. The function uses the Faster R-CNN model to detect objects frame by frame and then applies Deep-SORT to assign and maintain unique IDs for each detected object, tracking them consistently throughout the video.

Finally, the code uses Jupyter’s display features to show the processed video directly within the notebook. The `display(Video(...))` function embeds the video in the notebook, allowing for immediate playback, while `FileLink(...)` creates a downloadable link for the output video. This setup allows users to view and download the video that showcases the detected and tracked objects, complete with bounding boxes and unique IDs.

In [None]:
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()

# Run Deep-SORT Tracking
video_path = '/content/soccer.mp4'  # Update with your video path
deep_sort_output_path = 'deep_sort_output_video.mp4'
detect_and_track_deep_sort(video_path, deep_sort_output_path, model, device)

# Display the Deep-SORT video
display(Video(deep_sort_output_path, embed=True))
display(FileLink(deep_sort_output_path))


Deep-SORT Tracking video saved.


Link for the output
https://drive.google.com/drive/folders/1IX8i5NPcfNK1TcmbGlfLtQ8Fa8wbfP9K?usp=sharing