# Project 2: Vehicle Tracking 

## Introduction
#### Problem Description

In this project, we address the problem of multi-vehicle tracking in a traffic video sequence. The objective is not only to detect moving vehicles in each frame, but also to maintain a consistent identity (ID) for each vehicle over time.

Vehicle tracking is a fundamental problem in computer vision with applications in:

- Traffic monitoring

- Smart city infrastructure

- Autonomous driving systems

- Surveillance and safety analysis

The main challenge is that detection alone is not sufficient. Vehicles must be tracked across frames even when:

- Detection is noisy

- Objects are partially occluded

- Lighting conditions vary

- Vehicles temporarily disappear

#### Video link

https://drive.google.com/file/d/1bQPa8Yl_ufnt9xxYUh2o_OM3TCJ8KGTB/view?usp=sharing


In [9]:
import os
from pathlib import Path
import cv2
import numpy as np
import kagglehub

path = kagglehub.dataset_download("trainingdatapro/cars-video-object-tracking")
print("Path to dataset files:", path)

DATASET_DIR = Path(path)

Path to dataset files: /Users/Marta/.cache/kagglehub/datasets/trainingdatapro/cars-video-object-tracking/versions/3


In [10]:
IMAGE_DIR = DATASET_DIR / "images"

assert IMAGE_DIR.exists(), f"Missing {IMAGE_DIR}"
def sorted_images(folder: Path):
    exts = (".png", ".jpg", ".jpeg", ".bmp")
    files = [p for p in folder.iterdir() if p.suffix.lower() in exts]
    files.sort(key=lambda p: p.name)
    return files

IMAGE_FILES = sorted_images(IMAGE_DIR)

print("Frames:", len(IMAGE_FILES))


Frames: 301


## Approach Overview

We implement a tracking-by-detection pipeline, composed of two main stages:

### Detection stage
Vehicles are extracted from each frame using background subtraction (MOG2) and morphological filtering.

### Tracking stage
A Kalman Filter is used to:

- Predict vehicle motion

- Smooth noisy detections

- Maintain stable vehicle identities over time

This combination allows us to handle imperfect segmentation while keeping trajectories consistent.

## Detection

#### Why Background Subtraction?

We chose MOG2 (Mixture of Gaussians) because:

- It is computationally efficient.

- It adapts to gradual background changes.

- It works well for fixed-camera traffic scenarios.

Since the camera is static, moving vehicles appear as foreground regions, making background subtraction a suitable approach.

This section extracts moving objects from video frames using background subtraction and morphological cleaning. We use the MOG2 algorithm to separate foreground (moving objects) from the static background. To remove noise, we apply morphological operations such as opening, closing, and dilation. We then use connected components to extract individual objects from the mask.

Because the mask is imperfect due to various factors like vehicle color, distance to camera, and lighting conditions, a single car sometimes appears divided into multiple disconnected blobs. That's why we cluster and group the blobs that are likely to be part of the same vehicle, using proximity-based clustering to merge fragments while avoiding over-merging distant vehicles.

### Functions

Each function is explained below.

#### 1. Foreground Extraction

Each frame is processed using the MOG2 background subtractor to obtain a foreground mask. In this binary mask:

- Moving objects appear in white

- Static background appears in black

Since the camera is fixed, moving vehicles are detected as foreground regions.

However, the raw mask produced by MOG2 is not perfect. It typically contains:

- Shadows classified as foreground

- Reflections and glare

- Small noisy regions

- Fragmented vehicle shapes

Therefore, additional refinement steps are required before extracting reliable detections.

#### 2. Mask Cleaning

To improve the quality of the foreground mask, we apply several processing steps.

##### Thresholding

We first apply a binary threshold:

- Pixel values above 200 → set to 255

- All other values → set to 0

This removes weak shadow responses and ensures a clean binary mask.

##### Morphological Operations

Two morphological operations are applied sequentially:

###### Opening (elliptical kernel 3×3, 2 iterations)

- Removes small isolated noise

- Eliminates tiny false positives

- Preserves main vehicle structures

###### Closing (elliptical kernel 2×2, 2 iterations)

- Fills small gaps inside blobs

- Connects slightly fragmented regions

We experimented with different kernel sizes during development. Larger kernels over-smoothed the mask and sometimes merged nearby vehicles. A small closing kernel (2×2) provided the best balance between noise removal and shape preservation.

These parameters were selected empirically based on visual stability across frames.

The result is a cleaner mask where each vehicle is more likely to appear as a single connected region.

#### 3. Perspective-Aware Area Filtering

Due to perspective projection:

- Vehicles closer to the camera (bottom of the image) appear larger

- Vehicles farther from the camera (top of the image) appear smaller

To model this effect, we analysed the bounding box annotations of the dataset and studied how their pixel area grows along the vertical axis.  
By plotting area vs vertical position we obtained a clear increasing trend and fitted two curves that estimate the expected minimum and maximum vehicle area as a function of height:

For each detected blob:

- The bounding box is computed
- The bottom coordinate of the box (y_bottom) estimates its distance to the camera
- The blob area is checked against this allowed range

If the blob area is outside the interval, it is rejected.

During testing, these limits were too strict because background subtraction often produces fragmented or slightly merged blobs.  
To make the model tolerant to imperfect segmentation, we introduce scaling factors:

- kmin = 0.3
- kmax = 1.3

These widen the acceptable area range and prevent rejecting valid vehicles due to:

- Fragmentation
- Slight merging
- Lighting distortions

Note: This approach is camera-dependent because the area model is tuned specifically to this viewpoint and scene geometry.

#### 4. Blob Clustering

Even after cleaning and filtering, background subtraction may produce multiple blobs for a single vehicle due to:

- Broken masks

- Gaps caused by reflections

- Partial segmentation

If left uncorrected, this would cause:

- Duplicate track IDs

- Jittery centroid measurements

To address this, we implemented proximity-based clustering using a Union-Find (Disjoint Set) structure.

Clustering Policy

Each detection has a centroid.

If two centroids are closer than a predefined distance threshold (40 pixels), they are grouped.

Clustering is transitive:

If A is close to B, and B is close to C, then all belong to the same cluster.

##### Merging Strategy

For each cluster, the final detection is computed as:

- Area → sum of all member areas

- Centroid → area-weighted average (larger fragments influence more)

- Bounding box → tight bounding box enclosing all fragments

This ensures that the merged detection accurately represents the full vehicle.

##### Trade-Off

The distance threshold controls a balance:

- A larger threshold repairs more fragmentation but risks merging adjacent vehicles.

- A smaller threshold avoids merging different vehicles but keeps some fragmentation.

Through experimentation, a threshold of 40 pixels provided stable behavior in this scene.

In [None]:
def clean_mask(mask: np.ndarray) -> np.ndarray:
    
    _, mask = cv2.threshold(mask, 200, 255, cv2.THRESH_BINARY)
    k_open = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (3, 3))
    k_close = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (2, 2)) 
    # note: this is a small value but works better for us than bigger values

    mask = cv2.morphologyEx(mask, cv2.MORPH_OPEN, k_open, iterations=2)
    mask = cv2.morphologyEx(mask, cv2.MORPH_CLOSE, k_close, iterations=2)
    
    return mask

def cluster_blobs_by_proximity(dets: list, distance_threshold: float =100.0) -> list:
    if len(dets) <= 1:
        return dets
    n = len(dets)
    clusters = list(range(n))
    def find_root(i):
        if clusters[i] != i:
            clusters[i] = find_root(clusters[i])
        return clusters[i]
    def union(i, j):
        root_i, root_j = find_root(i), find_root(j)
        if root_i != root_j:
            clusters[root_j] = root_i
    for i in range(n):
        for j in range(i + 1, n):
            cx_i, cy_i = dets[i]["centroid"]
            cx_j, cy_j = dets[j]["centroid"]
            d = ((cx_j - cx_i)**2 + (cy_j - cy_i)**2) ** 0.5
            if d < distance_threshold:
                union(i, j)
    # Merge clusters
    cluster_map = {}
    for i in range(n):
        root = find_root(i)
        if root not in cluster_map:
            cluster_map[root] = []
        cluster_map[root].append(i)
    merged = []
    for cluster_indices in cluster_map.values():
        cluster_dets = [dets[i] for i in cluster_indices]
        if len(cluster_dets) == 1:
            merged.append(cluster_dets[0])
        else:
            total_area = sum(d["area"] for d in cluster_dets)
            merged_cx = sum(d["centroid"][0] * d["area"] for d in cluster_dets) / total_area
            merged_cy = sum(d["centroid"][1] * d["area"] for d in cluster_dets) / total_area
            all_xs = [d["bbox"][0] for d in cluster_dets] + [d["bbox"][0] + d["bbox"][2] for d in cluster_dets]
            all_ys = [d["bbox"][1] for d in cluster_dets] + [d["bbox"][1] + d["bbox"][3] for d in cluster_dets]
            x_min, x_max = min(all_xs), max(all_xs)
            y_min, y_max = min(all_ys), max(all_ys)
            merged.append({
                "centroid": (int(merged_cx), int(merged_cy)),
                "bbox": (x_min, y_min, x_max - x_min, y_max - y_min),
                "area": total_area
            })
    return merged

def allowed_area_range(y_bottom, img_h):
    t = y_bottom / img_h
    kmin = 0.3  # Adjusted to allow more variance
    kmax = 1.3
    min_area = (2000.00 + 44749.12 * t * t) * kmin
    max_area = (2000.00 + 108157.55 * t * t) * kmax
    return min_area, max_area

def detect_blobs(mask: np.ndarray):
    H, W = mask.shape[:2]
    contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    dets = []
    for c in contours:
        area = cv2.contourArea(c)
        if area <= 0:
            continue
        x, y, w, h = cv2.boundingRect(c)
        y_bottom = y + h  
        minA, maxA = allowed_area_range(y_bottom, H)
        if area <  minA or area > maxA:
            continue
        M = cv2.moments(c)
        if M["m00"] == 0:
            continue
        cx = int(M["m10"] / M["m00"])
        cy = int(M["m01"] / M["m00"])
        dets.append({"centroid": (cx, cy), "bbox": (x, y, w, h), "area": area})
    # Cluster and merge nearby fragmented blobs
    dets = cluster_blobs_by_proximity(dets, distance_threshold=40.0)
    return dets

## Tracking

Next, we need to track each detected car. We chose the Kalman filter because it is fast, supports multiple object tracking, and predicts the probable next position of each car based on its own motion model. Different cars move at different speeds, and the Kalman filter learns and adapts to these individual motion patterns.

The Kalman filter maintains a state vector [x, y, vx, vy] representing position and velocity. The transition matrix defines how the state evolves between frames (it predicts where the car will be based on its current velocity). The measurement matrix maps the detected position (which we observe) to the state space. The process noise covariance controls how much we trust the motion model (lower values = trust motion more), while the measurement noise covariance controls how much we trust the detections (lower values = trust detections more). By balancing these, the filter smooths noisy detections while allowing the object to change speed.

Each detection is attached to a Track object, which aims to maintain a persistent identity throughout the video. Every track has a unique ID and its own Kalman filter. The predict function estimates where the vehicle should be in the current frame using the motion model, before we see any new detections. The update function corrects the prediction using the actual detected position, allowing the filter to learn and adjust if the vehicle's motion changes.


##### Track Lifecycle and State Management

In addition to the Kalman filter, each Track object maintains internal counters to control its stability and lifetime:

- hits → total number of successful matches

- consecutive_hits → number of consecutive frames with successful detection

- missed → number of consecutive frames without a detection

A track is considered confirmed only after it has been matched for a minimum number of consecutive frames (MIN_HITS). This prevents short-lived noise detections from immediately becoming stable vehicle identities.

If a track does not receive a detection in a given frame, it is not immediately removed. Instead:

- The missed counter is incremented.

- The track continues relying on its predicted position.

A track is deleted only if:

- missed > MAX_AGE

This allows the system to tolerate temporary occlusions, glare, or segmentation failures while preventing stale tracks from persisting indefinitely.

##### Adaptive Association Strategy

Associating detections to existing tracks is one of the most critical parts of the tracking system.

Instead of using a fixed distance threshold, we implement an adaptive gating strategy based on bounding box size:

- Larger vehicles (closer to the camera) are allowed to move more pixels between frames.

- Smaller vehicles (farther from the camera) are constrained by a stricter motion limit.

The allowed matching distance is computed as:

- 20 + 0.8 × bbox_height

This makes the association perspective-aware and reduces identity switching across lanes.

##### Two-Pass Matching Procedure

To further improve robustness, association is performed in two passes.

###### First Pass: Confirmed Tracks

- Confirmed tracks are matched first.

- This prevents stable identities from being “stolen” by newly appearing detections.

- The closest detection within the adaptive gate is selected.

Additionally, a validation step is performed using the Kalman prediction:

- The Euclidean distance between the predicted position and the detection must remain plausible.

- A slightly relaxed threshold is used for this validation.

This prevents large, unrealistic corrections caused by mask artifacts or sudden segmentation errors.

###### Second Pass: Unconfirmed Tracks

After confirmed tracks are processed:

- Unconfirmed tracks are matched.

- A stricter distance threshold is applied.

This reduces the risk of unstable or newly created tracks capturing incorrect detections.

##### Creation of New Tracks

If a detection cannot be matched to any existing track:

- The system checks whether it is close to a predicted track position.

- The suppression radius increases with the number of missed frames.

This duplicate suppression mechanism prevents creating multiple IDs for the same vehicle if it temporarily disappears and reappears.

Only detections sufficiently far from all predicted track positions result in the creation of a new Track object with a new unique ID.

In [12]:
def init_kalman(dt: float = 1.0, process_var=None, meas_var=None) -> cv2.KalmanFilter:
    """
    State: [x, y, vx, vy]^T
    Measurement: [x, y]^T
    process_var, meas_var: empirical, in pixel units.
    """
    kf = cv2.KalmanFilter(4, 2)

    kf.transitionMatrix = np.array([
        [1, 0, dt, 0 ],
        [0, 1, 0 , dt],
        [0, 0, 1 , 0 ],
        [0, 0, 0 , 1 ],
    ], dtype=np.float32)

    kf.measurementMatrix = np.array([
        [1, 0, 0, 0],
        [0, 1, 0, 0],
    ], dtype=np.float32)

    # These are the two main tuning knobs.
    kf.processNoiseCov = np.eye(4, dtype=np.float32) * process_var
    kf.measurementNoiseCov = np.eye(2, dtype=np.float32) * meas_var

    # Start uncertain so it can adapt quickly.
    kf.errorCovPost = np.eye(4, dtype=np.float32) * 500.0
    kf.statePost = np.zeros((4, 1), dtype=np.float32)
    return kf

def color_from_id(track_id: int) -> tuple[int, int, int]:

    rng = np.random.default_rng(track_id)  # stable seed per id
    # Keep colors away from extremes: [40..220]
    c = rng.integers(40, 220, size=3, dtype=np.int32)
    return (int(c[0]), int(c[1]), int(c[2]))  # B, G, R

class Track:
    """
    One vehicle hypothesis + identity.
    We keep:
      - Kalman filter
      - hits: how many times we matched a detection (confidence)
      - missed: how many consecutive frames we failed to match (death timer)
      - confirmed: whether track has been stable for MIN_HITS frames
    """
    def __init__(self, track_id: int, init_xy: tuple[int,int], init_bbox, dt=1.0, process_var=None, meas_var=None):
        self.id = track_id
        self.kf = init_kalman(dt=dt, process_var=process_var, meas_var=meas_var)
        self.color = color_from_id(track_id)

        x, y = init_xy
        self.kf.statePost = np.array([[x], [y], [0], [0]], dtype=np.float32)

        self.hits = 1
        self.missed = 0
        self.bbox = init_bbox
        self.history = [init_xy]
        self.consecutive_hits = 1  # Count consecutive frames with detections
        self.last_pred = (x, y)  # Initialize with first position

    def predict(self) -> tuple[float,float]:
        """
        Predict where the vehicle should be in the current frame (before seeing detections).
        Stores prediction in self.last_pred for association gating.
        """
        pred = self.kf.predict()
        self.last_pred = (float(pred[0]), float(pred[1]))
        return self.last_pred

    def update(self, xy: tuple[int,int], bbox):
        """
        Correct the predicted state using the detection measurement.
        """
        cx, cy = xy
        z = np.array([[cx], [cy]], dtype=np.float32)
        self.kf.correct(z)
        self.hits += 1
        self.consecutive_hits += 1  # Increment consecutive hits
        self.missed = 0
        self.bbox = bbox
        self.history.append(xy)

    def mark_missed(self):
        """
        No detection matched this track this frame
        """
        self.missed += 1
        self.consecutive_hits = 0  # Reset consecutive hits

    def is_confirmed(self, min_hits: int = 3) -> bool:
        """Check if track is stable and should be displayed"""
        return self.consecutive_hits >= min_hits


In [13]:
# adaptive association with measurement validation

def match_distance_for_bbox(bbox):
    """
    Larger objects can move more pixels between frames
    bbox height = distance to camera (perspective effect)
    Adaptive gate: closer objects (larger) can move more
    """
    _, _, w, h = bbox
    return 20 + 0.8 * h   # slightly more permissive than before

def euclidean_distance(track, detection):
    """
    Compute Euclidean distance between track prediction and detection.
    Uses predicted state (last_pred) for gating
    """
    pred_x, pred_y = track.last_pred
    det_x, det_y = detection["centroid"]
    d = ((pred_x - det_x)**2 + (pred_y - det_y)**2) ** 0.5
    return d


def associate_detections_to_tracks(dets, tracks, min_confirmed_hits=3):
    """
    For each track, pick the nearest detection,
    but only if it lies within a size-dependent distance.
    
    Prioritize confirmed tracks to prevent ghost IDs from stealing detections.
    Validate measurements: reject detections that would cause large corrections
    to avoid mask artifacts corrupting track state.
    
    OPTIMIZED: Use sets instead of list.remove() to avoid O(n) operations.
    """

    matches = []
    unmatched_dets = set(range(len(dets)))
    unmatched_tracks = set(range(len(tracks)))

    # First pass: match CONFIRMED tracks (higher priority)
    for ti, t in enumerate(tracks):
        if not t.is_confirmed(min_confirmed_hits) or not unmatched_dets:
            continue

        #tx, ty = t.history[-1]
        tx, ty = getattr(t, "last_pred", t.history[-1])
        best_di = None
        best_d = float("inf")

        for di in unmatched_dets:
            cx, cy = dets[di]["centroid"]
            d = ((tx - cx)**2 + (ty - cy)**2) ** 0.5
            allowed = match_distance_for_bbox(dets[di]["bbox"])
            

            if d < allowed and d < best_d:
                best_d = d
                best_di = di

        if best_di is not None:
            # Additional validation: check that this detection is plausible
            # using Kalman filter's predicted state 
            euclid_d = euclidean_distance(t, dets[best_di])
            pred_allowed = match_distance_for_bbox(dets[best_di]["bbox"]) * 1.2  # 20% relaxation for KF
            if euclid_d < pred_allowed:
                matches.append((ti, best_di))
                unmatched_dets.discard(best_di)
                unmatched_tracks.discard(ti)

    # Second pass: match UNCONFIRMED tracks (lower priority, stricter matching)
    for ti, t in enumerate(tracks):
        if t.is_confirmed(min_confirmed_hits) or ti not in unmatched_tracks or not unmatched_dets:
            continue

        #tx, ty = t.history[-1]
        tx, ty = getattr(t, "last_pred", t.history[-1])
        best_di = None
        best_d = float("inf")

        for di in unmatched_dets:
            cx, cy = dets[di]["centroid"]
            d = ((tx - cx)**2 + (ty - cy)**2) ** 0.5
            # Stricter gate for unconfirmed tracks (avoid spurious matches)
            allowed = match_distance_for_bbox(dets[di]["bbox"]) * 0.7
            

            if d < allowed and d < best_d:
                best_d = d
                best_di = di

        if best_di is not None:
            matches.append((ti, best_di))
            unmatched_dets.discard(best_di)
            unmatched_tracks.discard(ti)

    return matches, list(unmatched_tracks), list(unmatched_dets)


In [14]:
#helper for visualization 
def draw_bbox(img, bbox, label, color, thickness=2):
    x, y, w, h = bbox
    x2, y2 = x + w, y + h

    # optional "shadow" outline for contrast
    cv2.rectangle(img, (x, y), (x2, y2), (0, 0, 0), thickness + 2, lineType=cv2.LINE_AA)
    cv2.rectangle(img, (x, y), (x2, y2), color, thickness, lineType=cv2.LINE_AA)

    font = cv2.FONT_HERSHEY_SIMPLEX
    font_scale = 0.55
    text_thickness = 2
    (tw, th), baseline = cv2.getTextSize(label, font, font_scale, text_thickness)

    y_text_top = y - (th + baseline + 6)
    if y_text_top < 0:
        y_text_top = y + 2

    x_text = x
    y_text = y_text_top + th + 3

    cv2.rectangle(img, (x_text, y_text_top), (x_text + tw + 8, y_text_top + th + baseline + 6),
                  (0, 0, 0), -1, lineType=cv2.LINE_AA)
    cv2.rectangle(img, (x_text + 1, y_text_top + 1), (x_text + tw + 7, y_text_top + th + baseline + 5),
                  color, -1, lineType=cv2.LINE_AA)

    cv2.putText(img, label, (x_text + 4, y_text),
                font, font_scale, (0, 0, 0), text_thickness, lineType=cv2.LINE_AA)

## Algorithm

The system follows a tracking by detection approach: first it detects moving vehicles in each frame, and then it keeps a persistent identity for each vehicle over time using a Kalman filter motion model.

The tracker behaviour is controlled by a few parameters that determine how tolerant the system is to missing detections and motion uncertainty.

In [15]:
SAVE_VIDEO = True
VIDEO_NAME = "vehicle_tracking_debug.mp4"
VIDEO_FPS = 20
video_writer = None
SHOW_EVERY = 1  
SCALE = 0.4      # scale for display screen

DT = 1.0
MAX_AGE  = 30       # frames allowed to miss before deleting (allows recovery from mask gaps)
MIN_HITS = 3         # show track after this many consecutive matches
PROCESS_VAR = 0.2   
MEAS_VAR    = 1.0
# bacgkground subtractor object 
bg = cv2.createBackgroundSubtractorMOG2(history=400, varThreshold=15, detectShadows=True)

tracks = []
next_id = 1


`DT`  
Represents the time step between frames in the motion model. In this project everything is measured per frame, so it is set to 1.0. It does not change the behaviour, it only keeps the motion equations consistent.

`MAX_AGE`  
Maximum number of consecutive frames a track can remain unmatched before being deleted. This allows a vehicle to temporarily disappear (for example due to glare or segmentation errors) and still keep its identity when it reappears. Larger values make the tracker more tolerant but may keep dead tracks alive longer.

`MIN_HITS`  
Number of consecutive successful matches required before a track is considered reliable. This prevents unstable short detections from immediately becoming tracked vehicles and reduces ID flickering.

`PROCESS_VAR`  
Indicates how uncertain the vehicle motion is assumed to be.  
If it is large, the tracker assumes vehicles may change speed or direction and therefore relies more on the new detections.  
If it is small, the tracker assumes motion is smooth and relies more on its predicted trajectory.

`MEAS_VAR`  
Indicates how noisy the detections are expected to be.  
If it is large, the tracker considers the detections unreliable and follows the predicted trajectory more closely.  
If it is small, the tracker follows the detections more strictly.

In this project the detections come from background subtraction, which is noisy (shadows, glare and fragmentation).  
Therefore the tracker is configured to trust the Kalman prediction more than the raw detections


# Main loop explanation 

For every frame of the video, the algorithm starts by extracting motion using a background subtraction model (MOG2). This produces a binary mask of moving regions. Because this raw mask contains noise, shadows and fragmented shapes, it is cleaned with morphological filtering so that each vehicle ideally becomes a single blob.

From this cleaned mask, blobs are extracted and converted into detections. Each detection contains a centroid (the measurement used by the tracker) and a bounding box (used for validation and visualization). At this stage, the algorithm does not yet know which vehicle is which, it only knows where motion exists in the current frame.

Next comes the prediction stage. Every tracked vehicle already has an associated Kalman filter that stores its estimated position and velocity. Before looking at the new detections, the tracker predicts where each vehicle should appear in the current frame. This prediction allows the system to bridge short detection failures (for example glare, shadows or imperfect segmentation).

After prediction, detections are matched to tracks. The association is done in two passes: confirmed tracks are matched first (to protect stable identities), and unconfirmed tracks are matched afterwards with stricter conditions. A detection is only assigned to a track if it is spatially close enough to the predicted position. This distance gating prevents a vehicle from suddenly jumping to another lane or swapping identity with another car.

When a match is found, the Kalman filter is corrected using the detected centroid. This step combines the prediction and the measurement to obtain a smoother and more stable estimate of the vehicle trajectory. If a track does not receive a detection in the current frame, it is not immediately deleted; instead, it is marked as “missed”. This allows the tracker to survive short occlusions or difficult lighting conditions.

If a detection cannot be matched to any existing track, the system decides whether it represents a new vehicle or a temporarily lost one. To avoid creating duplicate identities, the detection is compared against all predicted track positions. The longer a track has been missing, the larger the allowed distance becomes. This adaptive suppression lets a vehicle disappear for several frames and still recover its original ID when it reappears.

Only detections that are sufficiently far from all existing tracks create a new track with a new identifier. A track is considered reliable only after it has been successfully matched for several consecutive frames. Finally, tracks that remain unmatched for too long are removed from the system.

Overall, the algorithm maintains stable vehicle identities by combining three ideas: motion detection to obtain measurements, a Kalman filter to predict motion over time, and adaptive association rules that tolerate temporary detection failures while avoiding duplicated IDs.



In [16]:
#For every frame, the Kalman filter predicts where the car should be (the predicted position).
# When new detections (blobs) are found, the code tries to match them to existing tracks using both the last known position and the Kalman filter's prediction.
#If a detection matches a track, the Kalman filter updates (corrects) its state using the detected position.


for i, img_path in enumerate(IMAGE_FILES):
    frame = cv2.imread(str(img_path))
    if frame is None:
        continue

    H, W = frame.shape[:2]

    # 1) Foreground mask (motion)
    fg = bg.apply(frame)

    # 2) Clean the mask
    fg_clean = clean_mask(fg)

    # 3) Detections = blobs (with merging of fragments)
    dets = detect_blobs(fg_clean)

    # 4) Predict all tracks (KF motion model)
    for t in tracks:
        t.predict()

    # 5) Associate with two pass approach (confirmed tracks first)
    matches, unmatched_tracks, unmatched_dets = associate_detections_to_tracks(
        dets, tracks, min_confirmed_hits=MIN_HITS
    )

    # 6) Update matched tracks (KF correction step)
    for ti, di in matches:
        cx, cy = dets[di]["centroid"]
        bbox = dets[di]["bbox"]
        tracks[ti].update((cx, cy), bbox)

    # 7) Mark unmatched tracks: we collect them in order to delete them later
    for ti in unmatched_tracks:
        tracks[ti].mark_missed()

    # 8) Create new tracks for unmatched detections
    for di in unmatched_dets:
        cx, cy = dets[di]["centroid"]
        bbox = dets[di]["bbox"]

        # duplicate suppression: don't create new ID if near any existing track 
        duplicate = False
        for t in tracks:
            #tx, ty = t.history[-1]
            tx, ty = getattr(t, "last_pred", t.history[-1]) # use prediction 
            dist = ((tx - cx)**2 + (ty - cy)**2) ** 0.5
            base = 40 if not t.is_confirmed(MIN_HITS) else 50
            # If the track has missed frames, enlarge the suppression radius
            # Linear growth with a cap to avoid suppressing truly new vehicles
            threshold = min(120, base + 20 * t.missed)
            if dist < threshold:
                duplicate = True
                break
        if duplicate:
            continue

        # Allow new track creation 

        tracks.append(Track(
            track_id=next_id,
            init_xy=(cx, cy),
            init_bbox=bbox,
            dt=DT,
            process_var=PROCESS_VAR,
            meas_var=MEAS_VAR
        ))
        next_id += 1

    # 9) Delete only truly stale tracks (high MAX_AGE tolerance)
    tracks = [t for t in tracks if t.missed <= MAX_AGE]

    # 10) Visualize + record 
    if i % SHOW_EVERY == 0:
        vis = frame.copy()

        # Draw only confirmed tracks
        for t in tracks:
            if not t.is_confirmed(MIN_HITS):
                continue
            if t.missed > MAX_AGE:
                continue
            label = f"vehicle_{t.id}"
            draw_bbox(vis, t.bbox, label, t.color, thickness=2)

        mask_vis = cv2.cvtColor(fg_clean, cv2.COLOR_GRAY2BGR)

        vis_small  = cv2.resize(vis, (int(W * SCALE), int(H * SCALE)))
        mask_small = cv2.resize(mask_vis, (int(W * SCALE), int(H * SCALE)))

        stacked = np.vstack([mask_small, vis_small])

        # init writer 
        if SAVE_VIDEO and video_writer is None:
            hh, ww = stacked.shape[:2]
            fourcc = cv2.VideoWriter_fourcc(*"mp4v")
            video_writer = cv2.VideoWriter(VIDEO_NAME, fourcc, VIDEO_FPS, (ww, hh))
            print("Recording video to:", VIDEO_NAME)

        if SAVE_VIDEO and video_writer is not None:
            video_writer.write(stacked)

        cv2.imshow("vehicle tracking", stacked)

        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

# Cleanup
if video_writer is not None:
    video_writer.release()
    print("Saved:", VIDEO_NAME)

cv2.destroyAllWindows()



Recording video to: vehicle_tracking_debug.mp4


  self.last_pred = (float(pred[0]), float(pred[1]))


Saved: vehicle_tracking_debug.mp4


## Challenges and Possible Improvements
### Challenges

During the implementation and testing of the tracking system, several practical challenges were encountered.

#### 1. Multiple IDs per Vehicle (Duplicate Tracks)

One of the main issues was the creation of multiple track IDs for the same vehicle.

This typically occurred when:

- Background subtraction fragmented a single vehicle into multiple blobs.

- A vehicle temporarily disappeared and reappeared.

- The association gate was too strict.

Although blob clustering and duplicate suppression reduced this issue, there is always a trade-off:

- If the matching threshold is too strict → more duplicate IDs appear.

- If the threshold is too relaxed → different vehicles may merge or swap identities.

Finding the correct balance required empirical tuning.

#### 2. ID Switching / ID Jumping

When vehicles move close to each other (e.g., adjacent lanes), association becomes ambiguous.

Even with adaptive distance gating, the tracker may:

- Assign a detection to the wrong track.

- Cause identity switching between vehicles.

This is a common limitation of motion-only tracking systems, since no appearance information is used to distinguish similar vehicles.

#### 3. Missed Detections

The system sometimes failed to detect vehicles due to:

- Strong glare regions in the road.

- Sudden illumination changes.

- Imperfect segmentation.

- Vehicles blending with the background.

When detections are missed:

- The Kalman filter predicts the position.

- If the object reappears within MAX_AGE, the identity is preserved.

- Otherwise, a new ID may be created.

There is a trade-off between:

- Increasing MAX_AGE → better occlusion tolerance but risk of keeping dead tracks.

- Decreasing MAX_AGE → cleaner tracking but more ID fragmentation.

#### 4. Strong Perspective Dependency

The area filtering model is tuned specifically to the camera perspective of this video.

The expected vehicle size is computed as a function of vertical image position.

While this improves detection quality in this scene, it introduces a limitation:

- The system would not generalize well to a different camera angle.

- The area model would need to be re-estimated.

This makes the detector scene-dependent.

#### 5. Lighting Artifacts

In this particular video, a bright region on the road caused frequent tracking failures.

High-intensity areas affected:

- Foreground segmentation quality.

- Blob consistency.

- Stability of centroids.

Background subtraction methods are inherently sensitive to lighting variations and reflections.

#### 6. Vehicle Color Similarity

Vehicles with colors similar to the road surface were more difficult to segment.

For example:

- A car whose color blended with lane markings was poorly detected in early frames.

- This caused unstable tracking initialization.

Since MOG2 relies on pixel intensity differences, low contrast objects are harder to detect reliably.

#### 7. MOG2 Warm-Up Effect

The MOG2 background subtractor learns the background model online.

At the beginning of the video:

- The model is not yet stable.

- Moving vehicles may be partially absorbed into the background model.

- Foreground masks are inconsistent.

Because the video is relatively short, this stabilization period has a noticeable impact on early detections.

### Possible Improvements

Several modifications could improve robustness and generalization.

#### 1. Use an Appearance Model

Currently, tracking relies only on motion information.

Adding appearance features (such as color histograms or embeddings) could:

- Reduce ID switching.

- Improve discrimination between nearby vehicles.

- Increase robustness in crowded scenarios.

#### 2. Use Hungarian Algorithm for Optimal Assignment

The current association method uses greedy nearest-neighbor matching.

Replacing it with the Hungarian algorithm would:

- Provide globally optimal matching.

- Reduce ambiguous assignments.

- Improve stability when multiple vehicles are close together.

#### 3. Replace Background Subtraction with a Deep Detector

Instead of relying on MOG2, a deep learning detector (e.g., YOLO) could be used.

Advantages:

- More robust to lighting variations.

- Less sensitive to shadows and glare.

- Better generalization across scenes.

This would significantly improve detection reliability.

#### 4. Shadow Removal Techniques

Explicit shadow removal could reduce false positives and blob fragmentation.

Techniques such as:

- HSV-based shadow filtering

- Illumination-invariant representations

could improve mask quality.

#### 5. Adaptive Parameter Tuning

Instead of fixed thresholds:

- The gating distance could adapt dynamically based on motion variance.

- MAX_AGE could adjust based on object speed.

- Noise parameters could be updated online.

This would make the system more flexible in varying conditions.

#### 6. Re-Identification Module

For long occlusions, a re-identification mechanism could be added.

This would allow:

- Recovering identities after long disappearances.

- Maintaining consistency even across partial scene exits.