# Advanced Object Tracking & Counting with DETR and SORT

## Project Overview

This project demonstrates an advanced computer vision pipeline that goes beyond simple detection to perform robust **object tracking and cumulative counting**. The solution leverages the official **Facebook (Meta AI) DETR model** for high-accuracy object detection and the **SORT (Simple Online and Realtime Tracking) algorithm** to assign unique IDs to objects and track them across video frames.

The primary application shown here is traffic analysis, where a virtual "counting line" is established. The system provides a cumulative total for each object category (`car`, `person`, etc.) only when an object crosses this line, providing accurate metrics instead of a simple per-frame count. This is a portfolio-ready project showcasing skills in modern AI models and practical, real-world application logic.

## 1. Setup and Library Installation

We will install all necessary libraries. `sort-tracker` is a lightweight and efficient library for implementing the SORT algorithm.

In [1]:
# Install Libraries
!pip install transformers timm opencv-python filterpy scikit-image -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m178.0/178.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m83.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m71.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m28.7 MB/s

In [2]:
# Download AND Patch the SORT code

# Step 1: Download the original sort.py file, forcing the output name to be 'sort.py'
!wget https://raw.githubusercontent.com/abewley/sort/master/sort.py -O sort.py

# Step 2: Automatically edit the file to comment out the problematic line
# This 'sed' command finds the line "matplotlib.use('TkAgg')" and adds a '#' at the beginning, disabling it.
!sed -i "s/matplotlib.use('TkAgg')/# matplotlib.use('TkAgg')/" sort.py

print("File 'sort.py' downloaded and patched successfully.")

--2025-09-13 14:54:08--  https://raw.githubusercontent.com/abewley/sort/master/sort.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11739 (11K) [text/plain]
Saving to: ‘sort.py’


2025-09-13 14:54:08 (15.0 MB/s) - ‘sort.py’ saved [11739/11739]

File 'sort.py' downloaded and patched successfully.


In [3]:
# Imports

import matplotlib
matplotlib.use('Agg') # Best practice to set the backend early

import torch
from transformers import AutoImageProcessor, AutoModelForObjectDetection
import cv2
from PIL import Image
import numpy as np
import os
import base64
from IPython.display import HTML, display

# Import the Sort class from our now-patched 'sort.py' file
from sort import Sort

print("All libraries imported successfully.")

2025-09-13 14:54:26.595957: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1757775266.958994      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1757775267.062505      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


All libraries imported successfully.


## 2. Initialize Model and Tracker

Here, we load the official `facebook/detr-resnet-50` model from the Hugging Face Hub. We then initialize the SORT tracker and create data structures to hold our cumulative counts (`total_counts`) and the IDs of objects that have already been counted (`counted_ids`).

In [4]:
# Load the stable DETR model from Facebook
model_checkpoint = "facebook/detr-resnet-50"
image_processor = AutoImageProcessor.from_pretrained(model_checkpoint)
model = AutoModelForObjectDetection.from_pretrained(
    model_checkpoint,
    trust_remote_code=True
)

# Move model to GPU for faster inference
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
print(f"Model '{model_checkpoint}' loaded successfully on {device}.")

# Initialize the SORT tracker
tracker = Sort()

# Initialize variables for cumulative counting
total_counts = {
    'person': 0,
    'bicycle': 0,
    'car': 0,
    'motorcycle': 0
}
# A set to store the IDs of objects that have already been counted
counted_ids = set()

preprocessor_config.json:   0%|          | 0.00/290 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/167M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/102M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/detr-resnet-50 were not used when initializing DetrForObjectDetection: ['model.backbone.conv_encoder.model.layer1.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing DetrForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DetrForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Model 'facebook/detr-resnet-50' loaded successfully on cuda.


## 3. Process Video with Detection, Tracking, and Counting

This is the main logic loop. For each frame in the input video, we perform the following steps:
1.  **Detect:** Get object detections using the DETR model.
2.  **Format:** Convert the detections into the format required by the SORT tracker (`[x1, y1, x2, y2, score]`)
3.  **Track:** Update the tracker with the new detections to receive back bounding boxes with unique, persistent IDs.
4.  **Count:** Check if any tracked object's center has crossed our virtual line. If it has, and its ID hasn't been counted yet, we increment the total count and log its ID.
5.  **Visualize:** Draw the bounding boxes, object IDs, counting line, and the cumulative count overlay on the frame.

**⚠️ IMPORTANT:** You must change the `input_video_path` variable to the path of your video file.

In [5]:
# Final Video Processing Cell with Counting Zone Logic

def iou(boxA, boxB):
    xA = max(boxA[0], boxB[0])
    yA = max(boxA[1], boxB[1])
    xB = min(boxA[2], boxB[2])
    yB = min(boxA[3], boxB[3])
    interArea = max(0, xB - xA + 1) * max(0, yB - yA + 1)
    boxAArea = (boxA[2] - boxA[0] + 1) * (boxA[3] - boxA[1] + 1)
    boxBArea = (boxB[2] - boxB[0] + 1) * (boxB[3] - boxB[1] + 1)
    iou_score = interArea / float(boxAArea + boxBArea - interArea)
    return iou_score

# -------------------------------------------------------------------
# CHANGE THIS PATH to the path of your uploaded video file!
input_video_path = '/kaggle/input/rf-detr-vid-sample/5402016-hd_1920_1080_30fps.mp4'
# -------------------------------------------------------------------
output_video_path = '/kaggle/working/output_video_final_zone.mp4'

cap = cv2.VideoCapture(input_video_path)
if not cap.isOpened():
    print(f"Error: Could not open video file at {input_video_path}")
else:
    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    fps = int(cap.get(cv2.CAP_PROP_FPS))
    out = cv2.VideoWriter(output_video_path, cv2.VideoWriter_fourcc(*'mp4v'), fps, (frame_width, frame_height))

    ZONE_WIDTH = 40
    ZONE_START_X = int(frame_width / 2) - int(ZONE_WIDTH / 2)
    ZONE_END_X = int(frame_width / 2) + int(ZONE_WIDTH / 2)

    print(f"Processing video with counting zone... Output will be saved to {output_video_path}")
    while True:
        ret, frame = cap.read()
        if not ret:
            break

        pil_image = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
        inputs = image_processor(images=pil_image, return_tensors="pt").to(device)
        with torch.no_grad():
            outputs = model(**inputs)
        target_sizes = torch.tensor([pil_image.size[::-1]])
        results = image_processor.post_process_object_detection(outputs, threshold=0.9, target_sizes=target_sizes)[0]

        detections_for_sort = []
        original_detections = []
        for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
            label_name = model.config.id2label[label.item()]
            if label_name in total_counts:
                box_list = box.tolist()
                detections_for_sort.append([box_list[0], box_list[1], box_list[2], box_list[3], score.item()])
                original_detections.append({'box': box_list, 'label': label_name})

        # 3. Update tracker
        tracked_objects_raw = []
        if len(detections_for_sort) > 0:
            tracked_objects_raw = tracker.update(np.array(detections_for_sort))

        for obj in tracked_objects_raw:
            x1, y1, x2, y2, obj_id = [int(val) for val in obj]
            center_x = int((x1 + x2) / 2)

            # Re-associate label using IoU
            best_iou = 0
            best_label = None
            for det in original_detections:
                iou_score = iou([x1, y1, x2, y2], det['box'])
                if iou_score > best_iou:
                    best_iou = iou_score
                    best_label = det['label']

            # Count the object if its center enters the zone and it hasn't been counted before
            if best_label and obj_id not in counted_ids:
                if center_x > ZONE_START_X and center_x < ZONE_END_X:
                    total_counts[best_label] += 1
                    counted_ids.add(obj_id)
            
            # Draw bounding box and ID (no changes here)
            if best_label:
                cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
                cv2.putText(frame, f'{best_label} ID: {obj_id}', (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 0), 2)

        overlay = frame.copy()
        cv2.rectangle(overlay, (ZONE_START_X, 0), (ZONE_END_X, frame_height), (255, 0, 0, 0.2), -1)
        # Gabungkan overlay dengan frame asli
        alpha = 0.2 # Tingkat transparansi
        frame = cv2.addWeighted(overlay, alpha, frame, 1 - alpha, 0)
        
        # Display the cumulative total counts (no changes here)
        y_offset = 30
        for obj_name, count in total_counts.items():
            text = f'Total {obj_name.capitalize()}: {count}'
            cv2.putText(frame, text, (15, y_offset), cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 0, 0), 5)
            cv2.putText(frame, text, (15, y_offset), cv2.FONT_HERSHEY_SIMPLEX, 0.8, (255, 255, 255), 2)
            y_offset += 30

        out.write(frame)

    cap.release()
    out.release()
    print(f"\nVideo processing complete! Output saved to: {output_video_path}")

Processing video with counting zone... Output will be saved to /kaggle/working/output_video_final_zone.mp4

Video processing complete! Output saved to: /kaggle/working/output_video_final_zone.mp4


## 5. Final Visualization

This final cell will display the processed video directly in the notebook output. This allows for immediate review without needing to download the file first.

In [6]:
# Kode Visualisasi Baru yang Lebih Andal (Menampilkan Link Unduhan)

import os
from IPython.display import FileLink, display

# Path ke video output Anda
output_video_path = '/kaggle/working/output_video_final_zone.mp4'

# Periksa apakah file ada
if os.path.exists(output_video_path):
    # Dapatkan ukuran file dalam Megabyte
    file_size_mb = os.path.getsize(output_video_path) / (1024 * 1024)
    
    print(f"Video file found! Size: {file_size_mb:.2f} MB")
    print("Click the link below to download your processed video:")
    
    # Tampilkan link unduhan yang bisa diklik
    display(FileLink(output_video_path))
else:
    print(f"File video output tidak ditemukan di path: {output_video_path}")

Video file found! Size: 57.01 MB
Click the link below to download your processed video:


In [7]:
# !rm -f /kaggle/working/output_video_final.mp4