# OBJECT TRACKING | YOLO v10 | Deep SORT

# Content
## 1. What is Deep Sort?
## 2. Deep Sort Working.
## 3. Inferencing using deep sort algorithm.
## 4. Application using streamlit to track the objects in videos.

# 

# DeepSort

### DeepSORT is a Computer Vision Tracking Algorithm used to track the objects while assigning each of the tracked object a unique id. DeepSORT is an extension of the SORT. DeepSORT introduces deep learning into SORT algorithm by adding appearance descriptor to reduce the identity switches and hence making the tracking more efficient.


                                                         OR                                                     

### DeepSORT can be defined as a tracking algorithm which tracks object not only based on the velocity and motion of the object but also based on the appearance of the object.

# DeepSort Working

### Working

#### SORT is an approach to object tracking where Kalman Filters and Hungarian Algorithms are used to track objects. SORT consists of four components which are as follows:
 1. Detection
 2. Estimation
 3. Data Association
 4. Creation and Deletion of Track Identities

### Detection: 
* As first step objects needs to be detect using Yolov10 (object detection model) so it can be tracked. Then these detectionsare passed to next step.
### Estimation: 
* Here in this step we pass the detection from current frame to next frame to estimate the position of the target in the next frame using Gausian Distribution and constant velocity model. The estimation is done using the Kalman Filter.
### Data Association: 
* We now have the target bounding box and the detected bounding box. So, a cost matrix is computed as the intersection-over-union (IOU) distance between each detection and all predicted bounding boxes from the existing targets.
### Creation and deletion of track IDs: 
+ When any object is about to enter or exit the frame then unique object id’s are created and destroyed accordingly

#

# *PROBLEM*
- But there are two problems with Sort Algorithm
1. Deficiency in tracking to occlusion/ fails in case of occlusion and different view points.
2. Despite the effectiveness of Kalman filter, it returns a relatively higher number of ID switches.

# *SOLUTION* 
- These issues are because of the association metric used.

- So, in DeepSORT we use another distance metric which is based on the appearance of the object. The appearance feature vector (Deep Appearance Descriptor).
- DeepSORT uses a better association metrics which combines both motion and appearance descriptors.

#

# INFERENCING

In [None]:
# ! pip install ultralytics
# ! pip install supervision

In [1]:
import cv2
from ultralytics import YOLOv10
import wget
import numpy as np
from deep_sort.deep_sort import DeepSort
import time

import datetime

In [2]:
wget.download('https://github.com/THU-MIG/yolov10/releases/download/v1.1/yolov10x.pt')

 10% [.......                                                               ]  13123584 / 128288859


KeyboardInterrupt



In [4]:
model = YOLOv10('yolov10x.pt')

deep_sort_weights = 'deep_sort/deep/checkpoint/ckpt.t7'

  ckpt = torch.load(file, map_location="cpu")


In [3]:
def draw_label(image, text, top_left, bottom_right, color, font_color, font_scale=0.6, font_thickness=2):
    # Calculate text size
    font = cv2.FONT_HERSHEY_SIMPLEX
    text_size = cv2.getTextSize(text, font, font_scale, font_thickness)[0]
    
    # Create a filled rectangle for text background
    text_background_top_left = (top_left[0]+18, top_left[1] - text_size[1] - 10)
    text_background_bottom_right = (top_left[0] + text_size[0] + 25, top_left[1])
    
    cv2.rectangle(image, text_background_top_left, text_background_bottom_right, color, cv2.FILLED)
    
    # Add text on top of the rectangle
    text_position = (top_left[0] + 18, top_left[1] - 5)
    cv2.putText(image, text, text_position, font, font_scale, font_color, font_thickness)

def draw_rounded_rectangle(image, top_left, bottom_right, color, thickness, radius):
    tl = (top_left[0] + radius, top_left[1] + radius)
    tr = (bottom_right[0] - radius, top_left[1] + radius)
    bl = (top_left[0] + radius, bottom_right[1] - radius)
    br = (bottom_right[0] - radius, bottom_right[1] - radius)
    
    # image=cv2.rectangle(image, p1, p2, color, thickness=lw, lineType=cv2.LINE_AA)

    cv2.rectangle(image, (tl[0], top_left[1]), (tr[0], bottom_right[1]), color, 2, cv2.LINE_AA)
    cv2.rectangle(image, (top_left[0], tl[1]), (bottom_right[0], bl[1]), color, thickness)
    cv2.circle(image, tl, radius, color, thickness)
    cv2.circle(image, tr, radius, color, thickness)
    cv2.circle(image, bl, radius, color, thickness)
    cv2.circle(image, br, radius, color, thickness)


def draw_text(image, text, position, background_color, font_color, font_scale=0.5, font_thickness=1):
    # Calculate text size
    font = cv2.FONT_HERSHEY_SIMPLEX
    text_size = cv2.getTextSize(text, font, font_scale, font_thickness)[0]
    
    # Create a filled rectangle for text background
    text_background_top_left = (position[0] - 5, position[1] + 5)
    text_background_bottom_right = (position[0] + text_size[0] + 5, position[1] - text_size[1] - 5)
    
    cv2.rectangle(image, text_background_top_left, text_background_bottom_right, background_color, cv2.FILLED)
    
    # Add text on top of the rectangle
    text_position = (position[0], position[1] - 5)
    cv2.putText(image, text, text_position, font, font_scale, font_color, font_thickness)

def get_box_details(boxes):
    cls = boxes.cls.tolist()  # Convert tensor to list
    xyxy = boxes.xyxy
    conf = boxes.conf
    xywh = boxes.xywh

    return cls, xyxy, conf, xywh


In [5]:
cap = cv2.VideoCapture('4159610-hd_1920_1080_24fps.mp4')
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = cap.get(cv2.CAP_PROP_FPS)

tracker = DeepSort(model_path=deep_sort_weights, max_age=70, n_init=5, max_iou_distance=0.8)

details = []
prev_details = {}
frames = []
unique_track_ids = set()
frame_no = 0

i = 0
counter, fps, elapsed = 0, 0, 0
start_time = time.perf_counter()

while True:
    ret, frame = cap.read()

    if ret:
        print('here')
        og_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        frame = og_frame.copy()

        results = model(frame)

        bboxes_xywh = []
        confs = []

        class_names = list(model.names.values())
        cls, xyxy, conf, xywh = get_box_details(results[0].boxes) # type: ignore

        for c, b, co in zip(cls, xywh, conf.cpu().numpy()):
            if class_names[int(c)] == 'car' and co >=0.4:
                bboxes_xywh.append(b.cpu().numpy())
                confs.append(co)

        bboxes_xywh = np.array(bboxes_xywh, dtype=float)

        tracks = tracker.update(bboxes_xywh, confs, og_frame)
        
        ids = []
        for track in tracker.tracker.tracks:
            track_id = track.track_id
            hits = track.hits
            x1, y1, x2, y2 = track.to_tlbr()  # Get bounding box coordinates in (x1, y1, x2, y2) format
            w = x2 - x1  # Calculate width
            h = y2 - y1  # Calculate height

            # Set color values for red, blue, and green
            red_color = (0, 0, 255)  # (B, G, R)
            blue_color = (255, 0, 0)  # (B, G, R)
            green_color = (0, 255, 0)  # (B, G, R)

            # Determine color based on track_id
            color_id = track_id % 3
            if color_id == 0:
                color = red_color
            elif color_id == 1:
                color = blue_color
            else:
                color = green_color

            draw_rounded_rectangle(og_frame, (int(x1), int(y1)), (int(x1 + w), int(y1 + h)), color, 1, 15) # type: ignore

            text_color = (255, 255, 255)  # Black color for text
            draw_label(og_frame, f"{'car'}-{track_id}", (int(x1), int(y1)), (int(x1 + w), int(y1 + h)), color, text_color) # type: ignore
            
            if track_id not in prev_details:
                prev_details[track_id] = [time.time(), color]           

            # Add the track_id to the set of unique track IDs
            unique_track_ids.add(track_id)
            ids.append(track_id)
    
        prev_ids = list(prev_details.keys())
        ids_done = set(prev_ids)^set(ids)
        
        # Update the person count based on the number of unique track IDs
        object_counts = len(unique_track_ids)

        for id in ids_done:
            details.append(['car', id, time.time() - prev_details[id][0], prev_details[id][1], frame_no-1])
            del prev_details[id]
                           
        # Update FPS and place on frame
        current_time = time.perf_counter()
        elapsed = (current_time - start_time)
        counter += 1
        if elapsed > 1:
            fps = counter / elapsed
            counter = 0
            start_time = current_time

        # Draw person count on frame
        og_frame = cv2.cvtColor(og_frame, cv2.COLOR_BGR2RGB)
        og_frame = cv2.resize(og_frame, (700, 600))

        font_color = (255, 255, 255)  # White font

        # Position to draw the text (bottom-left corner)
        position = (10, 30)
        background_color = (0, 0, 0)
        timestamp = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        text = f'Frame: {frame_no} | Time: {timestamp} | Count: {object_counts}'
        # # Draw the text on the image
        draw_text(og_frame, text, position, background_color, font_color)

        frame_no += 1

        # Write the frame to the output video file
        # out.write(cv2.cvtColor(og_frame, cv2.COLOR_RGB2BGR))

        # Show the frame
        cv2.imshow("Video", og_frame)
        # cv2.waitKey(0)
        if cv2.waitKey(10) & 0xFF == ord('q'):
            break

cap.release()
# out.release()
cv2.destroyAllWindows()


  state_dict = torch.load(model_path, map_location=torch.device(self.device))[


here

0: 384x640 1 car, 397.8ms
Speed: 249.4ms preprocess, 397.8ms inference, 74.0ms postprocess per image at shape (1, 3, 384, 640)
here

0: 384x640 1 car, 236.3ms
Speed: 22.4ms preprocess, 236.3ms inference, 19.8ms postprocess per image at shape (1, 3, 384, 640)
here

0: 384x640 1 car, 262.4ms
Speed: 13.8ms preprocess, 262.4ms inference, 7.1ms postprocess per image at shape (1, 3, 384, 640)
here

0: 384x640 1 car, 182.5ms
Speed: 13.8ms preprocess, 182.5ms inference, 7.5ms postprocess per image at shape (1, 3, 384, 640)
here

0: 384x640 1 car, 116.1ms
Speed: 9.4ms preprocess, 116.1ms inference, 4.0ms postprocess per image at shape (1, 3, 384, 640)
here

0: 384x640 1 car, 113.0ms
Speed: 10.1ms preprocess, 113.0ms inference, 6.4ms postprocess per image at shape (1, 3, 384, 640)
here

0: 384x640 1 car, 102.1ms
Speed: 7.8ms preprocess, 102.1ms inference, 7.8ms postprocess per image at shape (1, 3, 384, 640)
here

0: 384x640 1 car, 106.7ms
Speed: 12.3ms preprocess, 106.7ms inference, 11.8m

In [4]:
import cv2
cap = cv2.VideoCapture('854671-hd_1920_1080_25fps.mp4')
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = cap.get(cv2.CAP_PROP_FPS)

fourcc = cv2.VideoWriter_fourcc(*'mp4v')
output_path = 'output.mp4'
out = cv2.VideoWriter(output_path, fourcc, fps, (frame_width, frame_height))

while True:
    ret, frame = cap.read()
    if not ret:
        break

    out.write(frame)
    
cap.release()
out.release()
cv2.destroyAllWindows()