#**Car Detection and Audio Matching from Video**

**This notebook processes a video of moving cars by:**


1- Detecting vehicles using YOLOv8

2- Capturing the best frame for each car

3- Extracting and segmenting the associated audio

4- Saving paired data (image and sound) for each detected car

It produces a clean dataset of car images and their corresponding engine sounds, enabling further classification or analysis.

In [None]:
!pip install ultralytics

Collecting ultralytics
  Downloading ultralytics-8.3.168-py3-none-any.whl.metadata (37 kB)
Collecting ultralytics-thop>=2.0.0 (from ultralytics)
  Downloading ultralytics_thop-2.0.14-py3-none-any.whl.metadata (9.4 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.8.0->ultralytics)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.8.0->ultralytics)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.8.0->ultralytics)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.8.0->ultralytics)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.8.0->ultralytics)
  Downloading n

In [None]:
import cv2
from ultralytics import YOLO
import os
from datetime import datetime
from moviepy.editor import VideoFileClip, AudioFileClip
from scipy import signal
import librosa
import soundfile as sf
import numpy as np
import matplotlib.pyplot as plt
import shutil
import json
from pathlib import Path

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


# Extract best image (yolov8)

This function processes a video to detect and track cars using YOLO model. draws bounding boxes and timestamps, and saves the best-captured images of each detected car.

**Main Steps**

1- Load video and YOLO  model

2- Loop over video frames

3- Run YOLO detection at intervals

4- Track cars across frames

5- Save best detections

6- Draw bounding Box

7- Finish processing

In [None]:
def process_video_with_car_capture(
    input_video_path,
    output_video_path,
    left_line_x,
    right_line_x,
    output_image_dir='car_captures',
    model_name='yolov8n.pt',
    detection_interval=5,
    min_confidence=0.5,
    min_car_width=100,
    min_car_height=100
):

    model = YOLO(model_name)
    cap = cv2.VideoCapture(input_video_path)

    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    fps = cap.get(cv2.CAP_PROP_FPS)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

    os.makedirs(output_image_dir, exist_ok=True)

    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    out = cv2.VideoWriter(output_video_path, fourcc, fps, (frame_width, frame_height))

    tracked_cars = {}
    car_id_counter = 0
    frame_count = 0
    detection_timestamps = []
    images = []

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        frame_count += 1
        current_time = frame_count / fps
        frame_copy = frame.copy()

        cv2.line(frame_copy, (left_line_x, 0), (left_line_x, frame_height), (255, 0, 0), 2)
        cv2.line(frame_copy, (right_line_x, 0), (right_line_x, frame_height), (255, 0, 0), 2)

        time_str = f"Time: {current_time:.2f}s"
        cv2.putText(frame_copy, time_str, (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 255), 2)

        if frame_count % detection_interval == 0:
            results = model(frame)
            current_frame_cars = {}

            for result in results:
                for box in result.boxes:
                    if int(box.cls) == 2:
                        x1, y1, x2, y2 = map(int, box.xyxy[0])
                        conf = float(box.conf[0])
                        center_x = (x1 + x2) // 2

                        if (conf < min_confidence or
                            (x2 - x1) < min_car_width or
                            (y2 - y1) < min_car_height):
                            continue

                        if not (left_line_x <= center_x <= right_line_x):
                            continue

                        matched_id = None
                        for car_id, car_data in tracked_cars.items():
                            if abs(center_x - car_data['prev_x']) < 100 and abs(y1 - car_data['prev_y']) < 100:
                                matched_id = car_id
                                break

                        if matched_id is None:
                            car_id = car_id_counter
                            car_id_counter += 1
                            direction = None
                            best_conf = conf
                            best_frame = frame[y1:y2, x1:x2]
                            best_bbox = (x1, y1, x2, y2)
                            detection_time = current_time
                            print(f"New car {car_id} detected at {detection_time:.2f}s")
                            detection_timestamps.append({
                                'car_id': car_id,
                                'detection_time': detection_time,
                                'direction': direction,
                                'confidence': best_conf
                            })
                        else:
                            car_data = tracked_cars[matched_id]
                            direction = car_data['direction']
                            if conf > car_data['best_conf']:
                                best_conf = conf
                                best_frame = frame[y1:y2, x1:x2]
                                best_bbox = (x1, y1, x2, y2)
                            else:
                                best_conf = car_data['best_conf']
                                best_frame = car_data['best_frame']
                                best_bbox = car_data['best_bbox']
                            car_id = matched_id
                            detection_time = car_data.get('detection_time', current_time)

                        if direction is None and car_id in tracked_cars:
                            direction = 'right' if center_x > tracked_cars[car_id]['prev_x'] else 'left'

                        current_frame_cars[car_id] = {
                            'prev_x': center_x,
                            'prev_y': y1,
                            'direction': direction,
                            'best_conf': best_conf,
                            'best_frame': best_frame,
                            'best_bbox': best_bbox,
                            'detection_time': detection_time,
                            'frames_since_capture': 0 if matched_id is None else tracked_cars[matched_id]['frames_since_capture'] + 1
                        }

            for car_id, car_data in tracked_cars.items():
                if (car_id not in current_frame_cars and
                    car_data['best_frame'] is not None and
                    car_data['best_frame'].size > 0 and
                    car_data['best_bbox'][2] - car_data['best_bbox'][0] >= min_car_width and
                    car_data['best_bbox'][3] - car_data['best_bbox'][1] >= min_car_height):

                    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
                    direction = car_data['direction'] if car_data['direction'] else 'unknown'
                    filename = f"{output_image_dir}/car_{car_id}_{direction}_{car_data['best_conf']:.2f}_{car_data['detection_time']:.2f}s.jpg"
                    cv2.imwrite(filename, car_data['best_frame'])
                    images.append(filename)

            tracked_cars = current_frame_cars

        for car_id, car_data in tracked_cars.items():
            x1, y1, x2, y2 = car_data['best_bbox']
            color = (0, 255, 0)
            cv2.rectangle(frame_copy, (x1, y1), (x2, y2), color, 2)
            label = f"Car {car_id} ({car_data['best_conf']:.2f}) @ {car_data['detection_time']:.2f}s"
            if car_data['direction']:
                label += f" {car_data['direction']}"
            cv2.putText(frame_copy, label, (x1, y1-10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)

        out.write(frame_copy)

    for car_id, car_data in tracked_cars.items():
        if (car_data['best_frame'] is not None and
            car_data['best_frame'].size > 0 and
            car_data['best_bbox'][2] - car_data['best_bbox'][0] >= min_car_width and
            car_data['best_bbox'][3] - car_data['best_bbox'][1] >= min_car_height):

            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
            direction = car_data['direction'] if car_data['direction'] else 'unknown'
            filename = f"{output_image_dir}/car_{car_id}_{direction}_{car_data['best_conf']:.2f}_{car_data['detection_time']:.2f}s.jpg"
            cv2.imwrite(filename, car_data['best_frame'])
            images.append(filename)

    cap.release()
    out.release()

    print("\nDetection Timestamps:")
    for ts in detection_timestamps:
        print(f"Car {ts['car_id']}: {ts['detection_time']:.2f}s (Confidence: {ts['confidence']:.2f}, Direction: {ts['direction'] or 'unknown'})")

    print(f"\nSaved qualifying car images to: {output_image_dir}")
    print(f"Processed video saved to: {output_video_path}")
    return images, detection_timestamps

Creating new Ultralytics Settings v0.0.6 file ✅ 
View Ultralytics Settings with 'yolo settings' or at '/root/.config/Ultralytics/settings.json'
Update Settings with 'yolo settings key=value', i.e. 'yolo settings runs_dir=path/to/dir'. For help see https://docs.ultralytics.com/quickstart/#ultralytics-settings.


  if event.key is 'enter':



Extracts audio from a video file and slices it into smaller segments centered around given timestamps (onsets).

**Main Steps**

1- Loads a video and saves its audio track as a .wav

2- Extract a segment of audio centered at center_sec, and save it to dest_path



In [None]:
def extract_audio(video_path, audio_output_path):
    video = VideoFileClip(video_path)
    print(type(video))
    video.audio.write_audiofile(audio_output_path, fps=44100)

def extract_audio_segment(center_sec, range_sec, sample_rate, mono_audio, dest_path):
    center_sample = int(center_sec * sample_rate)
    range_samples = int(range_sec * sample_rate)
    start_sample = max(0, center_sample - range_samples)
    end_sample = min(len(mono_audio), center_sample + range_samples)
    audio_segment = mono_audio[start_sample:end_sample]
    sf.write(dest_path, audio_segment, sample_rate)


def extract_segments(audio_path, onsets, output_dir, source_name, range_sec=1):
    os.makedirs(output_dir, exist_ok=True)
    y, sr = librosa.load(audio_path, sr=None, mono=True)

    for i, center_sec in enumerate(onsets):
        output_path = os.path.join(
            output_dir,
            f"{source_name}_segment{i+1}_{center_sec:.2f}s.wav"
        )
        extract_audio_segment(center_sec+0.12, range_sec, sr, y, output_path)

    print(f"Saved {len(onsets)} segments to {output_dir}")

In [None]:
def process_all_videos_in_folder(
    input_folder,
    output_base_dir="processed_videos",
    model_name='yolov8n.pt',
    detection_interval=5,
    min_confidence=0.5,
    min_car_width=100,
    min_car_height=100,
    audio_segment_range=1.0
):
    os.makedirs(output_base_dir, exist_ok=True)

    video_files = [
        f for f in os.listdir(input_folder)
        if f.lower().endswith(('.mp4', '.avi', '.mov', '.mkv'))
    ]

    print(f"Found {len(video_files)} videos to process")

    for video_file in video_files:
        video_name = os.path.splitext(video_file)[0]
        video_output_dir = os.path.join(output_base_dir, video_name)
        car_captures_dir = os.path.join(video_output_dir, "car_captures")
        audio_segments_dir = os.path.join(video_output_dir, "audio_segments")

        os.makedirs(car_captures_dir, exist_ok=True)
        os.makedirs(audio_segments_dir, exist_ok=True)

        input_video_path = os.path.join(input_folder, video_file)

        print(f"\nProcessing {video_file}...")

        output_video_path = os.path.join(video_output_dir, f"processed_{video_file}")

        cap = cv2.VideoCapture(input_video_path)
        frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
        cap.release()

        left_line_x = frame_width // 4
        right_line_x = 2 * frame_width // 4

        car_images, detection_timestamps = process_video_with_car_capture(
            input_video_path=input_video_path,
            output_video_path=output_video_path,
            left_line_x=left_line_x,
            right_line_x=right_line_x,
            output_image_dir=car_captures_dir,
            model_name=model_name,
            detection_interval=detection_interval,
            min_confidence=min_confidence,
            min_car_width=min_car_width,
            min_car_height=min_car_height
        )

        if detection_timestamps:
            temp_audio_path = os.path.join(video_output_dir, "temp_audio.wav")
            extract_audio(input_video_path, temp_audio_path)

            detection_times = [ts['detection_time'] for ts in detection_timestamps]

            extract_segments(
                audio_path=temp_audio_path,
                onsets=detection_times,
                output_dir=audio_segments_dir,
                source_name=video_name,
                range_sec=audio_segment_range
            )

            os.remove(temp_audio_path)

            print(f"Processed {len(detection_times)} audio segments")
        else:
            print("No cars detected, skipping audio extraction")

        print(f"Finished processing {video_file}")

    print("\nAll videos processed successfully!")

if __name__ == "__main__":
    input_videos_folder = "/content/drive/MyDrive/Gas Emission Estimation Project/Videos/NewVidoes/30 degree videos"
    output_base_directory = "/content/Processed_Videos"

    process_all_videos_in_folder(
        input_folder=input_videos_folder,
        output_base_dir=output_base_directory,
        model_name='yolov8n.pt',
        detection_interval=3,
        min_confidence=0.6,
        min_car_width=100,
        min_car_height=100,
        audio_segment_range=1.0
    )

Found 23 videos to process

Processing record_20250710_132725.mp4...
Downloading https://github.com/ultralytics/assets/releases/download/v8.3.0/yolov8n.pt to 'yolov8n.pt'...


100%|██████████| 6.25M/6.25M [00:00<00:00, 115MB/s]



0: 480x640 1 car, 476.3ms
Speed: 11.6ms preprocess, 476.3ms inference, 41.3ms postprocess per image at shape (1, 3, 480, 640)

0: 480x640 1 person, 1 car, 1 chair, 271.0ms
Speed: 2.8ms preprocess, 271.0ms inference, 5.9ms postprocess per image at shape (1, 3, 480, 640)

0: 480x640 1 car, 1 chair, 167.1ms
Speed: 2.9ms preprocess, 167.1ms inference, 1.4ms postprocess per image at shape (1, 3, 480, 640)

0: 480x640 1 car, 178.5ms
Speed: 2.8ms preprocess, 178.5ms inference, 1.5ms postprocess per image at shape (1, 3, 480, 640)

0: 480x640 1 car, 1 truck, 190.5ms
Speed: 2.9ms preprocess, 190.5ms inference, 1.4ms postprocess per image at shape (1, 3, 480, 640)

0: 480x640 1 car, 1 truck, 176.0ms
Speed: 2.5ms preprocess, 176.0ms inference, 2.3ms postprocess per image at shape (1, 3, 480, 640)

0: 480x640 1 truck, 167.1ms
Speed: 3.3ms preprocess, 167.1ms inference, 1.5ms postprocess per image at shape (1, 3, 480, 640)

0: 480x640 1 car, 1 truck, 189.5ms
Speed: 2.2ms preprocess, 189.5ms infere

                                                        

MoviePy - Done.




Saved 6 segments to /content/Processed_Videos/record_20250715_164216/audio_segments
Processed 6 audio segments
Finished processing record_20250715_164216.mp4

Processing record_20250715_164343.mp4...

0: 384x640 1 car, 137.8ms
Speed: 4.7ms preprocess, 137.8ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 146.8ms
Speed: 3.4ms preprocess, 146.8ms inference, 0.9ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 140.7ms
Speed: 4.9ms preprocess, 140.7ms inference, 1.0ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 146.6ms
Speed: 4.7ms preprocess, 146.6ms inference, 0.9ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 153.9ms
Speed: 4.7ms preprocess, 153.9ms inference, 1.1ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 134.4ms
Speed: 5.9ms preprocess, 134.4ms inference, 0.8ms postprocess per image at shape (1, 3, 384, 640)


                                                        

MoviePy - Done.
Saved 4 segments to /content/Processed_Videos/record_20250715_164343/audio_segments
Processed 4 audio segments
Finished processing record_20250715_164343.mp4

Processing record_20250715_164329.mp4...





0: 384x640 2 cars, 1 bench, 133.6ms
Speed: 5.4ms preprocess, 133.6ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)
New car 0 detected at 0.30s

0: 384x640 2 cars, 132.9ms
Speed: 3.1ms preprocess, 132.9ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 3 cars, 138.9ms
Speed: 4.6ms preprocess, 138.9ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 136.0ms
Speed: 3.6ms preprocess, 136.0ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 151.2ms
Speed: 4.1ms preprocess, 151.2ms inference, 2.5ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 person, 2 cars, 1 motorcycle, 1 bench, 138.7ms
Speed: 6.8ms preprocess, 138.7ms inference, 1.5ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 person, 2 cars, 1 motorcycle, 139.1ms
Speed: 5.1ms preprocess, 139.1ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 person, 2 

                                                        

MoviePy - Done.
Saved 1 segments to /content/Processed_Videos/record_20250715_164329/audio_segments
Processed 1 audio segments
Finished processing record_20250715_164329.mp4

Processing record_20250715_164315.mp4...





0: 384x640 1 person, 160.2ms
Speed: 4.9ms preprocess, 160.2ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 person, 1 car, 134.9ms
Speed: 3.1ms preprocess, 134.9ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 person, 1 car, 143.6ms
Speed: 2.9ms preprocess, 143.6ms inference, 1.4ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 car, 149.2ms
Speed: 4.2ms preprocess, 149.2ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 138.8ms
Speed: 3.0ms preprocess, 138.8ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 car, 153.9ms
Speed: 2.8ms preprocess, 153.9ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 132.0ms
Speed: 2.4ms preprocess, 132.0ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 136.1ms
Speed: 2.7ms preprocess, 136.1ms inference, 1.2ms postprocess per image 

                                                        

MoviePy - Done.
Saved 1 segments to /content/Processed_Videos/record_20250715_164315/audio_segments
Processed 1 audio segments
Finished processing record_20250715_164315.mp4

Processing record_20250715_164301.mp4...





0: 384x640 4 cars, 1 truck, 210.5ms
Speed: 4.6ms preprocess, 210.5ms inference, 1.7ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 1 truck, 216.8ms
Speed: 4.9ms preprocess, 216.8ms inference, 2.0ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 car, 1 truck, 226.0ms
Speed: 4.8ms preprocess, 226.0ms inference, 1.6ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 1 truck, 205.6ms
Speed: 10.4ms preprocess, 205.6ms inference, 1.7ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 truck, 208.7ms
Speed: 6.7ms preprocess, 208.7ms inference, 1.6ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 1 truck, 200.2ms
Speed: 4.7ms preprocess, 200.2ms inference, 1.6ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 1 truck, 233.0ms
Speed: 4.7ms preprocess, 233.0ms inference, 1.5ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 1 truck, 213.0ms
Speed: 9.7ms preprocess, 2

                                                        

MoviePy - Done.
Saved 5 segments to /content/Processed_Videos/record_20250715_164301/audio_segments
Processed 5 audio segments
Finished processing record_20250715_164301.mp4

Processing record_20250715_164247.mp4...





0: 384x640 2 cars, 151.8ms
Speed: 5.6ms preprocess, 151.8ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 141.5ms
Speed: 4.0ms preprocess, 141.5ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 3 cars, 134.4ms
Speed: 4.4ms preprocess, 134.4ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)
New car 0 detected at 0.90s

0: 384x640 2 cars, 153.2ms
Speed: 5.3ms preprocess, 153.2ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 3 cars, 141.4ms
Speed: 3.8ms preprocess, 141.4ms inference, 1.4ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 152.0ms
Speed: 4.2ms preprocess, 152.0ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 140.4ms
Speed: 4.6ms preprocess, 140.4ms inference, 1.7ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 1 truck, 134.8ms
Speed: 5.0ms preprocess, 134.8ms inference, 1.2ms pos

                                                                    

MoviePy - Done.
Saved 3 segments to /content/Processed_Videos/record_20250715_164247/audio_segments
Processed 3 audio segments
Finished processing record_20250715_164247.mp4

Processing record_20250715_164425.mp4...





0: 384x640 3 cars, 132.9ms
Speed: 3.8ms preprocess, 132.9ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)
New car 0 detected at 0.30s

0: 384x640 3 cars, 157.2ms
Speed: 5.1ms preprocess, 157.2ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 person, 3 cars, 1 motorcycle, 139.7ms
Speed: 4.3ms preprocess, 139.7ms inference, 2.5ms postprocess per image at shape (1, 3, 384, 640)
New car 1 detected at 0.90s

0: 384x640 3 cars, 146.6ms
Speed: 4.1ms preprocess, 146.6ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 person, 2 cars, 1 motorcycle, 138.2ms
Speed: 4.6ms preprocess, 138.2ms inference, 1.4ms postprocess per image at shape (1, 3, 384, 640)
New car 2 detected at 1.50s

0: 384x640 3 cars, 140.3ms
Speed: 3.9ms preprocess, 140.3ms inference, 1.4ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 3 cars, 1 truck, 143.6ms
Speed: 4.8ms preprocess, 143.6ms inference, 1.2ms postprocess per image at 

                                                                  

MoviePy - Done.




Saved 8 segments to /content/Processed_Videos/record_20250715_164425/audio_segments
Processed 8 audio segments
Finished processing record_20250715_164425.mp4

Processing record_20250715_164357.mp4...

0: 384x640 1 car, 1 motorcycle, 140.3ms
Speed: 4.4ms preprocess, 140.3ms inference, 1.5ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 person, 2 cars, 1 motorcycle, 140.8ms
Speed: 5.3ms preprocess, 140.8ms inference, 1.4ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 person, 2 cars, 1 train, 135.6ms
Speed: 3.7ms preprocess, 135.6ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 141.9ms
Speed: 4.8ms preprocess, 141.9ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 153.2ms
Speed: 4.3ms preprocess, 153.2ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 135.9ms
Speed: 4.5ms preprocess, 135.9ms inference, 1.2ms postprocess per image at shape (1, 3

                                                        

MoviePy - Done.
Saved 4 segments to /content/Processed_Videos/record_20250715_164357/audio_segments
Processed 4 audio segments
Finished processing record_20250715_164357.mp4

Processing record_20250715_164410.mp4...





0: 384x640 2 cars, 1 bench, 154.3ms
Speed: 5.6ms preprocess, 154.3ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 138.0ms
Speed: 4.3ms preprocess, 138.0ms inference, 1.1ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 3 cars, 1 truck, 157.5ms
Speed: 4.7ms preprocess, 157.5ms inference, 1.4ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 3 cars, 154.3ms
Speed: 3.7ms preprocess, 154.3ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 3 cars, 146.1ms
Speed: 5.1ms preprocess, 146.1ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 139.3ms
Speed: 4.5ms preprocess, 139.3ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 car, 155.4ms
Speed: 6.6ms preprocess, 155.4ms inference, 1.5ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 car, 136.1ms
Speed: 4.5ms preprocess, 136.1ms inference, 1.2ms postprocess per image at

                                                                    

MoviePy - Done.
Saved 3 segments to /content/Processed_Videos/record_20250715_164410/audio_segments
Processed 3 audio segments
Finished processing record_20250715_164410.mp4

Processing record_20250715_164438.mp4...





0: 384x640 2 cars, 222.8ms
Speed: 4.7ms preprocess, 222.8ms inference, 1.5ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 car, 1 bus, 1 train, 223.4ms
Speed: 3.1ms preprocess, 223.4ms inference, 1.8ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 car, 205.5ms
Speed: 5.8ms preprocess, 205.5ms inference, 1.6ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 218.6ms
Speed: 3.0ms preprocess, 218.6ms inference, 1.8ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 1 truck, 224.0ms
Speed: 3.1ms preprocess, 224.0ms inference, 1.8ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 car, 1 truck, 157.7ms
Speed: 3.6ms preprocess, 157.7ms inference, 1.1ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 persons, 1 car, 133.3ms
Speed: 3.7ms preprocess, 133.3ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 car, 136.6ms
Speed: 4.5ms preprocess, 136.6ms inference, 1.1ms 

                                                                    

MoviePy - Done.
Saved 2 segments to /content/Processed_Videos/record_20250715_164438/audio_segments
Processed 2 audio segments
Finished processing record_20250715_164438.mp4

Processing record_20250715_164520.mp4...





0: 384x640 1 car, 1 truck, 144.8ms
Speed: 4.6ms preprocess, 144.8ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 truck, 142.1ms
Speed: 3.0ms preprocess, 142.1ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 car, 1 truck, 141.6ms
Speed: 3.6ms preprocess, 141.6ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 person, 1 truck, 146.5ms
Speed: 4.4ms preprocess, 146.5ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 car, 140.1ms
Speed: 3.3ms preprocess, 140.1ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 bus, 149.9ms
Speed: 2.3ms preprocess, 149.9ms inference, 2.0ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 139.1ms
Speed: 3.0ms preprocess, 139.1ms inference, 0.9ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 141.2ms
Speed: 2.4ms preprocess, 141.2ms inference, 0.

                                                        

MoviePy - Done.
Saved 3 segments to /content/Processed_Videos/record_20250715_164520/audio_segments
Processed 3 audio segments
Finished processing record_20250715_164520.mp4

Processing record_20250715_164506.mp4...





0: 384x640 1 car, 133.6ms
Speed: 4.3ms preprocess, 133.6ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 car, 132.1ms
Speed: 4.4ms preprocess, 132.1ms inference, 1.1ms postprocess per image at shape (1, 3, 384, 640)
New car 0 detected at 0.60s

0: 384x640 1 car, 141.4ms
Speed: 5.4ms preprocess, 141.4ms inference, 1.7ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 car, 133.4ms
Speed: 4.2ms preprocess, 133.4ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)
New car 1 detected at 1.20s

0: 384x640 1 car, 138.9ms
Speed: 3.0ms preprocess, 138.9ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 131.0ms
Speed: 2.7ms preprocess, 131.0ms inference, 0.8ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 128.5ms
Speed: 2.9ms preprocess, 128.5ms inference, 0.8ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 134.3ms
Speed: 3.7ms

                                                        

MoviePy - Done.
Saved 6 segments to /content/Processed_Videos/record_20250715_164506/audio_segments
Processed 6 audio segments
Finished processing record_20250715_164506.mp4

Processing record_20250715_164452.mp4...





0: 384x640 2 cars, 1 bus, 155.0ms
Speed: 5.4ms preprocess, 155.0ms inference, 1.5ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 3 cars, 131.0ms
Speed: 5.2ms preprocess, 131.0ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 211.6ms
Speed: 5.6ms preprocess, 211.6ms inference, 1.9ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 205.3ms
Speed: 10.7ms preprocess, 205.3ms inference, 1.8ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 233.3ms
Speed: 4.4ms preprocess, 233.3ms inference, 1.7ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 3 cars, 197.9ms
Speed: 4.7ms preprocess, 197.9ms inference, 1.5ms postprocess per image at shape (1, 3, 384, 640)
New car 0 detected at 1.80s

0: 384x640 2 cars, 200.5ms
Speed: 9.6ms preprocess, 200.5ms inference, 1.7ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 3 cars, 213.7ms
Speed: 4.2ms preprocess, 213.7ms inference, 1.8ms post

                                                        

MoviePy - Done.
Saved 8 segments to /content/Processed_Videos/record_20250715_164452/audio_segments
Processed 8 audio segments
Finished processing record_20250715_164452.mp4

Processing record_20250715_164615.mp4...





0: 384x640 1 car, 133.9ms
Speed: 4.5ms preprocess, 133.9ms inference, 1.1ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 140.7ms
Speed: 5.0ms preprocess, 140.7ms inference, 0.9ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 128.7ms
Speed: 3.1ms preprocess, 128.7ms inference, 0.8ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 126.1ms
Speed: 3.2ms preprocess, 126.1ms inference, 0.9ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 147.8ms
Speed: 5.5ms preprocess, 147.8ms inference, 0.8ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 128.4ms
Speed: 3.3ms preprocess, 128.4ms inference, 0.8ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 128.1ms
Speed: 2.8ms preprocess, 128.1ms inference, 0.7ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 141.7ms
Speed: 4.8ms preprocess, 141

                                                        

MoviePy - Done.
Saved 4 segments to /content/Processed_Videos/record_20250715_164547/audio_segments
Processed 4 audio segments
Finished processing record_20250715_164547.mp4

Processing record_20250715_164533.mp4...





0: 384x640 1 person, 162.5ms
Speed: 4.3ms preprocess, 162.5ms inference, 2.0ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 person, 141.1ms
Speed: 3.9ms preprocess, 141.1ms inference, 1.1ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 person, 161.1ms
Speed: 6.0ms preprocess, 161.1ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 1 motorcycle, 161.9ms
Speed: 4.5ms preprocess, 161.9ms inference, 1.5ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 person, 2 cars, 1 motorcycle, 137.3ms
Speed: 2.8ms preprocess, 137.3ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 person, 1 car, 1 motorcycle, 140.6ms
Speed: 5.1ms preprocess, 140.6ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 person, 2 cars, 1 motorcycle, 148.5ms
Speed: 4.1ms preprocess, 148.5ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 car, 156.8ms


                                                        

MoviePy - Done.
Saved 6 segments to /content/Processed_Videos/record_20250715_164533/audio_segments
Processed 6 audio segments
Finished processing record_20250715_164533.mp4

Processing record_20250715_180435.mp4...





0: 384x640 (no detections), 139.7ms
Speed: 5.3ms preprocess, 139.7ms inference, 0.8ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 130.3ms
Speed: 3.7ms preprocess, 130.3ms inference, 0.8ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 133.0ms
Speed: 4.9ms preprocess, 133.0ms inference, 0.8ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 150.6ms
Speed: 3.9ms preprocess, 150.6ms inference, 0.8ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 143.2ms
Speed: 5.0ms preprocess, 143.2ms inference, 0.9ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 144.4ms
Speed: 5.2ms preprocess, 144.4ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 136.6ms
Speed: 5.5ms preprocess, 136.6ms inference, 0.9ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 149.1ms
Speed: 4.5ms prepr

                                                        

MoviePy - Done.
Saved 4 segments to /content/Processed_Videos/record_20250715_180420/audio_segments
Processed 4 audio segments
Finished processing record_20250715_180420.mp4

Processing record_20250715_180510.mp4...





0: 384x640 3 cars, 138.5ms
Speed: 4.1ms preprocess, 138.5ms inference, 1.4ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 3 cars, 137.6ms
Speed: 4.6ms preprocess, 137.6ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 car, 163.3ms
Speed: 4.5ms preprocess, 163.3ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 car, 147.5ms
Speed: 4.5ms preprocess, 147.5ms inference, 2.0ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 143.4ms
Speed: 4.4ms preprocess, 143.4ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 1 truck, 150.1ms
Speed: 6.1ms preprocess, 150.1ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 154.8ms
Speed: 3.3ms preprocess, 154.8ms inference, 1.2ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 cars, 146.9ms
Speed: 3.9ms preprocess, 146.9ms inference, 2.0ms postprocess per image at shape (1

                                                                    

MoviePy - Done.
Saved 1 segments to /content/Processed_Videos/record_20250715_180510/audio_segments
Processed 1 audio segments
Finished processing record_20250715_180510.mp4

Processing record_20250715_180523.mp4...





0: 384x640 (no detections), 225.1ms
Speed: 4.7ms preprocess, 225.1ms inference, 1.0ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 209.8ms
Speed: 4.4ms preprocess, 209.8ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 215.2ms
Speed: 4.8ms preprocess, 215.2ms inference, 1.4ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 214.9ms
Speed: 9.2ms preprocess, 214.9ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 223.9ms
Speed: 11.2ms preprocess, 223.9ms inference, 0.8ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 136.6ms
Speed: 5.5ms preprocess, 136.6ms inference, 0.9ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 129.0ms
Speed: 2.7ms preprocess, 129.0ms inference, 0.9ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 (no detections), 142.4ms
Speed: 4.3ms prep

In [None]:
def create_pairs_for_videos(processed_dir="processed_videos"):
    processed_path = Path(processed_dir)

    for video_dir in processed_path.iterdir():
        if not video_dir.is_dir():
            continue

        print(f"\nCreating pairs for {video_dir.name}...")

        car_dir = video_dir / "car_captures"
        audio_dir = video_dir / "audio_segments"
        pairs_dir = video_dir / "pairs"

        if not car_dir.exists():
            print(f"No car captures found in {video_dir.name}")
            continue

        pairs_dir.mkdir(exist_ok=True)
        metadata = []

        car_files = sorted(
            [f for f in car_dir.glob("*.jpg")],
            key=lambda x: float(x.stem.split('_')[-1].replace('s', ''))
        )

        audio_files = sorted(
            [f for f in audio_dir.glob("*.wav")],
            key=lambda x: float(x.stem.split('_')[-1].replace('s', ''))
        )

        for i, (car_file, audio_file) in enumerate(zip(car_files, audio_files)):
            timestamp = car_file.stem.split('_')[-1].replace('s', '')
            pair_folder = pairs_dir / f"pair_{i+1}_{timestamp}s"
            pair_folder.mkdir(exist_ok=True)

            shutil.copy2(car_file, pair_folder / "car.jpg")
            shutil.copy2(audio_file, pair_folder / "audio.wav")

            metadata.append({
                "pair_id": i+1,
                "timestamp": float(timestamp),
                "car_original": car_file.name,
                "audio_original": audio_file.name
            })

        with open(pairs_dir / "pairs_metadata.json", 'w') as f:
            json.dump(metadata, f, indent=2)

        print(f"Created {len(metadata)} pairs in {pairs_dir}")

if __name__ == "__main__":
    create_pairs_for_videos("/content/Processed_Videos")


Creating pairs for record_20250715_164301...
Created 5 pairs in /content/Processed_Videos/record_20250715_164301/pairs

Creating pairs for record_20250715_164216...
Created 6 pairs in /content/Processed_Videos/record_20250715_164216/pairs

Creating pairs for record_20250715_164425...
Created 8 pairs in /content/Processed_Videos/record_20250715_164425/pairs

Creating pairs for record_20250715_164315...
Created 1 pairs in /content/Processed_Videos/record_20250715_164315/pairs

Creating pairs for record_20250715_164410...
Created 3 pairs in /content/Processed_Videos/record_20250715_164410/pairs

Creating pairs for record_20250715_164452...
Created 8 pairs in /content/Processed_Videos/record_20250715_164452/pairs

Creating pairs for record_20250715_164506...
Created 6 pairs in /content/Processed_Videos/record_20250715_164506/pairs

Creating pairs for record_20250715_164329...
Created 1 pairs in /content/Processed_Videos/record_20250715_164329/pairs

Creating pairs for record_20250715_1643