# Dataset CSV generator for video detector models

This notebook processes annotated video frames and creates a CSV dataset file containing image paths and metadata for detector model training.

**Features:**
- Collects all annotated image paths
- Extracts image dimensions and file metadata
- Reads YOLO format labels and extracts annotation details
- Generates comprehensive dataset CSV with metadata
- Validates data integrity and provides statistics

## 1. Import Required Libraries

In [1]:
import pandas as pd
import os
from pathlib import Path
from PIL import Image
import csv
from collections import defaultdict
import warnings

warnings.filterwarnings('ignore')

print("‚úì Libraries imported successfully")

‚úì Libraries imported successfully


## 2. Define Dataset Directory Structure

In [2]:
# Define paths
BASE_DIR = Path(r"F:\DeTect_TaiwanBirds_VideoDetector")
OUTPUT_DIR = Path("G:/2025-05-14_videos_annotated")  # Change this to your output directory from annotation tool
IMAGES_DIR = OUTPUT_DIR / "images"
LABELS_DIR = OUTPUT_DIR / "labels"
CSV_OUTPUT_PATH = BASE_DIR / "dataset" / "csvs" / "annotations.csv"

# Verify directories exist
print(f"Base Directory: {BASE_DIR}")
print(f"Images Directory: {IMAGES_DIR}")
print(f"Labels Directory: {LABELS_DIR}")
print(f"Output CSV Path: {CSV_OUTPUT_PATH}")
print()

# Check if directories exist
if IMAGES_DIR.exists():
    print(f"‚úì Images directory found: {IMAGES_DIR}")
    image_count = len(list(IMAGES_DIR.glob("*.jpg"))) + len(list(IMAGES_DIR.glob("*.png")))
    print(f"  Found {image_count} image files")
else:
    print(f"‚úó Images directory not found: {IMAGES_DIR}")
    print("  Please update OUTPUT_DIR to point to your annotation output folder")

if LABELS_DIR.exists():
    print(f"‚úì Labels directory found: {LABELS_DIR}")
    label_count = len(list(LABELS_DIR.glob("*.txt")))
    print(f"  Found {label_count} label files")
else:
    print(f"‚úó Labels directory not found: {LABELS_DIR}")

Base Directory: F:\DeTect_TaiwanBirds_VideoDetector
Images Directory: G:\2025-05-14_videos_annotated\images
Labels Directory: G:\2025-05-14_videos_annotated\labels
Output CSV Path: F:\DeTect_TaiwanBirds_VideoDetector\dataset\csvs\annotations.csv

‚úì Images directory found: G:\2025-05-14_videos_annotated\images
  Found 3713 image files
‚úì Labels directory found: G:\2025-05-14_videos_annotated\labels
  Found 481 label files


## 3. Load Class Mapping

In [3]:
# Load class mapping from CSV
class_mapping = {}
classes_csv = OUTPUT_DIR / "classes.csv"

if classes_csv.exists():
    with open(classes_csv, 'r') as f:
        reader = csv.reader(f)
        next(reader)  # Skip header
        for row in reader:
            if len(row) >= 2:
                class_id = int(row[0])
                class_name = row[1]
                class_mapping[class_id] = class_name
    print(f"‚úì Loaded {len(class_mapping)} classes:")
    for class_id, class_name in sorted(class_mapping.items()):
        print(f"  {class_id}: {class_name}")
else:
    print(f"‚úó Classes CSV not found at {classes_csv}")
    print("  Using default class mapping")
    class_mapping = {0: "Bat", 1: "Bird", 2: "Insect", 3: "Drone", 4: "Plane", 5: "Other"}

‚úì Loaded 6 classes:
  0: Bat
  1: Bird
  2: Insect
  3: Drone
  4: Plane
  5: Other


## 4. Collect Image Paths and Extract Metadata

In [4]:
def extract_label_info(label_path):
    """Extract information from YOLO format label file"""
    num_objects = 0
    classes_present = set()
    annotations = []
    
    if label_path.exists():
        with open(label_path, 'r') as f:
            for line in f:
                parts = line.strip().split()
                if len(parts) == 5:
                    class_id = int(parts[0])
                    x_center = float(parts[1])
                    y_center = float(parts[2])
                    width = float(parts[3])
                    height = float(parts[4])
                    
                    num_objects += 1
                    classes_present.add(class_mapping.get(class_id, f"Unknown_{class_id}"))
                    annotations.append({
                        'class_id': class_id,
                        'class_name': class_mapping.get(class_id, f"Unknown_{class_id}"),
                        'x_center': x_center,
                        'y_center': y_center,
                        'width': width,
                        'height': height
                    })
    
    return num_objects, list(classes_present), annotations

# Collect all dataset information
dataset_rows = []
image_extensions = {'.jpg', '.jpeg', '.png', '.JPG', '.JPEG', '.PNG'}

print("Scanning images directory...")
image_files = sorted([f for f in IMAGES_DIR.iterdir() if f.suffix in image_extensions])

for idx, img_path in enumerate(image_files):
    if (idx + 1) % 50 == 0:
        print(f"  Processed {idx + 1}/{len(image_files)} images...")
    
    # Get image metadata
    try:
        img = Image.open(img_path)
        img_width, img_height = img.size
        img_format = img.format
        file_size_kb = img_path.stat().st_size / 1024
    except Exception as e:
        print(f"  Warning: Could not read image {img_path}: {e}")
        continue
    
    # Get corresponding label information
    label_path = LABELS_DIR / (img_path.stem + ".txt")
    num_targets, classes_present, annotations = extract_label_info(label_path)
    
    # Create relative path for CSV
    try:
        rel_path = img_path.relative_to(BASE_DIR)
    except ValueError:
        rel_path = img_path
    
    # Extract video name and frame number from filename (assuming format: videoname_framenumber.jpg)
    filename_parts = img_path.stem.rsplit('_', 1)
    if len(filename_parts) == 2:
        video_name = filename_parts[0]
        try:
            frame_number = int(filename_parts[1])
        except ValueError:
            video_name = img_path.stem
            frame_number = -1
    else:
        video_name = img_path.stem
        frame_number = -1
    
    # Create row for dataset
    row = {
        'image_path': str(rel_path),
        'image_width': img_width,
        'image_height': img_height,
        'image_format': img_format,
        # 'file_size_kb': round(file_size_kb, 2),
        'num_targets': num_targets,
        'classes': ';'.join(sorted(classes_present)) if classes_present else 'Background',
        'has_annotations': 'Yes' if num_targets > 0 else 'No',
        # Build video_path by stripping the annotated images directory prefix from the image path string
        'video_path': str(img_path).replace("_videos_annotated\\images", "").replace("_videos_annotated/images", "").replace(".jpg", ".mp4").replace(".jpeg", ".mp4").replace(".png", ".mp4"),
        'video_name': video_name,
        'frame_number': frame_number
    }
    
    dataset_rows.append(row)

print(f"\n‚úì Collected metadata for {len(dataset_rows)} images")

Scanning images directory...
  Processed 50/3713 images...
  Processed 100/3713 images...
  Processed 150/3713 images...
  Processed 200/3713 images...
  Processed 250/3713 images...
  Processed 300/3713 images...
  Processed 350/3713 images...
  Processed 400/3713 images...
  Processed 450/3713 images...
  Processed 500/3713 images...
  Processed 550/3713 images...
  Processed 600/3713 images...
  Processed 650/3713 images...
  Processed 700/3713 images...
  Processed 750/3713 images...
  Processed 800/3713 images...
  Processed 850/3713 images...
  Processed 900/3713 images...
  Processed 950/3713 images...
  Processed 1000/3713 images...
  Processed 1050/3713 images...
  Processed 1100/3713 images...
  Processed 1150/3713 images...
  Processed 1200/3713 images...
  Processed 1250/3713 images...
  Processed 1300/3713 images...
  Processed 1350/3713 images...
  Processed 1400/3713 images...
  Processed 1450/3713 images...
  Processed 1500/3713 images...
  Processed 1550/3713 images...

## 5. Create and Explore DataFrame

In [5]:
# Create DataFrame
df = pd.DataFrame(dataset_rows)

print("DataFrame shape:", df.shape)
print("\nColumn names:")
print(df.columns.tolist())
df.head(10)

DataFrame shape: (3713, 10)

Column names:
['image_path', 'image_width', 'image_height', 'image_format', 'num_targets', 'classes', 'has_annotations', 'video_path', 'video_name', 'frame_number']


Unnamed: 0,image_path,image_width,image_height,image_format,num_targets,classes,has_annotations,video_path,video_name,frame_number
0,G:\2025-05-14_videos_annotated\images\A03_0000...,1920,1080,JPEG,0,Background,No,G:\2025-05-14\A03_0000cfa6-1efa-350e-a754-5e23...,A03_0000cfa6-1efa-350e-a754-5e23cfe9afaf,0
1,G:\2025-05-14_videos_annotated\images\A03_0000...,1920,1080,JPEG,2,Bird,Yes,G:\2025-05-14\A03_0000cfa6-1efa-350e-a754-5e23...,A03_0000cfa6-1efa-350e-a754-5e23cfe9afaf,124
2,G:\2025-05-14_videos_annotated\images\A03_0000...,1920,1080,JPEG,2,Bird,Yes,G:\2025-05-14\A03_0000cfa6-1efa-350e-a754-5e23...,A03_0000cfa6-1efa-350e-a754-5e23cfe9afaf,149
3,G:\2025-05-14_videos_annotated\images\A03_0000...,1920,1080,JPEG,0,Background,No,G:\2025-05-14\A03_0000cfa6-1efa-350e-a754-5e23...,A03_0000cfa6-1efa-350e-a754-5e23cfe9afaf,16
4,G:\2025-05-14_videos_annotated\images\A03_0000...,1920,1080,JPEG,2,Bird,Yes,G:\2025-05-14\A03_0000cfa6-1efa-350e-a754-5e23...,A03_0000cfa6-1efa-350e-a754-5e23cfe9afaf,174
5,G:\2025-05-14_videos_annotated\images\A03_0000...,1920,1080,JPEG,2,Bird,Yes,G:\2025-05-14\A03_0000cfa6-1efa-350e-a754-5e23...,A03_0000cfa6-1efa-350e-a754-5e23cfe9afaf,199
6,G:\2025-05-14_videos_annotated\images\A03_0000...,1920,1080,JPEG,0,Background,No,G:\2025-05-14\A03_0000cfa6-1efa-350e-a754-5e23...,A03_0000cfa6-1efa-350e-a754-5e23cfe9afaf,224
7,G:\2025-05-14_videos_annotated\images\A03_0000...,1920,1080,JPEG,0,Background,No,G:\2025-05-14\A03_0000cfa6-1efa-350e-a754-5e23...,A03_0000cfa6-1efa-350e-a754-5e23cfe9afaf,24
8,G:\2025-05-14_videos_annotated\images\A03_0000...,1920,1080,JPEG,0,Background,No,G:\2025-05-14\A03_0000cfa6-1efa-350e-a754-5e23...,A03_0000cfa6-1efa-350e-a754-5e23cfe9afaf,32
9,G:\2025-05-14_videos_annotated\images\A03_0000...,1920,1080,JPEG,0,Background,No,G:\2025-05-14\A03_0000cfa6-1efa-350e-a754-5e23...,A03_0000cfa6-1efa-350e-a754-5e23cfe9afaf,48


## 6. Dataset Statistics and Validation

In [6]:
print("="*60)
print("DATASET STATISTICS")
print("="*60)
print(f"\nTotal images: {len(df)}")
print(f"\nImages with annotations: {(df['has_annotations'] == 'Yes').sum()}")
print(f"Images without annotations: {(df['has_annotations'] == 'No').sum()}")

print(f"\nAnnotation coverage: {(df['has_annotations'] == 'Yes').sum() / len(df) * 100:.1f}%")

print(f"\nImage dimensions:")
print(f"  Width  - Min: {df['image_width'].min()}, Max: {df['image_width'].max()}, Mean: {df['image_width'].mean():.0f}")
print(f"  Height - Min: {df['image_height'].min()}, Max: {df['image_height'].max()}, Mean: {df['image_height'].mean():.0f}")

print(f"\nObject count statistics:")
print(f"  Total targets: {df['num_targets'].sum()}")
print(f"  Mean targets per image: {df['num_targets'].mean():.2f}")
print(f"  Max targets in single image: {df['num_targets'].max()}")
print(f"  Images with targets distribution:")
for count in sorted(df['num_targets'].unique()):
    freq = (df['num_targets'] == count).sum()
    print(f"    {count} object(s): {freq} images")

print(f"\nUnique videos: {df['video_name'].nunique()}")
print(f"Unique classes: {df['classes'].nunique()}")

# Validate paths
print(f"\n\nValidating image paths...")
missing_count = 0
for idx, row in df.iterrows():
    img_full_path = BASE_DIR / row['image_path']
    if not img_full_path.exists():
        print(f"  ‚úó Missing: {row['image_path']}")
        missing_count += 1

if missing_count == 0:
    print(f"  ‚úì All {len(df)} image paths are valid")
else:
    print(f"  ‚úó Found {missing_count} missing image files")

DATASET STATISTICS

Total images: 3713

Images with annotations: 481
Images without annotations: 3232

Annotation coverage: 13.0%

Image dimensions:
  Width  - Min: 1920, Max: 1920, Mean: 1920
  Height - Min: 1080, Max: 1080, Mean: 1080

Object count statistics:
  Total targets: 560
  Mean targets per image: 0.15
  Max targets in single image: 3
  Images with targets distribution:
    0 object(s): 3232 images
    1 object(s): 422 images
    2 object(s): 39 images
    3 object(s): 20 images

Unique videos: 301
Unique classes: 5


Validating image paths...
  ‚úì All 3713 image paths are valid


## 7. Class Distribution Analysis

In [7]:
# Analyze class distribution
print("Class Distribution:")
print("-" * 60)

class_counts = defaultdict(int)
images_with_class = defaultdict(int)

for classes_str in df['classes']:
    if classes_str != 'none':
        classes_list = classes_str.split(';')
        for class_name in classes_list:
            images_with_class[class_name] += 1

# Count total objects by class
for idx, row in df.iterrows():
    if row['num_targets'] > 0:
        label_path = LABELS_DIR / (Path(row['image_path']).stem + ".txt")
        if label_path.exists():
            with open(label_path, 'r') as f:
                for line in f:
                    parts = line.strip().split()
                    if len(parts) == 5:
                        class_id = int(parts[0])
                        class_name = class_mapping.get(class_id, f"Unknown_{class_id}")
                        class_counts[class_name] += 1

print(f"\nClasses by number of images containing them:")
for class_name in sorted(images_with_class.keys()):
    count = images_with_class[class_name]
    total_objects = class_counts[class_name]
    print(f"  {class_name:15s}: {count:4d} images, {total_objects:5d} total objects")

print(f"\nTotal unique classes: {len(images_with_class)}")

Class Distribution:
------------------------------------------------------------

Classes by number of images containing them:
  Background     : 3232 images,     0 total objects
  Bird           :  446 images,   519 total objects
  Insect         :   24 images,    24 total objects
  Plane          :   12 images,    17 total objects

Total unique classes: 4


## 8. Export to CSV

In [8]:
# Export to CSV
print("\n" + "="*60)
print("EXPORTING TO CSV")
print("="*60)

# Sort by image_path for consistency
df_sorted = df.sort_values('image_path').reset_index(drop=True)

# Save to CSV
try:
    df_sorted.to_csv(CSV_OUTPUT_PATH, index=False, encoding='utf-8')
    print(f"\n‚úì Successfully exported dataset to:")
    print(f"  {CSV_OUTPUT_PATH}")
    print(f"\nFile size: {CSV_OUTPUT_PATH.stat().st_size / 1024:.1f} KB")
    print(f"Rows: {len(df_sorted)}")
    print(f"Columns: {len(df_sorted.columns)}")
except Exception as e:
    print(f"\n‚úó Error exporting to CSV: {e}")

# Display sample of CSV
print("\nSample CSV content (first 5 rows):")
print("-" * 60)
sample_df = pd.read_csv(CSV_OUTPUT_PATH).head(5)
sample_df


EXPORTING TO CSV

‚úì Successfully exported dataset to:
  F:\DeTect_TaiwanBirds_VideoDetector\dataset\csvs\annotations.csv

File size: 814.3 KB
Rows: 3713
Columns: 10

Sample CSV content (first 5 rows):
------------------------------------------------------------


Unnamed: 0,image_path,image_width,image_height,image_format,num_targets,classes,has_annotations,video_path,video_name,frame_number
0,G:\2025-05-14_videos_annotated\images\A03_0000...,1920,1080,JPEG,0,Background,No,G:\2025-05-14\A03_0000cfa6-1efa-350e-a754-5e23...,A03_0000cfa6-1efa-350e-a754-5e23cfe9afaf,0
1,G:\2025-05-14_videos_annotated\images\A03_0000...,1920,1080,JPEG,2,Bird,Yes,G:\2025-05-14\A03_0000cfa6-1efa-350e-a754-5e23...,A03_0000cfa6-1efa-350e-a754-5e23cfe9afaf,124
2,G:\2025-05-14_videos_annotated\images\A03_0000...,1920,1080,JPEG,2,Bird,Yes,G:\2025-05-14\A03_0000cfa6-1efa-350e-a754-5e23...,A03_0000cfa6-1efa-350e-a754-5e23cfe9afaf,149
3,G:\2025-05-14_videos_annotated\images\A03_0000...,1920,1080,JPEG,0,Background,No,G:\2025-05-14\A03_0000cfa6-1efa-350e-a754-5e23...,A03_0000cfa6-1efa-350e-a754-5e23cfe9afaf,16
4,G:\2025-05-14_videos_annotated\images\A03_0000...,1920,1080,JPEG,2,Bird,Yes,G:\2025-05-14\A03_0000cfa6-1efa-350e-a754-5e23...,A03_0000cfa6-1efa-350e-a754-5e23cfe9afaf,174


## 9. Summary Report

In [9]:
print("\n" + "="*60)
print("DATASET GENERATION COMPLETE")
print("="*60)

print(f"""
üìä SUMMARY REPORT
{"-"*60}

Dataset Location:
  üìÅ {CSV_OUTPUT_PATH}

Dataset Overview:
  ‚Ä¢ Total images:           {len(df):,}
  ‚Ä¢ Annotated images:       {(df['has_annotations'] == 'Yes').sum():,} ({(df['has_annotations'] == 'Yes').sum() / len(df) * 100:.1f}%)
  ‚Ä¢ Unannotated images:     {(df['has_annotations'] == 'No').sum():,}
  ‚Ä¢ Total targets:          {df['num_targets'].sum():,}
  ‚Ä¢ Unique classes:         {len(images_with_class)}
  ‚Ä¢ Unique videos:          {df['video_name'].nunique()}

Image Specifications:
  ‚Ä¢ Format(s):              {', '.join(df['image_format'].unique())}
  ‚Ä¢ Average size:           {df['image_width'].mean():.0f}√ó{df['image_height'].mean():.0f} px

CSV Columns:
  {', '.join(df.columns.tolist())}

Next Steps:
  1. Use this CSV for training data preparation
  2. Split dataset into train/validation/test sets
  3. Preprocess images if needed
  4. Configure your detector model with the metadata

Note: Image paths are relative to the base directory
{"-"*60}
""")

print("‚úì Dataset CSV successfully created!")


DATASET GENERATION COMPLETE

üìä SUMMARY REPORT
------------------------------------------------------------

Dataset Location:
  üìÅ F:\DeTect_TaiwanBirds_VideoDetector\dataset\csvs\annotations.csv

Dataset Overview:
  ‚Ä¢ Total images:           3,713
  ‚Ä¢ Annotated images:       481 (13.0%)
  ‚Ä¢ Unannotated images:     3,232
  ‚Ä¢ Total targets:          560
  ‚Ä¢ Unique classes:         4
  ‚Ä¢ Unique videos:          301

Image Specifications:
  ‚Ä¢ Format(s):              JPEG
  ‚Ä¢ Average size:           1920√ó1080 px

CSV Columns:
  image_path, image_width, image_height, image_format, num_targets, classes, has_annotations, video_path, video_name, frame_number

Next Steps:
  1. Use this CSV for training data preparation
  2. Split dataset into train/validation/test sets
  3. Preprocess images if needed
  4. Configure your detector model with the metadata

Note: Image paths are relative to the base directory
------------------------------------------------------------

‚úì 