# 03 — Dataset Generation (YOLO Format)

**Goal:**  
Generate a YOLOv8-compatible dataset for fine-tuning using selected classes from EPIC-KITCHENS.

This notebook:
1. Loads the main EPIC-KITCHENS annotation CSV  
2. Filters relevant classes  
3. Parses bounding boxes  
4. Creates a balanced subset  
5. Splits into `train` and `val`  
6. Writes YOLO `.txt` labels and dataset YAML file  

**Based on:** `train_dataset_img.py`

In [12]:
%load_ext autoreload
%autoreload 2

import os
import ast
import shutil
import pandas as pd
from tqdm import tqdm
from collections import defaultdict
from sklearn.model_selection import train_test_split
import sys

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
from config import DATA_ROOT, FRAMES_ROOT, ANNOTATION_CSV, CLASS_MAPPING,CLASSES, MAX_PER_CLASS, TRAIN_SPLIT, IMG_EXT, IMG_WIDTH, IMG_HEIGHT, OUTPUT_DIR

CLASSES = ["bread", "knife"]

print(f"Configuration loaded:\nDATA_ROOT: {DATA_ROOT}\nCLASSES: {CLASS_MAPPING}\nOUTPUT_DIR: {OUTPUT_DIR}")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Configuration loaded:
DATA_ROOT: ../annotations
CLASSES: {'bread': ['bread', 'bread package', 'bread packaging'], 'knife': ['knife', 'mezzaluna knife', 'mincing knife']}
OUTPUT_DIR: ./data/epic_train_subset


## Step 1 — Load and Filter the Annotations
We start by loading the CSV annotation file and filtering out any irrelevant or invalid entries.

In [13]:
df = pd.read_csv(ANNOTATION_CSV)
print(f"Loaded {len(df)} total annotations")

# Keep only classes of interest
target_nouns = sum(CLASS_MAPPING.values(), [])
df = df[df["noun"].isin(target_nouns)]
print(f"Filtered {len(df)} annotations for selected classes: {', '.join(target_nouns)}")

Loaded 389811 total annotations
Filtered 10811 annotations for selected classes: bread, bread package, bread packaging, knife, mezzaluna knife, mincing knife


In [14]:
def map_to_base(noun):
    for base, synonyms in CLASS_MAPPING.items():
        if noun in synonyms:
            return base
    return None

df["base_class"] = df["noun"].apply(map_to_base)
print(f"Found {len(df)} annotations for selected classes: {', '.join(df['noun'].unique())}")


Found 10811 annotations for selected classes: knife, bread, bread packaging, mincing knife, mezzaluna knife, bread package


## Step 2 — Parse and Validate Bounding Boxes

Bounding boxes are stored as text (e.g. `'[(top, left, height, width)]'`).  
We parse them safely and filter out invalid or empty entries.

In [15]:
def parse_bboxes(bbox_str):
    """Parse YOLO-style bounding boxes stored as string tuples."""
    try:
        bboxes = ast.literal_eval(bbox_str)
        if not isinstance(bboxes, list) or len(bboxes) == 0:
            return []
        valid_bboxes = [box for box in bboxes if isinstance(box, (list, tuple)) and len(box) == 4]
        return valid_bboxes
    except Exception:
        return []

df["parsed_bboxes"] = df["bounding_boxes"].apply(parse_bboxes)
df = df[df["parsed_bboxes"].map(lambda x: isinstance(x, list) and len(x) > 0)]
print(f" Remaining rows with valid bounding boxes: {len(df)}")

 Remaining rows with valid bounding boxes: 7983


## Step 3 — Balance the Dataset per Class

Limit the number of samples per class (`MAX_PER_CLASS`) to keep the dataset balanced and lightweight.

In [16]:
subset = []
for c in CLASSES:
    class_df = df[df["base_class"] == c]
    subset.append(class_df.sample(min(len(class_df), MAX_PER_CLASS), random_state=42))
df = pd.concat(subset).reset_index(drop=True)
print(f" Balanced subset created: {len(df)} samples total ({', '.join(CLASSES)})")

 Balanced subset created: 2000 samples total (bread, knife)


## Step 4 — Split into Train and Validation Sets
We divide the dataset into 90% training and 10% validation using stratified sampling.

In [17]:
VAL_SPLIT = 1 - TRAIN_SPLIT
train_df, val_df = train_test_split(df, test_size=VAL_SPLIT, stratify=df["base_class"], random_state=42)
print(f"Split complete → Train: {len(train_df)}  |  Val: {len(val_df)}")

Split complete → Train: 1800  |  Val: 200


##  Step 5 — Create Output Folder Structure

In [18]:
for split in ["train", "val"]:
    os.makedirs(f"{OUTPUT_DIR}/{split}/images", exist_ok=True)
    os.makedirs(f"{OUTPUT_DIR}/{split}/labels", exist_ok=True)

print(f" Folder structure created under: {OUTPUT_DIR}")

 Folder structure created under: ./data/epic_train_subset


## Step 6 — Generate YOLO Labels and Copy Images
Each image is copied to its split folder, and bounding boxes are saved in YOLO format:

In [19]:
class_to_id = {cls: i for i, cls in enumerate(CLASSES)}

def process_split(split_name, split_df):
    print(f"\n Processing {split_name.upper()} ({len(split_df)} samples)")
    stats_requested = defaultdict(lambda: defaultdict(int))
    stats_copied = defaultdict(lambda: defaultdict(int))

    for _, row in tqdm(split_df.iterrows(), total=len(split_df)):
        participant = row["participant_id"]
        video_id = row["video_id"]
        frame = int(row["frame"])
        base_class = row["noun"]
        valid_bboxes = row["parsed_bboxes"]

        stats_requested[base_class][video_id] += 1

        # Build source image path
        frame_name = f"{frame:010d}{IMG_EXT}"
        src_img = os.path.join(FRAMES_ROOT, participant, "object_detection_images", video_id, frame_name)
        if not os.path.exists(src_img):
            continue

        # Output paths
        img_name = f"{participant}_{video_id}_{frame}{IMG_EXT}"
        dst_img = os.path.join(OUTPUT_DIR, f"{split_name}/images", img_name)
        label_path = os.path.join(OUTPUT_DIR, f"{split_name}/labels", img_name.replace(IMG_EXT, ".txt"))

        shutil.copy(src_img, dst_img)
        stats_copied[base_class][video_id] += 1

        # Write YOLO label
        lines = []
        for box in valid_bboxes:
            if not isinstance(box, (list, tuple)) or len(box) != 4:
                continue
            top, left, height, width = map(float, box)
            xc = left + width / 2
            yc = top + height / 2
            xc_n, yc_n, wn, hn = xc / IMG_WIDTH, yc / IMG_HEIGHT, width / IMG_WIDTH, height / IMG_HEIGHT
            cls_id = class_to_id[base_class]
            lines.append(f"{cls_id} {xc_n:.6f} {yc_n:.6f} {wn:.6f} {hn:.6f}\n")

        if len(lines) > 0:
            with open(label_path, "w") as f:
                f.writelines(lines)

    # Summary per class
    print(f"\n === {split_name.upper()} SUMMARY ===")
    for cls in CLASSES:
        total_req = sum(stats_requested[cls].values())
        total_cop = sum(stats_copied[cls].values())
        if total_req == 0:
            continue
        print(f"Class '{cls}': {total_cop}/{total_req} images copied ({(total_cop/total_req)*100:.1f}% success)")

In [20]:
process_split("train", train_df)
process_split("val", val_df)


 Processing TRAIN (1800 samples)


  0%|          | 0/1800 [00:00<?, ?it/s]

100%|██████████| 1800/1800 [00:00<00:00, 12173.91it/s]



 === TRAIN SUMMARY ===
Class 'bread': 32/853 images copied (3.8% success)
Class 'knife': 132/871 images copied (15.2% success)

 Processing VAL (200 samples)


100%|██████████| 200/200 [00:00<00:00, 10960.77it/s]


 === VAL SUMMARY ===
Class 'bread': 1/96 images copied (1.0% success)
Class 'knife': 19/94 images copied (20.2% success)





## Step 7 — Generate YOLO Dataset YAML
The YAML file defines dataset structure and class list for training.

In [21]:
yaml_content = f"""
path: {OUTPUT_DIR}
train: train/images
val: val/images
nc: {len(CLASSES)}
names: {CLASSES}
"""

yaml_path = os.path.join(OUTPUT_DIR, "dataset.yaml")
with open(yaml_path, "w") as f:
    f.write(yaml_content)

print(f"YOLO dataset YAML created at: {yaml_path}")

YOLO dataset YAML created at: ./data/epic_train_subset/dataset.yaml


## Final Summary

- **Format:** YOLOv8-compatible (images + labels + dataset.yaml)

You can now train the model using:

```bash
yolo detect train data={OUTPUT_DIR}/dataset.yaml model=yolov8s.pt epochs=100 patience=10 imgsz=640 name=epic_full_epoch