# 01 — EPIC-KITCHENS Dataset Exploration

**Objective:**  
Identify segments of interest in the **EPIC-KITCHENS-100** dataset where kitchen-related objects appear for prolonged durations.

This notebook:
- Loads configuration parameters from `config.py`.
- Reads the action annotation file (`EPIC_100_train.csv`).
- Merges nearby temporal segments for the same object.
- Filters out short segments (below a minimum duration).
- Exports a final list of relevant clips and videos for fine-tuning.

**Based on script:** `get-timed-label-video.py`

In [30]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import os, sys
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
from config import DATA_ROOT, ANNOTATION_CSV, CLASSES

# === Temporal configuration ===
FPS = 60                       # frames per second (EPIC-KITCHENS videos)
MIN_DURATION_SECONDS = 20      # minimum duration to keep (seconds)
MAX_GAP_FRAMES = 15            # tolerance for small gaps between clips
OUTPUT_DIR = "./data/configs"
os.makedirs(OUTPUT_DIR, exist_ok=True)

ANNOTATIONS_FILE = os.path.join(DATA_ROOT, "EPIC_100_train.csv")

print("Annotation file:", ANNOTATIONS_FILE)
print(f"FPS: {FPS} | Minimum duration: {MIN_DURATION_SECONDS}s | Gap tolerance: {MAX_GAP_FRAMES} frames")
print(f"Target classes: {', '.join(CLASSES)}")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Annotation file: ../annotations/EPIC_100_train.csv
FPS: 60 | Minimum duration: 20s | Gap tolerance: 15 frames
Target classes: bread, knife, cheese, ham, tomato, cucumber, carrot, butter


In [31]:
df = pd.read_csv(ANNOTATIONS_FILE)
print(f"Total annotation rows loaded: {len(df):,}")

# Filter by relevant kitchen-related classes defined in config.py
df = df[df["noun"].isin(CLASSES)].copy()
print(f"Annotations after class filtering: {len(df):,}")
df.head()

Total annotation rows loaded: 67,217
Annotations after class filtering: 3,714


Unnamed: 0,narration_id,participant_id,video_id,narration_timestamp,start_timestamp,stop_timestamp,start_frame,stop_frame,narration,verb,verb_class,noun,noun_class,all_nouns,all_noun_classes
32,P01_01_128,P01,P01_01,00:09:20.670,00:09:22.08,00:09:22.81,33724,33768,take carrots,take,0,carrot,41,['carrot'],[41]
33,P01_01_129,P01,P01_01,00:09:23.250,00:09:24.25,00:09:28.84,33855,34130,cut carrots,cut,7,carrot,41,['carrot'],[41]
36,P01_01_131,P01,P01_01,00:09:34.890,00:09:34.42,00:09:46.92,34465,35215,grate carrots,grate,77,carrot,41,['carrot'],[41]
37,P01_01_132,P01,P01_01,00:09:49.030,00:09:47.02,00:09:55.58,35221,35734,still grating carrots,grate,77,carrot,41,['carrot'],[41]
38,P01_01_133,P01,P01_01,00:10:13.600,00:09:50.29,00:10:00.72,35417,36043,still grating carrot,grate,77,carrot,41,['carrot'],[41]


The EPIC-KITCHENS annotations often contain multiple short segments for the same object in one video.  
To obtain meaningful temporal continuity, we **merge segments** that are close together (i.e., separated by less than `MAX_GAP_FRAMES`).  
This allows us to focus on longer, uninterrupted object presence intervals.

In [32]:
df["duration_frames"] = df["stop_frame"] - df["start_frame"]

merged = []
for (video_id, noun), group in df.groupby(["video_id", "noun"]):
    group = group.sort_values("start_frame")
    start, stop = None, None
    for _, row in group.iterrows():
        if start is None:
            # Start of a new sequence
            start, stop = row.start_frame, row.stop_frame
        elif row.start_frame - stop <= MAX_GAP_FRAMES:
            # Extend current sequence if the gap is small
            stop = max(stop, row.stop_frame)
        else:
            # Save the completed sequence
            merged.append([video_id, noun, start, stop])
            start, stop = row.start_frame, row.stop_frame
    merged.append([video_id, noun, start, stop])

merged_df = pd.DataFrame(merged, columns=["video_id", "noun", "start_frame", "stop_frame"])
merged_df["duration_sec"] = (merged_df["stop_frame"] - merged_df["start_frame"]) / FPS

print(f"Total merged temporal segments: {len(merged_df)}")
merged_df.head()

Total merged temporal segments: 2571


Unnamed: 0,video_id,noun,start_frame,stop_frame,duration_sec
0,P01_01,carrot,1468,1676,3.466667
1,P01_01,carrot,4457,4967,8.5
2,P01_01,carrot,33724,33768,0.733333
3,P01_01,carrot,33855,34130,4.583333
4,P01_01,carrot,34465,36043,26.3


In [33]:
filtered = merged_df[merged_df["duration_sec"] >= MIN_DURATION_SECONDS].reset_index(drop=True)
print(f"Segments with duration ≥ {MIN_DURATION_SECONDS}s: {len(filtered)}")
filtered.head()

Segments with duration ≥ 20s: 81


Unnamed: 0,video_id,noun,start_frame,stop_frame,duration_sec
0,P01_01,carrot,34465,36043,26.3
1,P01_01,carrot,36131,38339,36.8
2,P01_105,carrot,15088,16801,28.55
3,P01_17,tomato,32356,34516,36.0
4,P02_03,cheese,60051,61615,26.066667


We now export the filtered list of clips and videos for further use.  
This output will later be used to generate the training dataset for YOLO fine-tuning.

In [34]:
# Save merged and filtered clips
filtered_path = os.path.join(OUTPUT_DIR, "epic_filtered_clips_merged.csv")
filtered.to_csv(filtered_path, index=False)

# Save unique video IDs
video_list = filtered["video_id"].unique().tolist()
videos_path = os.path.join(OUTPUT_DIR, "epic_filtered_videos.csv")
pd.Series(video_list).to_csv(videos_path, index=False, header=False)

# Also save as plain text for simpler downstream parsing
with open(os.path.join(OUTPUT_DIR, "epic_filtered_videos.txt"), "w") as f:
    f.write(",".join(video_list))

print(f"Saved merged clips to: {filtered_path}")
print(f"Unique videos: {len(video_list)} — list saved to {videos_path}")

Saved merged clips to: ./data/configs/epic_filtered_clips_merged.csv
Unique videos: 49 — list saved to ./data/configs/epic_filtered_videos.csv


### Expected Output

**Generated Files:**
1. `data/configs/epic_filtered_clips_merged.csv` — detailed list of merged clips, containing:
   - `video_id`
   - `noun`
   - `start_frame`
   - `stop_frame`
   - `duration_sec`
2. `data/configs/epic_filtered_videos.csv` — unique video IDs relevant for fine-tuning.
3. `data/configs/epic_filtered_videos.txt` — simplified comma-separated text version.

**Example Output:**

| video_id | noun | start_frame | stop_frame | duration_sec |
|-----------|------|-------------|-------------|---------------|
| P01_05 | knife | 1200 | 2400 | 20.0 |
| P02_11 | bread | 3210 | 4410 | 20.0 |

These results represent continuous, contextually relevant intervals where key kitchen objects are visible for an extended duration.  
They will serve as input for dataset generation and model fine-tuning in subsequent notebooks.