# üìã 01 ‚Äî Data Preparation

**Purpose:** Generate manifests, extract hybrid ROI crops, create tar archives, and audit quality.

**Sections:**
1. Inline Setup (run if starting fresh)
2. Manifest & Split Generation
3. Hybrid ROI Extraction (face, face_hands)
4. **Create Tar Archives (ONE-TIME)** ‚Äî for fast data loading in future sessions
5. Quality Audit & Thesis Figures

**Prerequisites:** Original images exist on Google Drive at `DRIVE_DATA_ROOT/auc.distracted.driver.dataset_v2/`


## üîß Section 1: Inline Setup

Run these cells if starting this notebook fresh (not coming from 00_setup.ipynb).


In [None]:
# --- INLINE SETUP (run if starting fresh) ---
import os, subprocess, sys

# Config
REPO_URL       = "https://github.com/ClaudiaCPach/CNNs-distracted-driving"
REPO_DIRNAME   = "CNNs-distracted-driving"
BRANCH         = "main"
PROJECT_ROOT   = f"/content/{REPO_DIRNAME}"
DRIVE_PATH     = "/content/drive/MyDrive/TFM"
DRIVE_DATA_ROOT = f"{DRIVE_PATH}/data"
FAST_DATA      = "/content/data"
DATASET_ROOT   = DRIVE_DATA_ROOT
OUT_ROOT       = f"{DRIVE_PATH}/outputs"
CKPT_ROOT      = f"{DRIVE_PATH}/checkpoints"

# Mount Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=False)

# Clone/update repo
def sh(cmd):
    print(f"$ {cmd}")
    rc = subprocess.call(cmd, shell=True, executable="/bin/bash")
    if rc != 0:
        raise RuntimeError(f"Command failed: {cmd}")

if os.path.isdir(PROJECT_ROOT):
    sh(f"cd {PROJECT_ROOT} && git pull --rebase origin {BRANCH}")
else:
    sh(f"git clone --branch {BRANCH} {REPO_URL} {PROJECT_ROOT}")

# Install
sh(f"pip install -q -e {PROJECT_ROOT}")

# Set env vars
os.environ["DRIVE_PATH"] = DRIVE_PATH
os.environ["DATASET_ROOT"] = DATASET_ROOT
os.environ["OUT_ROOT"] = OUT_ROOT
os.environ["CKPT_ROOT"] = CKPT_ROOT
os.environ["FAST_DATA"] = FAST_DATA

sys.path.insert(0, PROJECT_ROOT)
sys.path.insert(0, os.path.join(PROJECT_ROOT, "src"))

print("‚úÖ Inline setup complete")


## üìã Section 2: Manifest & Split Generation

Generate manifest.csv and train/val/test split CSVs. **Run once** ‚Äî results persist on Drive.


In [None]:
# Run the manifest generator
import subprocess
import sys

sys.path.insert(0, PROJECT_ROOT)

manifest_cmd = f"cd {PROJECT_ROOT} && python -m ddriver.data.manifest --write-split-lists"

print("üî® Generating manifest and split CSVs...")
print(f"Running: {manifest_cmd}\n")

result = subprocess.run(manifest_cmd, shell=True, capture_output=True, text=True)
print(result.stdout)
if result.stderr:
    print("Warnings/Errors:", result.stderr)

if result.returncode == 0:
    print("\n‚úÖ Manifest and split CSVs generated successfully!")
    print(f"   Manifest: {os.environ['OUT_ROOT']}/manifests/manifest.csv")
    print(f"   Train: {os.environ['OUT_ROOT']}/splits/train.csv")
    print(f"   Val: {os.environ['OUT_ROOT']}/splits/val.csv")
    print(f"   Test: {os.environ['OUT_ROOT']}/splits/test.csv")
else:
    print(f"\n‚ùå Error (exit code {result.returncode})")
    raise RuntimeError("Manifest generation failed")


In [None]:
# Verify CSVs were created
import pandas as pd
from pathlib import Path

manifest_path = Path(os.environ['OUT_ROOT']) / "manifests" / "manifest.csv"
train_path = Path(os.environ['OUT_ROOT']) / "splits" / "train.csv"
val_path = Path(os.environ['OUT_ROOT']) / "splits" / "val.csv"
test_path = Path(os.environ['OUT_ROOT']) / "splits" / "test.csv"

print("üìä Checking CSV files...\n")
for name, path in [("Manifest", manifest_path), ("Train", train_path), ("Val", val_path), ("Test", test_path)]:
    if path.exists():
        df = pd.read_csv(path)
        print(f"‚úÖ {name}: {len(df)} rows, columns: {list(df.columns)}")
    else:
        print(f"‚ùå {name}: File not found at {path}")

if manifest_path.exists():
    print("\nüìÑ Sample from manifest (first 3 rows):")
    sample = pd.read_csv(manifest_path).head(3)
    print(sample[['path', 'class_id', 'driver_id', 'camera', 'split']].to_string())


In [None]:
# Create a tiny balanced subset for quick testing (20 images per class)
import pandas as pd
from pathlib import Path
from ddriver import config

train_csv = Path(config.OUT_ROOT) / "splits" / "train.csv"
train_small_csv = Path(config.OUT_ROOT) / "splits" / "train_small.csv"

print(f"Reading {train_csv}...")
df = pd.read_csv(train_csv)

small = df.groupby("class_id").head(20)

print(f"Original train.csv: {len(df)} images")
print(f"Small subset: {len(small)} images ({len(small) // 10} per class)")
print(f"\nClass distribution:")
print(small["class_id"].value_counts().sort_index())

small.to_csv(train_small_csv, index=False)
print(f"\n‚úÖ Saved to {train_small_csv}")


## üîÄ Section 3: Hybrid ROI Extraction (InsightFace + MediaPipe Hands)

Extract face and face+hands crops using the hybrid pipeline. **Run once per variant** ‚Äî results persist on Drive.


In [None]:
# Install extraction dependencies
!pip -q install insightface onnxruntime mediapipe


In [None]:
# üîÄ Hybrid ROI Extraction ‚Äî FACE variant
import subprocess
from pathlib import Path

VARIANT = "face"  # <<<< CHANGE TO "face_hands" for second run

# Output location (Drive for persistence)
HYBRID_OUTPUT_ROOT = Path(OUT_ROOT) / "hybrid"

manifest_csv = Path(OUT_ROOT) / "manifests" / "manifest.csv"
splits_root = Path(OUT_ROOT) / "splits"

# Auto-detect local vs Drive images
LOCAL_DATASET_ROOT = Path("/content/data/auc.distracted.driver.dataset_v2")
DRIVE_DATASET_ROOT = Path(DATASET_ROOT)

if LOCAL_DATASET_ROOT.exists() and any(LOCAL_DATASET_ROOT.iterdir()):
    EFFECTIVE_DATASET_ROOT = LOCAL_DATASET_ROOT
    print(f"üöÄ Using local images from {LOCAL_DATASET_ROOT}")
else:
    EFFECTIVE_DATASET_ROOT = DRIVE_DATASET_ROOT
    print(f"üìÅ Using images from Drive: {DRIVE_DATASET_ROOT}")

# Test mode options
TEST_MODE = False  # Set True for quick test
LIMIT = None  # Set to e.g. 50 for debugging

sample_flag = ""
limit_flag = f"--limit {LIMIT}" if LIMIT else ""

extract_cmd = f"""
cd {PROJECT_ROOT}
python -m src.ddriver.data.hybrid_extract \
  --manifest {manifest_csv} \
  --splits-root {splits_root} \
  --dataset-root {EFFECTIVE_DATASET_ROOT} \
  --output-root {HYBRID_OUTPUT_ROOT} \
  --variant {VARIANT} \
  --min-face-conf 0.4 \
  --min-detection-area-frac 0.005 \
  --min-area-frac 0.01 \
  --min-aspect 0.08 \
  --pad-frac 0.35 \
  --max-area-frac 0.40 \
  {limit_flag} \
  --overwrite
"""

print(f"Running Hybrid extraction for variant: {VARIANT}")
print(extract_cmd)
proc = subprocess.Popen(extract_cmd, shell=True, text=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
for line in proc.stdout:
    print(line, end="")
proc.wait()
if proc.returncode != 0:
    raise RuntimeError("Hybrid extraction failed.")


In [None]:
# üîÅ Regenerate Hybrid CSVs (manifest + splits) for the extracted variant
from pathlib import Path
import pandas as pd

# VARIANT should match what you just extracted
VARIANT = VARIANT if 'VARIANT' in globals() else 'face'
HYBRID_OUTPUT_ROOT = HYBRID_OUTPUT_ROOT if 'HYBRID_OUTPUT_ROOT' in globals() else Path(OUT_ROOT) / 'hybrid'

manifest_csv = Path(OUT_ROOT) / 'manifests' / 'manifest.csv'
splits_root = Path(OUT_ROOT) / 'splits'
crop_root = Path(HYBRID_OUTPUT_ROOT) / VARIANT
meta_csv = Path(HYBRID_OUTPUT_ROOT) / f'detection_metadata_{VARIANT}.csv'

def _extract_class(path_str):
    for part in Path(path_str).parts:
        if len(part) == 2 and part.startswith('c') and part[1].isdigit():
            return part
    return None

def _extract_camera(path_str):
    for part in Path(path_str).parts:
        if part.lower().startswith('camera'):
            return part
    return None

def _extract_filename(path_str):
    return Path(path_str).name

def _coerce_class_id(value):
    if pd.isna(value):
        return None
    value_str = str(value)
    if len(value_str) == 2 and value_str.startswith('c') and value_str[1].isdigit():
        return value_str
    if value_str.isdigit():
        return f'c{int(value_str)}'
    return None

def _normalize_camera(cam):
    if cam is None or pd.isna(cam):
        return None
    cam_str = str(cam).lower().replace(' ', '')
    if cam_str in ['camera1', 'cam1']:
        return 'cam1'
    if cam_str in ['camera2', 'cam2']:
        return 'cam2'
    return cam_str

print(f'üìÇ Loading original manifest: {manifest_csv}')
orig_df = pd.read_csv(manifest_csv)
orig_df = orig_df.rename(columns={'path': 'original_path'})
orig_df['_filename'] = orig_df['original_path'].astype(str).map(_extract_filename)
orig_df['_class'] = orig_df['original_path'].astype(str).map(_extract_class)
orig_df['_camera'] = orig_df['original_path'].astype(str).map(_extract_camera).map(_normalize_camera)

print(f'üîç Scanning crops: {crop_root}')
crop_paths = list(crop_root.rglob('*.jpg'))
crop_df = pd.DataFrame({'crop_path': [str(p) for p in crop_paths]})
crop_df['path'] = crop_df['crop_path'].map(lambda p: str(Path(p).relative_to(HYBRID_OUTPUT_ROOT)))
crop_df['_filename'] = crop_df['crop_path'].map(_extract_filename)
crop_df['_class'] = crop_df['crop_path'].map(_extract_class)
crop_df['_camera'] = crop_df['crop_path'].map(_extract_camera).map(_normalize_camera)

fallback_paths = set()
if meta_csv.exists():
    meta_df = pd.read_csv(meta_csv)
    meta_df = meta_df[meta_df['cropped_path'].astype(str).str.len() > 0]
    meta_df['path'] = meta_df['cropped_path'].astype(str)
    meta_df['_class'] = meta_df['class_id'].map(_coerce_class_id)
    meta_df['_camera'] = meta_df['camera'].map(_normalize_camera)
    crop_df = crop_df.merge(
        meta_df[['path', '_class', '_camera', 'fallback_to_full']],
        on='path', how='left', suffixes=('', '_meta'),
    )
    crop_df['_class'] = crop_df['_class'].fillna(crop_df['_class_meta'])
    crop_df['_camera'] = crop_df['_camera'].fillna(crop_df['_camera_meta'])
    crop_df = crop_df.drop(columns=['_class_meta', '_camera_meta'], errors='ignore')
    fallback_paths = set(meta_df.loc[meta_df['fallback_to_full'] == True, 'path'].dropna().astype(str))
    print(f'üö´ Excluding {len(fallback_paths)} fallback crops from splits')

crop_df_all = crop_df.copy()
crop_df = crop_df[~crop_df['path'].isin(fallback_paths)]

merged = crop_df_all.merge(orig_df, on=['_filename', '_class', '_camera'], how='left')
manifest_out = merged.drop(columns=['crop_path', '_filename', '_class', '_camera'], errors='ignore')
manifest_out_path = Path(HYBRID_OUTPUT_ROOT) / f'manifest_{VARIANT}.csv'
manifest_out.to_csv(manifest_out_path, index=False)
print(f'‚úÖ Wrote manifest: {manifest_out_path}')

for split_name in ['train', 'val', 'test']:
    split_path = splits_root / f'{split_name}.csv'
    split_df = pd.read_csv(split_path)
    split_df['path'] = split_df['path'].astype(str)
    split_df['_filename'] = split_df['path'].map(_extract_filename)
    split_df['_class'] = split_df['path'].map(_extract_class)
    split_df['_camera'] = split_df['path'].map(_extract_camera).map(_normalize_camera)

    split_merged = split_df.merge(crop_df, on=['_filename', '_class', '_camera'], how='inner')
    split_merged['original_path'] = split_merged['path_x']
    split_merged['path'] = split_merged['path_y']
    cols_to_drop = ['path_x', 'path_y', '_filename', '_class', '_camera', 'crop_path', 'fallback_to_full']
    split_merged = split_merged.drop(columns=[c for c in cols_to_drop if c in split_merged.columns])

    out_split = Path(HYBRID_OUTPUT_ROOT) / f'{split_name}_{VARIANT}.csv'
    split_merged.to_csv(out_split, index=False)
    print(f'‚úÖ Wrote split: {out_split} ({len(split_merged)} rows)')


## üéØ Section 3b: Generate Control Splits (5-Run Plan)

Generate filtered split CSVs for the experimental control runs. This creates full-frame splits filtered to the same images that have face/face+hands crops available.

**Run once** after extracting both face and face_hands variants ‚Äî results persist on Drive.


In [None]:
# üéØ Generate Control Splits for 5-Run Experimental Plan
from pathlib import Path
from ddriver.data.id_sets import (
    extract_id_sets,
    generate_control_splits,
    save_id_sets,
    print_id_set_summary,
)

# Paths to manifests
manifest_full = Path(OUT_ROOT) / "manifests" / "manifest.csv"
manifest_face = Path(OUT_ROOT) / "hybrid" / "manifest_face.csv"
manifest_fh = Path(OUT_ROOT) / "hybrid" / "manifest_face_hands.csv"

# Verify manifests exist
missing = []
for name, path in [("Full-frame", manifest_full), ("Face", manifest_face), ("Face+Hands", manifest_fh)]:
    if not path.exists():
        missing.append(f"{name}: {path}")

if missing:
    print("‚ö†Ô∏è  Missing manifests (run hybrid extraction first):")
    for m in missing:
        print(f"   - {m}")
    raise FileNotFoundError("Run hybrid extraction for both face and face_hands variants first.")

# Extract ID sets from manifests
print("üîç Extracting ID sets from manifests...")
id_sets = extract_id_sets(
    manifest_full=manifest_full,
    manifest_face=manifest_face,
    manifest_fh=manifest_fh,
)
print_id_set_summary(id_sets)

# Save ID sets for reference/auditing
id_sets_dir = Path(OUT_ROOT) / "splits" / "id_sets"
print(f"\nüíæ Saving ID sets to {id_sets_dir}...")
save_id_sets(id_sets, id_sets_dir)

# Generate control split CSVs
splits_root = Path(OUT_ROOT) / "splits"
control_output = Path(OUT_ROOT) / "splits" / "control"

print("\nüîß Generating control splits...")
results = generate_control_splits(
    splits_root=splits_root,
    id_sets=id_sets,
    output_root=control_output,
    generate_both=True,  # Also generate S_both splits
)

print(f"\n‚úÖ Control splits saved to: {control_output}")
print("\nüìÅ Generated files:")
for subset_name, splits in results.items():
    for split_name, path in splits.items():
        print(f"   {subset_name}/{split_name}: {path.name}")


## üì¶ Section 4: Create Tar Archives (ONE-TIME)

Create tar archives of hybrid crops for **fast loading in future sessions**.

Copying ~13,000 small files one-by-one over the Drive FUSE mount takes 2+ hours.
A single tar archive can be copied in ~5 minutes and extracted instantly.

**Run these cells ONCE** after extracting hybrid crops. The archives persist on Drive.


In [None]:
# üì¶ Create Tar Archives for Hybrid Crops (ONE-TIME)
# Run this ONCE after extracting hybrid crops. Archives persist on Drive.

from pathlib import Path
from ddriver.data.fastcopy import create_tar_archive

DRIVE_ROOT = Path(OUT_ROOT) / "hybrid"
TAR_OUTPUT_DIR = DRIVE_ROOT  # Save tars alongside hybrid folder

# Create archives for both variants
for variant in ["face", "face_hands"]:
    source_dir = DRIVE_ROOT / variant
    tar_path = TAR_OUTPUT_DIR / f"hybrid_{variant}.tar"
    
    if not source_dir.exists():
        print(f"‚ö†Ô∏è  {variant}: Source not found at {source_dir}")
        continue
    
    if tar_path.exists():
        size_mb = tar_path.stat().st_size / (1024 * 1024)
        print(f"‚úÖ {variant}: Archive already exists ({size_mb:.1f} MB)")
        continue
    
    print(f"\n{'='*60}")
    print(f"üì¶ Creating tar archive for: {variant}")
    print(f"{'='*60}")
    
    result = create_tar_archive(
        source_dir=source_dir,
        tar_path=tar_path,
        use_gzip=False,  # Faster extraction, JPEG already compressed
        verbose=True,
    )
    
    print(f"   ‚úÖ Created: {tar_path}")

print("\n" + "="*60)
print("üìã Archive Summary")
print("="*60)
for variant in ["face", "face_hands"]:
    tar_path = TAR_OUTPUT_DIR / f"hybrid_{variant}.tar"
    if tar_path.exists():
        size_mb = tar_path.stat().st_size / (1024 * 1024)
        print(f"   ‚úÖ hybrid_{variant}.tar: {size_mb:.1f} MB")
    else:
        print(f"   ‚ùå hybrid_{variant}.tar: Not created")

print("\nüí° These archives will be used in 02_training.ipynb for fast data loading.")


## üì¶ Section 4b: Create Full-Frame Compressed Tar (ONE-TIME)

Create a tar archive of **compressed full-frame images** for fast loading.

This involves:
1. Compressing images to 320px (shorter side) at 80% JPEG quality
2. Creating a tar archive

**Run ONCE** ‚Äî the archive persists on Drive for all future sessions.


In [None]:
# üì¶ Create Full-Frame Compressed Tar Archive (ONE-TIME)
# This cell compresses full-frame images and creates a tar for fast loading.

from pathlib import Path
import shutil
from ddriver.data.fastcopy import (
    CompressionSpec, 
    copy_splits_with_compression,
    create_tar_archive,
)

# Check if tar already exists
FULL_TAR_PATH = Path(DRIVE_DATA_ROOT) / "full_compressed.tar"

if FULL_TAR_PATH.exists():
    size_mb = FULL_TAR_PATH.stat().st_size / (1024 * 1024)
    print(f"‚úÖ Full-frame tar already exists: {FULL_TAR_PATH}")
    print(f"   Size: {size_mb:.1f} MB")
    print("   Skipping creation. Delete the tar file if you want to recreate it.")
else:
    print("üì¶ Creating compressed full-frame tar archive...")
    print("   This is a ONE-TIME operation (may take 30-60 minutes).")
    print()
    
    # Step 1: Compress images to /content (temporary)
    SRC_ROOT = Path(DRIVE_DATA_ROOT) / "auc.distracted.driver.dataset_v2"
    DST_ROOT = Path("/content/data/full_compressed")
    
    split_csvs = {
        "train": Path(OUT_ROOT) / "splits" / "train.csv",
        "val": Path(OUT_ROOT) / "splits" / "val.csv",
        "test": Path(OUT_ROOT) / "splits" / "test.csv",
    }
    
    compression_spec = CompressionSpec(target_short_side=320, jpeg_quality=80)
    
    print("Step 1/3: Compressing images...")
    summary = copy_splits_with_compression(
        split_csvs=split_csvs,
        src_root=SRC_ROOT,
        dst_root=DST_ROOT,
        compression=compression_spec,
        skip_existing=False,
    )
    print(f"   ‚úÖ Compressed {summary['processed']} images to {DST_ROOT}")
    
    # Step 2: Create tar archive
    print("\nStep 2/3: Creating tar archive...")
    TEMP_TAR = Path("/content/full_compressed.tar")
    result = create_tar_archive(
        source_dir=DST_ROOT,
        tar_path=TEMP_TAR,
        use_gzip=False,
        verbose=True,
    )
    
    # Step 3: Copy tar to Drive
    print("\nStep 3/3: Copying tar to Drive...")
    shutil.copy2(TEMP_TAR, FULL_TAR_PATH)
    size_mb = FULL_TAR_PATH.stat().st_size / (1024 * 1024)
    print(f"   ‚úÖ Saved to {FULL_TAR_PATH} ({size_mb:.1f} MB)")
    
    # Cleanup
    TEMP_TAR.unlink()
    shutil.rmtree(DST_ROOT, ignore_errors=True)
    print("\n‚úÖ Full-frame tar archive created!")
    print("   Future sessions will load data in ~5 minutes instead of 2+ hours.")


## üîç Section 5: Quality Audit & Thesis Figures

Analyze detection quality and generate figures for your thesis.


In [None]:
# üîç Hybrid Crop Quality Audit
import pandas as pd
import numpy as np
from pathlib import Path

VARIANT = "face"  # must match the variant you extracted

hybrid_root_local = Path(os.environ.get("HYBRID_ROOT_LOCAL", ""))
hybrid_root = hybrid_root_local if hybrid_root_local.exists() else Path(OUT_ROOT) / "hybrid"

metadata_csv = hybrid_root / f"detection_metadata_{VARIANT}.csv"
if not metadata_csv.exists():
    raise FileNotFoundError(f"Detection metadata not found: {metadata_csv}")

print(f"üìÅ Loading metadata from: {metadata_csv}")
df = pd.read_csv(metadata_csv)

n_total = len(df)
n_fallback = df["fallback_to_full"].sum()
n_skipped = df["skipped"].sum() if "skipped" in df.columns else 0
n_saved = n_total - n_skipped
n_face = (df["face_count"] > 0).sum()
n_left_hand = df["left_hand_detected"].sum()
n_right_hand = df["right_hand_detected"].sum()
n_both_hands = ((df["left_hand_detected"]) & (df["right_hand_detected"])).sum()
n_any_hands = ((df["left_hand_detected"]) | (df["right_hand_detected"])).sum()

print("=" * 60)
print("üìä HYBRID DETECTION SUMMARY")
print("=" * 60)
print(f"Total images processed: {n_total}")
print(f"   Images SAVED: {n_saved} ({100*n_saved/n_total:.1f}%)")
print(f"\nüéØ Detection rates:")
print(f"   Face detected: {n_face} ({100*n_face/n_total:.1f}%)")
print(f"   Left hand: {n_left_hand} ({100*n_left_hand/n_total:.1f}%)")
print(f"   Right hand: {n_right_hand} ({100*n_right_hand/n_total:.1f}%)")
print(f"   Both hands: {n_both_hands} ({100*n_both_hands/n_total:.1f}%)")
print(f"\n‚ö†Ô∏è  Fallback to full frame: {n_fallback} ({100*n_fallback/n_total:.1f}%)")


In [None]:
# üìã Breakdown by Camera and Class
print("\nüìã BREAKDOWN BY CAMERA")
print("-" * 80)
camera_stats = df.groupby("camera").agg({
    "fallback_to_full": ["sum", "mean"],
    "roi_area_frac": "mean",
    "face_count": lambda x: (x > 0).mean(),
}).round(3)
camera_stats.columns = ["fallback_count", "fallback_pct", "mean_roi_area", "face_rate"]
camera_stats["fallback_pct"] = (camera_stats["fallback_pct"] * 100).round(1)
camera_stats["face_rate"] = (camera_stats["face_rate"] * 100).round(1)
print(camera_stats.to_string())

print("\nüìã BREAKDOWN BY CLASS")
print("-" * 80)
class_stats = df.groupby("class_id").agg({
    "fallback_to_full": ["sum", "mean"],
    "roi_area_frac": "mean",
    "face_count": lambda x: (x > 0).mean(),
}).round(3)
class_stats.columns = ["fallback_count", "fallback_pct", "mean_roi_area", "face_rate"]
class_stats["fallback_pct"] = (class_stats["fallback_pct"] * 100).round(1)
class_stats["face_rate"] = (class_stats["face_rate"] * 100).round(1)
print(class_stats.to_string())


In [None]:
# üìä ROI Quality Distribution Histograms (thesis figure)
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from pathlib import Path

plt.rcParams.update({"font.size": 12})

meta_face = pd.read_csv(Path(OUT_ROOT) / "hybrid/detection_metadata_face.csv")
meta_fh = pd.read_csv(Path(OUT_ROOT) / "hybrid/detection_metadata_face_hands.csv")

face_valid = meta_face[meta_face["fallback_to_full"] == False].copy()
fh_valid = meta_fh[meta_fh["fallback_to_full"] == False].copy()

lh_conf = fh_valid["left_hand_confidence"].fillna(0).clip(lower=0)
rh_conf = fh_valid["right_hand_confidence"].fillna(0).clip(lower=0)
fh_valid["any_hand_conf"] = np.maximum(lh_conf, rh_conf)

fig, axes = plt.subplots(2, 3, figsize=(15, 9.5))

# Row 1: Face-only
axes[0, 0].hist(face_valid["roi_area_frac"], bins=50, alpha=0.7, color="coral", edgecolor="black")
axes[0, 0].set_xlabel("ROI Area Fraction")
axes[0, 0].set_title("Face-Only: ROI Area Distribution")
axes[0, 0].axvline(face_valid["roi_area_frac"].median(), color="red", linestyle="--",
                   label=f"Median: {face_valid['roi_area_frac'].median():.3f}")
axes[0, 0].legend()

axes[0, 1].hist(face_valid["roi_aspect"], bins=50, alpha=0.7, color="coral", edgecolor="black")
axes[0, 1].set_xlabel("ROI Aspect Ratio (W/H)")
axes[0, 1].set_title("Face-Only: Aspect Ratio Distribution")

axes[0, 2].hist(face_valid["face_confidence"], bins=50, alpha=0.7, color="coral", edgecolor="black")
axes[0, 2].set_xlabel("Face Detection Confidence")
axes[0, 2].set_title("Face-Only: Detection Confidence")

# Row 2: Face+Hands
axes[1, 0].hist(fh_valid["roi_area_frac"], bins=50, alpha=0.7, edgecolor="black")
axes[1, 0].set_xlabel("ROI Area Fraction")
axes[1, 0].set_title("Face+Hands: ROI Area Distribution")

axes[1, 1].hist(fh_valid["roi_aspect"], bins=50, alpha=0.7, edgecolor="black")
axes[1, 1].set_xlabel("ROI Aspect Ratio (W/H)")
axes[1, 1].set_title("Face+Hands: Aspect Ratio Distribution")

axes[1, 2].hist(fh_valid["any_hand_conf"], bins=50, alpha=0.7, edgecolor="black")
axes[1, 2].set_xlabel("Any-hand Confidence")
axes[1, 2].set_title("Face+Hands: Hand Detection Confidence")

plt.tight_layout()
plt.savefig(Path(OUT_ROOT) / "metrics/crop_quality_distributions.png", dpi=150, bbox_inches="tight")
plt.show()
print(f"‚úÖ Saved: {Path(OUT_ROOT) / 'metrics/crop_quality_distributions.png'}")


## ‚úÖ Data Preparation Complete!

**What was created (persists on Drive):**
- `OUT_ROOT/manifests/manifest.csv` ‚Äî Full image manifest
- `OUT_ROOT/splits/train.csv`, `val.csv`, `test.csv` ‚Äî Split CSVs
- `OUT_ROOT/hybrid/face/` ‚Äî Face-only crops
- `OUT_ROOT/hybrid/face_hands/` ‚Äî Face+hands crops
- `OUT_ROOT/hybrid/*.csv` ‚Äî Hybrid manifests and splits
- `OUT_ROOT/metrics/` ‚Äî Quality audit figures

**Tar Archives (for fast loading):**
- `DRIVE_DATA_ROOT/full_compressed.tar` ‚Äî Compressed full-frame images
- `OUT_ROOT/hybrid/hybrid_face.tar` ‚Äî Face-only crops archive
- `OUT_ROOT/hybrid/hybrid_face_hands.tar` ‚Äî Face+hands crops archive

**‚ö° Performance Note:**
Future sessions will copy tar archives (~5 min) instead of individual files (~2 hours).

**Next steps:**
- Run **02_training.ipynb** to train models on these crops
