# Notebook 1: The Data Processing and Curation Pipeline

**Objective:** To transform a raw, noisy collection of webcam images into a clean, structured, and analysis-ready dataset for training our neural networks.

This notebook documents the critical pre-processing steps that were engineered to solve the primary challenge identified in early experiments: **background noise**. The initial models were learning to associate backgrounds with gestures rather than the hand shapes themselves. This pipeline systematically eliminates that noise.

The process involves several automated and human-in-the-loop stages:
1.  **Data Collection:** Capturing raw images using an interactive tool.
2.  **Automated Cropping:** Using MediaPipe to detect and isolate hands.
3.  **Manual Review:** A fallback tool for images where auto-cropping fails.
4.  **Final Dataset Build:** Consolidating all clean images.
5.  **Data Splitting:** Dividing the final dataset into `train`, `validation`, and `test` sets for robust model evaluation.

### Setup: Imports and Configuration

First, we import the necessary libraries and define the paths for our various data directories. This ensures our pipeline is organized and reproducible.

In [None]:
import os
import cv2
import shutil
import random
import logging
import mediapipe as mp
from pathlib import Path

# Configure logging for clear output
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    datefmt='%H:%M:%S'
)
logger = logging.getLogger(__name__)

# Define project paths relative to the project root
PROJECT_ROOT = Path.cwd().parent # Assumes notebook is in 'notebooks' dir
RAW_DATASET_DIR = PROJECT_ROOT / "dataset"
CROPPED_DIR = PROJECT_ROOT / "dataset_cropped"
REVIEW_DIR = PROJECT_ROOT / "dataset_review"
FINAL_DIR = PROJECT_ROOT / "dataset_final"
SPLIT_DATA_DIR = PROJECT_ROOT / "data"

print(f"Project Root: {PROJECT_ROOT}")

### Step 1: Data Collection (Conceptual)

The first step is to create our own dataset. The brief requires this, and it gives us full control over the data's quality and variability. The script `src/data_collection.py` provides an interactive OpenCV window to capture images from a webcam. 

**Key Features:**
- **Live Camera Feed:** Shows what the camera sees.
- **Class Switching:** Use keyboard keys (`r`, `p`, `s`, `n`) to switch the category for the next capture.
- **Countdown Capture:** Press `c` to start a 3-second countdown, allowing time to position the hand correctly.
- **Organized Saving:** Images are automatically saved into the correct subfolder (e.g., `dataset/rock/`).

*This script is interactive and designed to be run from the command line, so we will not execute it here. The output is the raw `dataset/` directory.*

### Step 2: Automated Hand Cropping

This is the core of our solution to the background noise problem. We use Google's **MediaPipe** library, a powerful tool for finding human body landmarks, including hands. 

The following function, taken directly from `src/utils/auto_crop.py`, performs these actions:
1. Initializes MediaPipe's hand detection model.
2. Iterates through every image in our raw `dataset/` directory.
3. For each image, it attempts to detect a hand.
4. If a hand is found, it calculates a bounding box around the hand landmarks.
5. It adds a small amount of `PADDING` to the bounding box to ensure the whole hand is included.
6. The image is cropped to this bounding box and saved in the `dataset_cropped/` directory, preserving the class subfolder structure.

In [None]:
# This code is from src/utils/auto_crop.py

def run_auto_crop(source_dir, dest_dir, padding=16):
    """Finds hands in the original dataset, crops them with padding, and saves them."""
    
    logger.info("Starting Automatic Hand Cropping Pipeline")
    
    # Initialize MediaPipe Hands
    mp_hands = mp.solutions.hands
    hands = mp_hands.Hands(static_image_mode=True, max_num_hands=1, min_detection_confidence=0.5)
    
    dest_dir.mkdir(parents=True, exist_ok=True)
    
    image_count = 0
    cropped_count = 0

    # Iterate through all class folders in the source directory
    for class_path in [p for p in source_dir.iterdir() if p.is_dir()]:
        dest_class_path = dest_dir / class_path.name
        dest_class_path.mkdir(parents=True, exist_ok=True)
        logger.info(f"Processing class: {class_path.name}")
        
        # Iterate through all images in the class folder
        for image_path in class_path.glob("*.png"):
            image_count += 1
            image = cv2.imread(str(image_path))

            if image is None:
                logger.warning(f"Could not read {image_path.name}, skipping.")
                continue

            # Process the image to find hands
            results = hands.process(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
            
            if results.multi_hand_landmarks:
                hand_landmarks = results.multi_hand_landmarks[0]
                h, w, _ = image.shape
                x_coords = [lm.x * w for lm in hand_landmarks.landmark]
                y_coords = [lm.y * h for lm in hand_landmarks.landmark]
                
                x_min, x_max = int(min(x_coords)), int(max(x_coords))
                y_min, y_max = int(min(y_coords)), int(max(y_coords))
                    
                # Apply padding and crop
                x_min = max(0, x_min - padding)
                y_min = max(0, y_min - padding)
                x_max = min(w, x_max + padding)
                y_max = min(h, y_max + padding)
                cropped_image = image[y_min:y_max, x_min:x_max]
                
                if cropped_image.size > 0:
                    cv2.imwrite(str(dest_class_path / image_path.name), cropped_image)
                    cropped_count += 1
            else:
                logger.warning(f"No hand detected in {image_path.name}, skipping.")
                
    hands.close()
    logger.info("Cropping Pipeline Complete")
    logger.info(f"Total images processed: {image_count}")
    logger.info(f"Successfully cropped and saved: {cropped_count}")

# Run the function
# run_auto_crop(RAW_DATASET_DIR, CROPPED_DIR)

### Step 3: Manual Review (Conceptual)

Not every image is perfect. Sometimes MediaPipe fails to detect a hand due to poor lighting, awkward angles, or motion blur. The `src/utils/review.py` script is another interactive tool that creates a **human-in-the-loop** workflow to handle these failures.

It presents the user with any image that MediaPipe couldn't process and provides options:
- **Manually Crop:** Draw a box around the hand to crop it.
- **Keep Original:** If the original image is fine as-is (e.g., for the 'none' class).
- **Discard & Replace:** Remove the bad image and replace it with a copy of the last known good image to maintain dataset balance.

*Like the data collection script, this is an interactive tool not meant for execution within a notebook.*

### Step 4: Building the Final Dataset

After the automated and manual review steps, our clean images are located in different folders (e.g., `dataset_review/good_crop`, `dataset_review/keep_original`).

The `src/utils/build_final_dataset.py` script consolidates all these approved images into a single, clean `dataset_final/` directory. This becomes the master source for our training data.

In [None]:
# This code is from src/utils/build_final_dataset.py

def run_build(review_dir, final_dir):
    """Combines reviewed images into a final dataset directory."""
    logger.info("Building Final Dataset")
    
    if final_dir.exists():
        shutil.rmtree(final_dir)
    final_dir.mkdir()

    sources_to_combine = [
        review_dir / "good_crop",
        review_dir / "keep_original"
    ]

    total_copied = 0
    for source_base in sources_to_combine:
        logger.info(f"Copying from {source_base.name}...")
        for class_path in [p for p in source_base.iterdir() if p.is_dir()]:
            dest_class_path = final_dir / class_path.name
            dest_class_path.mkdir(exist_ok=True)
            
            count = 0
            for image_path in class_path.glob("*.png"):
                shutil.copy(str(image_path), dest_class_path)
                count += 1
            total_copied += count
            
    logger.info(f"Final Dataset Build Complete. Total images: {total_copied}")

# Run the function
# run_build(REVIEW_DIR, FINAL_DIR)

### Step 5: Splitting the Dataset for Training

The final step in data preparation is to split our `dataset_final` into training, validation, and test sets. This is crucial for properly training a model and evaluating its ability to generalize to new, unseen data.

The `src/utils/prepare_dataset.py` script performs this split:
- It shuffles the images within each class to ensure randomness.
- It uses a fixed `RANDOM_SEED` to make the split **reproducible**.
- It splits the data according to the defined ratios (e.g., 70% train, 15% validation, 15% test).
- It copies the files into a new `data/` directory with `train/`, `validation/`, and `test/` subfolders.

In [None]:
# This code is from src/utils/prepare_dataset.py

def run_split(source_dir, dest_dir, ratios={"train": 0.7, "validation": 0.15, "test": 0.15}, seed=123):
    """Splits the final dataset into train/validation/test sets."""
    logger.info("Starting Dataset Split")
    random.seed(seed)

    if dest_dir.exists():
        shutil.rmtree(dest_dir)

    # Create destination directories
    for split in ratios.keys():
        for class_name in [p.name for p in source_dir.iterdir() if p.is_dir()]:
            (dest_dir / split / class_name).mkdir(parents=True, exist_ok=True)

    # Process each class
    for class_path in [p for p in source_dir.iterdir() if p.is_dir()]:
        image_files = list(class_path.glob("*.png"))
        random.shuffle(image_files)
        num_images = len(image_files)

        # Calculate split points
        train_end = int(ratios["train"] * num_images)
        validation_end = train_end + int(ratios["validation"] * num_images)
        
        splits = {
            "train": image_files[:train_end],
            "validation": image_files[train_end:validation_end],
            "test": image_files[validation_end:],
        }

        # Copy files
        for split_name, files in splits.items():
            dest_split_dir = dest_dir / split_name / class_path.name
            for file_path in files:
                shutil.copy(file_path, dest_split_dir)
                
    logger.info("Dataset Split Complete")

# Run the function
# run_split(FINAL_DIR, SPLIT_DATA_DIR)

### Conclusion

At the end of this pipeline, we have successfully transformed our raw webcam captures into a high-quality, organized, and split dataset located in the `data/` directory. This structured data is now perfectly prepared for the model development and training phase, which is covered in the next notebook.