# 1. Data Exploration & Preparation

This notebook covers the initial exploration and preparation of the Rock-Paper-Scissors dataset. The goal is to understand the data's structure, verify its integrity, and prepare it for the model training pipeline by splitting it into training, validation, and test sets.

**Key Steps:**
1.  Verify the integrity of the raw images.
2.  Visualize sample images from each class.
3.  Execute the script to split the final, clean dataset (`dataset_final`) into `train`, `validation`, and `test` directories.

In [None]:
import os
import random
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from pathlib import Path

# --- Configuration ---
# Assuming the notebook is in the /notebooks directory
PROJECT_ROOT = Path.cwd().parent 
DATASET_DIR = PROJECT_ROOT / "dataset_final" # We explore the FINAL, CLEAN dataset
CLASSES = ['rock', 'paper', 'scissors', 'none']

## 1.1. Verifying and Visualizing the Dataset

Before any training, it's crucial to ensure the dataset is correctly structured and the images are valid. We will count the images in each class and display a few random samples to get a feel for the data's quality and variety.

In [None]:
# Count images in each class
print("Image count per class in the final dataset:")
for class_name in CLASSES:
    class_path = DATASET_DIR / class_name
    if class_path.is_dir():
        num_files = len(list(class_path.glob('*.png')))
        print(f"- {class_name}: {num_files} images")

# Visualize some sample images
plt.figure(figsize=(12, 8))
for i, class_name in enumerate(CLASSES):
    class_path = DATASET_DIR / class_name
    if class_path.is_dir():
        # Get a random image from the class directory
        random_image = random.choice(list(class_path.glob('*.png')))
        img = mpimg.imread(random_image)
        
        plt.subplot(2, 2, i + 1)
        plt.imshow(img)
        plt.title(f"Class: {class_name}\nShape: {img.shape}")
        plt.axis('off')

plt.suptitle("Sample Images from the V2 (Cropped) Dataset", fontsize=16)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

## 1.2. Splitting the Dataset

The dataset is split into `train`, `validation`, and `test` sets to ensure a robust evaluation of the models. This process is handled by the `src/utils/prepare_dataset.py` script, which is executed via `run.py`.

The script takes all images from `dataset_final` and distributes them according to a 70/15/15 ratio for training, validation, and testing, respectively. This ensures that the model is trained on a majority of the data, evaluated on a separate validation set during training, and finally tested on a completely unseen test set.

We can run this from the command line: