# 1. Data Exploration & Dataset Verification

**Author:** Tim Chinye
**Date:** [Current Date]

## 1.1. Introduction

This notebook serves as the initial step in our project pipeline: understanding and verifying the custom dataset we've collected. Before we can train any models, it is crucial to:

1.  **Confirm the dataset structure** and ensure all class directories exist.
2.  **Quantify the data** by counting the number of images in each class to check for balance.
3.  **Visually inspect** sample images to verify their quality, diversity, and correctness.
4.  **Programmatically verify** that all image files are valid and not corrupt.

This exploratory data analysis (EDA) ensures we are building our models on a high-quality foundation.

## 1.2. Importing Libraries and Defining Paths

In [None]:
import os
import random
from pathlib import Path
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from PIL import Image

# Define the root path to the original, unprocessed dataset
# Assuming the notebook is run from the project's root directory
DATASET_ROOT = Path("./dataset")
CLASSES = ['rock', 'paper', 'scissors', 'none']

# Verify that the root directory exists
if not DATASET_ROOT.exists():
    print(f"FATAL: Dataset directory not found at '{DATASET_ROOT.resolve()}'")
    print("Please ensure you have run the data collection script or placed the dataset in the correct location.")
else:
    print(f"Dataset directory found at: '{DATASET_ROOT.resolve()}'")

## 1.3. Quantifying the Dataset

Next, we will iterate through each class directory and count the number of images. This is essential to ensure we have a roughly balanced dataset, which prevents the model from becoming biased towards a majority class during training.

In [None]:
image_counts = {}
total_images = 0

for class_name in CLASSES:
    class_dir = DATASET_ROOT / class_name
    if class_dir.is_dir():
        count = len([f for f in class_dir.iterdir() if f.is_file()])
        image_counts[class_name] = count
        total_images += count
    else:
        image_counts[class_name] = 0
        print(f"Warning: Directory for class '{class_name}' not found.")

# Print the counts
print("--- Image Counts per Class ---")
for class_name, count in image_counts.items():
    print(f"- {class_name.capitalize():<10}: {count} images")
print("------------------------------")
print(f"Total Images: {total_images}")

# Optional: Plotting the distribution as a bar chart
plt.figure(figsize=(8, 5))
plt.bar(image_counts.keys(), image_counts.values(), color=['#FF6347', '#4682B4', '#32CD32', '#6A5ACD'])
plt.title('Image Distribution Across Classes')
plt.xlabel('Class')
plt.ylabel('Number of Images')
for i, count in enumerate(image_counts.values()):
    plt.text(i, count + 5, str(count), ha='center') # Add count labels on top of bars
plt.show()

## 1.4. Visual Inspection of Sample Images

A numerical count is useful, but we must also visually inspect the data. This helps us confirm that the images are high-quality, varied, and correctly labeled. We will display a random selection of 4 images from each class.

In [None]:
# Setup for plotting 4x4 grid (4 classes, 4 samples each)
fig, axes = plt.subplots(len(CLASSES), 4, figsize=(12, 12))
fig.suptitle('Random Image Samples from Each Class', fontsize=16)

for i, class_name in enumerate(CLASSES):
    class_dir = DATASET_ROOT / class_name
    if not class_dir.is_dir():
        continue
    
    image_files = list(class_dir.glob('*.png'))
    if len(image_files) < 4:
        print(f"Warning: Not enough images in '{class_name}' to display 4 samples.")
        sample_images = image_files
    else:
        sample_images = random.sample(image_files, 4)
    
    for j, img_path in enumerate(sample_images):
        img = mpimg.imread(img_path)
        ax = axes[i, j]
        ax.imshow(img)
        ax.set_title(class_name)
        ax.axis('off')

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

## 1.5. Programmatic Image File Verification

Finally, we will run a script to programmatically open and verify every image file in the dataset. A single corrupt image can halt the training process hours in. This step is a crucial safeguard to ensure a smooth model training pipeline.

The following code will iterate through all images and report any that cannot be opened or verified by the PIL (Pillow) library.

In [None]:
corrupt_files = []

print("--- Verifying all images in the dataset... ---")
for class_name in CLASSES:
    class_dir = DATASET_ROOT / class_name
    if not class_dir.is_dir():
        continue
    
    print(f"Checking class: {class_name}...")
    for img_path in class_dir.iterdir():
        if img_path.is_file() and img_path.suffix.lower() in ['.png', '.jpg', '.jpeg']:
            try:
                img = Image.open(img_path)
                img.verify()  # Verify that it is a valid image
            except (IOError, SyntaxError) as e:
                print(f"  -> CORRUPT FILE DETECTED: {img_path.name} | Reason: {e}")
                corrupt_files.append(str(img_path))

print("\n--- Verification Complete ---")
if not corrupt_files:
    print("âœ… Success! All image files are valid and can be opened.")
else:
    print(f"ðŸš¨ Warning! Found {len(corrupt_files)} corrupt file(s):")
    for f in corrupt_files:
        print(f"  - {f}")
    print("\nThese files should be deleted before proceeding to model training.")