# Rock–Paper–Scissors Dataset Preprocessing

This notebook scans the original dataset folders, validates images, and creates reproducible **train**, **validation**, and **test** splits.  
It also generates a configuration JSON (`preprocess.json`) with resizing, normalization, and augmentation parameters used by all CNN models.


## 1) Imports & Global Config
Import the required libraries and define constants such as random seed and dataset path.


In [None]:
# --- Import libraries and set global parameters (paths, random seed) ---

from pathlib import Path
import pandas as pd
import numpy as np
from PIL import Image, UnidentifiedImageError
from sklearn.model_selection import StratifiedShuffleSplit
import json, os

SEED = 42
np.random.seed(SEED)

## 2) Dataset Scanning & Validation
Walk through the dataset directory, check for valid image files, collect metadata (path, label, width, height),  
and handle any unreadable or corrupted files.


In [None]:
Dataset_Dir = Path(r"Dataset/")
assert Dataset_Dir.exists(), f"Dataset directory not found: {Dataset_Dir}"

In [None]:
records = []
corrupt = []
EXTS = [".png", ".jpg", ".jpeg",".bmp",".gif"]

class_dirs = [p for p in Dataset_Dir.iterdir() if p.is_dir()]
class_names = sorted([p.name for p in class_dirs])
print("Found class folder: ", class_names)

In [None]:
# --- Iterate through all subfolders, check image validity, and record metadata ---

for cdir in class_dirs:
    label = cdir.name
    for img_path in cdir.rglob("*"):
        if img_path.suffix.lower() not in EXTS:
            continue
        try:
            with Image.open(img_path) as im:
                im = im.convert("RGB")
                w, h = im.size
            records.append({"filepath": str(img_path.resolve()), "label": label, "width": w, "height": h})
        except (UnidentifiedImageError, OSError):
            corrupt.append(str(img_path))

In [None]:
df = pd.DataFrame(records)
print(f"\nScanned {len(df)} images\nCorrupt/unreadable: {len(corrupt)}")

In [None]:
print("\nCounts per class:")
print(df.groupby("label").size().sort_index())

In [None]:
print("\nUnique (width, height) pairs found:")
print(df[["width", "height"]].drop_duplicates().to_string(index=False))

## 3) Create Manifest File
Combine all scanned images into a single dataframe and export it to `rps_outputs/manifest.csv` for reproducibility.


In [None]:
out_dir = Path("rps_outputs")
out_dir.mkdir(parents=True, exist_ok=True)
df.to_csv(out_dir / "manifest.csv", index=False)
if corrupt:
    with open(out_dir / "corrupt_files.txt", "w") as f:
        f.write("\n".join(corrupt))
print("\nSaved:", out_dir / "manifest.csv")

In [None]:
try: df
except NameError:
    df = pd.read_csv(out_dir / "manifest.csv")

X = df["filepath"].values
y = df["label"].values

## 4) Stratified Train/Val/Test Split
Use `StratifiedShuffleSplit` to maintain class balance across **train**, **validation**, and **test** splits.  
Ensure reproducibility via a fixed seed (42).


In [None]:
# --- Perform 2-level stratified split: (1) train+val vs test, (2) train vs val ---

TEST_FRAC = 0.15
VAL_FRAC = 0.15

sss1 = StratifiedShuffleSplit(n_splits=1, test_size=TEST_FRAC, random_state=SEED)

trainval_idx, test_idx = next(sss1.split(X, y))
X_trainval, y_trainval = X[trainval_idx], y[trainval_idx]
X_test, y_test = X[test_idx], y[test_idx]

In [None]:
val_size = VAL_FRAC / (1.0 - TEST_FRAC)

sss2 = StratifiedShuffleSplit(n_splits=1, test_size=val_size, random_state=SEED)

train_idx, val_idx = next(sss2.split(X_trainval, y_trainval))
X_train, y_train = X_trainval[train_idx], y_trainval[train_idx]
X_val, y_val = X_trainval[val_idx], y_trainval[val_idx]

In [None]:
train_df = pd.DataFrame({"filepath": X_train, "label": y_train})
val_df = pd.DataFrame({"filepath": X_val, "label": y_val})
test_df = pd.DataFrame({"filepath": X_test, "label": y_test})

## 5) Export CSV Splits
Save the generated splits as CSV files (`train.csv`, `val.csv`, `test.csv`) in `rps_outputs/`.


In [None]:
train_df.to_csv(out_dir / "train.csv", index=False)
val_df.to_csv(out_dir / "val.csv", index=False)
test_df.to_csv(out_dir / "test.csv", index=False)

print("Saved splits to:", out_dir.resolve())
print("Sizes -> Train:", len(train_df), " val:", len(val_df), " test:", len(test_df))

## 6) Sanity Checks
Print per-class counts for all splits and verify there’s no overlap between them.


In [None]:
def show_counts(d, name):
    c = d.groupby("label").size().sort_index()
    print(f"\n{name} per-class counts:\n{c}")

show_counts(train_df, "Train")
show_counts(val_df, "Val")
show_counts(test_df, "Test")

In [None]:
set_train = set(train_df.filepath)
set_val = set(val_df.filepath)
set_test = set(test_df.filepath)

assert set_train.isdisjoint(set_val) and set_train.isdisjoint(set_test) and set_val.isdisjoint(set_test), "Overlap detected"
print("\nNo overlap across splits")

## 7) Preprocessing Configuration JSON
Define and store preprocessing parameters (resize, normalize, augment) in `rps_outputs/preprocess.json`.  
This ensures **consistency** between preprocessing and model training.


In [None]:
# --- Save preprocessing settings (resize, normalization, augmentation) for later model use ---

os.makedirs("rps_outputs", exist_ok=True)

PREPROC = {
    "seed": 42,
    "img_size": 128,
    "resize": {
        "mode": "pad",
        "width": 128,
        "hright": 128,
        "pad_color": [0, 0, 0]
    },
    "normalize": {
        "type": "rescale",
        "scale": 1/255.0
    },
    "augment": {
        "flip_horizontal": True,
        "rotation": 0.08,
        "zoom": 0.10,
        "contrast": 0.10
    }
}

with open("rps_outputs/preprocess.json", "w") as f:
    json.dump(PREPROC, f, indent=2)

print("Wrote rps_outputs/preprocess.json")
PREPROC

---

## Outputs
- `rps_outputs/manifest.csv` — Full dataset index  
- `rps_outputs/train.csv`, `val.csv`, `test.csv` — Clean, stratified splits  
- `rps_outputs/preprocess.json` — Shared configuration file  

**Next step:** Run the model notebooks (`Model_A.ipynb` → `Model_D.ipynb`) using these CSVs.
