# EdgeWash Kaggle Notebook 

This notebook downloads the Kaggle hand-wash dataset subset, preprocesses frames, optionally computes optical flow, and trains the EdgeWash CNN classifiers. Each step is heavily commented so you can adapt hyperparameters or swap architectures quickly in a Kaggle environment.

## 1. Environment setup

We install Python dependencies listed in `requirements.txt` and make sure `ffmpeg` is available for video processing. Kaggle images already ship with CUDA-enabled TensorFlow, so the install is fast.

*Tip:* Re-run this cell if you change dependencies.

In [None]:
%%bash
pip install -q -r requirements.txt
which ffmpeg || true

## 2. Define paths and hyperparameters

We collect every important configurable value in one place. Setting environment variables keeps the training code aligned with the repository scripts (e.g., `classify_dataset.py`).

* `USE_OPTICAL_FLOW`: toggle the two-stream model (RGB + optical flow).
* `NUM_EPOCHS`, `NUM_LAYERS`, etc.: mirror the `HANDWASH_*` variables used by the training helpers.
* `DATA_ROOT`: where we download and preprocess the dataset (defaults to `/kaggle/working`).

You can edit values in the dictionary before running the cell.

In [None]:
import os, json, pathlib

CONFIG = {
    "DATA_ROOT": "/kaggle/working/edgewash_data",
    "USE_OPTICAL_FLOW": False,  # set True to train the merged two-stream network
    "HANDWASH_NN": "MobileNetV2",  # options: MobileNetV2, InceptionV3, Xception
    "HANDWASH_NUM_LAYERS": 0,
    "HANDWASH_NUM_EPOCHS": 10,
    "HANDWASH_NUM_FRAMES": 5,  # only used for the time-distributed GRU model
    "HANDWASH_EXTRA_LAYERS": 0,
    "BATCH_SIZE": 32,
    "IMG_HEIGHT": 240,
    "IMG_WIDTH": 320,
}

# Export environment variables so downstream scripts pick them up
for key, value in CONFIG.items():
    if key.startswith("HANDWASH_"):
        os.environ[key] = str(value)

root = pathlib.Path(CONFIG["DATA_ROOT"])
root.mkdir(parents=True, exist_ok=True)
print(json.dumps(CONFIG, indent=2))
print("Environment variables applied.")


## 3. Download the Kaggle hand-wash subset

The repository ships a helper script (`dataset-kaggle/get-and-preprocess-dataset.sh`) that fetches a reorganized 7-class subset of the public Kaggle hand-wash dataset. The cell below mirrors that logic with inline Python so the notebook stays self-contained.

Artifacts created:
* `kaggle-dataset-6classes.tar` — downloaded archive
* `kaggle-dataset-6classes/` — raw videos sorted into 7 class folders

Run the cell once; it skips work if files already exist.

In [None]:
import pathlib, tarfile, urllib.request

data_root = pathlib.Path(CONFIG["DATA_ROOT"]).resolve()
raw_tar = data_root / "kaggle-dataset-6classes.tar"
raw_dir = data_root / "kaggle-dataset-6classes"

url = "https://github.com/atiselsts/data/raw/master/kaggle-dataset-6classes.tar"
if not raw_tar.exists():
    print("Downloading dataset archive...")
    urllib.request.urlretrieve(url, raw_tar)
else:
    print("Archive already present:", raw_tar)

if not raw_dir.exists():
    print("Extracting archive...")
    with tarfile.open(raw_tar, "r") as tar:
        tar.extractall(data_root)
else:
    print("Extracted directory already exists:", raw_dir)

print("Contents:", list(raw_dir.iterdir())[:3])


## 4. Frame extraction and train/validation/test split

The repository's `dataset-kaggle/separate-frames.py` script splits each class into `trainval` and `test` partitions (70/30) and extracts every video frame. We reuse the same logic here, saving both full videos and per-frame JPEGs.

Outputs (under `kaggle-dataset-6classes-preprocessed/`):
* `videos/trainval` and `videos/test` — original clips split by partition
* `frames/trainval` and `frames/test` — every decoded frame with a class label directory

If you already preprocessed once, the cell will skip the heavy work.

In [None]:
import os, random, cv2, shutil, pathlib

random.seed(123)
input_dir = raw_dir
out_root = data_root / "kaggle-dataset-6classes-preprocessed"
videos_dir = out_root / "videos"
frames_dir = out_root / "frames"

if out_root.exists():
    print("Preprocessed data already exists at", out_root)
else:
    print("Creating frame and video splits...")
    for subset in ["trainval", "test"]:
        for base in [videos_dir, frames_dir]:
            for cls in range(7):
                (base / subset / str(cls)).mkdir(parents=True, exist_ok=True)

    for class_dir in sorted(os.listdir(input_dir)):
        src_cls_path = input_dir / class_dir
        if not src_cls_path.is_dir():
            continue
        for filename in os.listdir(src_cls_path):
            if not filename.endswith(".mp4"):
                continue
            subset = "test" if random.random() < 0.3 else "trainval"
            src = src_cls_path / filename
            video_target = videos_dir / subset / class_dir / filename
            shutil.copy2(src, video_target)

            cap = cv2.VideoCapture(str(src))
            success, frame = cap.read()
            frame_num = 0
            while success:
                frame_name = f"frame_{frame_num}_{os.path.splitext(filename)[0]}.jpg"
                frame_path = frames_dir / subset / class_dir / frame_name
                cv2.imwrite(str(frame_path), frame)
                success, frame = cap.read()
                frame_num += 1
            cap.release()
    print("Finished preprocessing!")

print("Trainval frame examples:", len(list((frames_dir/"trainval").glob("*/*.jpg"))))
print("Test frame examples:", len(list((frames_dir/"test").glob("*/*.jpg"))))


## 5. Build TensorFlow datasets and normalize inputs

We now create TensorFlow `tf.data.Dataset` objects for training, validation, and testing. The helper `dataset_utilities.get_datasets` mirrors the repository code: it performs an 80/20 split of the `trainval` frames, sets labels, and computes class weights to handle imbalance.

After loading, we map a normalization step that matches the MobileNet preprocessing used in the models. We also save a combined normalized dataset to disk (`tf.data.experimental.save`) so later runs can reload without repeating preprocessing.

In [None]:
import tensorflow as tf
from dataset_utilities import get_datasets
from classify_dataset import get_preprocessing_function

frames_trainval = frames_dir / "trainval"
frames_test = frames_dir / "test"

train_ds, val_ds, test_ds, weights = get_datasets(str(frames_trainval), str(frames_test), batch_size=CONFIG["BATCH_SIZE"])

preprocess_fn = get_preprocessing_function()

def normalize_batch(images, labels):
    return preprocess_fn(images), labels

train_ds_norm = train_ds.map(normalize_batch)
val_ds_norm = val_ds.map(normalize_batch)
test_ds_norm = test_ds.map(normalize_batch)

norm_root = data_root / "normalized_dataset"
if not norm_root.exists():
    norm_root.mkdir(parents=True)
    tf.data.experimental.save(train_ds_norm, norm_root/"train")
    tf.data.experimental.save(val_ds_norm, norm_root/"val")
    tf.data.experimental.save(test_ds_norm, norm_root/"test")
    with open(norm_root/"README.txt", "w") as f:
        f.write("Normalized datasets saved with MobileNet preprocessing and categorical labels.
")
    print("Normalized dataset saved to", norm_root)
else:
    print("Normalized dataset already saved; skipping save step.")

print("Class weights:", weights)


## 6. (Optional) Compute optical flow for the merged two-stream network

Set `CONFIG['USE_OPTICAL_FLOW'] = True` if you want to train the RGB + optical flow model. This step computes Farnebäck dense optical flow between frames spaced by ~1/3 second (matching `calculate-optical-flow.py`).

Outputs live in `kaggle-dataset-6classes-preprocessed/of/trainval` and `/test` alongside the RGB frames.

In [None]:
import numpy as np

if CONFIG["USE_OPTICAL_FLOW"]:
    print("Optical flow enabled; proceed with extraction below.")
else:
    print("Optical flow disabled; skip to training.")


### Optical flow helper (embedded)

The repository script references a global `input_dir`; the helper below keeps everything scoped inside the notebook and mirrors the same Farnebäck settings.

In [None]:
import cv2 as cv
from tqdm import tqdm

def compute_optical_flow(dataset_root, frame_step=10):
    import pathlib, numpy as np
    videos_path = pathlib.Path(dataset_root)/"videos"
    output_root = pathlib.Path(dataset_root)/"of"
    output_root.mkdir(exist_ok=True)
    for subset in ["trainval", "test"]:
        subset_in = videos_path/subset
        subset_out = output_root/subset
        for cls_dir in subset_in.iterdir():
            if not cls_dir.is_dir():
                continue
            (subset_out/cls_dir.name).mkdir(parents=True, exist_ok=True)
            for video in tqdm(list(cls_dir.glob("*.mp4")), desc=f"{subset}-{cls_dir.name}"):
                cap = cv.VideoCapture(str(video))
                frames = []
                ret, frame = cap.read()
                while ret:
                    frames.append(cv.cvtColor(frame, cv.COLOR_BGR2GRAY))
                    ret, frame = cap.read()
                cap.release()
                for idx in range(len(frames)-frame_step):
                    flow = cv.calcOpticalFlowFarneback(frames[idx], frames[idx+frame_step], None, 0.5, 3, 15, 3, 5, 1.2, 0)
                    mag, ang = cv.cartToPolar(flow[...,0], flow[...,1])
                    mask = np.zeros((frames[idx].shape[0], frames[idx].shape[1],3), dtype=np.float32)
                    mask[...,0] = ang*180/np.pi/2
                    mask[...,1] = 255
                    mask[...,2] = cv.normalize(mag, None, 0, 255, cv.NORM_MINMAX)
                    rgb = cv.cvtColor(mask, cv.COLOR_HSV2BGR)
                    out_name = subset_out/cls_dir.name/f"frame_{idx}_{video.stem}.jpg"
                    cv.imwrite(str(out_name), rgb)
    print("Optical flow extraction complete at", output_root)

if CONFIG["USE_OPTICAL_FLOW"]:
    compute_optical_flow(out_root)


## 7. Train the classifier

We reuse `classify_dataset.evaluate` to stay faithful to the repository logic. Choose one of the three architectures by importing the right training script: 

* Single-frame baseline (used below)
* Time-distributed GRU (`kaggle-classify-videos.py`)
* Two-stream merged network (`kaggle-classify-merged-network.py`, requires optical flow)

Training artifacts saved:
* `kaggle-single-framefinal-model/` — SavedModel directory
* `results-<name>.txt` — metrics and F1 scores
* `accuracy-<name>.pdf` — accuracy plot

In [None]:
from classify_dataset import evaluate

model_name = "kaggle-single-frame" if not CONFIG["USE_OPTICAL_FLOW"] else "kaggle-merged"

evaluate(model_name, train_ds_norm, val_ds_norm, test_ds_norm, weights_dict=weights)


## 8. Evaluate and visualize

After training, we can inspect the saved accuracy plot and load the metrics log. We also print a few sample predictions to verify the label ordering.

In [None]:
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
import pathlib

results_file = pathlib.Path(f"results-{model_name}.txt")
print(results_file.read_text())

acc_plot = pathlib.Path(f"accuracy-{model_name}.pdf")
print("Accuracy plot saved to", acc_plot)

sample_images, sample_labels = next(iter(test_ds_norm.take(1)))
loaded_model = tf.keras.models.load_model(f"{model_name}final-model", custom_objects={"MobileNetPreprocessingLayer": None})
preds = loaded_model.predict(sample_images)
print("Predicted class indices (first 10):", np.argmax(preds, axis=1)[:10])
print("True class indices (first 10):", np.argmax(sample_labels.numpy(), axis=1)[:10])


## 9. Export for inference (SavedModel + TFLite)

Kaggle notebooks often deploy to mobile or lightweight environments. The following cell saves a TensorFlow Lite version compatible with the MobileNet preprocessing layer defined in `classify_dataset.py`.

In [None]:
import tensorflow as tf
from classify_dataset import MobileNetPreprocessingLayer

saved_dir = f"{model_name}final-model"
model = tf.keras.models.load_model(saved_dir, custom_objects={"MobileNetPreprocessingLayer": MobileNetPreprocessingLayer})
converter = tf.lite.TFLiteConverter.from_saved_model(saved_dir)
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS, tf.lite.OpsSet.SELECT_TF_OPS]
tflite_model = converter.convert()
open(f"{model_name}.tflite", "wb").write(tflite_model)
print("Saved TFLite model:", f"{model_name}.tflite")


## 10. TensorBoard (optional)

You can monitor training live by setting `HANDWASH_TENSORBOARD_LOGDIR` before training and launching TensorBoard inside the notebook. Uncomment the block below to enable.

In [None]:
# import os
# os.environ['HANDWASH_TENSORBOARD_LOGDIR'] = str(pathlib.Path(CONFIG['DATA_ROOT']) / 'logs')
# %load_ext tensorboard
# %tensorboard --logdir $HANDWASH_TENSORBOARD_LOGDIR
