# Comprehensive Handwashing Detection Training Pipeline

**Complete training pipeline using modular Python modules**

This notebook demonstrates:
1. Dataset download (Kaggle WHO6)
2. Data preprocessing (frame extraction)
3. Model training (MobileNetV2)
4. Evaluation and visualization
5. Model comparison

**Runtime**: GPU (recommended for training)

**Expected Duration**: 2-3 hours for complete pipeline

**Author**: Generated with AdaL (https://github.com/sylphai/adal-cli)

**Date**: 2025-12-31

## 1. Setup & Dependencies

In [None]:
# Check if running on Google Colab
try:
    import google.colab
    IN_COLAB = True
    print("Running on Google Colab")
except ImportError:
    IN_COLAB = False
    print("Running locally")

In [None]:
# Mount Google Drive (Colab only)
if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    
    # Set working directory
    import os
    WORK_DIR = '/content/drive/MyDrive/handwash_training'
    os.makedirs(WORK_DIR, exist_ok=True)
    %cd {WORK_DIR}
else:
    WORK_DIR = '.'
    print(f"Working directory: {WORK_DIR}")

In [None]:
# Install dependencies
!pip install -q tensorflow==2.15.0
!pip install -q scikit-learn pandas numpy opencv-python-headless
!pip install -q matplotlib seaborn tqdm requests

print("Dependencies installed!")

In [None]:
# Verify GPU availability
import tensorflow as tf

print(f"TensorFlow version: {tf.__version__}")
print(f"GPU available: {len(tf.config.list_physical_devices('GPU')) > 0}")
print(f"GPU devices: {tf.config.list_physical_devices('GPU')}")

In [None]:
# Import standard libraries
import sys
import json
import logging
from pathlib import Path
from datetime import datetime

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Set random seeds
np.random.seed(42)
tf.random.set_seed(42)

print("Libraries imported successfully!")

## 2. Clone Training Modules

Clone the modular Python training modules from your repository.

In [None]:
# Clone repository (if not already cloned)
REPO_URL = "https://github.com/AliNikkhah2001/edgeWash.git"
REPO_DIR = Path("edgeWash")

if not REPO_DIR.exists():
    print(f"Cloning repository from {REPO_URL}...")
    !git clone {REPO_URL}
else:
    print(f"Repository already exists: {REPO_DIR}")
    print("Pulling latest changes...")
    !cd {REPO_DIR} && git pull

# Add training modules to Python path
training_dir = REPO_DIR / "training"
if str(training_dir) not in sys.path:
    sys.path.insert(0, str(training_dir))

print(f"Training modules path: {training_dir}")

In [None]:
# Import training modules
import config
import download_datasets
import preprocess_data
import data_generators
import models
import train as train_module
import evaluate

print("Training modules imported successfully!")

## 3. Configuration

View and customize training hyperparameters.

In [None]:
# Display configuration
print("=" * 80)
print("TRAINING CONFIGURATION")
print("=" * 80)

print(f"\nImage size: {config.IMG_SIZE}")
print(f"Sequence length: {config.SEQUENCE_LENGTH}")
print(f"Number of classes: {config.NUM_CLASSES}")
print(f"Class names: {config.CLASS_NAMES}")

print(f"\nBatch size: {config.BATCH_SIZE}")
print(f"Epochs: {config.EPOCHS}")
print(f"Learning rate: {config.LEARNING_RATE}")
print(f"Early stopping patience: {config.PATIENCE}")

print(f"\nData split:")
print(f"  Train: {config.TRAIN_RATIO*100:.0f}%")
print(f"  Val:   {config.VAL_RATIO*100:.0f}%")
print(f"  Test:  {config.TEST_RATIO*100:.0f}%")

print(f"\nAugmentation:")
for key, value in config.AUGMENTATION_CONFIG.items():
    print(f"  {key}: {value}")

print(f"\nModel architectures available:")
for model_name, model_config in config.MODEL_CONFIGS.items():
    print(f"  - {model_name}: {model_config['name']}")

## 4. Dataset Download

Download Kaggle WHO6 dataset (~1 GB, quick start).

For full pipeline, also download PSKUS (18 GB) and METC (2 GB) - see commented code below.

In [None]:
# Download Kaggle WHO6 dataset
print("Downloading Kaggle WHO6 dataset...")
success = download_datasets.download_kaggle_dataset()

if success:
    print("\n✓ Kaggle dataset ready!")
else:
    print("\n✗ Kaggle dataset download failed!")

In [None]:
# Optional: Download PSKUS and METC datasets (large, requires zenodo-get)
# Uncomment to download:

# # Install zenodo-get
# !pip install zenodo-get

# # Download PSKUS (18 GB, ~30-60 minutes)
# print("Downloading PSKUS Hospital dataset (18 GB)...")
# download_datasets.download_pskus_dataset()

# # Download METC (2 GB, ~5-10 minutes)
# print("Downloading METC Lab dataset (2 GB)...")
# download_datasets.download_metc_dataset()

In [None]:
# Verify datasets
status = download_datasets.verify_datasets()

print("\n" + "=" * 80)
print("DATASET VERIFICATION")
print("=" * 80)

for dataset_name, info in status.items():
    status_icon = "✓" if info['exists'] else "✗"
    print(f"{status_icon} {info['name']}: {info['num_files']} files")

## 5. Data Preprocessing

Extract frames from videos and create train/val/test splits.

In [None]:
# Preprocess Kaggle dataset
print("Preprocessing Kaggle dataset...")
print("This may take 5-10 minutes...\n")

result = preprocess_data.preprocess_all_datasets(
    use_kaggle=True,
    use_pskus=False,  # Set True if PSKUS downloaded
    use_metc=False    # Set True if METC downloaded
)

if result:
    print("\n✓ Preprocessing complete!")
    print(f"\nProcessed files:")
    for key, path in result.items():
        print(f"  {key}: {path}")
else:
    print("\n✗ Preprocessing failed!")

## 6. Data Exploration

Visualize dataset statistics and sample frames.

In [None]:
# Load preprocessed data
train_df = pd.read_csv(config.PROCESSED_DIR / 'train.csv')
val_df = pd.read_csv(config.PROCESSED_DIR / 'val.csv')
test_df = pd.read_csv(config.PROCESSED_DIR / 'test.csv')

print("Dataset sizes:")
print(f"  Train: {len(train_df)} frames ({len(train_df['video_id'].unique())} videos)")
print(f"  Val:   {len(val_df)} frames ({len(val_df['video_id'].unique())} videos)")
print(f"  Test:  {len(test_df)} frames ({len(test_df['video_id'].unique())} videos)")

In [None]:
# Class distribution
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, (df, split_name) in enumerate([(train_df, 'Train'), (val_df, 'Val'), (test_df, 'Test')]):
    class_counts = df['class_name'].value_counts()
    
    axes[idx].bar(range(len(class_counts)), class_counts.values)
    axes[idx].set_title(f'{split_name} Set - Class Distribution', fontsize=12)
    axes[idx].set_xlabel('Class', fontsize=10)
    axes[idx].set_ylabel('Number of Frames', fontsize=10)
    axes[idx].set_xticks(range(len(class_counts)))
    axes[idx].set_xticklabels([cn.split('_')[-1] for cn in class_counts.index], rotation=45, ha='right')
    axes[idx].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('class_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Visualize sample frames
import cv2

fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()

combined_df = pd.concat([train_df, val_df, test_df])

for class_id in range(config.NUM_CLASSES):
    # Get sample frame for this class
    sample_row = combined_df[combined_df['class_id'] == class_id].sample(1).iloc[0]
    frame_path = sample_row['frame_path']
    
    # Load and display frame
    img = cv2.imread(frame_path)
    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    
    axes[class_id].imshow(img_rgb)
    axes[class_id].set_title(config.CLASS_NAMES[class_id], fontsize=10)
    axes[class_id].axis('off')

# Remove extra subplot
fig.delaxes(axes[7])

plt.tight_layout()
plt.savefig('sample_frames.png', dpi=150, bbox_inches='tight')
plt.show()

## 7. Model Training

Train MobileNetV2 model (frame-based classifier).

In [None]:
# Training configuration
MODEL_TYPE = 'mobilenetv2'
EPOCHS = 20  # Reduce for quick testing (use 50 for full training)
BATCH_SIZE = 32

print(f"Training {MODEL_TYPE} for {EPOCHS} epochs...")

In [None]:
# Train model
result = train_module.train_model(
    model_type=MODEL_TYPE,
    train_csv=config.PROCESSED_DIR / 'train.csv',
    val_csv=config.PROCESSED_DIR / 'val.csv',
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    learning_rate=config.LEARNING_RATE
)

print("\n✓ Training complete!")
print(f"Final model saved: {result['final_model_path']}")

## 8. Training Visualization

Plot training curves (loss, accuracy).

In [None]:
# Plot training history
history = result['history']

fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Loss
axes[0].plot(history['loss'], label='Train Loss')
axes[0].plot(history['val_loss'], label='Val Loss')
axes[0].set_title('Model Loss', fontsize=14)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].legend()
axes[0].grid(alpha=0.3)

# Accuracy
axes[1].plot(history['accuracy'], label='Train Accuracy')
axes[1].plot(history['val_accuracy'], label='Val Accuracy')
axes[1].set_title('Model Accuracy', fontsize=14)
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig('training_curves.png', dpi=150, bbox_inches='tight')
plt.show()

# Best epoch
best_epoch = result['best_epoch']
print(f"\nBest epoch: {best_epoch + 1}")
print(f"Best val_accuracy: {history['val_accuracy'][best_epoch]:.4f}")
print(f"Best val_loss: {history['val_loss'][best_epoch]:.4f}")

## 9. Model Evaluation

Evaluate trained model on test set.

In [None]:
# Evaluate model
eval_results = evaluate.evaluate_model(
    model_path=result['final_model_path'],
    test_csv=config.PROCESSED_DIR / 'test.csv',
    model_type=MODEL_TYPE,
    batch_size=BATCH_SIZE,
    save_results=True
)

print("\n✓ Evaluation complete!")

In [None]:
# Display metrics
print("\n" + "=" * 80)
print("EVALUATION METRICS")
print("=" * 80)

print(f"\nOverall Metrics:")
print(f"  Accuracy:      {eval_results['accuracy']:.4f}")
print(f"  Top-2 Accuracy: {eval_results['top2_accuracy']:.4f}")
print(f"  Precision:     {eval_results['precision']:.4f}")
print(f"  Recall:        {eval_results['recall']:.4f}")
print(f"  F1-Score:      {eval_results['f1_score']:.4f}")

print(f"\nPer-Class Metrics:")
for class_name in config.CLASS_NAMES:
    metrics = eval_results['per_class_metrics'][class_name]
    print(f"  {class_name}:")
    print(f"    Precision: {metrics['precision']:.4f}")
    print(f"    Recall:    {metrics['recall']:.4f}")
    print(f"    F1-Score:  {metrics['f1-score']:.4f}")
    print(f"    Support:   {int(metrics['support'])}")

In [None]:
# Display confusion matrix
from IPython.display import Image, display

cm_path = config.RESULTS_DIR / MODEL_TYPE / 'confusion_matrix.png'
if cm_path.exists():
    display(Image(filename=str(cm_path)))
else:
    # Plot confusion matrix inline
    evaluate.plot_confusion_matrix(
        eval_results['confusion_matrix'],
        config.CLASS_NAMES,
        save_path=None,
        normalize=True
    )
    plt.show()

## 10. TensorBoard

Launch TensorBoard to view training logs.

In [None]:
# Load TensorBoard extension (Jupyter/Colab)
%load_ext tensorboard

In [None]:
# Launch TensorBoard
%tensorboard --logdir {config.LOGS_DIR}

## 11. Optional: Train Additional Models

Train LSTM or GRU models for temporal modeling (requires sequence data).

In [None]:
# Uncomment to train LSTM model

# lstm_result = train_module.train_model(
#     model_type='lstm',
#     train_csv=config.PROCESSED_DIR / 'train.csv',
#     val_csv=config.PROCESSED_DIR / 'val.csv',
#     batch_size=16,  # Reduce batch size for sequence models
#     epochs=20,
#     learning_rate=config.LEARNING_RATE
# )

# print("\n✓ LSTM training complete!")

In [None]:
# Uncomment to train GRU model

# gru_result = train_module.train_model(
#     model_type='gru',
#     train_csv=config.PROCESSED_DIR / 'train.csv',
#     val_csv=config.PROCESSED_DIR / 'val.csv',
#     batch_size=16,
#     epochs=20,
#     learning_rate=config.LEARNING_RATE
# )

# print("\n✓ GRU training complete!")

## 12. Model Comparison

Compare multiple models (if trained).

In [None]:
# Example: Compare MobileNetV2, LSTM, GRU
# Uncomment if you trained multiple models

# model_results = {
#     'MobileNetV2': eval_results,
#     'LSTM': evaluate.evaluate_model(
#         model_path=str(config.MODELS_DIR / 'lstm_final.keras'),
#         test_csv=config.PROCESSED_DIR / 'test.csv',
#         model_type='lstm',
#         batch_size=16,
#         save_results=True
#     ),
#     'GRU': evaluate.evaluate_model(
#         model_path=str(config.MODELS_DIR / 'gru_final.keras'),
#         test_csv=config.PROCESSED_DIR / 'test.csv',
#         model_type='gru',
#         batch_size=16,
#         save_results=True
#     )
# }

# # Create comparison plot
# evaluate.compare_models(
#     model_results,
#     save_path=config.RESULTS_DIR / 'model_comparison.png'
# )

# display(Image(filename=str(config.RESULTS_DIR / 'model_comparison.png')))

## 13. Summary & Next Steps

Training pipeline complete!

In [None]:
print("=" * 80)
print("TRAINING PIPELINE COMPLETE")
print("=" * 80)

print(f"\nTrained model: {MODEL_TYPE}")
print(f"Model saved: {result['final_model_path']}")
print(f"\nTest Accuracy: {eval_results['accuracy']:.4f}")
print(f"Test F1-Score: {eval_results['f1_score']:.4f}")

print(f"\nResults saved to:")
print(f"  - Confusion matrix: {config.RESULTS_DIR / MODEL_TYPE / 'confusion_matrix.png'}")
print(f"  - Classification report: {config.RESULTS_DIR / MODEL_TYPE / 'classification_report.txt'}")
print(f"  - Metrics CSV: {config.RESULTS_DIR / MODEL_TYPE / 'metrics.csv'}")

print(f"\nTensorBoard logs: {config.LOGS_DIR}")
print(f"Checkpoints: {config.CHECKPOINTS_DIR}")

print("\nNext steps:")
print("  1. Fine-tune model with more epochs (50+)")
print("  2. Train temporal models (LSTM/GRU) for sequence modeling")
print("  3. Download larger datasets (PSKUS, METC) for better accuracy")
print("  4. Experiment with different augmentation strategies")
print("  5. Export model to TFLite for mobile deployment")