# Sign Language Recognition - Training Notebook

This notebook automatically trains the GRU model for sign language recognition.

## Quick Start:
1. **Runtime ‚Üí Change runtime type ‚Üí Select GPU**
2. **Runtime ‚Üí Run all**

Or click the "Open in Colab" badge in the GitHub repository!

---

**Note:** Make sure your data is available in the `Data/` directory. If you need to upload data, see `docs/COLAB_UPLOAD_GUIDE.md`.


In [None]:
# Clone repository - delete existing and clone fresh
import os
import shutil
from pathlib import Path

# Go to /content first
os.chdir('/content')
print(f"Current directory: {os.getcwd()}")

# Delete existing SignLanguage-Recognition directory if it exists
if os.path.exists('SignLanguage-Recognition'):
    print("üóëÔ∏è  Deleting existing SignLanguage-Recognition directory...")
    shutil.rmtree('SignLanguage-Recognition')
    print("‚úÖ Deleted existing directory")

# Clone fresh
print("üì• Cloning repository...")
os.system('git clone https://github.com/MAya0M/SignLanguage-Recognition.git')
print("‚úÖ Repository cloned")

# Change to the directory
os.chdir('/content/SignLanguage-Recognition')
print(f"‚úÖ Now in: {os.getcwd()}")


In [None]:
# Check GPU availability
import tensorflow as tf
print("TensorFlow version:", tf.__version__)
print("GPU Available:", tf.config.list_physical_devices('GPU'))


## Install Dependencies


In [None]:
# Install required packages
%pip install -q tensorflow numpy pandas scikit-learn opencv-python mediapipe tqdm


In [None]:
# Check GPU availability
import tensorflow as tf
print("TensorFlow version:", tf.__version__)
print("GPU Available:", tf.config.list_physical_devices('GPU'))


## Install Dependencies


In [None]:
# Step 0: Re-extract keypoints with MINIMAL normalization (only translate)
# This preserves size/rotation differences which help distinguish classes
# ‚ö†Ô∏è  This will OVERWRITE all existing keypoint files!
print("‚ö†Ô∏è  WARNING: This will re-extract ALL keypoints with minimal normalization")
print("   This may take a while...")
!python scripts/re_extract_with_minimal_normalization.py


## üîç CRITICAL: Check Why Model Stuck at 12.5% (Random Chance)

**Run this BEFORE training!** This will diagnose why the model is not learning.


In [None]:
# Step 1: Check if data from different classes is actually different
!python scripts/fix_normalization.py


In [None]:
# Step 2: Deep debug - check everything
!python scripts/debug_model_training.py


In [None]:
# Install required packages
%pip install -q tensorflow numpy pandas scikit-learn opencv-python mediapipe tqdm


## Verify Data

Make sure your data is in the `Data/` directory. If not, upload it using one of the methods in `docs/COLAB_UPLOAD_GUIDE.md`.


In [None]:
# Verify data exists and check dataset size
import os
import pandas as pd
from pathlib import Path

data_dir = Path('Data')
if data_dir.exists():
    print("‚úÖ Data directory found")
    csv_path = data_dir / 'Labels' / 'dataset.csv'
    keypoints_dir = data_dir / 'Keypoints' / 'rawVideos'
    
    if csv_path.exists():
        print(f"‚úÖ CSV file: {csv_path}")
        # Check dataset size
        df = pd.read_csv(csv_path)
        print(f"\nüìä Dataset Statistics:")
        print(f"   Total samples: {len(df)}")
        print(f"   Expected: 226 samples (with new videos)")
        
        if len(df) < 200:
            print(f"\n‚ö†Ô∏è  WARNING: Only {len(df)} samples found!")
            print(f"   Expected 226 samples. Make sure CSV is updated with new videos.")
            print(f"   Run: !python scripts/create_dataset_csv.py")
        else:
            print(f"   ‚úÖ CSV has correct number of samples!")
        
        print(f"\n   Per label:")
        for label, count in df.groupby('label').size().sort_values(ascending=False).items():
            status = "‚úÖ" if count >= 25 else "‚ö†Ô∏è"
            print(f"      {status} {label:12s}: {count:3d} samples")
        
        print(f"\n   Per split:")
        for split, count in df.groupby('split').size().items():
            print(f"      {split:6s}: {count:3d} samples")
    else:
        print(f"‚ùå CSV file not found: {csv_path}")
    
    if keypoints_dir.exists():
        print(f"\n‚úÖ Keypoints directory: {keypoints_dir}")
        # Count keypoint files
        npy_files = list(keypoints_dir.rglob("*.npy"))
        print(f"   Found {len(npy_files)} keypoint files")
        if len(npy_files) < 200:
            print(f"   ‚ö†Ô∏è  Expected ~226 keypoint files")
    else:
        print(f"‚ùå Keypoints directory not found: {keypoints_dir}")
else:
    print("‚ùå Data directory not found")
    print("Please upload data first! See docs/COLAB_UPLOAD_GUIDE.md")

# Run detailed dataset check
print("\n" + "="*60)
print("Running detailed dataset verification...")
print("="*60)
!python scripts/check_dataset.py


In [None]:
## Train Model


In [None]:
# Train the model with optimized parameters for better learning
# FIXED settings for small dataset:
# - Higher learning rate (0.002) - faster learning for small dataset
# - Smaller batch size (8) - better gradient updates with small dataset
# - Smaller model (128 units, 2 layers) - better for small dataset, prevents overfitting
# - Less dropout (0.2) - allows more learning with small dataset
# - More epochs (200) - give model time to learn
# - Normalization DISABLED - keypoints already normalized in extraction

!python scripts/train_model.py --csv Data/Labels/dataset.csv --keypoints-dir Data/Keypoints/rawVideos --output-dir models --batch-size 8 --epochs 200 --gru-units 128 --num-gru-layers 2 --dropout 0.2 --learning-rate 0.002 --patience 50


In [None]:
# Train the model with optimized parameters for better learning
# Improved settings:
# - Lower learning rate (0.0005) - slightly higher for faster learning
# - Smaller batch size (16) for better gradient updates
# - More GRU units (256) and layers (3) for better capacity
# - Moderate dropout (0.3) - reduced to allow more learning
# - More patience (50) - will be tripled internally (150 epochs) to allow model to learn much longer
# - More epochs (200) to give model more time to converge
# - Early stopping now monitors val_accuracy with very small min_delta to catch any improvement
# - Removed batch normalization from dense layers (can interfere with small datasets)

!python scripts/train_model.py --csv Data/Labels/dataset.csv --keypoints-dir Data/Keypoints/rawVideos --output-dir models --batch-size 16 --epochs 200 --gru-units 256 --num-gru-layers 3 --dropout 0.3 --learning-rate 0.0005 --patience 50


## Evaluate Model Quality

After training, check the prediction quality:


In [None]:
# Evaluate the latest trained model
import glob
from pathlib import Path

models = sorted(glob.glob('models/run_*/best_model.keras'))
if models:
    latest_model = models[-1]
    print(f"üìä Evaluating: {Path(latest_model).name}")
    print("=" * 60)
    
    # Run evaluation
    !python scripts/evaluate_model.py --model {latest_model} --csv Data/Labels/dataset.csv --keypoints-dir Data/Keypoints/rawVideos
else:
    print("‚ùå No models found - train the model first!")


## Download Model (Optional)

To save your trained model to Google Drive:


In [None]:
# Download model to your computer
from google.colab import files
import shutil
import glob
from pathlib import Path

models_dir = sorted(glob.glob('models/run_*'))
if models_dir:
    latest_run = models_dir[-1]  # Latest run
    print(f"üì¶ Preparing model: {Path(latest_run).name}")
    
    # Create a zip file
    zip_name = f"{Path(latest_run).name}"
    shutil.make_archive(zip_name, 'zip', latest_run)
    
    # Download
    print(f"‚¨áÔ∏è Downloading {zip_name}.zip...")
    files.download(f'{zip_name}.zip')
    print("‚úÖ Model downloaded! Extract and add to your GitHub repo.")
else:
    print("‚ùå No models found - train the model first!")

# Alternative: Save to Google Drive (uncomment to use)
# from google.colab import drive
# drive.mount('/content/drive')
# dest = f'/content/drive/MyDrive/{Path(latest_run).name}'
# shutil.copytree(latest_run, dest, dirs_exist_ok=True)
# print(f"‚úÖ Model saved to Google Drive: {Path(latest_run).name}")
