# WLASL Dataset - Large-scale Video Dataset for Sign Language Recognition

**Dataset:** WLASL by Facebook AI (ASL Citizen)  
**Size:** 2,700 videos of complete ASL words (not isolated signs)  
**Classes:** 2,000+ unique words  
**Format:** MP4 videos with variable length (1-10 seconds)  
**Challenge:** Temporal modeling, variable sequence lengths, complete word recognition  
**Goal:** Train LSTM/GRU model for word-level sign language recognition

**Best Run On:** Google Colab with GPU (video processing is intensive)  
**Training Time:** 1-2 hours for temporal model training

**Key Steps:**
1. Download WLASL videos
2. Extract frames from videos
3. Use MediaPipe to extract hand keypoints (21 landmarks per hand)
4. Create sequence dataset (frame sequences with labels)
5. Train LSTM model on keypoint sequences
6. Export for production

In [None]:
# Setup with MediaPipe
import subprocess, sys

packages = [
    'tensorflow', 'kaggle', 'opencv-python', 'mediapipe', 'numpy', 'pandas',
    'matplotlib', 'scikit-learn', 'tensorflowjs'
]
for pkg in packages:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', pkg])

from pathlib import Path
for d in ['external_data', 'datasets', 'models', 'output']:
    Path(d).mkdir(parents=True, exist_ok=True)

print("‚úì Setup complete with MediaPipe")

In [None]:
# Extract hand keypoints using MediaPipe
import cv2
import mediapipe as mp
import numpy as np
from pathlib import Path

mp_hands = mp.solutions.hands
mp_drawing = mp.solutions.drawing_utils

def extract_hand_keypoints_from_video(video_path, max_frames=30):
    """Extract hand keypoints from video using MediaPipe."""
    cap = cv2.VideoCapture(str(video_path))
    keypoints_sequence = []
    
    with mp_hands.Hands(static_image_mode=False, max_num_hands=2) as hands:
        frame_count = 0
        while cap.isOpened() and frame_count < max_frames:
            ret, frame = cap.read()
            if not ret:
                break
            
            frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            results = hands.process(frame_rgb)
            
            # Extract landmarks (21 points per hand, 3D coordinates)
            frame_keypoints = np.zeros((42,))  # 2 hands √ó 21 points = 42 values (x,y)
            
            if results.multi_hand_landmarks:
                for hand_idx, landmarks in enumerate(results.multi_hand_landmarks):
                    if hand_idx >= 2:
                        break
                    for i, lm in enumerate(landmarks.landmark):
                        frame_keypoints[hand_idx*21 + i*2] = lm.x
                        frame_keypoints[hand_idx*21 + i*2 + 1] = lm.y
            
            keypoints_sequence.append(frame_keypoints)
            frame_count += 1
    
    cap.release()
    return np.array(keypoints_sequence) if keypoints_sequence else None

print("‚úì Keypoint extraction function ready")

In [None]:
# Create sequence dataset for temporal modeling
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences

def create_sequence_dataset(video_dir, max_frames=30):
    """Create padded sequence dataset from videos."""
    sequences = []
    labels = []
    label_dict = {}
    
    video_dir = Path(video_dir)
    label_idx = 0
    
    for word_dir in video_dir.iterdir():
        if not word_dir.is_dir():
            continue
        
        word = word_dir.name
        label_dict[label_idx] = word
        
        for video_file in word_dir.glob('*.mp4'):
            try:
                keypoints = extract_hand_keypoints_from_video(str(video_file), max_frames)
                if keypoints is not None:
                    sequences.append(keypoints)
                    labels.append(label_idx)
            except:
                pass
        
        label_idx += 1
    
    # Pad sequences to uniform length
    X = pad_sequences(sequences, maxlen=max_frames, padding='post', dtype='float32')
    y = np.array(labels)
    
    print(f"‚úì Dataset created: {len(X)} sequences, {len(label_dict)} classes")
    return X, y, label_dict

print("‚úì Sequence dataset creation ready")

In [None]:
# Train LSTM temporal model
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, GRU, Dense, Dropout, Bidirectional
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from sklearn.train_test_split import train_test_split

# Load or create dummy data for demonstration
print("üìù Creating LSTM model for temporal sequence data...")

# Model parameters
MAX_FRAMES = 30
FEATURE_DIM = 42  # 2 hands √ó 21 points √ó 2 coords
EPOCHS = 20
BATCH = 32

# Build LSTM model
model = Sequential([
    Bidirectional(LSTM(128, return_sequences=True), input_shape=(MAX_FRAMES, FEATURE_DIM)),
    Dropout(0.3),
    Bidirectional(LSTM(64, return_sequences=False)),
    Dropout(0.3),
    Dense(128, activation='relu'),
    Dropout(0.3),
    Dense(1, activation='sigmoid')  # Binary classification (modify for multi-class)
])

model.compile(
    optimizer=Adam(learning_rate=0.001),
    loss='binary_crossentropy',
    metrics=['accuracy']
)

print("‚úì LSTM model built")
print(f"  Architecture: Bidirectional LSTM ‚Üí Dense layers")
print(f"  Input: ({MAX_FRAMES}, {FEATURE_DIM}) - sequences of hand keypoints")
print(f"  Output: Class prediction for ASL word")

model.summary()

In [None]:
# Training template (requires actual WLASL data)
print("üöÄ To train with real WLASL data:")
print("""
# After creating sequences with create_sequence_dataset():
X, y, label_dict = create_sequence_dataset('external_data/wlasl_videos')

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42
)

# Convert labels to categorical (for multi-class)
y_train_cat = to_categorical(y_train, num_classes=len(label_dict))
y_test_cat = to_categorical(y_test, num_classes=len(label_dict))

# Train model
history = model.fit(
    X_train, y_train_cat,
    epochs=EPOCHS,
    batch_size=BATCH,
    validation_split=0.15,
    verbose=1
)

# Evaluate
test_loss, test_acc = model.evaluate(X_test, y_test_cat)
print(f'Test Accuracy: {test_acc:.4f}')

# Save
model.save('models/wlasl_lstm.h5')
""")

## Summary: Four Dataset Notebooks

### üìä Dataset Comparison

| Dataset | Format | Size | Classes | Model | Best For |
|---------|--------|------|---------|-------|----------|
| **ASL Alphabet** | Images (160√ó160) | 87K | 26 | MobileNetV2 | Letter recognition |
| **Sign MNIST** | CSV (28√ó28) | 27K | 24 | Simple CNN | Quick training |
| **HaGRID** | Images (variable) | 500K | 18 | EfficientNet | Real-world robustness |
| **WLASL** | Videos (MP4) | 2.7K | 2000+ | LSTM | Word-level recognition |

### üéØ Use Cases

1. **ASL Alphabet** ‚Üí Real-time letter spelling (interactive games, education)
2. **Sign MNIST** ‚Üí Lightweight deployment (mobile, web)
3. **HaGRID** ‚Üí Production gesture recognition (various backgrounds)
4. **WLASL** ‚Üí Complete word understanding (conversation support)

### üöÄ Next Steps

1. Choose dataset based on use case
2. Download and preprocess data
3. Run training notebook
4. Export to TFJS or ONNX
5. Deploy in your SamvadSetu application