# WhisperX Batch Processing for Silksong Sessions

**🎯 Purpose:** Process multiple audio sessions with WhisperX for word-level transcription

**📊 Status:** Your test run was successful! Results:
- ✅ 26 segments transcribed
- ✅ 734 words with timestamps
- ✅ 18.3% gesture keyword coverage
- ✅ High confidence scores (0.8+ average)

**⚡ Performance:** ~2-3 minutes per 10-minute audio file on Tesla T4

---

## Quick Setup

1. **Upload all your session folders** to Google Drive:
   ```
   My Drive/silksong_data/
   ├── 20251017_125600_session/audio_16k.wav
   ├── 20251017_130000_session/audio_16k.wav
   ├── 20251017_131500_session/audio_16k.wav
   └── [more sessions]/audio_16k.wav
   ```

2. **Update the session list** in the batch processing cell below

3. **Run all cells** and let it process overnight

---

# WhisperX Transcription for Silksong Gesture Controller

**Project:** Hollow Knight: Silksong Gesture Recognition

**Purpose:** Transcribe audio recordings with word-level timestamps using WhisperX large-v3

**Hardware:** GPU-accelerated (CUDA)

---

## Setup Instructions

1. **Enable GPU:** Runtime → Change runtime type → GPU
2. **Upload audio files** to Google Drive: `My Drive/silksong_data/[session_name]/audio_16k.wav`
3. **Run all cells** in order

---

## 1. Verify GPU Access

In [None]:
# Check GPU availability
!nvidia-smi

import torch
print(f"\nCUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA version: {torch.version.cuda}")

## 2. Install WhisperX and Dependencies

In [None]:
# Install WhisperX
!pip install -q git+https://github.com/m-bain/whisperx.git

# Install PyTorch with CUDA support
!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

print("✅ Installation complete!")

## 3. Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Verify mount
import os
base_path = '/content/drive/My Drive/silksong_data'

if os.path.exists(base_path):
    print(f"✅ Google Drive mounted successfully!")
    print(f"\nAvailable sessions:")
    sessions = [d for d in os.listdir(base_path) if os.path.isdir(os.path.join(base_path, d))]
    for session in sessions:
        print(f"  - {session}")
else:
    print(f"⚠️  Path not found: {base_path}")
    print(f"   Please create the folder structure in Google Drive first.")

## 4. Define Custom Prompt for Gesture Commands

In [None]:
# Custom prompt optimized for Silksong gesture recognition
CUSTOM_PROMPT = (
    "The following is a transcription of a person playing the video game "
    "Hollow Knight: Silksong. They are speaking their character's actions out loud. "
    "The key commands are: jump, punch, attack, turn, walk, walking, walk start, "
    "idle, rest, stop, noise. The speaker might say phrases like 'I'm gonna jump here', "
    "'punch punch', 'let me walk over there', 'okay, now idle', or 'that was noise'."
)

print("Custom prompt configured:")
print(f"  {CUSTOM_PROMPT[:100]}...")

## 5. Single Session Transcription

In [None]:
# CONFIGURE THIS: Set your session name
SESSION_NAME = "20251017_125600_session"  # Change this to your session

# Paths
audio_path = f"/content/drive/My Drive/silksong_data/{SESSION_NAME}/audio_16k.wav"
output_dir = f"/content/drive/My Drive/silksong_data/{SESSION_NAME}/"

# Verify audio file exists
if not os.path.exists(audio_path):
    print(f"❌ ERROR: Audio file not found: {audio_path}")
    print(f"   Please upload audio_16k.wav to this location in Google Drive.")
else:
    print(f"✅ Audio file found: {audio_path}")

    # Get file size
    size_mb = os.path.getsize(audio_path) / (1024 * 1024)
    print(f"   Size: {size_mb:.2f} MB")

    # Get duration (approximate)
    import librosa
    duration = librosa.get_duration(path=audio_path)
    print(f"   Duration: {duration/60:.2f} minutes")
    print(f"   Estimated transcription time: {duration/60 * 0.5:.1f} minutes")

In [None]:
# Run WhisperX transcription
import whisperx
import torch

print("🚀 Starting WhisperX transcription...\n")

# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
compute_type = "float16" if device == "cuda" else "float32"

print(f"Device: {device}")
print(f"Compute type: {compute_type}\n")

# Load model
print("Loading WhisperX model (large-v3)...")
model = whisperx.load_model(
    "large-v3",
    device=device,
    compute_type=compute_type,
    language="en"
)
print("✅ Model loaded\n")

# Load audio
print("Loading audio...")
audio = whisperx.load_audio(audio_path)
print("✅ Audio loaded\n")

# Transcribe with custom prompt
print("Transcribing with custom gesture prompt...")
result = model.transcribe(
    audio,
    batch_size=16,
    initial_prompt=CUSTOM_PROMPT
)
print(f"✅ Transcription complete!")
print(f"   Segments: {len(result['segments'])}")
print(f"   Language: {result.get('language', 'unknown')}\n")

# Apply forced alignment for word-level timestamps
print("Applying forced alignment for word-level timestamps...")
model_a, metadata = whisperx.load_align_model(
    language_code="en",
    device=device
)

result = whisperx.align(
    result["segments"],
    model_a,
    metadata,
    audio,
    device,
    return_char_alignments=False
)

# Count words
word_count = sum(len(seg.get('words', [])) for seg in result.get('segments', []))
print(f"✅ Alignment complete!")
print(f"   Words with timestamps: {word_count}\n")

In [None]:
# Save results
import json

output_json = os.path.join(output_dir, "whisperx_output.json")
output_txt = os.path.join(output_dir, "whisperx_output_summary.txt")

# Save JSON
with open(output_json, 'w', encoding='utf-8') as f:
    json.dump(result, f, indent=2, ensure_ascii=False)
print(f"✅ Saved JSON: {output_json}")

# Save summary
with open(output_txt, 'w', encoding='utf-8') as f:
    f.write("WhisperX Transcription Summary\n")
    f.write("=" * 70 + "\n\n")
    f.write(f"Session: {SESSION_NAME}\n")
    f.write(f"Segments: {len(result.get('segments', []))}\n")
    f.write(f"Words: {word_count}\n\n")

    # First 50 words
    f.write("First 50 words with timestamps:\n")
    f.write("-" * 70 + "\n")

    all_words = []
    for seg in result.get('segments', []):
        all_words.extend(seg.get('words', []))

    for i, word_info in enumerate(all_words[:50], 1):
        word = word_info.get('word', '')
        start = word_info.get('start', 0)
        end = word_info.get('end', 0)
        score = word_info.get('score', 0)
        f.write(f"{i:3d}. {start:7.2f}s-{end:7.2f}s | {word:20s} | conf: {score:.3f}\n")

print(f"✅ Saved summary: {output_txt}")
print(f"\n🎉 Transcription complete! Results saved to Google Drive.")

## 6. Preview Results

In [None]:
# Display first few segments
print("First 5 segments with word-level timestamps:\n")
print("=" * 70)

for i, segment in enumerate(result['segments'][:5], 1):
    print(f"\n[Segment {i}] {segment['start']:.2f}s - {segment['end']:.2f}s")
    print(f"Text: {segment['text']}")

    if 'words' in segment:
        print("Words:")
        for word_info in segment['words']:
            word = word_info['word']
            start = word_info['start']
            end = word_info['end']
            conf = word_info.get('score', 0)
            print(f"  {start:6.2f}s-{end:6.2f}s: '{word}' (conf: {conf:.3f})")

print("\n" + "=" * 70)
print(f"Total segments: {len(result['segments'])}")
print(f"Total words: {word_count}")

## 7. Gesture Keyword Analysis

In [None]:
# Count gesture keywords in transcription
from collections import Counter

gesture_keywords = ['jump', 'punch', 'attack', 'turn', 'walk', 'walking', 'idle', 'rest', 'stop', 'noise']

# Extract all words
all_words = []
for seg in result['segments']:
    for word_info in seg.get('words', []):
        all_words.append(word_info['word'].lower().strip())

# Count gesture keywords
gesture_counts = Counter()
for word in all_words:
    if word in gesture_keywords:
        gesture_counts[word] += 1

# Display results
print("Gesture Keyword Frequency Analysis")
print("=" * 70)
print(f"\nTotal words transcribed: {len(all_words)}")
print(f"Gesture keywords found: {sum(gesture_counts.values())}")
print(f"Coverage: {sum(gesture_counts.values()) / len(all_words) * 100:.1f}%\n")

print("Gesture breakdown:")
for keyword in sorted(gesture_keywords):
    count = gesture_counts.get(keyword, 0)
    if count > 0:
        bar = '█' * int(count / 2)
        print(f"  {keyword:10s}: {count:3d} {bar}")
    else:
        print(f"  {keyword:10s}: {count:3d}")

print("\n" + "=" * 70)

## 8. Batch Processing (Optional)

In [None]:
# 🎯 CONFIGURE THIS: Add ALL your session folder names here
SESSIONS_TO_PROCESS = [
    "20251017_125600_session",
    "20251017_130000_session",
    "20251017_131500_session",
    "20251017_132000_session",
    "20251017_133000_session",
    # Add all your actual session folder names here
    # Check your Google Drive: My Drive/silksong_data/ for the exact names
]

# 📁 Auto-discover sessions (uncomment to use this instead)
# import os
# base_path = "/content/drive/My Drive/silksong_data/"
# SESSIONS_TO_PROCESS = [d for d in os.listdir(base_path)
#                       if os.path.isdir(os.path.join(base_path, d))
#                       and os.path.exists(os.path.join(base_path, d, "audio_16k.wav"))]

base_path = "/content/drive/My Drive/silksong_data/"

print("🚀 BATCH PROCESSING SILKSONG SESSIONS")
print("=" * 80)

# 🔍 Pre-flight check: Verify which sessions exist
existing_sessions = []
missing_sessions = []
total_duration = 0

print("📋 Session inventory:")
for session in SESSIONS_TO_PROCESS:
    audio_path = os.path.join(base_path, session, "audio_16k.wav")
    if os.path.exists(audio_path):
        existing_sessions.append(session)
        # Get file info
        size_mb = os.path.getsize(audio_path) / (1024 * 1024)
        duration = librosa.get_duration(path=audio_path)
        total_duration += duration
        print(f"   ✅ {session:25s} | {size_mb:5.1f}MB | {duration/60:5.1f}min")
    else:
        missing_sessions.append(session)
        print(f"   ❌ {session:25s} | FILE NOT FOUND")

print(f"\n📊 Summary:")
print(f"   ✅ Found: {len(existing_sessions)} sessions ({total_duration/60:.1f} minutes total)")
print(f"   ❌ Missing: {len(missing_sessions)} sessions")
print(f"   ⏱️  Estimated processing time: {total_duration/60 * 0.3:.1f} minutes")

if missing_sessions:
    print(f"\n⚠️  Missing audio files - please upload to Google Drive:")
    for session in missing_sessions:
        print(f"      📁 My Drive/silksong_data/{session}/audio_16k.wav")
    print(f"\n   💡 Tip: You can continue with the {len(existing_sessions)} found sessions")

if len(existing_sessions) == 0:
    print(f"\n🚨 No sessions found! Please check your Google Drive folder structure.")
    print(f"   Expected: My Drive/silksong_data/[session_name]/audio_16k.wav")
else:
    print(f"\n🎯 Ready to process {len(existing_sessions)} sessions!")

print("=" * 80)

# 🚀 START BATCH PROCESSING
if len(existing_sessions) > 0:
    print(f"\n🚀 STARTING BATCH TRANSCRIPTION...")
    print("=" * 80)

    successful_sessions = []
    failed_sessions = []

    for i, session in enumerate(existing_sessions, 1):
        print(f"\n📍 [{i:2d}/{len(existing_sessions)}] Processing: {session}")
        print("-" * 70)

        audio_path = os.path.join(base_path, session, "audio_16k.wav")
        output_dir = os.path.join(base_path, session)
        output_json = os.path.join(output_dir, "whisperx_output.json")

        # Check if already processed
        if os.path.exists(output_json):
            print(f"   ⚡ Already processed - skipping (delete {session}/whisperx_output.json to reprocess)")
            successful_sessions.append(session)
            continue

        try:
            # 🎵 Load and transcribe
            print(f"   🎵 Loading audio...")
            audio = whisperx.load_audio(audio_path)

            print(f"   🤖 Transcribing with large-v3...")
            result = model.transcribe(audio, batch_size=16)

            # 🎯 Apply alignment
            print(f"   🎯 Applying forced alignment...")
            result = whisperx.align(
                result["segments"],
                model_a,
                metadata,
                audio,
                device,
                return_char_alignments=False
            )

            # 💾 Save results
            print(f"   💾 Saving results...")
            with open(output_json, 'w', encoding='utf-8') as f:
                json.dump(result, f, indent=2, ensure_ascii=False)

            # 📊 Summary
            word_count = sum(len(seg.get('words', [])) for seg in result.get('segments', []))
            segment_count = len(result['segments'])

            print(f"   ✅ SUCCESS: {segment_count} segments, {word_count} words")
            successful_sessions.append(session)

        except Exception as e:
            print(f"   ❌ ERROR: {str(e)}")
            failed_sessions.append((session, str(e)))
            continue

    # 🎉 Final Summary
    print(f"\n" + "=" * 80)
    print(f"🎉 BATCH PROCESSING COMPLETE!")
    print(f"=" * 80)
    print(f"✅ Successful: {len(successful_sessions)} sessions")
    print(f"❌ Failed: {len(failed_sessions)} sessions")

    if successful_sessions:
        print(f"\n✅ Successfully processed:")
        for session in successful_sessions:
            print(f"   - {session}")

    if failed_sessions:
        print(f"\n❌ Failed sessions:")
        for session, error in failed_sessions:
            print(f"   - {session}: {error}")

    print(f"\n📁 Results saved to Google Drive: My Drive/silksong_data/[session]/whisperx_output.json")
    print(f"💡 Next step: Download these files and run label alignment locally")

else:
    print(f"\n⏸️  Batch processing skipped - no valid sessions found")
    print(f"   Please upload audio files and try again")

## 9. Download Results (Optional)

In [None]:
# 📥 Download ALL processed results
from google.colab import files
import zipfile
import shutil

print("📥 DOWNLOADING ALL TRANSCRIPTION RESULTS")
print("=" * 60)

base_path = "/content/drive/My Drive/silksong_data/"

# Find all sessions with whisperx_output.json files
processed_sessions = []
for item in os.listdir(base_path):
    session_path = os.path.join(base_path, item)
    json_path = os.path.join(session_path, "whisperx_output.json")
    if os.path.isdir(session_path) and os.path.exists(json_path):
        processed_sessions.append(item)

if not processed_sessions:
    print("❌ No processed sessions found!")
    print("   Run the batch processing first")
else:
    print(f"📊 Found {len(processed_sessions)} processed sessions:")
    for session in processed_sessions:
        print(f"   ✅ {session}")

    # Option 1: Download individual files
    print(f"\n📥 Option 1: Download individual files")
    for session in processed_sessions[:3]:  # Limit to first 3 to avoid too many downloads
        json_path = os.path.join(base_path, session, "whisperx_output.json")
        print(f"   📁 Downloading {session}/whisperx_output.json...")
        files.download(json_path)

    if len(processed_sessions) > 3:
        print(f"   💡 Only downloaded first 3 files. Uncomment below for more.")

    # Option 2: Create and download ZIP file
    print(f"\n📥 Option 2: Download as ZIP file")
    zip_path = "/content/silksong_transcriptions.zip"

    with zipfile.ZipFile(zip_path, 'w') as zipf:
        for session in processed_sessions:
            json_path = os.path.join(base_path, session, "whisperx_output.json")
            summary_path = os.path.join(base_path, session, "whisperx_output_summary.txt")

            # Add JSON file
            if os.path.exists(json_path):
                zipf.write(json_path, f"{session}/whisperx_output.json")

            # Add summary if exists
            if os.path.exists(summary_path):
                zipf.write(summary_path, f"{session}/whisperx_output_summary.txt")

    print(f"   🗜️  Created ZIP with {len(processed_sessions)} sessions")
    files.download(zip_path)

    print(f"\n✅ Download complete!")
    print(f"📁 Extract the ZIP file to your local project:")
    print(f"   data/continuous/[session_name]/whisperx_output.json")

# Uncomment to download ALL individual files:
# for session in processed_sessions:
#     json_path = os.path.join(base_path, session, "whisperx_output.json")
#     print(f"Downloading {session}...")
#     files.download(json_path)

# 📊 Batch Analysis & Results

## Quick Stats for All Sessions

Run this after batch processing to get an overview of all your transcribed sessions.

In [None]:
# 📊 Analyze all processed sessions
import pandas as pd
from collections import Counter

base_path = "/content/drive/My Drive/silksong_data/"
gesture_keywords = ['jump', 'punch', 'attack', 'turn', 'walk', 'walking', 'idle', 'rest', 'stop', 'noise']

print("📊 SILKSONG SESSION ANALYSIS")
print("=" * 80)

# Find all processed sessions
processed_sessions = []
for item in os.listdir(base_path):
    session_path = os.path.join(base_path, item)
    json_path = os.path.join(session_path, "whisperx_output.json")
    if os.path.isdir(session_path) and os.path.exists(json_path):
        processed_sessions.append(item)

if not processed_sessions:
    print("❌ No processed sessions found! Run batch processing first.")
else:
    # Analyze each session
    session_stats = []
    total_gestures = Counter()

    for session in sorted(processed_sessions):
        json_path = os.path.join(base_path, session, "whisperx_output.json")

        with open(json_path, 'r') as f:
            result = json.load(f)

        # Count words and gestures
        all_words = []
        for seg in result['segments']:
            for word_info in seg.get('words', []):
                all_words.append(word_info['word'].lower().strip())

        # Count gesture keywords
        gesture_counts = Counter()
        for word in all_words:
            if word in gesture_keywords:
                gesture_counts[word] += 1
                total_gestures[word] += 1

        # Session stats
        stats = {
            'session': session,
            'segments': len(result['segments']),
            'total_words': len(all_words),
            'gesture_words': sum(gesture_counts.values()),
            'coverage': sum(gesture_counts.values()) / len(all_words) * 100 if all_words else 0
        }

        # Add individual gesture counts
        for gesture in gesture_keywords:
            stats[gesture] = gesture_counts.get(gesture, 0)

        session_stats.append(stats)

    # Create DataFrame for easy viewing
    df = pd.DataFrame(session_stats)

    print(f"📈 SUMMARY: {len(processed_sessions)} sessions processed")
    print("-" * 80)

    # Overall stats
    total_segments = df['segments'].sum()
    total_words = df['total_words'].sum()
    total_gesture_words = df['gesture_words'].sum()

    print(f"🎯 Total segments: {total_segments:,}")
    print(f"🎯 Total words: {total_words:,}")
    print(f"🎯 Total gesture words: {total_gesture_words:,}")
    print(f"🎯 Overall gesture coverage: {total_gesture_words/total_words*100:.1f}%")

    print(f"\n📊 Per-session breakdown:")
    print("-" * 80)
    print(f"{'Session':25s} | {'Segments':8s} | {'Words':6s} | {'Gestures':8s} | {'Coverage':8s}")
    print("-" * 80)

    for _, row in df.iterrows():
        print(f"{row['session']:25s} | {row['segments']:8d} | {row['total_words']:6d} | {row['gesture_words']:8d} | {row['coverage']:7.1f}%")

    print(f"\n🎮 Gesture frequency across all sessions:")
    print("-" * 50)

    for gesture in sorted(gesture_keywords):
        count = total_gestures.get(gesture, 0)
        if count > 0:
            bar = '█' * min(count // 5, 50)  # Scale bar
            print(f"  {gesture:10s}: {count:4d} {bar}")
        else:
            print(f"  {gesture:10s}: {count:4d}")

    print(f"\n🔥 Top sessions by gesture activity:")
    print("-" * 50)
    top_sessions = df.nlargest(5, 'gesture_words')[['session', 'gesture_words', 'coverage']]

    for _, row in top_sessions.iterrows():
        print(f"  {row['session']:25s}: {row['gesture_words']:3d} gestures ({row['coverage']:.1f}%)")

    print(f"\n💡 Next steps:")
    print(f"   1. Download all JSON files using the cell above")
    print(f"   2. Move to your local project: data/continuous/[session]/whisperx_output.json")
    print(f"   3. Run label alignment: python align_voice_labels.py --session [session]")
    print(f"   4. Train your CNN/LSTM model with the labeled data")

print("=" * 80)

---

## 🏠 Next Steps: Local Integration

After downloading your transcription results:

### 1. Move Files to Local Project

```bash
# Extract downloaded ZIP to your project
unzip ~/Downloads/silksong_transcriptions.zip -d /tmp/transcriptions

# Move each session's results
for session in /tmp/transcriptions/*/; do
    session_name=$(basename "$session")
    cp "$session/whisperx_output.json" \
       "data/continuous/$session_name/whisperx_output.json"
done
```

### 2. Run Label Alignment

```bash
# Process each session locally
python align_voice_labels.py \
  --session 20251017_125600_session \
  --whisper data/continuous/20251017_125600_session/whisperx_output.json

# Or batch process all sessions
for session_dir in data/continuous/*/; do
    session_name=$(basename "$session_dir")
    if [ -f "$session_dir/whisperx_output.json" ]; then
        python align_voice_labels.py \
          --session "$session_name" \
          --whisper "$session_dir/whisperx_output.json"
    fi
done
```

### 3. Continue to Phase IV: Model Training

Your gesture data is now ready for CNN/LSTM training! 🎉

---

## 📋 Checklist

- [ ] All sessions transcribed with WhisperX
- [ ] Results downloaded to local machine  
- [ ] Label alignment completed
- [ ] Ready for CNN/LSTM training
- [ ] Gesture recognition model deployment

**You've successfully completed Phase V! 🚀**