Skip to content

Real-Time Transcription System with whisper_streaming #335

Description

@ProjectLiminality

Overview

A minimalist Python CLI tool for real-time speech transcription that integrates seamlessly with the InterBrain Obsidian plugin. Uses whisper_streaming with LocalAgreement-2 policy to append timestamped, duplicate-free transcripts to markdown files for post-call summarization and semantic search integration.

Vision

Enable effortless voice-to-text capture during conversations, meetings, and ideation sessions. Transcripts appear seamlessly in markdown files with timestamps, ready for semantic search indexing and DreamNode integration. No UI complexity - just terminal commands orchestrated by Obsidian.

Architectural Decision: Deferred Marketplace Compatibility

Current Implementation Strategy: Build as self-contained vertical slice feature with Python scripts included in src/features/realtime-transcription/scripts/.

Rationale:

  • Faster Development: No external repo coordination needed during initial development
  • Easy Testing: Scripts co-located with TypeScript code for rapid iteration
  • Future-Proof: Minimal refactoring cost to split later (~2-3 hours)
  • Cross-Platform: Works on macOS, Windows, Linux with Python 3.8+

Path to Marketplace Compatibility (Optional Future):

If we later decide to pursue Obsidian marketplace listing:

  1. Extract scripts/ directory to separate interbrain-transcription-extension repository
  2. Update transcription-service.ts path resolution to check vault .interbrain/extensions/ first
  3. Add installation command for extension setup (git clone + pip install instructions)
  4. Submit core plugin (without Python scripts) to marketplace

What Stays The Same (No Rewrite):

  • ✅ All TypeScript code (commands, services, store)
  • ✅ Python script logic
  • ✅ Process spawning (just path argument changes)
  • ✅ Feature architecture (self-contained vertical slice)

What Changes (Minimal Refactoring):

  • 📦 Move Python scripts to separate repo
  • 🔧 Update path resolution logic (~50 lines)
  • 📝 Add installation command (~100 lines)

Current Focus: Build robust, cross-platform transcription feature. Defer packaging decisions until functionality proven.

Technical Architecture

Core Components

1. Python CLI Script: interbrain-transcribe.py

  • Uses whisper_streaming library (UFAL - IWSLT 2022 winner)
  • LocalAgreement-2 policy prevents duplicates and handles retroactive corrections
  • Captures microphone audio via sounddevice library
  • Appends timestamped, finalized transcripts to markdown file
  • Runs as background process managed by Obsidian plugin

2. Obsidian Plugin Integration

  • Two command palette commands: Start/Stop Real-Time Transcription
  • Uses Node.js child_process.spawn() for process management
  • Real-time stdout/stderr monitoring via event listeners
  • Process reference stored in Zustand for lifecycle management
  • Automatic cleanup on plugin unload

Why This Approach?

Decision: Python CLI over Tauri

  • Simplicity: ~150 lines Python vs ~2000+ lines Rust/Tauri/UI
  • Development Speed: 2-3 days vs 2-3 weeks
  • Proven Technology: whisper_streaming achieves 3.3s latency in production
  • Zero UI Overhead: Terminal-based, managed by Obsidian
  • Easy Maintenance: Pure Python, no build system

Decision: whisper_streaming over raw whisper.cpp

  • LocalAgreement-2 Built-In: Prevents duplicates without manual implementation
  • Seamless Retroactive Corrections: Like macOS native dictation
  • Production-Ready: Used in live multilingual conference transcription
  • Self-Adaptive Latency: Adjusts based on source complexity

Decision: whisper.cpp over Nvidia Parakeet

  • Accuracy: Whisper remains gold standard (Parakeet trades accuracy for speed)
  • Mature Ecosystem: Extensive documentation, large community
  • Apple Silicon Support: Metal acceleration, 30x realtime speeds
  • Streaming Support: whisper_streaming built specifically for it

Implementation Specification

Python CLI Script

Command-Line Interface:

# Start transcription
python3 interbrain-transcribe.py --output "/path/to/transcript.md"

# With optional parameters
python3 interbrain-transcribe.py \
  --output "/path/to/transcript.md" \
  --device "MacBook Pro Microphone" \
  --model "small.en"

Arguments:

  • --output (required): Path to output markdown file (absolute or relative)
  • --device (optional): Microphone device ID/name for selection
  • --model (optional): Whisper model size (tiny, base, small.en, medium, large)
  • --language (optional): Language code (default: auto-detect)

Output Format:

[2025-10-03 14:32:15] Hello my name is David I'm working on InterBrain transcription system.

[2025-10-03 14:32:47] The whisper streaming library uses local agreement to prevent duplicates.

[2025-10-03 14:33:12] This is a seamless experience just like macOS native dictation.

Design Decisions:

  • Timestamped entries: ISO 8601 format [YYYY-MM-DD HH:MM:SS] for easy parsing
  • Blank line separator: Natural reading flow, easy semantic segmentation
  • Text extraction: Simple regex \[.*?\] (.*) to strip timestamps for search indexing
  • No speaker diarization: Deferred to future enhancement (Phase 2)
  • Always append: Never overwrite existing content
  • Create if missing: Automatically create output file and parent directories

Terminal Output (for Obsidian monitoring):

🎙️  Starting transcription to: /Users/david/transcript.md
📝 Model: small.en
🔴 Recording... (Ctrl+C to stop)

✅ Hello my name is David
✅ I'm working on InterBrain transcription system
⚠️  Audio buffer overrun (dropped 0.05s)
✅ This is a test of the transcription system

⏹️  Transcription stopped

Error Handling:

  • Microphone access denied: ❌ Microphone permission denied. Grant access in System Settings → Privacy & Security → Microphone.
  • Output file unwritable: ❌ Cannot write to [path]. Check file permissions.
  • Model download needed: 📥 Downloading whisper model 'small.en' (first run, ~500MB)...
  • Audio device not found: ❌ Device 'X' not found. Available devices:\n - MacBook Pro Microphone\n - External USB Mic
  • whisper_streaming import error: ❌ whisper_streaming not installed. Run: pip install whisper-streaming

Obsidian Plugin Integration

New Commands File: src/features/realtime-transcription/commands/transcription-commands.ts

Command 1: Start Real-Time Transcription

plugin.addCommand({
  id: 'start-realtime-transcription',
  name: 'Start Real-Time Transcription',
  hotkeys: [{ modifiers: ['Ctrl', 'Shift'], key: 't' }],
  callback: async () => {
    const store = useInterBrainStore.getState();
    
    // Check if already running
    if (store.transcriptionProcess) {
      uiService.showWarning('Transcription already running');
      return;
    }
    
    // Validate active file is markdown
    const activeFile = plugin.app.workspace.getActiveFile();
    if (!activeFile || !activeFile.path.endsWith('.md')) {
      uiService.showError('Please open a markdown file for transcript output');
      return;
    }
    
    // Get full file path
    const transcriptPath = vaultService.getFullPath(activeFile.path);
    
    // Spawn Python process
    const { spawn } = require('child_process');
    const scriptPath = require('path').join(
      plugin.manifest.dir,
      'features',
      'realtime-transcription',
      'scripts',
      'interbrain-transcribe.py'
    );
    
    const process = spawn('python3', [
      scriptPath,
      '--output', transcriptPath,
      '--model', 'small.en'
    ]);
    
    // Monitor stdout for status updates
    process.stdout.on('data', (data: Buffer) => {
      const output = data.toString().trim();
      console.log(`[Transcription] ${output}`);
    });
    
    // Monitor stderr for errors
    process.stderr.on('data', (data: Buffer) => {
      const error = data.toString().trim();
      console.error(`[Transcription Error] ${error}`);
      
      if (error.includes('permission denied')) {
        uiService.showError('Microphone permission denied');
      } else {
        uiService.showError('Transcription error - see console');
      }
    });
    
    // Handle process exit
    process.on('close', (code: number) => {
      console.log(`Transcription process exited with code ${code}`);
      store.setTranscriptionProcess(null);
      
      if (code === 0) {
        uiService.showSuccess('Transcription stopped');
      } else {
        uiService.showError(`Transcription failed (exit code: ${code})`);
      }
    });
    
    // Store process reference
    store.setTranscriptionProcess(process);
    uiService.showSuccess('🎙️ Transcription started');
  }
});

Command 2: Stop Real-Time Transcription

plugin.addCommand({
  id: 'stop-realtime-transcription',
  name: 'Stop Real-Time Transcription',
  hotkeys: [{ modifiers: ['Ctrl', 'Shift'], key: 't' }], // Same key toggles
  callback: () => {
    const store = useInterBrainStore.getState();
    const process = store.transcriptionProcess;
    
    if (!process) {
      uiService.showWarning('No active transcription');
      return;
    }
    
    console.log('Stopping transcription process...');
    
    // Send SIGTERM for graceful shutdown
    process.kill('SIGTERM');
    
    // Force kill after 5 seconds if still running
    setTimeout(() => {
      if (process.exitCode === null) {
        console.warn('Force killing transcription process');
        process.kill('SIGKILL');
      }
    }, 5000);
    
    uiService.showInfo('Stopping transcription...');
  }
});

Zustand Store Extension:

interface InterBrainStore {
  // ... existing state ...
  transcriptionProcess: ChildProcess | null;
  setTranscriptionProcess: (process: ChildProcess | null) => void;
}

Plugin Cleanup (main.ts):

onunload() {
  // Kill transcription process on plugin unload
  const store = useInterBrainStore.getState();
  if (store.transcriptionProcess) {
    console.log('Cleaning up transcription process...');
    store.transcriptionProcess.kill('SIGTERM');
  }
}

Dependencies

Python Dependencies (requirements.txt):

whisper-streaming>=0.1.0
sounddevice>=0.4.6
numpy>=1.24.0
faster-whisper>=0.10.0  # Backend for whisper_streaming

Installation:

pip install -r src/features/realtime-transcription/scripts/requirements.txt

First-Time Model Download:
The whisper model will auto-download on first run (~500MB for small.en). User sees progress:

📥 Downloading whisper model 'small.en' (first run)...
⬇️  Progress: 45% (225MB / 500MB)
✅ Model downloaded successfully

Implementation Plan

Phase 1: Python CLI Script (~1 day)

Tasks:

  1. Create src/features/realtime-transcription/scripts/interbrain-transcribe.py with basic structure
  2. Integrate whisper_streaming library with LocalAgreement-2
  3. Implement audio capture with sounddevice (16kHz mono)
  4. Add file writing with ISO 8601 timestamps
  5. Implement command-line argument parsing
  6. Add graceful error handling with user-friendly messages
  7. Test LocalAgreement-2 duplicate prevention manually

Acceptance Criteria:

  • Script launches and displays startup message
  • Captures microphone audio successfully
  • Transcribes speech with <5s latency
  • Appends to markdown file with correct timestamp format
  • No duplicate sentences in output (LocalAgreement working)
  • Stops gracefully on Ctrl+C (SIGTERM)
  • Error messages are clear and actionable

Phase 2: Obsidian Plugin Integration (~1 day)

Tasks:

  1. Create src/features/realtime-transcription/commands/transcription-commands.ts
  2. Add two command palette commands (start/stop)
  3. Extend Zustand store with transcriptionProcess state
  4. Implement process spawning with stdout/stderr monitoring
  5. Add process cleanup in plugin onunload
  6. Register commands in main.ts
  7. Test full start/stop workflow

Acceptance Criteria:

  • "Start Real-Time Transcription" command appears in palette
  • "Stop Real-Time Transcription" command appears in palette
  • Commands have keyboard shortcuts (Ctrl+Shift+T toggles)
  • Active markdown file path is correctly passed to Python
  • Process starts and stdout appears in DevTools console
  • Process stops cleanly when commanded
  • Process is killed when plugin unloads
  • Prevents starting multiple transcription instances

Phase 3: Testing & Documentation (~1 day)

Tasks:

  1. Test on macOS Apple Silicon (primary target)
  2. Test error scenarios (no mic, permission denied, file locked)
  3. Validate transcription accuracy with real speech
  4. Test rapid start/stop cycles
  5. Verify no memory leaks or zombie processes
  6. Write README with installation/usage instructions
  7. Update CLAUDE.md with transcription workflow

Acceptance Criteria:

  • Works reliably on Apple Silicon Mac
  • Handles microphone permission errors gracefully
  • Handles file write errors gracefully
  • No zombie processes left after stopping
  • Memory usage stable (<200MB for Python process)
  • CPU usage reasonable (<50% on M1)
  • README includes clear setup instructions
  • CLAUDE.md documents integration points

Phase 4: Polish & Future Enhancements (Optional)

Potential Future Improvements:

  • Multiple language support: CLI flag for language selection
  • Speaker diarization: Integrate pyannote.audio for multi-speaker transcription
  • Custom model paths: Allow local whisper model file selection
  • Pause/resume: Two-way stdin communication for pause control
  • Menu bar indicator: Optional Python rumps app for visual status
  • Auto-indexing: Trigger semantic search indexing after transcription
  • Session metadata: Include session start/end markers in transcript

File Structure

InterBrain/
├── src/
│   └── features/
│       └── realtime-transcription/      # Self-contained feature
│           ├── README.md                # Feature documentation
│           ├── index.ts                 # Feature exports
│           ├── commands/
│           │   └── transcription-commands.ts
│           ├── services/
│           │   └── transcription-service.ts
│           ├── store/
│           │   └── transcription-store.ts
│           ├── scripts/                 # Python scripts (co-located)
│           │   ├── interbrain-transcribe.py
│           │   └── requirements.txt
│           ├── types/
│           │   └── transcription-types.ts
│           └── tests/
│               └── transcription-service.test.ts
└── CLAUDE.md                            # Updated with integration notes

Testing Strategy

Unit Tests (Python)

  • File writing and timestamp formatting
  • Path validation (absolute/relative)
  • Error message generation
  • Command-line argument parsing

Integration Tests (Python + Audio)

  • whisper_streaming integration
  • Audio capture from mock device
  • LocalAgreement-2 behavior verification
  • Process signal handling (SIGTERM, SIGINT)

Manual Testing (End-to-End)

  • Real dictation sessions with natural speech patterns
  • Various microphone devices
  • File permission edge cases
  • Rapid start/stop cycles
  • Plugin reload scenarios

Performance Testing

  • CPU usage monitoring during transcription
  • Memory leak detection (run for 30+ minutes)
  • Latency measurement (speech to file write)
  • Concurrent Obsidian usage (no UI lag)

Success Criteria

MVP Definition of Done

  • Python script transcribes speech to markdown with timestamps
  • LocalAgreement-2 prevents all duplicate sentences
  • Obsidian commands start/stop transcription successfully
  • Real-time stdout monitoring works in DevTools console
  • Process cleanup works on plugin unload
  • Error handling provides clear, actionable messages
  • Works reliably on Apple Silicon Macs
  • Latency consistently <5 seconds
  • Documentation covers installation and usage

Quality Gates

  • Zero lint errors/warnings
  • TypeScript strict mode compliance
  • All manual test scenarios pass
  • No memory leaks after 30-minute session
  • CPU usage <50% during active transcription
  • Process cleanup leaves no zombies

Dependencies & Prerequisites

System Requirements

  • macOS: 10.15+ (Apple Silicon preferred)
  • Windows: 10+ (with Python 3.8+)
  • Linux: Any modern distro (with Python 3.8+)
  • Python: 3.8+ (with pip)
  • Microphone: Any USB or built-in mic
  • Disk Space: ~1GB (for whisper models)

Python Environment

# Install dependencies
pip install -r src/features/realtime-transcription/scripts/requirements.txt

# First run downloads model (~500MB)
python3 src/features/realtime-transcription/scripts/interbrain-transcribe.py --output test.md

Obsidian Plugin

  • Node.js child_process module (built-in)
  • Zustand store access
  • Command palette registration

Future Integration Points

Semantic Search (Epic 5)

  • Auto-index transcripts after session ends
  • Extract text without timestamps for embedding
  • Link transcript nodes to DreamNodes

Copilot Mode (Future Epic)

  • Real-time transcription during DreamSong creation
  • Voice-driven canvas node creation
  • Spoken relationship mapping

DreamWeaving (Epic 6)

  • Transcribe video calls for DreamSong content
  • Auto-generate DreamTalk summaries from transcripts
  • Voice annotation for canvas nodes

Technical References

Research Links

Related Issues

Notes

Why Not Tauri?

Originally considered Tauri system tray app but realized unnecessary complexity:

  • No UI actually needed (Obsidian commands suffice)
  • Terminal output via stdout is perfect for monitoring
  • Python subprocess avoids Rust/IPC overhead
  • Much faster development cycle

Why LocalAgreement-2?

Prevents the duplicate sentence problem that plagues naive streaming approaches:

  • Waits for 2 consecutive chunks to agree on prefix before committing
  • Handles retroactive corrections seamlessly (like macOS dictation)
  • Production-proven at IWSLT 2022 conference

Performance Expectations

  • Latency: 3.3 seconds (whisper_streaming benchmark)
  • Accuracy: Whisper-grade (best in class)
  • CPU: <50% on Apple Silicon M1
  • Memory: ~150-200MB for Python process
  • Disk I/O: Negligible (appending text to file)

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureFeature level issues

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions