Real-Time Transcription System with whisper_streaming

## Overview

A minimalist Python CLI tool for real-time speech transcription that integrates seamlessly with the InterBrain Obsidian plugin. Uses whisper_streaming with LocalAgreement-2 policy to append timestamped, duplicate-free transcripts to markdown files for post-call summarization and semantic search integration.

## Vision

Enable effortless voice-to-text capture during conversations, meetings, and ideation sessions. Transcripts appear seamlessly in markdown files with timestamps, ready for semantic search indexing and DreamNode integration. No UI complexity - just terminal commands orchestrated by Obsidian.

## Architectural Decision: Deferred Marketplace Compatibility

**Current Implementation Strategy**: Build as self-contained vertical slice feature with Python scripts included in `src/features/realtime-transcription/scripts/`.

**Rationale**:
- **Faster Development**: No external repo coordination needed during initial development
- **Easy Testing**: Scripts co-located with TypeScript code for rapid iteration
- **Future-Proof**: Minimal refactoring cost to split later (~2-3 hours)
- **Cross-Platform**: Works on macOS, Windows, Linux with Python 3.8+

**Path to Marketplace Compatibility (Optional Future)**:

If we later decide to pursue Obsidian marketplace listing:
1. Extract `scripts/` directory to separate `interbrain-transcription-extension` repository
2. Update `transcription-service.ts` path resolution to check vault `.interbrain/extensions/` first
3. Add installation command for extension setup (git clone + pip install instructions)
4. Submit core plugin (without Python scripts) to marketplace

**What Stays The Same** (No Rewrite):
- ✅ All TypeScript code (commands, services, store)
- ✅ Python script logic
- ✅ Process spawning (just path argument changes)
- ✅ Feature architecture (self-contained vertical slice)

**What Changes** (Minimal Refactoring):
- 📦 Move Python scripts to separate repo
- 🔧 Update path resolution logic (~50 lines)
- 📝 Add installation command (~100 lines)

**Current Focus**: Build robust, cross-platform transcription feature. Defer packaging decisions until functionality proven.

## Technical Architecture

### Core Components

**1. Python CLI Script: `interbrain-transcribe.py`**
- Uses `whisper_streaming` library (UFAL - IWSLT 2022 winner)
- LocalAgreement-2 policy prevents duplicates and handles retroactive corrections
- Captures microphone audio via `sounddevice` library
- Appends timestamped, finalized transcripts to markdown file
- Runs as background process managed by Obsidian plugin

**2. Obsidian Plugin Integration**
- Two command palette commands: Start/Stop Real-Time Transcription
- Uses Node.js `child_process.spawn()` for process management
- Real-time stdout/stderr monitoring via event listeners
- Process reference stored in Zustand for lifecycle management
- Automatic cleanup on plugin unload

### Why This Approach?

**Decision: Python CLI over Tauri**
- **Simplicity**: ~150 lines Python vs ~2000+ lines Rust/Tauri/UI
- **Development Speed**: 2-3 days vs 2-3 weeks
- **Proven Technology**: whisper_streaming achieves 3.3s latency in production
- **Zero UI Overhead**: Terminal-based, managed by Obsidian
- **Easy Maintenance**: Pure Python, no build system

**Decision: whisper_streaming over raw whisper.cpp**
- **LocalAgreement-2 Built-In**: Prevents duplicates without manual implementation
- **Seamless Retroactive Corrections**: Like macOS native dictation
- **Production-Ready**: Used in live multilingual conference transcription
- **Self-Adaptive Latency**: Adjusts based on source complexity

**Decision: whisper.cpp over Nvidia Parakeet**
- **Accuracy**: Whisper remains gold standard (Parakeet trades accuracy for speed)
- **Mature Ecosystem**: Extensive documentation, large community
- **Apple Silicon Support**: Metal acceleration, 30x realtime speeds
- **Streaming Support**: whisper_streaming built specifically for it

## Implementation Specification

### Python CLI Script

**Command-Line Interface:**
```bash
# Start transcription
python3 interbrain-transcribe.py --output "/path/to/transcript.md"

# With optional parameters
python3 interbrain-transcribe.py \
  --output "/path/to/transcript.md" \
  --device "MacBook Pro Microphone" \
  --model "small.en"
```

**Arguments:**
- `--output` (required): Path to output markdown file (absolute or relative)
- `--device` (optional): Microphone device ID/name for selection
- `--model` (optional): Whisper model size (tiny, base, small.en, medium, large)
- `--language` (optional): Language code (default: auto-detect)

**Output Format:**
```markdown
[2025-10-03 14:32:15] Hello my name is David I'm working on InterBrain transcription system.

[2025-10-03 14:32:47] The whisper streaming library uses local agreement to prevent duplicates.

[2025-10-03 14:33:12] This is a seamless experience just like macOS native dictation.
```

**Design Decisions:**
- **Timestamped entries**: ISO 8601 format `[YYYY-MM-DD HH:MM:SS]` for easy parsing
- **Blank line separator**: Natural reading flow, easy semantic segmentation
- **Text extraction**: Simple regex `\[.*?\] (.*)` to strip timestamps for search indexing
- **No speaker diarization**: Deferred to future enhancement (Phase 2)
- **Always append**: Never overwrite existing content
- **Create if missing**: Automatically create output file and parent directories

**Terminal Output (for Obsidian monitoring):**
```
🎙️  Starting transcription to: /Users/david/transcript.md
📝 Model: small.en
🔴 Recording... (Ctrl+C to stop)

✅ Hello my name is David
✅ I'm working on InterBrain transcription system
⚠️  Audio buffer overrun (dropped 0.05s)
✅ This is a test of the transcription system

⏹️  Transcription stopped
```

**Error Handling:**
- **Microphone access denied**: `❌ Microphone permission denied. Grant access in System Settings → Privacy & Security → Microphone.`
- **Output file unwritable**: `❌ Cannot write to [path]. Check file permissions.`
- **Model download needed**: `📥 Downloading whisper model 'small.en' (first run, ~500MB)...`
- **Audio device not found**: `❌ Device 'X' not found. Available devices:\n  - MacBook Pro Microphone\n  - External USB Mic`
- **whisper_streaming import error**: `❌ whisper_streaming not installed. Run: pip install whisper-streaming`

### Obsidian Plugin Integration

**New Commands File: `src/features/realtime-transcription/commands/transcription-commands.ts`**

**Command 1: Start Real-Time Transcription**
```typescript
plugin.addCommand({
  id: 'start-realtime-transcription',
  name: 'Start Real-Time Transcription',
  hotkeys: [{ modifiers: ['Ctrl', 'Shift'], key: 't' }],
  callback: async () => {
    const store = useInterBrainStore.getState();
    
    // Check if already running
    if (store.transcriptionProcess) {
      uiService.showWarning('Transcription already running');
      return;
    }
    
    // Validate active file is markdown
    const activeFile = plugin.app.workspace.getActiveFile();
    if (!activeFile || !activeFile.path.endsWith('.md')) {
      uiService.showError('Please open a markdown file for transcript output');
      return;
    }
    
    // Get full file path
    const transcriptPath = vaultService.getFullPath(activeFile.path);
    
    // Spawn Python process
    const { spawn } = require('child_process');
    const scriptPath = require('path').join(
      plugin.manifest.dir,
      'features',
      'realtime-transcription',
      'scripts',
      'interbrain-transcribe.py'
    );
    
    const process = spawn('python3', [
      scriptPath,
      '--output', transcriptPath,
      '--model', 'small.en'
    ]);
    
    // Monitor stdout for status updates
    process.stdout.on('data', (data: Buffer) => {
      const output = data.toString().trim();
      console.log(`[Transcription] ${output}`);
    });
    
    // Monitor stderr for errors
    process.stderr.on('data', (data: Buffer) => {
      const error = data.toString().trim();
      console.error(`[Transcription Error] ${error}`);
      
      if (error.includes('permission denied')) {
        uiService.showError('Microphone permission denied');
      } else {
        uiService.showError('Transcription error - see console');
      }
    });
    
    // Handle process exit
    process.on('close', (code: number) => {
      console.log(`Transcription process exited with code ${code}`);
      store.setTranscriptionProcess(null);
      
      if (code === 0) {
        uiService.showSuccess('Transcription stopped');
      } else {
        uiService.showError(`Transcription failed (exit code: ${code})`);
      }
    });
    
    // Store process reference
    store.setTranscriptionProcess(process);
    uiService.showSuccess('🎙️ Transcription started');
  }
});
```

**Command 2: Stop Real-Time Transcription**
```typescript
plugin.addCommand({
  id: 'stop-realtime-transcription',
  name: 'Stop Real-Time Transcription',
  hotkeys: [{ modifiers: ['Ctrl', 'Shift'], key: 't' }], // Same key toggles
  callback: () => {
    const store = useInterBrainStore.getState();
    const process = store.transcriptionProcess;
    
    if (!process) {
      uiService.showWarning('No active transcription');
      return;
    }
    
    console.log('Stopping transcription process...');
    
    // Send SIGTERM for graceful shutdown
    process.kill('SIGTERM');
    
    // Force kill after 5 seconds if still running
    setTimeout(() => {
      if (process.exitCode === null) {
        console.warn('Force killing transcription process');
        process.kill('SIGKILL');
      }
    }, 5000);
    
    uiService.showInfo('Stopping transcription...');
  }
});
```

**Zustand Store Extension:**
```typescript
interface InterBrainStore {
  // ... existing state ...
  transcriptionProcess: ChildProcess | null;
  setTranscriptionProcess: (process: ChildProcess | null) => void;
}
```

**Plugin Cleanup (main.ts):**
```typescript
onunload() {
  // Kill transcription process on plugin unload
  const store = useInterBrainStore.getState();
  if (store.transcriptionProcess) {
    console.log('Cleaning up transcription process...');
    store.transcriptionProcess.kill('SIGTERM');
  }
}
```

### Dependencies

**Python Dependencies (`requirements.txt`):**
```
whisper-streaming>=0.1.0
sounddevice>=0.4.6
numpy>=1.24.0
faster-whisper>=0.10.0  # Backend for whisper_streaming
```

**Installation:**
```bash
pip install -r src/features/realtime-transcription/scripts/requirements.txt
```

**First-Time Model Download:**
The whisper model will auto-download on first run (~500MB for small.en). User sees progress:
```
📥 Downloading whisper model 'small.en' (first run)...
⬇️  Progress: 45% (225MB / 500MB)
✅ Model downloaded successfully
```

## Implementation Plan

### Phase 1: Python CLI Script (~1 day)

**Tasks:**
1. Create `src/features/realtime-transcription/scripts/interbrain-transcribe.py` with basic structure
2. Integrate whisper_streaming library with LocalAgreement-2
3. Implement audio capture with sounddevice (16kHz mono)
4. Add file writing with ISO 8601 timestamps
5. Implement command-line argument parsing
6. Add graceful error handling with user-friendly messages
7. Test LocalAgreement-2 duplicate prevention manually

**Acceptance Criteria:**
- [ ] Script launches and displays startup message
- [ ] Captures microphone audio successfully
- [ ] Transcribes speech with <5s latency
- [ ] Appends to markdown file with correct timestamp format
- [ ] No duplicate sentences in output (LocalAgreement working)
- [ ] Stops gracefully on Ctrl+C (SIGTERM)
- [ ] Error messages are clear and actionable

### Phase 2: Obsidian Plugin Integration (~1 day)

**Tasks:**
1. Create `src/features/realtime-transcription/commands/transcription-commands.ts`
2. Add two command palette commands (start/stop)
3. Extend Zustand store with transcriptionProcess state
4. Implement process spawning with stdout/stderr monitoring
5. Add process cleanup in plugin onunload
6. Register commands in main.ts
7. Test full start/stop workflow

**Acceptance Criteria:**
- [ ] "Start Real-Time Transcription" command appears in palette
- [ ] "Stop Real-Time Transcription" command appears in palette
- [ ] Commands have keyboard shortcuts (Ctrl+Shift+T toggles)
- [ ] Active markdown file path is correctly passed to Python
- [ ] Process starts and stdout appears in DevTools console
- [ ] Process stops cleanly when commanded
- [ ] Process is killed when plugin unloads
- [ ] Prevents starting multiple transcription instances

### Phase 3: Testing & Documentation (~1 day)

**Tasks:**
1. Test on macOS Apple Silicon (primary target)
2. Test error scenarios (no mic, permission denied, file locked)
3. Validate transcription accuracy with real speech
4. Test rapid start/stop cycles
5. Verify no memory leaks or zombie processes
6. Write README with installation/usage instructions
7. Update CLAUDE.md with transcription workflow

**Acceptance Criteria:**
- [ ] Works reliably on Apple Silicon Mac
- [ ] Handles microphone permission errors gracefully
- [ ] Handles file write errors gracefully
- [ ] No zombie processes left after stopping
- [ ] Memory usage stable (<200MB for Python process)
- [ ] CPU usage reasonable (<50% on M1)
- [ ] README includes clear setup instructions
- [ ] CLAUDE.md documents integration points

### Phase 4: Polish & Future Enhancements (Optional)

**Potential Future Improvements:**
- **Multiple language support**: CLI flag for language selection
- **Speaker diarization**: Integrate pyannote.audio for multi-speaker transcription
- **Custom model paths**: Allow local whisper model file selection
- **Pause/resume**: Two-way stdin communication for pause control
- **Menu bar indicator**: Optional Python `rumps` app for visual status
- **Auto-indexing**: Trigger semantic search indexing after transcription
- **Session metadata**: Include session start/end markers in transcript

## File Structure

```
InterBrain/
├── src/
│   └── features/
│       └── realtime-transcription/      # Self-contained feature
│           ├── README.md                # Feature documentation
│           ├── index.ts                 # Feature exports
│           ├── commands/
│           │   └── transcription-commands.ts
│           ├── services/
│           │   └── transcription-service.ts
│           ├── store/
│           │   └── transcription-store.ts
│           ├── scripts/                 # Python scripts (co-located)
│           │   ├── interbrain-transcribe.py
│           │   └── requirements.txt
│           ├── types/
│           │   └── transcription-types.ts
│           └── tests/
│               └── transcription-service.test.ts
└── CLAUDE.md                            # Updated with integration notes
```

## Testing Strategy

### Unit Tests (Python)
- File writing and timestamp formatting
- Path validation (absolute/relative)
- Error message generation
- Command-line argument parsing

### Integration Tests (Python + Audio)
- whisper_streaming integration
- Audio capture from mock device
- LocalAgreement-2 behavior verification
- Process signal handling (SIGTERM, SIGINT)

### Manual Testing (End-to-End)
- Real dictation sessions with natural speech patterns
- Various microphone devices
- File permission edge cases
- Rapid start/stop cycles
- Plugin reload scenarios

### Performance Testing
- CPU usage monitoring during transcription
- Memory leak detection (run for 30+ minutes)
- Latency measurement (speech to file write)
- Concurrent Obsidian usage (no UI lag)

## Success Criteria

### MVP Definition of Done
- [ ] Python script transcribes speech to markdown with timestamps
- [ ] LocalAgreement-2 prevents all duplicate sentences
- [ ] Obsidian commands start/stop transcription successfully
- [ ] Real-time stdout monitoring works in DevTools console
- [ ] Process cleanup works on plugin unload
- [ ] Error handling provides clear, actionable messages
- [ ] Works reliably on Apple Silicon Macs
- [ ] Latency consistently <5 seconds
- [ ] Documentation covers installation and usage

### Quality Gates
- Zero lint errors/warnings
- TypeScript strict mode compliance
- All manual test scenarios pass
- No memory leaks after 30-minute session
- CPU usage <50% during active transcription
- Process cleanup leaves no zombies

## Dependencies & Prerequisites

### System Requirements
- **macOS**: 10.15+ (Apple Silicon preferred)
- **Windows**: 10+ (with Python 3.8+)
- **Linux**: Any modern distro (with Python 3.8+)
- **Python**: 3.8+ (with pip)
- **Microphone**: Any USB or built-in mic
- **Disk Space**: ~1GB (for whisper models)

### Python Environment
```bash
# Install dependencies
pip install -r src/features/realtime-transcription/scripts/requirements.txt

# First run downloads model (~500MB)
python3 src/features/realtime-transcription/scripts/interbrain-transcribe.py --output test.md
```

### Obsidian Plugin
- Node.js `child_process` module (built-in)
- Zustand store access
- Command palette registration

## Future Integration Points

### Semantic Search (Epic 5)
- Auto-index transcripts after session ends
- Extract text without timestamps for embedding
- Link transcript nodes to DreamNodes

### Copilot Mode (Future Epic)
- Real-time transcription during DreamSong creation
- Voice-driven canvas node creation
- Spoken relationship mapping

### DreamWeaving (Epic 6)
- Transcribe video calls for DreamSong content
- Auto-generate DreamTalk summaries from transcripts
- Voice annotation for canvas nodes

## Technical References

### Research Links
- [whisper_streaming (UFAL)](https://github.com/ufal/whisper_streaming) - LocalAgreement-2 implementation
- [whisper.cpp](https://github.com/ggml-org/whisper.cpp) - C++ backend with Metal support
- [LocalAgreement Paper](https://arxiv.org/html/2307.14743) - Academic foundation for streaming policy

### Related Issues
- Epic 7: #334 - Conversational Copilot System (parent epic)
- Epic 5: #256 - Semantic Search System (transcript indexing)
- Epic 6: #259 - DreamWeaving Operations (video transcription)

## Notes

### Why Not Tauri?
Originally considered Tauri system tray app but realized unnecessary complexity:
- No UI actually needed (Obsidian commands suffice)
- Terminal output via stdout is perfect for monitoring
- Python subprocess avoids Rust/IPC overhead
- Much faster development cycle

### Why LocalAgreement-2?
Prevents the duplicate sentence problem that plagues naive streaming approaches:
- Waits for 2 consecutive chunks to agree on prefix before committing
- Handles retroactive corrections seamlessly (like macOS dictation)
- Production-proven at IWSLT 2022 conference

### Performance Expectations
- **Latency**: 3.3 seconds (whisper_streaming benchmark)
- **Accuracy**: Whisper-grade (best in class)
- **CPU**: <50% on Apple Silicon M1
- **Memory**: ~150-200MB for Python process
- **Disk I/O**: Negligible (appending text to file)


Real-Time Transcription System with whisper_streaming #335

Description

Overview

Vision

Architectural Decision: Deferred Marketplace Compatibility

Technical Architecture

Core Components

Why This Approach?

Implementation Specification

Python CLI Script

Obsidian Plugin Integration

Dependencies

Implementation Plan

Phase 1: Python CLI Script (~1 day)

Phase 2: Obsidian Plugin Integration (~1 day)

Phase 3: Testing & Documentation (~1 day)

Phase 4: Polish & Future Enhancements (Optional)

File Structure

Testing Strategy

Unit Tests (Python)

Integration Tests (Python + Audio)

Manual Testing (End-to-End)

Performance Testing

Success Criteria

MVP Definition of Done

Quality Gates

Dependencies & Prerequisites

System Requirements

Python Environment

Obsidian Plugin

Future Integration Points

Semantic Search (Epic 5)

Copilot Mode (Future Epic)

DreamWeaving (Epic 6)

Technical References

Research Links

Related Issues

Notes

Why Not Tauri?

Why LocalAgreement-2?

Performance Expectations

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions