Fix widget routing infinite loop and room switching#254
Merged
Conversation
- RustWorkerStorageAdapter sends include_data=true to Rust worker - Rust returns full record data inline, eliminating k IPC round trips - Supports both TEXT and BLOB embedding storage formats - Performance: 63ms for 3,422 vectors (was ~500ms) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
TrainingBuffer now logs to persona-specific cognition.log files instead of console.log: - Add BufferLogger type and setLogger() for dynamic logger updates - getTrainingBuffer() accepts optional logger parameter - PersonaMessageEvaluator passes persona's log function to buffer - Add emoji prefixes for log visibility (📥 Added, 🔥 Training, etc.) Result: Each persona's signal detection and buffer accumulation logs appear in their own .continuum/personas/<name>/logs/cognition.log 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
TypeScript fixes: - Fix MILESTONE_DEPENDENCIES type to Record<SystemMilestone, ...> - Add missing BUILD_START milestone entry - Fix serverProcess null vs undefined (use ?? undefined) - Fix cli.ts ParsedValue type and positional arg handling - Add 'blob' to DataSchemaTypes fieldType union - Fix SystemMetricsCollector readonly property with MutableSystemReadySignal Blob storage for large RAG contexts: - Add BlobStorage with content-addressable SHA256 storage - Add @blobfield and @JsonField(blobThreshold) decorators - CoordinationDecisionEntity supports ragContextRef for blobs - Rewrite migration script to use Commands.execute (no direct DB access) Signal-driven micro-training foundation: - Add SignalDetector for correction/approval pattern matching - Add TrainingBuffer for per-trait signal accumulation - PersonaTaskExecutor handles pre-collected training examples - PersonaResponseGenerator uses trained Ollama models - Add FineTuningAdapterFactory for provider selection Test file cleanup: - Remove hardcoded /Volumes/FlashGordon paths - Use process.cwd() and FINE_TUNING_DATASET_PATH env var 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
RTOS-inspired adaptive backpressure that prevents queue flooding:
BackpressureService (new):
- Queries OllamaAdapter.getQueueStats() for actual queue load
- Provides shouldProceed(priority) - callers check before making requests
- Priority levels: critical, high, normal, low, background
- Load thresholds: critical=2.0, high=1.5, normal=1.0, low=0.7, background=0.3
- 100ms cache TTL for efficient repeated checks
Hippocampus integration:
- Check isHighLoad() before consolidation attempt
- If load > 1.0, skip entire tick (don't even group thoughts)
- Logs deferred consolidation every 30 ticks for visibility
SemanticCompressionAdapter integration:
- Check shouldProceed('low') before LLM synthesis
- Use fallback memory (no LLM call) when under load
- Check shouldProceed('background') before embedding generation
- Skip embedding when not idle (will retry later)
- Track skipped operations in consolidation metadata
Key insight from user: "hard coding to a different value is no better"
This is NOT a hardcoded throttle - it queries actual queue load and
adapts in real-time. When queue clears, traffic resumes automatically.
Result: 11 personas no longer flood Ollama with background memory
synthesis while chat responses are flowing. Backpressure logs show:
"Blocking low operation (load=1.00, threshold=0.7)"
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Rust Embedding Worker: - Add fastembed-based ONNX embedding generation (~5ms vs ~80ms Ollama) - Create RustEmbeddingClient.ts for TypeScript integration - Wire into VectorSearchAdapterBase with NO FALLBACK (fail fast) - Register worker via worker.config.ts and workers-config.json - Performance: 100 embeddings in 89ms, vector search 70ms total Post-Inference Skip Bug Fix (PersonaMessageEvaluator.ts): - Original trigger message was slipping through timestamp filter - Got matched against itself at 100% similarity, blocking all responses - Fix: Exclude original message by ID AND original sender's messages - AIs now respond correctly instead of all going silent Verified: AI team responding, Rust embeddings working, TypeScript compiles. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
CRITICAL FIX: Multiple code paths were bypassing the fast Rust embedding worker (~5ms) and going through slow EmbeddingService → DataDaemon path. Files fixed: - RustWorkerStorageAdapter: Use RustEmbeddingClient.embed() directly - MemoryConsolidationWorker: Replace fake 128d hash embeddings with real 384d - MemoryConsolidationSubprocess: Same fix - PersonaTaskExecutor: Direct RustEmbeddingClient usage - SemanticCompressionAdapter: Direct RustEmbeddingClient usage - AiDetectSemanticLoopServerCommand: Direct RustEmbeddingClient + inline cosine Also increased vector search timeout from 30s to 60s for large corpuses (3K+). Verified: Vector searches completing without timeouts, AIs responding correctly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Architecture cleanup after 10+ failed attempts at full Rust data layer: - Add RustVectorSearchClient for vector search only (44ms for 2K vectors) - VectorSearchAdapterBase tries Rust first, falls back to TypeScript - DatabaseHandleRegistry uses TypeScript SqliteStorageAdapter exclusively - Remove unused @matteo.collina/sqlite-pool dependency Key insight: sqlite3 npm package already runs async on thread pool. The Rust CRUD layer had concurrency bugs (single pendingResponse). Vector search works because it's one request at a time. Performance: - Vector search: 44ms for 2222 vectors (Rust reads from SQLite directly) - Embeddings: ~5ms via Rust embedding worker - CRUD: TypeScript sqlite3 (async, reliable) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New command to analyze Rust data-daemon-worker timing metrics: - Reads /tmp/jtag-data-daemon-timing.jsonl (nanosecond precision) - Calculates percentiles (P50/P95/P99) per request type - Shows breakdown: socket_read, parse, execute, serialize, socket_write - Generates recommendations for high latency or variance - Supports time window filtering (--windowMinutes=5) - Can clear timing data (--clear) Also: - clean:logs now also clears /tmp/jtag-*-timing.jsonl files Current performance (after CRUD moved to TypeScript): - vector/search: P50=108ms (was 1.3s avg - 12x faster) - ping: 0.16ms - 100% success rate 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace blocking sync fs calls with async equivalents: - fs.existsSync() → fsPromises.access() - fs.readFileSync() → fsPromises.readFile() - fs.writeFileSync() → fsPromises.writeFile() - fs.mkdirSync() → fsPromises.mkdir() (lazy in persist) This prevents event loop blocking during memory consolidation, even though it's a background worker operation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ORM Interface (DataStorageAdapter):
- Add JoinSpec interface for declarative join specification
- Add StorageQueryWithJoin extending StorageQuery
- Add queryWithJoin() method with default throw implementation
RustWorkerStorageAdapter:
- Implement queryWithJoin() that builds SQL with JOINs
- Add rawQuery() and rawQueryCamelCase() for direct SQL access
- JOINed data is nested under alias keys in results
Rust data-daemon-worker:
- Add data/query command for raw SELECT queries
- Security: Only SELECT allowed, rejects modification keywords
- Enables single-query JOINs eliminating N+1 patterns
Example usage:
```typescript
const result = await adapter.queryWithJoin({
collection: 'chatMessages',
filter: { roomId: 'room-123' },
joins: [{
collection: 'users',
alias: 'sender',
localField: 'senderId',
foreignField: 'id',
type: 'left',
select: ['displayName', 'userType']
}],
limit: 50
});
// Result: { ...message, sender: { displayName: 'Joel' } }
```
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Bug: ChatRAGBuilder.extractFromComposition() only extracted systemPromptSection from 'widget-context' source, silently ignoring the 'global-awareness' source. Cross-context events were found but never injected into the LLM prompt. Fix: - Extract globalAwareness from 'global-awareness' source - Inject into system prompt alongside widgetContext - Add hasGlobalAwareness metadata flag - Add room name cache in PersonaUser for human-readable contextNames Result: AIs now recall info from other rooms. DeepSeek correctly recalled "DELTA_PHOENIX_99" from academy when asked in general. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Vector Search Fix (RustVectorSearchClient, VectorSearchAdapterBase): - RustVectorSearchClient now supports per-database handles (Map not singleton) - ensureHandle() takes optional dbPath, caches handles per database - search() takes optional dbPath parameter, routes to correct adapter - VectorSearchAdapterBase accepts dbPath in constructor - SqliteVectorSearchManager passes dbPath to VectorSearchAdapterBase - SqliteStorageAdapter stores this.dbPath and passes to vector manager This fixes semantic memory recall for per-persona databases. Previously, RustVectorSearchClient always opened the main database, causing hangs when searching per-persona longterm.db files. Now each database gets its own adapter handle in Rust. Timeline ORM Migration (PersonaTimeline, TimelineEventEntity): - Complete rewrite from JSON file to SQLite ORM storage - Uses Commands.execute with dbHandle for per-persona databases - TimelineEventEntity with @Index and @timestamp decorators - Added to EntityRegistry for proper table creation - Fixed DataListResult usage (.items not .data) Tested: AI personas responding with semantic memory recall working. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Three critical fixes for TypeScript/Rust database coexistence: 1. Path-based adapter caching (AdapterRegistry) - Cache adapters by database path in path_cache HashMap - Reuse existing connections instead of opening new ones - Prevents duplicate connections to same database 2. Serialized opens (open_lock mutex) - All adapter opens serialize through open_lock - Prevents concurrent pragma configuration race conditions - "database is locked" errors during startup eliminated 3. Multi-writer mode (CRITICAL FIX) - Changed get_sqlite_pragmas(storage_type, false) → true - Was: PRAGMA locking_mode=EXCLUSIVE (blocks TypeScript) - Now: Standard WAL mode without exclusive lock - TypeScript (better-sqlite3) and Rust (rusqlite) can coexist Result: Zero "database is locked" errors in timing logs after fix. Verified: Vector search succeeding consistently. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Critical fix for per-persona database access: In multi_writer mode (TypeScript + Rust accessing same database): - SKIP journal_mode pragma (TypeScript already set it) - SKIP locking_mode pragma (would conflict) - SKIP checkpoint operations (requires exclusive access) - Only set safe pragmas: synchronous, temp_store, busy_timeout Root cause: Setting PRAGMA journal_mode=WAL requires exclusive access. When TypeScript (better-sqlite3) has the database open, Rust's attempt to change journal_mode fails with "database is locked". Previous commit fixed internal SSD case but still failed on persona databases because we were still trying to set journal_mode. Result: All persona databases now open successfully. Verified with vector search returning 3602 memories from helper persona. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
OllamaAdapter: - Add getModelContextWindow() to query /api/show for real context size - Cache results per model (context windows don't change) - Pass num_ctx in all requests to use full context window - Updates both /api/generate and /api/chat endpoints - llama3.2:3b now uses 128K context instead of default 4096 AICapabilityRegistry: - Add getModelInfo() to look up model by name across providers - Add getContextWindow() for easy context window lookup - Single source of truth for model capabilities PersonaResponseGenerator: - Remove hardcoded context window mapping (30 lines deleted) - Use AICapabilityRegistry.getContextWindow() instead - Automatically gets correct context for any registered model RAGInspectServerCommand: - Fix null-safety bug: params.contextId?.slice() with fallback This fixes the "Ollama circuit breaker open" issue - models were rejecting large prompts because num_ctx defaulted to 4096. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Three key changes for uniform genome training across all local models:
1. FineTuningAdapterFactory
- Add 'peft' and 'local' as providers routing to PEFTLoRAAdapter
- Change return type to BaseLoRATrainer (common interface)
- Update default fallback from Ollama to PEFT
2. PEFTLoRAAdapter
- Add Ollama → HuggingFace model name mapping
- Maps llama3.2:3b → meta-llama/Llama-3.2-3B-Instruct, etc.
- Remove model validation (PEFT supports all transformers models)
- Log model remapping when it occurs
3. PersonaTaskExecutor
- Route local providers (ollama, local, peft) to PEFT adapter
- PEFT is preferred because it:
- Supports any HuggingFace model (not just Ollama)
- Enables multi-adapter composition (genome vision)
- Works cross-platform (MPS/CUDA/CPU)
- Doesn't require external binaries (llama.cpp finetune)
Result: Signal-driven micro-training now routes through PEFT for all
local models, enabling the genome vision of trait-specific adapters
that can be composed at inference time.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ence Replaces Ollama for local models with key advantages: - Unix socket IPC (no HTTP overhead) - Multi-adapter LoRA composition (genome vision) - Metal acceleration on Apple Silicon - Low-level control over memory/timing/adapters New components: 1. Rust inference-worker (workers/inference/) - Trait-driven architecture (ModelProvider, AdapterManager, TextGenerator) - HuggingFace Hub model downloads - JTAG protocol compatible - LoggerClient integration for structured logging - Metal acceleration enabled 2. InferenceWorkerClient (TypeScript) - Singleton with connection reuse - Auto-reconnect with exponential backoff (3 retries) - Request queue for serialization 3. CandleAdapter - Extends BaseAIProviderAdapter (circuit breaker, timeout, logging) - Priority 105 (above Ollama's 100) when worker available - Skill/adapter management for LoRA genome vision 4. Worker registration - Enabled in workers-config.json - Added to WorkerRegistry - Registered in AIProviderDaemonServer Note: Rust worker returns placeholder text for now. Actual Candle inference implementation is next phase. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Inference worker (workers/inference/src/main.rs): - Implement Phi model loading from HuggingFace Hub - Add autoregressive text generation with proper KV cache handling - Clear cache before generation, pass single tokens on subsequent forwards - Memory-efficient: preallocated Box<Vec>, fixed stack buffers - No raw unwrap() - all errors handled via map_err and ? - Metal acceleration on Apple Silicon (~14-17 tok/s) - Document unsafe mmap usage (required by Candle API) Data daemon (workers/data-daemon/src/main.rs): - Fix unused has_wal_artifacts warning - now logs WAL artifacts - Remove dead register() method (open_adapter() is the proper path) Cargo.toml: - Update hf-hub from 0.3 to 0.4 (fixes URL parsing) Tested: TypeScript InferenceWorkerClient successfully generates text through CandleAdapter which is registered with priority 105. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Core philosophy: Never use a resource unless necessary. Documents: - CBARFrame pattern (compute once, attach forever) - Resource hierarchy (GPU > CPU > Disk > Compute time) - Adaptive priority scheduling (AI-assisted) - Zero-copy patterns (handles not data, attach not return) - Bottleneck elimination strategies - Multi-modal pipeline design for video/audio/images - Rust ownership model as enforcement mechanism - Future Sora-like video generation architecture These principles come from shipping 60fps computer vision on iPhone 7s - the same discipline scales to AI workloads. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
JSON serialization of f32 arrays is wasteful. This adds a binary protocol that eliminates serialization overhead for data payloads. Binary Protocol (workers/shared/binary_protocol.rs): - JSON header (newline-terminated) + raw binary payload - Header contains: type, length, dtype, shape, batchSize - Supports f32, f16, u8, i16 data types - Includes FrameFormat for future video/Sora support - Zero-copy patterns where possible Rust embedding-worker: - Control messages (ping, model/list) still use JSON - embedding/generate now uses binary response - Single allocation to flatten embeddings, then reinterpret as bytes - Logs [BINARY] marker for visibility TypeScript RustEmbeddingClient: - Buffer-based data handling (not string) - Separate handlers for JSON vs binary responses - Header parsing, then exact-length binary read - Float32Array view over aligned buffer Benchmark (100 embeddings × 384 dims): - Binary: 150 KB (1.3ms/embedding, 127ms total) - JSON would be: 786 KB - Savings: 80.9% bandwidth, zero parse overhead - Throughput: 1.2 MB/s Designed for reuse with video frames, audio, tensors - any large data that shouldn't be JSON-serialized. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Inference worker multi-model support: - Add Llama, Mistral, Qwen2 architectures alongside Phi - Each architecture has its own forward() signature handling: - Phi: internal KV cache - Llama: external Cache object, seqlen offset - Mistral/Qwen2: internal cache, seqlen offset - Architecture detection from model_id patterns and config.json - Tested: TinyLlama 19 tok/s, Qwen2-0.5B 26 tok/s, Phi-1.5 5.2 tok/s Sharded weight support: - Add download_weights() helper for single and multi-file models - Parse model.safetensors.index.json for shard discovery - Load multiple safetensors into VarBuilder - Enables Mistral 7B and other large sharded models Fix workspace binary paths: - workers-config.json: Use workers/target/release/ (workspace target) not workers/<name>/target/release/ (per-crate target doesn't exist) - start-workers.sh: Single workspace build instead of per-worker builds 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The KV cache was persisting between generations, causing shape mismatches like "cannot broadcast [8, 8] to [1, 32, 8, 74]" on subsequent requests. Root cause: LlamaCache has no reset() or clear() method - the cache was accumulating position indices from previous generations. Fix: - Store LlamaModelConfig in ModelArchitecture::Llama tuple - Recreate cache before each generation using LlamaCache::new() - This properly resets all KV state between requests Tested: 4 consecutive generations all produce coherent output. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace all 'any' types with proper interfaces for type safety:
BaseOpenAICompatibleAdapter:
- Add OpenAIUsage, OpenAIModelData, OpenAIImageData, OpenAIEmbeddingData
- Add OpenAIChatCompletionResponse, OpenAIModelsResponse
- Add OpenAIImageResponse, OpenAIEmbeddingsResponse
- Update calculateCost(usage: OpenAIUsage | undefined)
- Update parseModelInfo(modelData: OpenAIModelData)
- Update all makeRequest<any> to proper response types
- Update all .map() callbacks with proper types
PricingManager:
- Add StaticPricingConfig interface
- Change staticPricing from 'any' to StaticPricingConfig | null
AnthropicAdapter:
- Add AnthropicUsage interface (input_tokens, output_tokens, cache fields)
- Update calculateCost to use AnthropicUsage | undefined
Other files:
- BaseAIProviderAdapter: ...args: any[] → ...args: unknown[]
- BaseLocalAdapter: parseModelsResponse(data: any) → data: unknown
- LlamaCppAdapter: layer: any → { mediaType?: string; digest?: string }
- AdapterHealthMonitor: Add AdapterUnhealthyEvent interface
Architecture review confirms "extreme adapter extensibility":
- Capability-based optional methods (only implement what provider supports)
- Template Method pattern (base handles timeout/circuit breaker)
- Uniform request/response types regardless of underlying API
- Model tier abstraction for semantic model selection
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Rust inference-worker (main.rs):
- Add generate/binary command for binary prompt transmission
- Binary prompts eliminate JSON escaping issues (newlines, quotes, etc.)
- Binary response: JSON header + raw UTF-8 bytes
- Add BinaryTextHeader struct with promptTokens, generatedTokens
- handle_client detects binary command and reads prompt payload
- Add write_binary_text() and read_exact_bytes() helpers
- GenerateBinary variant handled separately from JSON path
TypeScript InferenceWorkerClient:
- Fix command format: remove JTAG wrapper, send simple JSON
- Add binary protocol support with dataBuffer and binaryHeader state
- Add generateBinary() method for prompts with special characters
- generateText() now uses binary protocol internally (safer default)
- handleData() parses both JSON and binary responses
- Update all methods to use new response format (result not payload)
- Remove unused JTAGRequest/JTAGResponse types
Wire format:
Request: {"command":"generate/binary",...}\n<prompt_bytes>
Response: {"type":"binary","length":N,...}\n<response_bytes>
Tested: Multi-line prompts work, 13.1 tok/s on Qwen2-0.5B-Instruct.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Three key integrations for the genome vision: 1. Local Provider Aliasing (AIProviderDaemon.ts) - Route 'ollama'/'local'/'llamacpp' → 'candle' when available - Enables transparent Candle drop-in for local inference - Candle has higher priority (105 vs 100) for better routing 2. PersonaGenome ↔ AIProviderAdapter Wiring - Add setAIProvider() to PersonaGenome and LoRAAdapter - PersonaUser.initialize() wires genome to CandleAdapter - LoRAAdapter.load() delegates to provider.applySkill() - Enables real GPU adapter loading (not just state tracking) 3. Rust GPU Allocator Module (workers/shared/gpu_allocator.rs) - Dedicated module for centralized GPU memory management - LRU+priority eviction algorithm - Thread-safe with Mutex<HashMap> - Global singleton via lazy_static - Proper modular architecture (not crammed into main.rs) Architecture: GPU memory is now managed at the Rust layer. PersonaGenome tracks logical state, Rust owns actual allocation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Expose the Rust GPU allocator through inference worker commands: Commands added: - gpu/status: Get total/allocated/available memory and pressure - gpu/allocate: Request GPU memory for an adapter - gpu/release: Release GPU memory allocation - gpu/stress-test: Throw many allocations at the allocator Stress test results (1000 operations): - 11ms total (84,000 ops/second) - Proper eviction suggestions when memory full - LRU+priority algorithm correctly identifies victims - Clean state after stress test cleanup The allocator uses #[path] include from workers/shared/ to avoid needing a separate lib crate. This follows the modular pattern requested - GPU management lives in its own focused module. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
GPU allocator now tracks allocation types and load times for smart paging: Allocation Types: - Model: Base models (~7 seconds to load, keep resident) - Adapter: LoRA adapters (~200ms to load, can page freely) - Embedding: Embedding models (~500ms, medium priority) - Other: Fallback category Smart Eviction Algorithm: - score = (age * type_weight) / (priority * 10 * reload_cost) - Adapters get 2x eviction weight (cheap to reload) - Models get 0.3x weight (expensive, avoid evicting) - Load time penalizes eviction (don't evict slow-to-load items) New Commands: - gpu/paging-stats: Get breakdown by type (count, MB, avg load time) - gpu/allocate now accepts load_time_ms and alloc_type params Tested: Eviction correctly suggests adapters (180ms) over models (7500ms) when memory pressure requires freeing space. This enables running 50+ personas with shared base models, paging adapters in/out as needed. Foundation for multi-node GPU sharing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Rust gRPC worker (inference-grpc/src/main.rs): - Switch from Qwen2-1.5B to unsloth/Llama-3.2-3B-Instruct - Use candle_transformers::models::llama with external Cache - Handle Llama's EOS tokens (128009) - Load via Llama::load() with LlamaConfig → Config conversion TypeScript adapter (CandleGrpcAdapter.ts): - Add Llama 3.2 chat template with special tokens: <|begin_of_text|><|start_header_id|>role<|end_header_id|> - Change queue guard from reject to wait (90s timeout) - Allows patient queuing for serial local inference Result: Helper AI now produces coherent responses matching Ollama llama3.2:3b quality. Foundation ready for LoRA training since we're using HuggingFace model format directly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Infrastructure for model hot-swapping and LoRA adapter paging: Proto extensions: - LoadModel/UnloadModel/ListModels for model hot-swap - LoadAdapter/UnloadAdapter/ListAdapters for LoRA management - Status for server health and request stats Rust server (inference-grpc): - Model loading factored into load_model_by_id() - LoadModel spawns blocking task for async model loading - Server stats tracking (requests pending/completed) - Adapter metadata tracking (actual LoRA loading is TODO) TypeScript client: - All new RPC methods exposed with proper types - ModelInfo, AdapterInfo, ServerStatus interfaces Tested: Status, ListModels, LoadAdapter, UnloadAdapter all working. Helper AI continues to generate coherently through CandleGrpcAdapter. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Documentation: - Add LORA-LAB-ARCHITECTURE.md with comprehensive adapter economy design - Platform spectrum: M1 8GB (floor) → RTX 5090 (ceiling) - User tiers: Consumers 80%, Creators 15%, Power Users 5% - Multi-modal outputs: text, vision, diffusion, audio, video - Cross-domain adapters: code, design, legal, medical, etc. - Free-first philosophy with optional cloud acceleration - 11 implementation phases mapped out Rust inference worker refactoring: - Extract grpc.rs for service implementation - Extract model.rs for model loading + generation - Extract lora.rs for LoRA adapter parsing - Clean main.rs entry point (60 lines) - Add safetensor parsing for LoRA A/B matrices - Support F32, F16, BF16 dtype conversion - Add half crate for f16/bf16 handling Note: Weight merging (W' = W + scale × B @ A) pending Phase 2.5 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The name mapping wasn't correctly handling the double "model." prefix in PEFT-format adapter names like: base_model.model.model.layers.X.Y Before: mapped to "layers.X.Y.weight" (stripped too much) After: maps to "model.layers.X.Y.weight" (correct) Result: 196/196 LoRA layers now merge successfully with base model. Changes: - lora.rs: Fix map_lora_name_to_model_name() to handle double prefix - lora.rs: Update test cases for new expected format - model.rs: Add failed merge count logging - Add lora-adapter-test.ts for integration testing Tested with Jiten1024/llama-3.2-3b-int-finetune-jav-rank-1-alpha-32 public adapter - generation produces visibly different output. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- MILESTONE 1 complete: 196/196 LoRA layers merged successfully - Tested with Jiten1024/llama-3.2-3b-int-finetune-jav-rank-1-alpha-32 - Added Phase 2.5 design: LoRANameMapper trait for model-agnostic mapping - Current hardcoded Llama/PEFT mapping works, trait needed for multi-arch 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Formula: W' = W + Σ(scale_i × B_i @ A_i) Rust: - model.rs: Add GenomeAdapter struct and rebuild_with_stacked_lora() - grpc.rs: Implement ApplyGenome RPC handler - proto: Add ApplyGenomeRequest/Response messages TypeScript: - InferenceGrpcClient: Add applyGenome() method Test (genome-stacking-test.ts): - Loads two public adapters (rank-1 and rank-64) - Applies genome with both at scale=0.5 each - Validates stacked output differs from single-adapter Result: 392 layers merged from 2 adapters in ~9 seconds. Stacked output shows vocabulary blend from both adapters. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Genome stacking validated with 2 adapters, 392 layers merged. Formula W' = W + Σ(scale_i × B_i @ A_i) working correctly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Milestone 3.6 - Adapter Search Command: - adapter/search command via CommandGenerator - Search HuggingFace Hub (PEFT/LoRA filter) - Search local registry (~/.continuum/adapters/installed/) - Filter by base model, sort by downloads/likes/recent - Base model extraction from both cardData and tags Milestone 4 - Local Registry: - Manifest.json with metadata (repo_id, base_model, rank, alpha) - Installed adapters marked in search results - Handle both snake_case (Rust) and camelCase field names Milestone 5 - Provider Abstraction (design validated): - IAdapterProvider interface for all backends - LocalAdapterProvider wraps InferenceGrpcClient - TogetherAdapterProvider validates cloud LoRA pattern - AdapterProviderRegistry for federated search - Architecture proven: local ↔ cloud ↔ third-party APIs Rust (adapter_registry.rs): - HuggingFace Hub download via hf-hub crate - Copy to local registry with manifest generation - DownloadAdapter gRPC RPC Documentation: - Updated LORA-LAB-ARCHITECTURE.md with current state - Added Multi-Provider Adapter Abstraction section - Added Autonomous AI Self-Improvement vision - Code location table for quick reference 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adapter Compatibility System (CandleGrpcAdapter): - Read manifest.json for proper scale calculation (α/r) - Legal adapter: α=16, r=32 → scale=0.5 (was hardcoded 1.0) - Detect incompatible adapters (quantized 4bit/bnb, wrong base model) - Garbage output detection with pattern matching - Auto-blocklist bad adapters, fall back to safe mode - Methods: enterSafeMode(), exitSafeMode(), unblockAdapter() RTOS Fair Scheduling (InferenceCoordinator): - Reserve 1 of 5 slots for local-inference (prevents cloud starvation) - Auto-thin queues when depth exceeds limit - Newest-first priority for fresher messages - Card-dealing fairness (one response per persona per message) Other fixes: - Relax semantic loop threshold (BLOCK: 0.85 → 0.95) - Add adapter/adopt, adapter/try, persona/genome commands - Add adapter compatibility test suite Result: Helper AI now responds coherently with Legal LoRA adapter. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Quantized inference (Q4_K_M): - 2.3s load time (vs 14s for BF16), 6x faster startup - ~2GB memory (vs ~6GB for BF16), 3x smaller footprint - Same inference speed (~45 tok/sec on Metal) LoRA compatibility via auto-switch: - Defaults to quantized mode for fast startup - Auto-switches to BF16 when LoRA adapters are requested - Transparent to callers - just works Configuration via ~/.continuum/config.env: - INFERENCE_MODE=auto (default) - quantized, auto-switch for LoRA - INFERENCE_MODE=quantized - force quantized (no LoRA) - INFERENCE_MODE=bf16 - force BF16 (full LoRA from start) Device priority: CUDA (RTX 5090) > Metal (M-series) > CPU Files: - quantized_model.rs: GGUF loading, quantized generation - main.rs: Mode selection from config, dual model support - grpc.rs: Auto-switch logic in load_adapter/download_adapter - model.rs: CUDA/Metal device selection 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
BRAIN-HUD-DESIGN.md: - One sci-fi brain visualization as central HUD - Six regions mapping to cognitive domains: - Hippocampus → Memory (RAG, semantic) - Genome → Adapters (LoRA stack, scales) - Motor Cortex → Tools (actions, usage) - Prefrontal → Logs (activity stream) - Limbic → State (energy, mood) - CNS → Performance (latency, connections) - Tap region to expand detail view - Real-time updates via event subscriptions - Mobile-first with vertical stacking - Future: Three.js 3D version GENOME-BUILDER-DESIGN.md: - Now references unified Brain HUD - Adapter cards, search, try/adopt workflow - Game mechanics (skill trees, loadouts) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Personas can be configured with: - Voice (TTS): OpenAI, ElevenLabs, local (Coqui/Piper), custom clones - Avatar: static, animated 2D (Live2D), video diffusion, 3D (Three.js) - Lip sync from audio stream - Expression mapping from persona mood/state Layout shows avatar above brain HUD with speech bubble and volume control. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Motor Cortex - Outputs: - Text (chat) - Speech (TTS) - Singing (music generation) - Video (video diffusion) - Game actions (controls) - Output modality registry with enable/disable Sensory Cortex - Inputs: - Vision (images, video, screenshots) - Audio (speech transcription, sound description) - Game state observation - All inputs convert to text/embeddings for LLM Multimodal context building: - RAG builder incorporates all sensory inputs - [VISION], [AUDIO], [GAME], [MEMORY] sections in context Updated layout with 7 brain regions total. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Personas have their own digital presence: - Profile: avatar, banner, bio (self-generated) - Content: blogs, image galleries, playlists - Social: Twitter, Bluesky, Mastodon accounts - Preferences: theme, timezone, interests Content creation via tools: - image/generate for avatars, blog images - blog/post for publishing articles - social/tweet for external posting - user/preferences for self-configuration Social graph: - Personas follow each other and humans - Collaborative content (co-authoring) - Permission levels (approval workflow, rate limits) Added Social section to Brain HUD detail views. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Shows how brain regions map to real-world voice AI product: - Sensory Cortex: STT (Whisper/Deepgram) converts phone audio to text - Motor Cortex (Voice): TTS (ElevenLabs) converts response to speech - Motor Cortex (Actions): Calendar booking, CRM lookup, SMS, transfers - Hippocampus: Business knowledge (FAQs, services, hours, patients) - Genome: LoRA trained on THEIR call transcripts = brand voice - Prefrontal: Conversation state, decision logging - Limbic: Sentiment detection → escalation triggers - CNS: <200ms latency critical for natural voice Includes: - Voice pipeline diagram (STT → LLM+LoRA → TTS) - LoRA training example from call transcripts - Business admin dashboard mockup (Brain HUD as SaaS UI) Reference: docs/examples/ENTERPRISE-IVR.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Self-improving AI teammates that learn and integrate: Continuous Learning: - Historical IVR transcripts for baseline training - Live call handling with real-time learning - Customer feedback loops (ratings, corrections) - Supervisor coaching and human manager guidance Real-Time Monitoring: - Calls are rooms - observable like any chat - Supervisors can monitor multiple calls - Human managers have dashboard visibility Intervention Capabilities: - Side-channel DM to persona (invisible to customer) - Whisper mode (coach while customer can't hear) - Full takeover (human assumes call) Supervision Hierarchy: - Human Manager → Supervisor Personas → Frontline Personas - AI supervisors handle routine escalations - Humans for complex issues and training External Integrations: - Slack, email, SMS for team coordination - CRM, calendar for business context - Ecosystem diagram showing persona as central node 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Set-theory based room uniqueId: {A, B} == {B, A}
- Deterministic uniqueId from sorted participant set (SHA256 hash)
- Finds existing room if same participants already have DM
- Creates new room if none exists
- Works with 2 (classic DM) or 3+ participants (group DM)
- Name is optional, can be set later
Usage:
./jtag collaboration/dm --participants="helper"
./jtag collaboration/dm --participants='["helper", "teacher"]'
./jtag collaboration/dm --participants="helper" --name="Project Chat"
Recipe (dm.json):
- Private room settings (not public, requires invite)
- One-on-one conversation pipeline
- No response gating (always respond when addressed)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Test scripts for expert LoRA adapters: - sql-adapter-test.ts: Tests SujanKarki SQL adapter - legal-adapter-test.ts: Tests sartajbhuvaji Legal adapter Results: - Infrastructure works correctly (loading, merging, generation) - Legal adapter produces excellent domain-specific output - SQL adapter quality is poor (repetitive output after correct SQL) - Adapter quality varies by source - need to vet before production use Usage: npx tsx tests/sql-adapter-test.ts npx tsx tests/legal-adapter-test.ts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Shared code must be environment-agnostic (browser + Node.js).
Using Node.js `crypto` module in shared DmCommand.ts broke browser loading.
Fix: Replace SHA256 hash with simple deterministic ID from sorted participant
short IDs. Still guarantees set equality: {A,B} == {B,A}
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
PersonaUser was only subscribing to data:rooms:updated events, missing data:rooms:created events. When a new DM room was created with a persona as a member, the persona never received the event and didn't subscribe to chat messages for that room. Fix: subscribeToRoomUpdates() now subscribes to BOTH events: - data:rooms:updated (existing room membership changes) - data:rooms:created (new rooms with persona as member) Tested: DM with Helper AI + Legal LoRA adapter - persona now responds to messages in newly created DM rooms. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When user UUIDs change (e.g., after data reseed), the uniqueId-based lookup fails because uniqueId uses last 6 chars of UUIDs. Two-phase lookup now: 1. Fast path: Try uniqueId (deterministic, handles most cases) 2. Fallback: Search direct/private rooms and match by member set When fallback finds a match, it updates the room's uniqueId to the current format for future fast lookups. This prevents duplicate DM rooms after reseed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
DM command now correctly uses persona's callerId: - Priority 1: params.callerId from persona tool context - Priority 2: params.personaId alternative - Priority 3: UserIdentityResolver fallback (human/CLI) This fixes the bug where Teacher AI calling DM tool created "Claude Code & Claude Code" instead of "Teacher AI & Claude Code". Also adds MarkdownToolAdapter to parse tool calls that local models (llama3.2-3b, etc.) produce in backtick format: `tool: collaboration/dm` `participants=helper` This format is now parsed alongside the standard XML formats. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Three critical fixes for widget routing: 1. Infinite loop prevention (MainWidget.ts) - Add guard to switchContentView() tracking currentViewType/entityId - Skip re-render when already showing the same content - Prevents cascading events from triggering infinite widget recreation 2. Remove duplicate event emission (RoomListWidget.ts) - Remove direct content:opened emit from selectRoom() - The collaboration/content/open command already emits with proper contentItemId - Reduces event noise and potential race conditions 3. Critical pageState fix (MainWidget.ts) - ROOM_SELECTED handler now calls pageState.setContent() BEFORE switchContentView() - ChatWidget reads from pageState as priority 1, was loading stale room - Now room switching correctly updates both tab AND chat content Result: Clicking rooms in sidebar now properly switches tabs and content without infinite loops or stale data. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces a comprehensive LoRA adapter system with inference coordination, spanning local inference infrastructure, adapter management, provider abstraction, and extensive documentation. The changes enable personas to discover, try, and adopt adapters autonomously while preventing inference queue flooding through coordinated slot management.
Key changes:
- Added InferenceCoordinator for rate-limiting AI inference requests across multiple personas
- Implemented federated adapter provider system (Local, Together.ai) with unified search
- Created adapter commands for search, try, and adopt workflows
- Added comprehensive architecture documentation for LoRA lab, training, and persona consciousness
- Updated configuration to use HF_TOKEN instead of HUGGINGFACE_API_KEY
Reviewed changes
Copilot reviewed 119 out of 230 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| AIDecisionService.ts | Added InferenceCoordinator integration to prevent thundering herd during gating/redundancy checks |
| system/adapters/* | New provider abstraction layer for cross-platform adapter management |
| shared/version.ts | Version bump to 1.0.6890 |
| WorkerRegistry.ts | Added embedding and inference workers to registry |
| generated-command-constants.ts | Added adapter and inference command constants |
| server/generated.ts | Registered new adapter and inference commands |
| scripts/* | New setup and benchmarking scripts for Rust/inference |
| signaling/* | Type safety improvements with readonly arrays |
| package.json | Added gRPC dependencies and updated worker lifecycle scripts |
| generator/specs/inference-generate.json | New inference command specification |
| generated-command-schemas.json | Schema updates for new commands |
| ExampleConfigServer.ts | Return type improvements for port getters |
| docs/* | Extensive new architecture documentation |
| daemons/data-daemon/shared/DataDaemon.ts | Added queryWithJoin for optimized related data loading |
| BaseLocalAdapter.ts | Type safety improvement for parseModelsResponse |
Files not reviewed (1)
- src/debug/jtag/package-lock.json: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Changes
content:openedevent (command already emits it)Test plan
🤖 Generated with Claude Code