feat(monitoring): Add comprehensive agent memory monitoring and crash insights#52
Merged
khaliqgant merged 12 commits intomainfrom Jan 3, 2026
Merged
Conversation
… insights This adds a complete monitoring system for tracking agent memory usage, detecting crashes, and providing proactive alerting: - Memory Monitor Service (memory-monitor.ts): - Detailed memory metrics (RSS, heap, CPU) - Memory trend analysis (growing/stable/shrinking) - High watermark and average tracking - Configurable thresholds for warnings/critical alerts - Proactive alerts before OOM - Crash Insights Service (crash-insights.ts): - Crash history with analysis - Memory state capture at crash time - Root cause analysis (OOM, memory leak, spike) - Recommendations for prevention - Pattern detection across crashes - Cloud API (monitoring.ts): - POST /api/monitoring/metrics - Report agent metrics - POST /api/monitoring/crash - Report crashes - POST /api/monitoring/alert - Report alerts - GET /api/monitoring/overview - Dashboard overview - GET /api/monitoring/insights - Health insights - Database Schema: - agent_metrics table for metrics storage - agent_crashes table for crash history - memory_alerts table for alert tracking - CLI Commands: - agent-relay metrics - View memory metrics - agent-relay health - View crash insights and health score - agent-relay profile - Run agent with profiling enabled - Dashboard API: - GET /api/metrics/agents - Agent memory metrics - GET /api/metrics/health - System health status
…oring for cloud - Add Agent Memory & Resources section to metrics dashboard - Display memory usage bars with alert level indicators (healthy/warning/critical/OOM) - Show CPU usage, memory trends, and peak memory per agent - Add system memory indicator showing total/free memory - Auto-enable memory monitoring when RELAY_CLOUD_ENABLED=true - Support RELAY_MEMORY_MONITORING=true for explicit enablement - Memory panel fetches from /api/metrics/agents endpoint
…h insights
- Add 31 tests for AgentMemoryMonitor covering registration, lifecycle,
metrics, crash context, trend analysis, alerts, and watermarks
- Add 35 tests for CrashInsightsService covering crash recording,
history, statistics, insights, analysis, and persistence
- Fix dynamic import in dashboard-server to use static import
- Fix WorkerInfo type usage (spawnedAt instead of startedAt)
- Convert require('os') to static import in memory-monitor
All 66 tests pass.
- Add docker-compose.test.yml for full cloud simulation - Add daemon simulator that reports metrics, crashes, and alerts - Add integration tests for monitoring API endpoints - Add test helper API routes for creating test users/daemons - Add QA runner script (scripts/run-cloud-qa.sh) - Add comprehensive local testing documentation This enables full end-to-end testing of: - Agent memory monitoring - Crash reporting and insights - Alert system - Multi-daemon scenarios Usage: ./scripts/run-cloud-qa.sh # Full QA suite ./scripts/run-cloud-qa.sh --quick # Smoke test npm run test:integration # Integration tests only
- Add scripts/manual-qa.sh for easy local QA setup - Add npm run qa / npm run qa:stop commands - Script starts infrastructure, cloud server, creates test data, and starts daemon simulators automatically Usage: npm run qa # Start everything for manual testing npm run qa:stop # Stop all services
Implements a complete auto-scaling system with: - ScalingPolicyService: Configurable policies per plan tier (free/pro/team/enterprise) with thresholds for memory, CPU, agent count, and trend analysis - AutoScaler: Leader-elected scaling coordinator using Redis pub/sub for cross-server communication - CapacityManager: Real-time workspace capacity tracking with placement recommendations and capacity forecasting - ScalingOrchestrator: Integration layer connecting auto-scaler, capacity manager, and workspace provisioner Key features: - Plan-based scaling thresholds and cooldowns - Memory pressure and trend-based scaling triggers - Agent count-based scaling (90% threshold) - Cross-server coordination via Redis pub/sub - Distributed locking for scaling operations - Capacity forecasting with 15/60 min projections - Optimal agent placement recommendations The provisioner uses an adapter pattern (ComputeProvisioner interface) supporting Fly.io, Railway, and Docker - easily extensible to Kubernetes. Includes 21 unit tests for ScalingPolicyService.
Implement vertical scaling as a higher-priority alternative to horizontal scaling (adding workspaces). This is more efficient as it scales resources within an existing workspace before provisioning new ones. Changes: - Add new scaling action types: resize_up, resize_down, increase_agent_limit, migrate_agents - Add ResourceTier interface with small/medium/large/xlarge configurations - Implement resize methods in FlyProvisioner (resize, updateAgentLimit, getCurrentTier) - Add handlers in ScalingOrchestrator for vertical scaling operations - Add in-workspace scaling policies with higher priority (135-150) than horizontal scaling (80-100) - Add updateConfig method to WorkspaceQueries for config updates - Add resourceTier field to WorkspaceConfig schema Policy priority order: 1. agent-limit-increase (150) - Increase max agents in single workspace 2. workspace-resize-up (140) - Vertical scale when memory is high 3. cpu-pressure-resize (135) - Resize when CPU consistently high 4. memory-pressure-scale-up (100) - Add workspace when memory high 5. agent-count-scale-up (80) - Add workspace when agents high 6. workspace-resize-down (45) - Reduce resources when underutilized
Previously, the workspace limit check would block ALL scaling when at max workspaces. This prevented in-workspace scaling (resize, agent limit increase) from working even though those actions don't require adding new workspaces. Now the limit check only blocks scale_up actions, allowing vertical scaling to continue working when horizontal scaling is blocked.
Resolved conflicts: - src/cloud/provisioner/index.ts: kept HEAD (unused variable removed) - src/cloud/server.ts: combined routers from both branches - src/resiliency/index.ts: combined exports from both branches 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This adds a complete monitoring system for tracking agent memory usage,
detecting crashes, and providing proactive alerting:
Memory Monitor Service (memory-monitor.ts):
Crash Insights Service (crash-insights.ts):
Cloud API (monitoring.ts):
Database Schema:
CLI Commands:
Dashboard API: