feat(monitoring): Add comprehensive agent memory monitoring and crash insights by khaliqgant · Pull Request #52 · AgentWorkforce/relay

khaliqgant · 2026-01-02T18:25:15Z

This adds a complete monitoring system for tracking agent memory usage,
detecting crashes, and providing proactive alerting:

Memory Monitor Service (memory-monitor.ts):
- Detailed memory metrics (RSS, heap, CPU)
- Memory trend analysis (growing/stable/shrinking)
- High watermark and average tracking
- Configurable thresholds for warnings/critical alerts
- Proactive alerts before OOM
Crash Insights Service (crash-insights.ts):
- Crash history with analysis
- Memory state capture at crash time
- Root cause analysis (OOM, memory leak, spike)
- Recommendations for prevention
- Pattern detection across crashes
Cloud API (monitoring.ts):
- POST /api/monitoring/metrics - Report agent metrics
- POST /api/monitoring/crash - Report crashes
- POST /api/monitoring/alert - Report alerts
- GET /api/monitoring/overview - Dashboard overview
- GET /api/monitoring/insights - Health insights
Database Schema:
- agent_metrics table for metrics storage
- agent_crashes table for crash history
- memory_alerts table for alert tracking
CLI Commands:
- agent-relay metrics - View memory metrics
- agent-relay health - View crash insights and health score
- agent-relay profile - Run agent with profiling enabled
Dashboard API:
- GET /api/metrics/agents - Agent memory metrics
- GET /api/metrics/health - System health status

… insights This adds a complete monitoring system for tracking agent memory usage, detecting crashes, and providing proactive alerting: - Memory Monitor Service (memory-monitor.ts): - Detailed memory metrics (RSS, heap, CPU) - Memory trend analysis (growing/stable/shrinking) - High watermark and average tracking - Configurable thresholds for warnings/critical alerts - Proactive alerts before OOM - Crash Insights Service (crash-insights.ts): - Crash history with analysis - Memory state capture at crash time - Root cause analysis (OOM, memory leak, spike) - Recommendations for prevention - Pattern detection across crashes - Cloud API (monitoring.ts): - POST /api/monitoring/metrics - Report agent metrics - POST /api/monitoring/crash - Report crashes - POST /api/monitoring/alert - Report alerts - GET /api/monitoring/overview - Dashboard overview - GET /api/monitoring/insights - Health insights - Database Schema: - agent_metrics table for metrics storage - agent_crashes table for crash history - memory_alerts table for alert tracking - CLI Commands: - agent-relay metrics - View memory metrics - agent-relay health - View crash insights and health score - agent-relay profile - Run agent with profiling enabled - Dashboard API: - GET /api/metrics/agents - Agent memory metrics - GET /api/metrics/health - System health status

…oring for cloud - Add Agent Memory & Resources section to metrics dashboard - Display memory usage bars with alert level indicators (healthy/warning/critical/OOM) - Show CPU usage, memory trends, and peak memory per agent - Add system memory indicator showing total/free memory - Auto-enable memory monitoring when RELAY_CLOUD_ENABLED=true - Support RELAY_MEMORY_MONITORING=true for explicit enablement - Memory panel fetches from /api/metrics/agents endpoint

…h insights - Add 31 tests for AgentMemoryMonitor covering registration, lifecycle, metrics, crash context, trend analysis, alerts, and watermarks - Add 35 tests for CrashInsightsService covering crash recording, history, statistics, insights, analysis, and persistence - Fix dynamic import in dashboard-server to use static import - Fix WorkerInfo type usage (spawnedAt instead of startedAt) - Convert require('os') to static import in memory-monitor All 66 tests pass.

- Add docker-compose.test.yml for full cloud simulation - Add daemon simulator that reports metrics, crashes, and alerts - Add integration tests for monitoring API endpoints - Add test helper API routes for creating test users/daemons - Add QA runner script (scripts/run-cloud-qa.sh) - Add comprehensive local testing documentation This enables full end-to-end testing of: - Agent memory monitoring - Crash reporting and insights - Alert system - Multi-daemon scenarios Usage: ./scripts/run-cloud-qa.sh # Full QA suite ./scripts/run-cloud-qa.sh --quick # Smoke test npm run test:integration # Integration tests only

- Add scripts/manual-qa.sh for easy local QA setup - Add npm run qa / npm run qa:stop commands - Script starts infrastructure, cloud server, creates test data, and starts daemon simulators automatically Usage: npm run qa # Start everything for manual testing npm run qa:stop # Stop all services

Implements a complete auto-scaling system with: - ScalingPolicyService: Configurable policies per plan tier (free/pro/team/enterprise) with thresholds for memory, CPU, agent count, and trend analysis - AutoScaler: Leader-elected scaling coordinator using Redis pub/sub for cross-server communication - CapacityManager: Real-time workspace capacity tracking with placement recommendations and capacity forecasting - ScalingOrchestrator: Integration layer connecting auto-scaler, capacity manager, and workspace provisioner Key features: - Plan-based scaling thresholds and cooldowns - Memory pressure and trend-based scaling triggers - Agent count-based scaling (90% threshold) - Cross-server coordination via Redis pub/sub - Distributed locking for scaling operations - Capacity forecasting with 15/60 min projections - Optimal agent placement recommendations The provisioner uses an adapter pattern (ComputeProvisioner interface) supporting Fly.io, Railway, and Docker - easily extensible to Kubernetes. Includes 21 unit tests for ScalingPolicyService.

Implement vertical scaling as a higher-priority alternative to horizontal scaling (adding workspaces). This is more efficient as it scales resources within an existing workspace before provisioning new ones. Changes: - Add new scaling action types: resize_up, resize_down, increase_agent_limit, migrate_agents - Add ResourceTier interface with small/medium/large/xlarge configurations - Implement resize methods in FlyProvisioner (resize, updateAgentLimit, getCurrentTier) - Add handlers in ScalingOrchestrator for vertical scaling operations - Add in-workspace scaling policies with higher priority (135-150) than horizontal scaling (80-100) - Add updateConfig method to WorkspaceQueries for config updates - Add resourceTier field to WorkspaceConfig schema Policy priority order: 1. agent-limit-increase (150) - Increase max agents in single workspace 2. workspace-resize-up (140) - Vertical scale when memory is high 3. cpu-pressure-resize (135) - Resize when CPU consistently high 4. memory-pressure-scale-up (100) - Add workspace when memory high 5. agent-count-scale-up (80) - Add workspace when agents high 6. workspace-resize-down (45) - Reduce resources when underutilized

Previously, the workspace limit check would block ALL scaling when at max workspaces. This prevented in-workspace scaling (resize, agent limit increase) from working even though those actions don't require adding new workspaces. Now the limit check only blocks scale_up actions, allowing vertical scaling to continue working when horizontal scaling is blocked.

Resolved conflicts: - src/cloud/provisioner/index.ts: kept HEAD (unused variable removed) - src/cloud/server.ts: combined routers from both branches - src/resiliency/index.ts: combined exports from both branches 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

claude and others added 12 commits January 2, 2026 17:50

fix: Resolve lint errors and warnings in scaling files

1941851

docs: Add auto-scaling infrastructure documentation

600e66b

fixes

900a67f

khaliqgant merged commit 152638f into main Jan 3, 2026
6 checks passed

khaliqgant deleted the claude/agent-memory-monitoring-roI9F branch January 3, 2026 16:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(monitoring): Add comprehensive agent memory monitoring and crash insights#52

feat(monitoring): Add comprehensive agent memory monitoring and crash insights#52
khaliqgant merged 12 commits intomainfrom
claude/agent-memory-monitoring-roI9F

khaliqgant commented Jan 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

khaliqgant commented Jan 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants