Skip to content

feat(monitoring): Add comprehensive agent memory monitoring and crash insights#52

Merged
khaliqgant merged 12 commits intomainfrom
claude/agent-memory-monitoring-roI9F
Jan 3, 2026
Merged

feat(monitoring): Add comprehensive agent memory monitoring and crash insights#52
khaliqgant merged 12 commits intomainfrom
claude/agent-memory-monitoring-roI9F

Conversation

@khaliqgant
Copy link
Copy Markdown
Member

This adds a complete monitoring system for tracking agent memory usage,
detecting crashes, and providing proactive alerting:

  • Memory Monitor Service (memory-monitor.ts):

    • Detailed memory metrics (RSS, heap, CPU)
    • Memory trend analysis (growing/stable/shrinking)
    • High watermark and average tracking
    • Configurable thresholds for warnings/critical alerts
    • Proactive alerts before OOM
  • Crash Insights Service (crash-insights.ts):

    • Crash history with analysis
    • Memory state capture at crash time
    • Root cause analysis (OOM, memory leak, spike)
    • Recommendations for prevention
    • Pattern detection across crashes
  • Cloud API (monitoring.ts):

    • POST /api/monitoring/metrics - Report agent metrics
    • POST /api/monitoring/crash - Report crashes
    • POST /api/monitoring/alert - Report alerts
    • GET /api/monitoring/overview - Dashboard overview
    • GET /api/monitoring/insights - Health insights
  • Database Schema:

    • agent_metrics table for metrics storage
    • agent_crashes table for crash history
    • memory_alerts table for alert tracking
  • CLI Commands:

    • agent-relay metrics - View memory metrics
    • agent-relay health - View crash insights and health score
    • agent-relay profile - Run agent with profiling enabled
  • Dashboard API:

    • GET /api/metrics/agents - Agent memory metrics
    • GET /api/metrics/health - System health status

claude and others added 12 commits January 2, 2026 17:50
… insights

This adds a complete monitoring system for tracking agent memory usage,
detecting crashes, and providing proactive alerting:

- Memory Monitor Service (memory-monitor.ts):
  - Detailed memory metrics (RSS, heap, CPU)
  - Memory trend analysis (growing/stable/shrinking)
  - High watermark and average tracking
  - Configurable thresholds for warnings/critical alerts
  - Proactive alerts before OOM

- Crash Insights Service (crash-insights.ts):
  - Crash history with analysis
  - Memory state capture at crash time
  - Root cause analysis (OOM, memory leak, spike)
  - Recommendations for prevention
  - Pattern detection across crashes

- Cloud API (monitoring.ts):
  - POST /api/monitoring/metrics - Report agent metrics
  - POST /api/monitoring/crash - Report crashes
  - POST /api/monitoring/alert - Report alerts
  - GET /api/monitoring/overview - Dashboard overview
  - GET /api/monitoring/insights - Health insights

- Database Schema:
  - agent_metrics table for metrics storage
  - agent_crashes table for crash history
  - memory_alerts table for alert tracking

- CLI Commands:
  - agent-relay metrics - View memory metrics
  - agent-relay health - View crash insights and health score
  - agent-relay profile - Run agent with profiling enabled

- Dashboard API:
  - GET /api/metrics/agents - Agent memory metrics
  - GET /api/metrics/health - System health status
…oring for cloud

- Add Agent Memory & Resources section to metrics dashboard
- Display memory usage bars with alert level indicators (healthy/warning/critical/OOM)
- Show CPU usage, memory trends, and peak memory per agent
- Add system memory indicator showing total/free memory
- Auto-enable memory monitoring when RELAY_CLOUD_ENABLED=true
- Support RELAY_MEMORY_MONITORING=true for explicit enablement
- Memory panel fetches from /api/metrics/agents endpoint
…h insights

- Add 31 tests for AgentMemoryMonitor covering registration, lifecycle,
  metrics, crash context, trend analysis, alerts, and watermarks
- Add 35 tests for CrashInsightsService covering crash recording,
  history, statistics, insights, analysis, and persistence
- Fix dynamic import in dashboard-server to use static import
- Fix WorkerInfo type usage (spawnedAt instead of startedAt)
- Convert require('os') to static import in memory-monitor

All 66 tests pass.
- Add docker-compose.test.yml for full cloud simulation
- Add daemon simulator that reports metrics, crashes, and alerts
- Add integration tests for monitoring API endpoints
- Add test helper API routes for creating test users/daemons
- Add QA runner script (scripts/run-cloud-qa.sh)
- Add comprehensive local testing documentation

This enables full end-to-end testing of:
- Agent memory monitoring
- Crash reporting and insights
- Alert system
- Multi-daemon scenarios

Usage:
  ./scripts/run-cloud-qa.sh         # Full QA suite
  ./scripts/run-cloud-qa.sh --quick # Smoke test
  npm run test:integration          # Integration tests only
- Add scripts/manual-qa.sh for easy local QA setup
- Add npm run qa / npm run qa:stop commands
- Script starts infrastructure, cloud server, creates test data,
  and starts daemon simulators automatically

Usage:
  npm run qa       # Start everything for manual testing
  npm run qa:stop  # Stop all services
Implements a complete auto-scaling system with:
- ScalingPolicyService: Configurable policies per plan tier (free/pro/team/enterprise)
  with thresholds for memory, CPU, agent count, and trend analysis
- AutoScaler: Leader-elected scaling coordinator using Redis pub/sub
  for cross-server communication
- CapacityManager: Real-time workspace capacity tracking with
  placement recommendations and capacity forecasting
- ScalingOrchestrator: Integration layer connecting auto-scaler,
  capacity manager, and workspace provisioner

Key features:
- Plan-based scaling thresholds and cooldowns
- Memory pressure and trend-based scaling triggers
- Agent count-based scaling (90% threshold)
- Cross-server coordination via Redis pub/sub
- Distributed locking for scaling operations
- Capacity forecasting with 15/60 min projections
- Optimal agent placement recommendations

The provisioner uses an adapter pattern (ComputeProvisioner interface)
supporting Fly.io, Railway, and Docker - easily extensible to Kubernetes.

Includes 21 unit tests for ScalingPolicyService.
Implement vertical scaling as a higher-priority alternative to horizontal
scaling (adding workspaces). This is more efficient as it scales resources
within an existing workspace before provisioning new ones.

Changes:
- Add new scaling action types: resize_up, resize_down, increase_agent_limit, migrate_agents
- Add ResourceTier interface with small/medium/large/xlarge configurations
- Implement resize methods in FlyProvisioner (resize, updateAgentLimit, getCurrentTier)
- Add handlers in ScalingOrchestrator for vertical scaling operations
- Add in-workspace scaling policies with higher priority (135-150) than
  horizontal scaling (80-100)
- Add updateConfig method to WorkspaceQueries for config updates
- Add resourceTier field to WorkspaceConfig schema

Policy priority order:
1. agent-limit-increase (150) - Increase max agents in single workspace
2. workspace-resize-up (140) - Vertical scale when memory is high
3. cpu-pressure-resize (135) - Resize when CPU consistently high
4. memory-pressure-scale-up (100) - Add workspace when memory high
5. agent-count-scale-up (80) - Add workspace when agents high
6. workspace-resize-down (45) - Reduce resources when underutilized
Previously, the workspace limit check would block ALL scaling when at
max workspaces. This prevented in-workspace scaling (resize, agent
limit increase) from working even though those actions don't require
adding new workspaces.

Now the limit check only blocks scale_up actions, allowing vertical
scaling to continue working when horizontal scaling is blocked.
Resolved conflicts:
- src/cloud/provisioner/index.ts: kept HEAD (unused variable removed)
- src/cloud/server.ts: combined routers from both branches
- src/resiliency/index.ts: combined exports from both branches

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@khaliqgant khaliqgant merged commit 152638f into main Jan 3, 2026
6 checks passed
@khaliqgant khaliqgant deleted the claude/agent-memory-monitoring-roI9F branch January 3, 2026 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants