Fix shared replicas repository with enhanced memory allocation and metrics by jfrench9 · Pull Request #355 · RoboFinSystems/robosystems

jfrench9 · 2026-02-14T00:10:25Z

Summary

This PR addresses critical issues with shared database replica management while improving system observability and resource allocation. The changes focus on stabilizing the connection pooling mechanism and enhancing monitoring capabilities across the graph API infrastructure.

Key Accomplishments

Database Connection Pool Improvements

Enhanced Memory Management: Updated memory allocation strategies for database instances to prevent resource exhaustion
Connection Pool Stability: Refactored the ladybug pool implementation with improved error handling and connection lifecycle management
Replica Configuration: Fixed shared replicas repository configuration issues that were causing deployment inconsistencies

Monitoring & Observability Enhancements

Enhanced Metrics Collection: Expanded metrics collector functionality to capture more granular performance data
Improved Volume Monitoring: Updated lambda-based volume monitoring with better error handling and resource tracking
API Metrics Integration: Enhanced metrics router with additional endpoints and improved data aggregation

Infrastructure Updates

Deployment Configuration: Updated graph deployment workflows for both primary and replica environments
Configuration Management: Refined graph configuration settings to support new memory allocation parameters
CI/CD Improvements: Streamlined build and deployment processes

Testing Notes

Added comprehensive test suite for connection pool functionality (244+ new test lines)
All existing tests pass with the updated connection pool implementation
Volume monitoring changes have been validated in staging environment
Metrics collection endpoints tested for proper data formatting and response times

Infrastructure Considerations

Memory allocation changes may result in slightly different resource utilization patterns
Enhanced metrics collection will provide better visibility into system performance
Deployment workflows have been updated to ensure consistent replica configuration
No breaking changes to existing API contracts or client integrations

Files Modified

Core API application and routing logic
Database connection pool implementation
Metrics collection and monitoring systems
CI/CD workflows and configuration files
Comprehensive test coverage additions

🤖 Generated with Claude Code

Branch Info:

Source: bugfix/shared-replicas-repo-fix
Target: main
Type: bugfix

Co-Authored-By: Claude noreply@anthropic.com

…ollection - Reduced memory allocation per database from 4GB to 2GB for both production and staging environments to better manage resource limits. - Modified the deployment workflow to handle shared repositories more robustly by adding quotes around the parameter value. - Improved the graph volume monitoring logic to include clearer logging and fallback mechanisms for metrics retrieval under heavy I/O conditions. - Introduced a new method in the metrics collector to return cached database metrics, ensuring availability during timeouts. - Updated the metrics API to implement a timeout for database metrics collection, returning cached data if the collection exceeds the limit.

jfrench9 · 2026-02-14T00:10:26Z

@claude please review this PR

claude · 2026-02-14T00:10:51Z

Claude finished @jfrench9's task in 1m 56s —— View job

Code Review Complete

Tasks

Gather context and read modified files
Review connection pool refactoring (pool.py)
Review metrics collection enhancements
Review infrastructure and configuration changes
Review test coverage
Check for security, performance, and code quality issues
Post comprehensive review feedback

Review Summary

This PR introduces significant improvements to the shared database replica management system while enhancing observability and resource allocation. The changes are well-structured and address real operational challenges. I've identified several important areas for consideration below.

Critical Issues

1. Memory Configuration Mismatch Between Writer and Replica ⚠️

graph.yml:308-310

The production shared replicas use m7g.large (8GB RAM) with max_memory_mb: 5120 (5GB), but this overrides the writer's 10GB settings. However, the writer configuration (graph.yml:271-272) still defaults to 10GB:

max_memory_mb: 10240 # 10GB default for LadybugDB
duckdb_memory_limit: "10GB"

Issue: If these settings are copied to replica instances without being overridden, they would exceed the replica's container memory budget (~6GB on 8GB instance), causing OOM crashes.

Recommendation: Verify that the replica deployment workflow correctly applies the memory overrides from the replica section. Consider adding validation to prevent configuration drift.

Fix this →

2. S3 ATTACH Connection TTL Bypass Logic 🔍

pool.py:688-698

The code bypasses TTL for S3 ATTACH connections to prevent expensive recreation:

def _is_connection_valid(self, connection_info: ConnectionInfo) -> bool:
    """Check if a connection is still valid."""
    # S3 ATTACH connections bypass TTL - they are read-only remote mounts
    # that are expensive to recreate (~11 min ATTACH + full buffer pool reallocation).
    # Cycling them via TTL causes OOM on memory-constrained replicas.
    if not connection_info.s3_attached:
        if datetime.now(UTC) - connection_info.created_at > self.connection_ttl:
            return False

Issue: While the comment explains the reasoning (preventing OOM on replicas), S3 ATTACH connections never expire unless explicitly invalidated. This could lead to:

Stale connections accumulating over time
No mechanism to refresh credentials (if they expire)
Connections that fail health checks remaining in the pool indefinitely

Recommendation: Consider adding:

A separate, longer TTL for S3 ATTACH connections (e.g., 24 hours instead of 30 minutes)
Credential refresh mechanism for long-lived S3 connections
Periodic health check validation to remove truly unhealthy S3 connections

Fix this →

Significant Observations

3. Enhanced CloudWatch Fallback for Volume Monitoring ✅

graph_volume_monitor.py:206-220

The Lambda now has a robust fallback to CloudWatch metrics when the Graph API is unresponsive:

metrics = get_volume_metrics_from_instance(instance)
metrics_source = "graph_api"

if not metrics:
    logger.warning(
        f"Graph API unavailable for {instance['instance_id']}, "
        "falling back to CloudWatch metrics"
    )
    metrics = get_volume_metrics_from_cloudwatch(instance)
    metrics_source = "cloudwatch"

Observation: This is excellent defensive programming. The CloudWatch fallback ensures monitoring continues even during heavy I/O load when the Graph API may timeout. The implementation correctly uses DISK_USED_PERCENT and disk_total metrics with proper dimension matching.

Minor Suggestion: Consider adding a metric to track how often the fallback is used. This would help identify instances that are chronically unresponsive.

4. Shared Buffer Pool Architecture 🎯

pool.py:117-229

The new shared S3 database implementation creates a single in-memory database with all S3 repos attached:

def initialize_shared_s3_database(
    self, repos: list[str], s3_prefix: str
) -> list[str]:
    """
    Create a single shared in-memory Database and ATTACH all S3 repos to it.
    
    This creates ONE buffer pool (sized to max_memory_mb) shared across all
    S3 ATTACH repos. LadybugDB dynamically manages memory based on actual
    query patterns rather than static per-database allocations.
    """

Observation: This is a clever optimization that:

Reduces memory fragmentation by using a single buffer pool
Allows LadybugDB to dynamically allocate memory across repos based on query patterns
Simplifies connection management (one Database object, multiple Connection objects with USE statements)

Potential Issue: The implementation uses a single Database for all S3 repos, but each connection sets USE <repo> to target specific catalogs. If a query accidentally references tables from multiple repos without catalog prefixes, it could return incorrect results.

Recommendation: Consider adding query validation or documenting the catalog isolation expectations clearly.

5. Test Coverage ✅

test_connection_pool.py

244 new lines of comprehensive tests were added, including:

S3 shared database initialization tests (test_initialize_shared_s3_database)
Shared connection creation tests (test_shared_s3_connection_creation)
Database cleanup preservation tests (test_close_database_connections_preserves_shared_db)
Partial ATTACH failure handling (test_partial_attach_failure)

Observation: The test coverage is excellent and covers edge cases well. The tests properly mock LadybugDB objects to avoid segfaults and test the critical shared buffer pool logic.

Suggestion: Consider adding an integration test that verifies catalog isolation (ensuring USE statements properly scope queries to the correct repo).

Minor Issues

6. Magic Number in XFS Metadata Tolerance 📝

graph_volume_monitor.py:244-245

# Use a tolerance of 2GB to account for XFS metadata overhead (XFS reserves ~1-2%
# for internal metadata/journal, so df reports less than the actual device size)
ebs_fs_diff = (ebs_size_gb - current_size) if ebs_size_gb else 0
if ebs_size_gb and ebs_fs_diff > 2:

Issue: The 2GB tolerance is hardcoded and might be too small for very large volumes (e.g., 1000GB would have ~10-20GB metadata overhead at 1-2%).

Recommendation: Use a percentage-based threshold instead:

tolerance_gb = max(2, int(ebs_size_gb * 0.02))  # 2GB minimum, 2% for large volumes
if ebs_size_gb and ebs_fs_diff > tolerance_gb:

7. Metrics Collection Timeout 📊

metrics.py:24-25

# Timeout for database metrics collection (seconds). Under heavy ingestion I/O,
# _get_database_sizes() can block for 10+ seconds doing recursive file walks.
_DATABASE_METRICS_TIMEOUT = 3.0

Observation: The 3-second timeout is a reasonable default, but the comment mentions blocking can last "10+ seconds". This means legitimate metrics collection could frequently timeout during heavy load.

Recommendation: Consider making this configurable via environment variable, or use a more sophisticated approach:

First attempt with 3s timeout (returns cached if timeout)
Background refresh with longer timeout (10s) that updates cache for next request

Performance Considerations

8. Serial ATTACH with Retry Logic ⏱️

pool.py:183-216

# Serially ATTACH each repo with retry logic
attached: list[str] = []
for repo in repos:
    s3_uri = f"{s3_prefix.rstrip('/')}/{repo}.lbug"
    attach_query = f"ATTACH '{s3_uri}' AS {repo} (dbtype lbug, read_only)"
    logger.info(f"ATTACHing {repo} from {s3_uri}...")
    
    max_retries = 3
    success = False
    for attempt in range(max_retries):
        try:
            setup_conn.execute(attach_query)
            success = True
            break
        except Exception as attach_err:
            if attempt == max_retries - 1:
                logger.error(...)
            else:
                wait_time = 5 * (attempt + 1)
                time.sleep(wait_time)

Observation: The serial ATTACH with exponential backoff is correct given LadybugDB's limitations, but could be slow for many repos (e.g., 3 repos × 3 retries × 15s max wait = potentially minutes).

Recommendation: This is acceptable given the comment explains it runs in background during warmup. Consider adding progress logging to help operators understand why startup is slow.

Security Considerations

9. S3 Credentials Handling 🔐

pool.py:598-687

The _configure_s3_credentials method properly handles both IMDS credentials (production) and environment variables (dev):

def _configure_s3_credentials(self, conn: lbug.Connection) -> None:
    """
    Configure S3 credentials for httpfs extension.
    
    Uses EC2 Instance Metadata Service (IMDS) in production to get
    temporary credentials from the instance's IAM role.
    """

Observation: The implementation correctly:

Uses IMDSv2 (token-based) for enhanced security
Properly escapes credentials for SQL injection prevention: access_key.replace("'", "''")
Handles session tokens required for IMDS credentials
Falls back gracefully to environment variables in dev

Minor Issue: The error handling at line 657-660 raises a RuntimeError if session token setting fails, which could break S3 ATTACH entirely. Consider whether this should be a warning instead, allowing the attach to proceed and fail naturally with clearer S3 auth errors.

Code Quality

10. Comprehensive Documentation ✅

The code includes excellent documentation throughout:

Clear module-level docstrings explaining architecture decisions
Detailed comments explaining complex logic (S3 buffer pool, TTL bypass, memory management)
Helpful examples in docstrings

Example (pool.py:54-56):

IMPORTANT: All connections for a given database share the same Database
object to ensure proper transaction visibility across connections.

This is exactly the kind of architectural context that makes code maintainable.

Summary

Overall Assessment: This is a high-quality PR that addresses real operational challenges with thoughtful solutions. The shared buffer pool architecture is clever, the CloudWatch fallback improves reliability, and the test coverage is comprehensive.

Recommendations:

Critical: Verify memory configuration overrides for replicas (Issue Inital open source release of robosystems #1)
Important: Consider S3 ATTACH connection TTL strategy (Issue Fix MCP tools query results to include units and improve dimensions #2)
Nice-to-have: Add percentage-based XFS tolerance (Issue Fix worker monitor Valkey connection with Secrets Manager auth #6)
Nice-to-have: Make metrics timeout configurable (Issue #7)

Strengths:

Robust CloudWatch fallback for monitoring
Comprehensive test coverage including edge cases
Excellent code documentation
Proper security handling (IMDSv2, credential escaping)
Defensive programming throughout

The code is ready to merge pending verification of the memory configuration override mechanism for replicas.

--- | Branch: bugfix/shared-replicas-repo-fix

jfrench9 merged commit cf01e85 into main Feb 14, 2026
7 checks passed

jfrench9 deleted the bugfix/shared-replicas-repo-fix branch February 14, 2026 04:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix shared replicas repository with enhanced memory allocation and metrics#355

Fix shared replicas repository with enhanced memory allocation and metrics#355
jfrench9 merged 1 commit intomainfrom
bugfix/shared-replicas-repo-fix

jfrench9 commented Feb 14, 2026

Uh oh!

jfrench9 commented Feb 14, 2026

Uh oh!

claude bot commented Feb 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jfrench9 commented Feb 14, 2026

Summary

Key Accomplishments

Database Connection Pool Improvements

Monitoring & Observability Enhancements

Infrastructure Updates

Testing Notes

Infrastructure Considerations

Files Modified

Uh oh!

jfrench9 commented Feb 14, 2026

Uh oh!

claude bot commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete

Tasks

Review Summary

Critical Issues

1. Memory Configuration Mismatch Between Writer and Replica ⚠️

2. S3 ATTACH Connection TTL Bypass Logic 🔍

Significant Observations

3. Enhanced CloudWatch Fallback for Volume Monitoring ✅

4. Shared Buffer Pool Architecture 🎯

5. Test Coverage ✅

Minor Issues

6. Magic Number in XFS Metadata Tolerance 📝

7. Metrics Collection Timeout 📊

Performance Considerations

8. Serial ATTACH with Retry Logic ⏱️

Security Considerations

9. S3 Credentials Handling 🔐

Code Quality

10. Comprehensive Documentation ✅

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude bot commented Feb 14, 2026 •

edited

Loading