Skip to content

Fix shared replicas repository with enhanced memory allocation and metrics#355

Merged
jfrench9 merged 1 commit intomainfrom
bugfix/shared-replicas-repo-fix
Feb 14, 2026
Merged

Fix shared replicas repository with enhanced memory allocation and metrics#355
jfrench9 merged 1 commit intomainfrom
bugfix/shared-replicas-repo-fix

Conversation

@jfrench9
Copy link
Copy Markdown
Member

Summary

This PR addresses critical issues with shared database replica management while improving system observability and resource allocation. The changes focus on stabilizing the connection pooling mechanism and enhancing monitoring capabilities across the graph API infrastructure.

Key Accomplishments

Database Connection Pool Improvements

  • Enhanced Memory Management: Updated memory allocation strategies for database instances to prevent resource exhaustion
  • Connection Pool Stability: Refactored the ladybug pool implementation with improved error handling and connection lifecycle management
  • Replica Configuration: Fixed shared replicas repository configuration issues that were causing deployment inconsistencies

Monitoring & Observability Enhancements

  • Enhanced Metrics Collection: Expanded metrics collector functionality to capture more granular performance data
  • Improved Volume Monitoring: Updated lambda-based volume monitoring with better error handling and resource tracking
  • API Metrics Integration: Enhanced metrics router with additional endpoints and improved data aggregation

Infrastructure Updates

  • Deployment Configuration: Updated graph deployment workflows for both primary and replica environments
  • Configuration Management: Refined graph configuration settings to support new memory allocation parameters
  • CI/CD Improvements: Streamlined build and deployment processes

Testing Notes

  • Added comprehensive test suite for connection pool functionality (244+ new test lines)
  • All existing tests pass with the updated connection pool implementation
  • Volume monitoring changes have been validated in staging environment
  • Metrics collection endpoints tested for proper data formatting and response times

Infrastructure Considerations

  • Memory allocation changes may result in slightly different resource utilization patterns
  • Enhanced metrics collection will provide better visibility into system performance
  • Deployment workflows have been updated to ensure consistent replica configuration
  • No breaking changes to existing API contracts or client integrations

Files Modified

  • Core API application and routing logic
  • Database connection pool implementation
  • Metrics collection and monitoring systems
  • CI/CD workflows and configuration files
  • Comprehensive test coverage additions

🤖 Generated with Claude Code

Branch Info:

  • Source: bugfix/shared-replicas-repo-fix
  • Target: main
  • Type: bugfix

Co-Authored-By: Claude noreply@anthropic.com

…ollection

- Reduced memory allocation per database from 4GB to 2GB for both production and staging environments to better manage resource limits.
- Modified the deployment workflow to handle shared repositories more robustly by adding quotes around the parameter value.
- Improved the graph volume monitoring logic to include clearer logging and fallback mechanisms for metrics retrieval under heavy I/O conditions.
- Introduced a new method in the metrics collector to return cached database metrics, ensuring availability during timeouts.
- Updated the metrics API to implement a timeout for database metrics collection, returning cached data if the collection exceeds the limit.
@jfrench9
Copy link
Copy Markdown
Member Author

@claude please review this PR

@claude
Copy link
Copy Markdown

claude bot commented Feb 14, 2026

Claude finished @jfrench9's task in 1m 56s —— View job


Code Review Complete

Tasks

  • Gather context and read modified files
  • Review connection pool refactoring (pool.py)
  • Review metrics collection enhancements
  • Review infrastructure and configuration changes
  • Review test coverage
  • Check for security, performance, and code quality issues
  • Post comprehensive review feedback

Review Summary

This PR introduces significant improvements to the shared database replica management system while enhancing observability and resource allocation. The changes are well-structured and address real operational challenges. I've identified several important areas for consideration below.


Critical Issues

1. Memory Configuration Mismatch Between Writer and Replica ⚠️

graph.yml:308-310

The production shared replicas use m7g.large (8GB RAM) with max_memory_mb: 5120 (5GB), but this overrides the writer's 10GB settings. However, the writer configuration (graph.yml:271-272) still defaults to 10GB:

max_memory_mb: 10240 # 10GB default for LadybugDB
duckdb_memory_limit: "10GB"

Issue: If these settings are copied to replica instances without being overridden, they would exceed the replica's container memory budget (~6GB on 8GB instance), causing OOM crashes.

Recommendation: Verify that the replica deployment workflow correctly applies the memory overrides from the replica section. Consider adding validation to prevent configuration drift.

Fix this →


2. S3 ATTACH Connection TTL Bypass Logic 🔍

pool.py:688-698

The code bypasses TTL for S3 ATTACH connections to prevent expensive recreation:

def _is_connection_valid(self, connection_info: ConnectionInfo) -> bool:
    """Check if a connection is still valid."""
    # S3 ATTACH connections bypass TTL - they are read-only remote mounts
    # that are expensive to recreate (~11 min ATTACH + full buffer pool reallocation).
    # Cycling them via TTL causes OOM on memory-constrained replicas.
    if not connection_info.s3_attached:
        if datetime.now(UTC) - connection_info.created_at > self.connection_ttl:
            return False

Issue: While the comment explains the reasoning (preventing OOM on replicas), S3 ATTACH connections never expire unless explicitly invalidated. This could lead to:

  • Stale connections accumulating over time
  • No mechanism to refresh credentials (if they expire)
  • Connections that fail health checks remaining in the pool indefinitely

Recommendation: Consider adding:

  1. A separate, longer TTL for S3 ATTACH connections (e.g., 24 hours instead of 30 minutes)
  2. Credential refresh mechanism for long-lived S3 connections
  3. Periodic health check validation to remove truly unhealthy S3 connections

Fix this →


Significant Observations

3. Enhanced CloudWatch Fallback for Volume Monitoring

graph_volume_monitor.py:206-220

The Lambda now has a robust fallback to CloudWatch metrics when the Graph API is unresponsive:

metrics = get_volume_metrics_from_instance(instance)
metrics_source = "graph_api"

if not metrics:
    logger.warning(
        f"Graph API unavailable for {instance['instance_id']}, "
        "falling back to CloudWatch metrics"
    )
    metrics = get_volume_metrics_from_cloudwatch(instance)
    metrics_source = "cloudwatch"

Observation: This is excellent defensive programming. The CloudWatch fallback ensures monitoring continues even during heavy I/O load when the Graph API may timeout. The implementation correctly uses DISK_USED_PERCENT and disk_total metrics with proper dimension matching.

Minor Suggestion: Consider adding a metric to track how often the fallback is used. This would help identify instances that are chronically unresponsive.


4. Shared Buffer Pool Architecture 🎯

pool.py:117-229

The new shared S3 database implementation creates a single in-memory database with all S3 repos attached:

def initialize_shared_s3_database(
    self, repos: list[str], s3_prefix: str
) -> list[str]:
    """
    Create a single shared in-memory Database and ATTACH all S3 repos to it.
    
    This creates ONE buffer pool (sized to max_memory_mb) shared across all
    S3 ATTACH repos. LadybugDB dynamically manages memory based on actual
    query patterns rather than static per-database allocations.
    """

Observation: This is a clever optimization that:

  • Reduces memory fragmentation by using a single buffer pool
  • Allows LadybugDB to dynamically allocate memory across repos based on query patterns
  • Simplifies connection management (one Database object, multiple Connection objects with USE statements)

Potential Issue: The implementation uses a single Database for all S3 repos, but each connection sets USE <repo> to target specific catalogs. If a query accidentally references tables from multiple repos without catalog prefixes, it could return incorrect results.

Recommendation: Consider adding query validation or documenting the catalog isolation expectations clearly.


5. Test Coverage

test_connection_pool.py

244 new lines of comprehensive tests were added, including:

  • S3 shared database initialization tests (test_initialize_shared_s3_database)
  • Shared connection creation tests (test_shared_s3_connection_creation)
  • Database cleanup preservation tests (test_close_database_connections_preserves_shared_db)
  • Partial ATTACH failure handling (test_partial_attach_failure)

Observation: The test coverage is excellent and covers edge cases well. The tests properly mock LadybugDB objects to avoid segfaults and test the critical shared buffer pool logic.

Suggestion: Consider adding an integration test that verifies catalog isolation (ensuring USE statements properly scope queries to the correct repo).


Minor Issues

6. Magic Number in XFS Metadata Tolerance 📝

graph_volume_monitor.py:244-245

# Use a tolerance of 2GB to account for XFS metadata overhead (XFS reserves ~1-2%
# for internal metadata/journal, so df reports less than the actual device size)
ebs_fs_diff = (ebs_size_gb - current_size) if ebs_size_gb else 0
if ebs_size_gb and ebs_fs_diff > 2:

Issue: The 2GB tolerance is hardcoded and might be too small for very large volumes (e.g., 1000GB would have ~10-20GB metadata overhead at 1-2%).

Recommendation: Use a percentage-based threshold instead:

tolerance_gb = max(2, int(ebs_size_gb * 0.02))  # 2GB minimum, 2% for large volumes
if ebs_size_gb and ebs_fs_diff > tolerance_gb:

7. Metrics Collection Timeout 📊

metrics.py:24-25

# Timeout for database metrics collection (seconds). Under heavy ingestion I/O,
# _get_database_sizes() can block for 10+ seconds doing recursive file walks.
_DATABASE_METRICS_TIMEOUT = 3.0

Observation: The 3-second timeout is a reasonable default, but the comment mentions blocking can last "10+ seconds". This means legitimate metrics collection could frequently timeout during heavy load.

Recommendation: Consider making this configurable via environment variable, or use a more sophisticated approach:

  • First attempt with 3s timeout (returns cached if timeout)
  • Background refresh with longer timeout (10s) that updates cache for next request

Performance Considerations

8. Serial ATTACH with Retry Logic ⏱️

pool.py:183-216

# Serially ATTACH each repo with retry logic
attached: list[str] = []
for repo in repos:
    s3_uri = f"{s3_prefix.rstrip('/')}/{repo}.lbug"
    attach_query = f"ATTACH '{s3_uri}' AS {repo} (dbtype lbug, read_only)"
    logger.info(f"ATTACHing {repo} from {s3_uri}...")
    
    max_retries = 3
    success = False
    for attempt in range(max_retries):
        try:
            setup_conn.execute(attach_query)
            success = True
            break
        except Exception as attach_err:
            if attempt == max_retries - 1:
                logger.error(...)
            else:
                wait_time = 5 * (attempt + 1)
                time.sleep(wait_time)

Observation: The serial ATTACH with exponential backoff is correct given LadybugDB's limitations, but could be slow for many repos (e.g., 3 repos × 3 retries × 15s max wait = potentially minutes).

Recommendation: This is acceptable given the comment explains it runs in background during warmup. Consider adding progress logging to help operators understand why startup is slow.


Security Considerations

9. S3 Credentials Handling 🔐

pool.py:598-687

The _configure_s3_credentials method properly handles both IMDS credentials (production) and environment variables (dev):

def _configure_s3_credentials(self, conn: lbug.Connection) -> None:
    """
    Configure S3 credentials for httpfs extension.
    
    Uses EC2 Instance Metadata Service (IMDS) in production to get
    temporary credentials from the instance's IAM role.
    """

Observation: The implementation correctly:

  • Uses IMDSv2 (token-based) for enhanced security
  • Properly escapes credentials for SQL injection prevention: access_key.replace("'", "''")
  • Handles session tokens required for IMDS credentials
  • Falls back gracefully to environment variables in dev

Minor Issue: The error handling at line 657-660 raises a RuntimeError if session token setting fails, which could break S3 ATTACH entirely. Consider whether this should be a warning instead, allowing the attach to proceed and fail naturally with clearer S3 auth errors.


Code Quality

10. Comprehensive Documentation

The code includes excellent documentation throughout:

  • Clear module-level docstrings explaining architecture decisions
  • Detailed comments explaining complex logic (S3 buffer pool, TTL bypass, memory management)
  • Helpful examples in docstrings

Example (pool.py:54-56):

IMPORTANT: All connections for a given database share the same Database
object to ensure proper transaction visibility across connections.

This is exactly the kind of architectural context that makes code maintainable.


Summary

Overall Assessment: This is a high-quality PR that addresses real operational challenges with thoughtful solutions. The shared buffer pool architecture is clever, the CloudWatch fallback improves reliability, and the test coverage is comprehensive.

Recommendations:

  1. Critical: Verify memory configuration overrides for replicas (Issue Inital open source release of robosystems #1)
  2. Important: Consider S3 ATTACH connection TTL strategy (Issue Fix MCP tools query results to include units and improve dimensions #2)
  3. Nice-to-have: Add percentage-based XFS tolerance (Issue Fix worker monitor Valkey connection with Secrets Manager auth #6)
  4. Nice-to-have: Make metrics timeout configurable (Issue #7)

Strengths:

  • Robust CloudWatch fallback for monitoring
  • Comprehensive test coverage including edge cases
  • Excellent code documentation
  • Proper security handling (IMDSv2, credential escaping)
  • Defensive programming throughout

The code is ready to merge pending verification of the memory configuration override mechanism for replicas.

--- | Branch: bugfix/shared-replicas-repo-fix

@jfrench9 jfrench9 merged commit cf01e85 into main Feb 14, 2026
7 checks passed
@jfrench9 jfrench9 deleted the bugfix/shared-replicas-repo-fix branch February 14, 2026 04:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant