Skip to content

Fix worker monitor Valkey connection with Secrets Manager auth#6

Merged
jfrench9 merged 2 commits intomainfrom
bugfix/worker-monitor-valkey-conn
Oct 7, 2025
Merged

Fix worker monitor Valkey connection with Secrets Manager auth#6
jfrench9 merged 2 commits intomainfrom
bugfix/worker-monitor-valkey-conn

Conversation

@jfrench9
Copy link
Copy Markdown
Member

@jfrench9 jfrench9 commented Oct 7, 2025

Summary

This bugfix resolves connection issues in the worker monitor by implementing proper Valkey authentication through AWS Secrets Manager and enhancing Redis connection handling mechanisms.

Key Changes

  • Authentication Enhancement: Added secure retrieval of Valkey authentication tokens from AWS Secrets Manager
  • Connection Reliability: Improved Redis/Valkey connection handling with better error management and retry logic
  • Infrastructure Updates: Updated CloudFormation template to support the new authentication requirements

Technical Improvements

  • Enhanced connection pooling and management for Redis/Valkey instances
  • Added proper exception handling for authentication failures
  • Implemented secure credential retrieval pattern using AWS Secrets Manager
  • Strengthened connection resilience with improved retry mechanisms

Infrastructure Considerations

  • CloudFormation template includes new permissions for Secrets Manager access
  • Worker monitor now requires appropriate IAM roles for secret retrieval
  • No breaking changes to existing API or configuration interfaces

Testing Notes

  • Verify worker monitor can successfully authenticate with Valkey instances
  • Confirm connection stability under various network conditions
  • Test failover scenarios and connection recovery
  • Validate that secrets are properly retrieved and cached

This fix addresses critical connectivity issues that were preventing the worker monitor from maintaining stable connections to the Valkey backend.


🤖 Generated with Claude Code

Branch Info:

  • Source: bugfix/worker-monitor-valkey-conn
  • Target: main
  • Type: bugfix

Co-Authored-By: Claude noreply@anthropic.com

…s connection handling

- Implemented a function to retrieve the Valkey auth token from AWS Secrets Manager with caching.
- Updated Redis connection logic to include the auth token and handle connection timeouts and errors more gracefully.
- Modified CloudFormation template to allow access to the Secrets Manager for the Valkey auth token.
- Enhanced error handling in various functions to improve robustness and logging for Redis operations.
@jfrench9
Copy link
Copy Markdown
Member Author

jfrench9 commented Oct 7, 2025

@claude please review this PR

@claude
Copy link
Copy Markdown

claude bot commented Oct 7, 2025

Claude finished @jfrench9's task —— View job


PR Review

Reviewing changes for Valkey authentication and connection handling improvements.

Review Tasks

  • Read and analyze worker_monitor.py changes
  • Review CloudFormation template updates
  • Check for security issues and best practices
  • Verify error handling and connection resilience
  • Provide comprehensive feedback

Code Review Summary

This PR successfully addresses the Valkey authentication issue and significantly improves connection reliability. The implementation is solid with good error handling patterns. I have a few suggestions for improvement:

🟢 Strengths

  1. Proper AWS Secrets Manager Integration: The get_valkey_auth_token() function correctly retrieves credentials with caching (lines 92-110)
  2. TLS Security: Correctly uses rediss:// scheme and properly documents the ElastiCache self-signed certificate limitation (lines 147-167)
  3. URL Encoding: Properly encodes auth token with quote() to handle special characters (line 151)
  4. Enhanced Error Handling: Comprehensive timeout and connection error handling throughout (lines 468-476, 565-570, 577-584, 600-610, 709-714)
  5. Graceful Degradation: Circuit-breaker pattern for consecutive errors prevents cascading failures (lines 339-356)
  6. IAM Permissions: CloudFormation template correctly scopes Secrets Manager permissions with wildcards for rotation (line 86)

🟡 Suggestions for Improvement

1. Token Caching Security Concern (High Priority)

The global _valkey_auth_token cache never refreshes, even when secrets are rotated:

# bin/lambda/worker_monitor.py:92-110
def get_valkey_auth_token() -> Optional[str]:
  global _valkey_auth_token
  
  if _valkey_auth_token:  # ⚠️ Never refreshes!
    return _valkey_auth_token

Recommendation: Add TTL-based cache expiration or catch auth failures and retry with fresh token:

import time

_valkey_auth_token = None
_token_cache_time = None
TOKEN_CACHE_TTL = 3600  # 1 hour

def get_valkey_auth_token() -> Optional[str]:
  global _valkey_auth_token, _token_cache_time
  
  # Refresh if expired or missing
  if _valkey_auth_token and _token_cache_time:
    if time.time() - _token_cache_time < TOKEN_CACHE_TTL:
      return _valkey_auth_token
  
  # ... existing retrieval logic ...
  _token_cache_time = time.time()
  return _valkey_auth_token

2. Connection Retry Logic Inconsistency (Medium Priority)

The handler uses exponential backoff (line 194) but get_valkey_endpoint() uses fixed delays (line 323):

# Line 194 - exponential
sleep(RETRY_DELAY * (attempt + 1))

# Line 323 - fixed
sleep(RETRY_DELAY * (attempt + 1))  # Actually same, my mistake!

Wait, both use exponential backoff. However, connection timeout is very aggressive:

# Line 160-162
socket_connect_timeout=2,
socket_timeout=2,
retry_on_timeout=False,

Recommendation: Consider slightly longer timeouts (3-5s) for Lambda cold starts or network hiccups, especially since you disabled automatic retries.

3. Connection Pool Validation (Low Priority)

Connection pool validation only pings, doesn't verify database selection:

# Line 132
_redis_connections[database].ping()

Recommendation: While ping() is generally sufficient, consider catching redis.ConnectionError specifically to handle connection pool exhaustion gracefully.

4. Secret Name Hardcoding (Low Priority)

Secret name pattern is hardcoded in Python and CloudFormation separately:

# worker_monitor.py:99
secret_name = f"robosystems/{ENVIRONMENT}/valkey"

# worker-monitor.yaml:86
Resource: !Sub "...robosystems/${Environment}/valkey-*"

Recommendation: Consider documenting this naming convention in both places or using an environment variable for the secret name to maintain single source of truth.

5. Logging Sensitive Information (Security Note)

Line 105 logs successful retrieval:

logger.info(f"Retrieved Valkey auth token from {secret_name}")

Recommendation: This is fine as-is (doesn't log the token itself), but consider adding a note that the token value is intentionally excluded for security.

🔴 Issues to Address

Missing Error Handling in Connection Pool (High Priority)

If get_valkey_auth_token() returns None, a confusing RuntimeError is raised:

# Line 143-145
auth_token = get_valkey_auth_token()
if not auth_token:
  raise RuntimeError("Failed to get Valkey auth token")

The original get_valkey_auth_token() already logs the specific error (line 108), but this generic message loses context.

Recommendation: Either:

  1. Re-raise the original exception with context, or
  2. Make the error message more specific: RuntimeError(f"Failed to retrieve Valkey auth token from Secrets Manager: {secret_name}")

📋 Testing Recommendations

Based on the PR description, ensure you test:

  1. Secret Rotation: Verify Lambda picks up new tokens after rotation (relates to caching concern Inital open source release of robosystems #1)
  2. Network Failures: Test behavior when Secrets Manager or ElastiCache is unreachable
  3. Concurrent Invocations: Verify connection pooling works correctly with Lambda's concurrency model
  4. Cold Start Performance: Measure impact of TLS handshake and Secrets Manager call on cold start time
  5. IAM Permissions: Verify the wildcard suffix (-*) correctly matches rotated secrets

Summary

Overall Assessment: ✅ Approve with minor suggestions

This is a well-implemented fix that properly addresses the Valkey authentication issue. The code demonstrates good understanding of:

  • AWS Secrets Manager integration patterns
  • TLS connection requirements for ElastiCache
  • Defensive error handling and graceful degradation
  • Lambda function optimization (connection pooling, caching)

The main concern is the indefinite token caching which could cause issues during secret rotation. Consider implementing TTL-based cache expiration before merging to production.

Estimated Risk: Low (primarily a missing refresh mechanism for cached secrets)


Job Run | Branch

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Comment on lines +99 to +105
secret_name = f"robosystems/{ENVIRONMENT}/valkey"

try:
response = secretsmanager.get_secret_value(SecretId=secret_name)
secret_data = json.loads(response["SecretString"])
_valkey_auth_token = secret_data.get("AUTH_TOKEN")
logger.info(f"Retrieved Valkey auth token from {secret_name}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Badge Fetch Valkey token under wrong key from Secrets Manager

The Valkey secret created by CloudFormation stores the token under the key VALKEY_AUTH_TOKEN (and other parts of the codebase look up that key), but this function retrieves secret_data.get("AUTH_TOKEN"). That lookup always returns None, so _valkey_auth_token is never populated and get_redis_connection immediately raises RuntimeError("Failed to get Valkey auth token"), preventing the Lambda from ever connecting to Valkey and breaking all monitoring actions. The key name should match the secret structure (VALKEY_AUTH_TOKEN).

Useful? React with 👍 / 👎.

…ed error handling

- Implemented TTL-based caching for Valkey auth token to reduce unnecessary calls to Secrets Manager.
- Updated logging to provide clearer information on token retrieval and errors.
- Enhanced error messages for better debugging when the VALKEY_AUTH_TOKEN key is missing.
@jfrench9 jfrench9 merged commit 4cf2c05 into main Oct 7, 2025
2 checks passed
@jfrench9 jfrench9 deleted the bugfix/worker-monitor-valkey-conn branch October 19, 2025 02:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant