Skip to content

Add RDS Proxy support with configurable database endpoint routing#620

Merged
jfrench9 merged 5 commits intomainfrom
feature/rds-proxy-flag
Apr 15, 2026
Merged

Add RDS Proxy support with configurable database endpoint routing#620
jfrench9 merged 5 commits intomainfrom
feature/rds-proxy-flag

Conversation

@jfrench9
Copy link
Copy Markdown
Member

Summary

This PR introduces configurable RDS Proxy support across the infrastructure stack, allowing deployments to optionally route database connections through an RDS Proxy rather than directly to the PostgreSQL instance. The changes span CloudFormation templates, CI/CD workflows, the secret rotation Lambda, and project setup tooling.

Key Accomplishments

Infrastructure (CloudFormation)

  • Refactored the PostgreSQL CloudFormation template to support an optional RDS Proxy resource, controlled by a feature flag parameter
  • Streamlined and simplified the overall template structure, resulting in a net reduction in template complexity despite adding new proxy resources
  • Added conditional logic to output either the RDS Proxy endpoint or the direct RDS endpoint based on configuration

CI/CD Workflows

  • Updated deployment workflows (staging, prod, and deploy-postgres) to propagate the RDS Proxy flag and correctly resolve the database endpoint based on whether the proxy is enabled
  • Ensured that downstream services receive the appropriate database endpoint (proxy vs. direct) depending on the environment configuration

Secret Rotation Lambda

  • Enhanced the PostgreSQL secret rotation Lambda to be RDS Proxy-aware, adjusting its connection and rotation behavior to work correctly in both proxy and non-proxy configurations
  • Refactored configuration handling within the rotation function for improved clarity and maintainability

Setup & Documentation

  • Updated GitHub Actions setup tooling to include the new RDS Proxy flag variable
  • Revised setup documentation to reflect the new configuration option

Breaking Changes

  • Environment variable changes: Deployments now require the RDS Proxy flag to be defined in environment/workflow configuration. Existing environments will need this variable added (can default to disabled for backward compatibility).
  • The database endpoint resolution logic has changed — any hardcoded references to the direct RDS endpoint in consuming services should be verified against the new dynamic endpoint output.

Testing Notes

  • Verify deployments with the RDS Proxy flag enabled and disabled to confirm both paths resolve the correct database endpoint
  • Test secret rotation in both proxy and non-proxy configurations to ensure credentials rotate successfully and connections remain valid
  • Validate that application database connectivity works end-to-end through the proxy when enabled
  • Confirm CloudFormation stack updates apply cleanly for existing stacks (no resource replacement surprises)

Infrastructure Considerations

  • RDS Proxy incurs additional AWS costs when enabled — ensure this is accounted for in environment budgets
  • Proxy connections have different timeout and connection pooling characteristics compared to direct RDS connections; application connection pool settings may need tuning
  • The proxy adds a layer of IAM authentication and TLS requirements — ensure security group and IAM policies are correctly configured in target environments
  • Rolling this out incrementally (e.g., staging first with proxy enabled, prod with proxy disabled) is recommended to validate behavior before full adoption

🤖 Generated with Claude Code

Branch Info:

  • Source: feature/rds-proxy-flag
  • Target: main
  • Type: feature

Co-Authored-By: Claude noreply@anthropic.com

- Removed the database engine configuration from workflows and justfile, simplifying the setup process.
- Introduced new parameters for RDS Proxy configuration in CloudFormation templates and workflows, allowing for connection pooling.
- Updated the PostgreSQL password rotation script to support RDS Proxy by writing both legacy and proxy-compatible keys.
- Enhanced documentation to reflect changes in database configuration and RDS Proxy settings.
- Modified the database endpoint settings in both production and staging workflows to utilize RDS Proxy when enabled, falling back to the direct endpoint otherwise.
- Added comments to clarify the behavior of the new configuration, particularly regarding the handling of empty proxy outputs and implications for Dagster daemon connections.
@jfrench9
Copy link
Copy Markdown
Member Author

@claude please review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Apr 15, 2026

Claude finished @jfrench9's task in 2m 22s —— View job


PR Review: Add RDS Proxy support with configurable database endpoint routing

  • Read changed files and gather context
  • Review CloudFormation template changes
  • Review CI/CD workflow changes
  • Review Lambda rotation script changes
  • Review setup tooling changes
  • Compile and post final review

Overall this is a well-structured PR. The Aurora removal is the right call (cleaning up dead code) and the RDS Proxy implementation follows AWS best practices. The dual-key secret pattern in the rotation Lambda is particularly clever for zero-downtime proxy enablement. I have a few concerns worth addressing before merging.


cloudformation/postgres.yaml

Hardcoded postgres16 parameter group family (lines 395, 749)

Family: postgres16

The old FindInMap was more robust — if AllowedValues for PostgresEngineVersion ever gains 17.x, the parameter group family will silently be wrong. Either keep the mapping (even a minimal one) or add a comment noting this must be updated alongside AllowedValues.

Fix this →

AllowMajorVersionUpgrade: true on the RDS instance (line ~484)

This allows AWS to apply a major PostgreSQL version upgrade during the maintenance window without operator approval. For production this is dangerous — a 16 → 17 upgrade could break application compatibility silently. Recommend setting this to false and only changing it manually when you're ready to upgrade.

Fix this →

IAM auth disabled on the proxy (cloudformation/postgres.yaml, RDSProxy.Auth)

Auth:
  - AuthScheme: SECRETS
    IAMAuth: DISABLED

IAMAuth: DISABLED means any client with a valid password (from within the VPC) can connect to the proxy without AWS credentials. IAMAuth: REQUIRED would add a defense-in-depth layer so compromised DB credentials alone are insufficient. If you intentionally keep it disabled (e.g., to avoid IAM token complexity in the app), a comment explaining the trade-off would help future readers.

Hardcoded proxy connection pool values

MaxIdleConnectionsPercent: 50
ConnectionBorrowTimeout: 120

ConnectionBorrowTimeout: 120 means applications wait up to 2 minutes for a connection before getting an error — this is unusually long and could cause cascading request queuing under load. Most app-side connection pools have much shorter timeouts (5–30s). MaxIdleConnectionsPercent: 50 is also aggressive, keeping half the max connection budget as idle connections. Consider making these parameters (or at minimum default to lower values like 30s / 10%).

Fix this →

No dedicated security group for proxy → RDS traffic

Both the proxy and the RDS instance share RDSSecurityGroup. The recommended topology is:

  • App SG → Proxy SG (port 5432)
  • Proxy SG → RDS SG (port 5432)

With the current setup, application traffic could theoretically reach RDS directly (bypassing the proxy) if the same SG is used everywhere. This is optional hardening, but worth tracking as a follow-up.


.github/workflows/ (prod.yml, staging.yml)

GHA || coalescing on empty string — works, but fragile (prod.yml line ~548, staging.yml line ~565)

database_endpoint: ${{ needs.deploy-postgres.outputs.database_proxy_endpoint || needs.deploy-postgres.outputs.database_endpoint }}

This relies on GHA treating an empty string as falsy in ||. That's currently true, but it's an undocumented behavior edge. A more explicit approach would be to emit the chosen endpoint as a named output in deploy-postgres.yml (effective_database_endpoint) that contains the resolution logic in bash — making it explicit and testable. The current approach is fine for now but consider this for a follow-up.


bin/lambda/postgres_rotation.py

describe_db_instances() has no pagination (line 64)

response = rds_client.describe_db_instances()

The AWS API returns at most 100 instances per call. If the account has >100 RDS instances, the target may not be in the first page and get_database_connection_info raises ValueError. Pre-existing issue, but worth fixing while touching this file:

paginator = rds_client.get_paginator("describe_db_instances")
for page in paginator.paginate():
    for instance in page["DBInstances"]:
        ...

Fix this →

Dual-key secret pattern is well-executed — writing username/password alongside POSTGRES_USER/POSTGRES_PASSWORD on every rotation is the right approach for zero-downtime proxy enablement. The chicken-and-egg caveat is clearly documented both in the template comment and the docstring. Good.


justfile

--yes flag removed from gh variable delete (line 273)

-    @gh variable delete {{name}} --yes
+    @gh variable delete {{name}}

gh variable delete does not prompt for confirmation — the --yes flag was likely a no-op or invalid. If this removal silences a CLI warning, it's a good fix. If --yes was suppressing a real prompt in older versions, removing it would break just gha-delete in non-interactive CI contexts. Worth confirming intentional.


Minor / Non-blocking

  • SetSecret / TestSecret connect directly to RDS, not through the proxy — this is correct behavior for secret rotation (you want to set the password on the actual instance). Just note it in a comment so future readers don't try to "fix" it.
  • The DatabaseEngine output hardcoded to "postgres" with the backwards-compat comment is appropriate.
  • The PostgresHealthCompositeAlarmRDS logical ID retention comment is good — avoids a destructive rename on existing stacks.

Summary

Severity Finding
Should fix AllowMajorVersionUpgrade: true — automatic major version upgrades are risky in prod
Should fix Hardcoded postgres16 in parameter group family — will silently break when v17 is added
Consider ConnectionBorrowTimeout: 120 is very long; consider making it a parameter
Consider IAMAuth: DISABLED — document the trade-off or upgrade to REQUIRED
Nitpick describe_db_instances() needs pagination for large accounts
Nitpick --yes removal from just gha-delete — confirm intentional
Branch: feature/rds-proxy-flag

- Added `rds_proxy_connection_borrow_timeout` parameter to workflows for production, staging, and deployment configurations, allowing customization of the timeout duration for pooled connections.
- Updated CloudFormation template to include the new parameter, with a default value of 120 seconds.
- Enhanced documentation to reflect the new configuration option for RDS Proxy connection pooling.
@jfrench9 jfrench9 merged commit 4e992b1 into main Apr 15, 2026
7 checks passed
@jfrench9 jfrench9 deleted the feature/rds-proxy-flag branch April 15, 2026 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant