Refactor Lambda deployment to use container images and streamline workflows by jfrench9 · Pull Request #96 · RoboFinSystems/robosystems

jfrench9 · 2025-12-23T05:33:08Z

Summary

This PR modernizes our Lambda deployment architecture by introducing container image support and significantly streamlining our CI/CD workflows. The changes reduce complexity while improving maintainability and deployment reliability.

Key Accomplishments

Lambda Container Image Support

Added Docker container support for Lambda functions with new Dockerfile configuration
Modernized packaging approach to replace traditional ZIP-based deployments
Enhanced build process to support containerized Lambda deployments

Workflow Optimization

Consolidated and simplified deployment workflows across all environments
Enhanced build workflow with improved caching and efficiency
Streamlined PostgreSQL deployment by merging IAM configuration into main deployment
Reduced workflow complexity while maintaining deployment reliability

Infrastructure Updates

Updated CloudFormation templates for graph infrastructure, volumes, PostgreSQL, and Valkey
Improved resource configuration and parameter management
Enhanced stack configuration management for better environment consistency

Dependency Management

Updated project dependencies and lock file for improved compatibility
Added new tooling dependencies to support container-based deployments

Breaking Changes

Deployment Method: Lambda functions now use container images instead of ZIP packages
Workflow Structure: PostgreSQL IAM deployment has been merged into the main PostgreSQL workflow
Build Process: Updated build and packaging procedures require container runtime

Testing Notes

Verify container image builds successfully in CI environment
Validate Lambda function deployments work correctly with new container approach
Test all deployment workflows in staging environment before production rollout
Confirm CloudFormation stack updates apply cleanly without resource recreation

Infrastructure Considerations

Container images will be stored in ECR and may impact storage costs
Initial deployments may take longer due to container image pulls
Ensure adequate IAM permissions for ECR operations in deployment environments
Monitor Lambda cold start times as container images may affect startup performance

🤖 Generated with Claude Code

Branch Info:

Source: refactor/lambda-packaging
Target: main
Type: refactor

Co-Authored-By: Claude noreply@anthropic.com

- Introduced Dockerfile.lambda for building a container image for Lambda functions, utilizing the AWS Lambda Python base image. - Updated pyproject.toml to define dependency groups for Lambda functions. - Modified GitHub Actions workflows to deploy Lambda functions using the new container image approach, replacing S3 bucket references with ECR image URIs. - Refactored CloudFormation templates to accommodate the new Lambda deployment method, including permissions for ECR access. - Removed legacy Lambda packaging scripts and updated related documentation to reflect the transition to container-based deployment.

jfrench9 · 2025-12-23T05:33:10Z

@claude please review this PR

claude · 2025-12-23T05:33:26Z

Claude finished @jfrench9's task in 3m 11s —— View job

PR Review Progress

Read repository guidelines (CLAUDE.md)
Analyze PR diff and changed files
Review Dockerfile.lambda (new Lambda container configuration)
Review workflow changes (build.yml, staging.yml, prod.yml)
Review CloudFormation template updates
Review packaging script simplification
Check for security concerns
Verify breaking changes are properly documented
Post comprehensive review feedback

Code Review

This is a well-structured refactoring that modernizes Lambda deployments from ZIP files to container images. The implementation is solid overall, but there are several areas that need attention before merging.

Critical Issues

1. Missing Lambda dependency group validation (cloudformation/*.yaml)

The pyproject.toml adds a new [dependency-groups].lambda section with specific dependencies, but there's no validation that these dependencies match what the Lambda functions actually need.

Issue: The dependency group includes redis>=6.2.0,<7.0, but valkey_rotation.py was previously packaged with redis==5.0.1 (see bin/tools/package-scripts.sh:77). This version mismatch could cause runtime failures.

Recommendation:

Verify each Lambda function's actual dependencies match the dependency group
The redis version change from 5.0.1 to >=6.2.0,<7.0 is a breaking change that should be documented
Test Valkey rotation with the new redis version to ensure compatibility

Location: pyproject.toml:144-149

2. Incomplete workflow file rename (.github/workflows/deploy-postgres.yml)

The workflow was renamed from deploy-postgres-iam.yml to deploy-postgres.yml, but I can only see the new file in the diff. This appears to be a file rename, but git is showing it as RENAMED with SHA: unknown.

Issue: Need to verify:

The old deploy-postgres-iam.yml file is actually deleted
All references to the old workflow name are updated (checked staging.yml and prod.yml - looks good)
The stack name in stacks.yml still references "PostgresIAM" but comments say "Legacy name kept for backward compatibility" - this is acceptable

Recommendation: Verify the old workflow file was properly deleted in the PR.

Location: .github/workflows/deploy-postgres.yml

3. ARM64-only deployment lacks fallback (Dockerfile.lambda, build.yml)

The Lambda container image is built exclusively for ARM64 (Graviton), which is great for cost savings. However, there's no fallback or documentation about what happens if ARM64 becomes unavailable in a region.

Issue:

The PR description mentions "Monitor Lambda cold start times as container images may affect startup performance"
No baseline metrics or monitoring setup is included in the PR
CloudWatch alarms in CloudFormation templates weren't updated to include container-specific metrics

Recommendation:

Add CloudWatch metrics for Lambda cold start duration in CloudFormation templates
Document expected cold start times in the PR description or infrastructure docs
Consider adding a note about ARM64 availability by region

Location: Dockerfile.lambda:6, .github/workflows/build.yml:239-242

High Priority Issues

4. ECR image pull permissions are broad (cloudformation/*.yaml)

All CloudFormation templates add ECR permissions with a wildcard for ecr:GetAuthorizationToken, which is correct since it doesn't support resource-level permissions. However, the ecr:BatchGetImage and ecr:GetDownloadUrlForLayer permissions are scoped to the robosystems repository.

Issue: This is correct, but the ARN uses a hardcoded repository name:

Resource: !Sub "arn:${AWS::Partition}:ecr:${AWS::Region}:${AWS::AccountId}:repository/robosystems"

Recommendation: Verify the ECR repository is actually named "robosystems" in all environments. If it varies by environment, this needs to be parameterized.

Location: All cloudformation/*.yaml files (postgres.yaml:276, valkey.yaml:387, graph-infra.yaml:464, graph-volumes.yaml:188, etc.)

5. Lambda timeout unchanged despite container overhead (cloudformation/*.yaml)

Container-based Lambdas typically have longer cold start times than ZIP-based deployments, especially for the first invocation or after scaling. However, all Lambda timeout values remain unchanged at 300 seconds (5 minutes).

Issue: The PR description mentions "Monitor Lambda cold start times as container images may affect startup performance" but doesn't adjust timeouts as a precaution.

Recommendation:

Consider increasing timeouts slightly for initial deployment (e.g., 360s)
Monitor actual execution times and adjust down if unnecessary
Document baseline timeout metrics before and after migration

Location: All Lambda function definitions in CloudFormation templates

6. Manifest generation removed but no migration path documented (bin/tools/package-scripts.sh)

The packaging script previously generated a manifest file (lambda-manifest-${ENVIRONMENT}.json) that tracked Lambda S3 keys and hashes. This has been completely removed.

Issue:

Any automation or monitoring that relied on this manifest will break
The PR doesn't document whether this manifest was actually used anywhere
No deprecation notice or migration guide

Recommendation:

Document whether the manifest was used by any other systems
Add a note about this removal in the PR description under "Breaking Changes"
Consider adding a comment in package-scripts.sh explaining why the manifest was removed

Location: bin/tools/package-scripts.sh:63-90 (removed code)

Medium Priority Issues

7. Docker buildx cache strategy could be optimized (.github/workflows/build.yml)

The Lambda image build uses cache-to type=registry,mode=max which caches all layers, but the main application image build (earlier in the file) might benefit from the same caching strategy for consistency.

Recommendation: Review if the main image build should also use mode=max for consistency, or document why different caching strategies are appropriate for each image type.

Location: .github/workflows/build.yml:241-242

8. Missing rollback documentation

The PR description mentions this is a breaking change but doesn't provide a rollback procedure.

Issue: If the container deployment fails in production, teams need to know:

Can they roll back to the previous ZIP-based version?
Are there stack parameter changes that prevent rollback?
What's the disaster recovery procedure?

Recommendation: Add a "Rollback Procedure" section to the PR description or link to runbook documentation.

9. uv export command uses --frozen but no lock file validation (Dockerfile.lambda)

The Dockerfile runs uv export --only-group lambda --no-hashes --frozen which requires an up-to-date lock file.

Issue:

If uv.lock is out of sync with pyproject.toml, the build will fail
No CI validation step ensures the lock file is current before building
The build workflow doesn't run uv lock --check or similar validation

Recommendation:

Add a step in the build workflow to validate uv.lock is up-to-date before building
Consider running uv lock --check in the test workflow as well

Location: Dockerfile.lambda:19, .github/workflows/build.yml:177-251

Minor Issues / Suggestions

10. Inconsistent Lambda handler naming convention (Dockerfile.lambda)

The Lambda functions use two different naming patterns:

postgres_rotation.lambda_handler (module.function format)
File is named postgres_rotation.py with function lambda_handler

vs. the Dockerfile's default:

graph_volume_manager.lambda_handler

Observation: This is actually consistent (all use module.lambda_handler), but the Dockerfile comment could be clearer.

Recommendation: The default CMD in Dockerfile.lambda could include a comment explaining that each CloudFormation template overrides this via ImageConfig.Command.

Location: Dockerfile.lambda:33-35

11. Build workflow output doesn't validate both images built successfully

The build workflow now outputs two images (image and lambda_image), but the success criteria only validates the main image was pushed. The Lambda image validation happens inside the build step but doesn't propagate to job outputs.

Recommendation: Add a validation step at the end of the lambda job to confirm both the environment tag and version tag were pushed successfully.

Location: .github/workflows/build.yml:249-252

12. Dependency version ranges could be tighter (pyproject.toml)

The Lambda dependency group uses wide version ranges:

boto3>=1.39.0,<2.0 - major version range
redis>=6.2.0,<7.0 - major version change from previous 5.0.1

Recommendation:

Consider tighter ranges for Lambda functions to prevent unexpected runtime changes
Document why redis was upgraded from 5.x to 6.x (API changes? Bug fixes?)

Location: pyproject.toml:146-147

13. CloudFormation template parameter descriptions could mention container deployment

Several templates still have parameter descriptions that reference "S3 bucket" or "deployment package" without mentioning the new container-based approach.

Example: valkey.yaml:93 - "ECR image URI for Lambda functions (container-based deployment)" - This is good!

Recommendation: Ensure all parameter descriptions are updated to reflect container deployment (spot check shows they are mostly updated, good work).

Security Review

✅ No major security concerns identified

ECR permissions are appropriately scoped to the specific repository
IAM roles follow least-privilege principle
Secrets Manager integration unchanged (good)
Container images use official AWS Lambda base images
No hardcoded credentials or secrets

Minor security note: The ARM64 base image public.ecr.aws/lambda/python:3.13-arm64 should be pinned to a specific SHA256 digest in production for supply chain security, but using tags is acceptable for development velocity.

Breaking Changes Verification

The PR description lists breaking changes, which are accurate:

✅ Deployment Method: Correctly documented
✅ Workflow Structure: Correctly documented
⚠️ Build Process: Documented, but should mention the new lambda dependency group in pyproject.toml
❌ Missing: Redis version upgrade from 5.0.1 to 6.2.0+ for Valkey rotation Lambda

Testing Recommendations

The PR description includes good testing notes. I'd add:

Validate each Lambda function individually in staging:
- postgres-init (database creation)
- postgres-rotation (password rotation)
- valkey-rotation (auth token rotation) - especially test with new redis library version
- graph-api-rotation
- graph-volume-manager
- graph-volume-monitor
- graph-volume-detachment
Cold start metrics: Capture baseline metrics before and after deployment:
- Time to first execution
- Memory usage
- Duration metrics
Rollback test: Verify you can roll back to previous infrastructure if needed

Summary

Overall Assessment: This is a well-executed refactoring with a clear architecture improvement. The code quality is high, and the approach is sound. However, there are several issues that should be addressed before merging:

Must Fix Before Merge:

Verify redis library version upgrade compatibility (Critical Issue Inital open source release of robosystems #1)
Confirm old workflow file was deleted (Critical Issue Fix MCP tools query results to include units and improve dimensions #2)
Document ECR repository naming convention (High Priority chore: improve documentation and suppress build output #4)
Add rollback procedure documentation (Medium Priority #8)

Should Fix Before Merge:
5. Add uv.lock validation to build workflow (Medium Priority #9)
6. Document manifest removal as breaking change (High Priority #6)

Nice to Have:
7. Add CloudWatch metrics for cold start monitoring (Critical Issue #3)
8. Tighter dependency version ranges with changelog (Minor Issue #12)

Approval Status: Request Changes - The redis version upgrade and missing documentation are blocking issues that should be resolved before merge.

CI Status: 2 workflow runs currently in progress

Test Pull Request - in progress
PR #96 - in progress

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-23T05:35:42Z

cloudformation/graph-volumes.yaml

+      ImageConfig:
+        Command:
+          - graph_volume_monitor.lambda_handler


Point volume monitor to the actual handler function

The Lambda container entrypoint is set to graph_volume_monitor.lambda_handler, but bin/lambda/graph_volume_monitor.py only defines def handler(...) (line 48) and does not export lambda_handler. With container images, Lambda resolves the handler from ImageConfig.Command, so this will raise a “handler not found” error on every invocation and break scheduled monitoring/alarm-driven expansions. Use graph_volume_monitor.handler or add an alias in the module so the configured handler exists.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-12-23T05:35:42Z

cloudformation/graph-volumes.yaml

+      ImageConfig:
+        Command:
+          - graph_volume_detachment.lambda_handler


Point volume detachment to the actual handler function

The Lambda container entrypoint is set to graph_volume_detachment.lambda_handler, but bin/lambda/graph_volume_detachment.py defines def handler(...) (line 31) and has no lambda_handler. With container-based Lambda, this mismatch causes cold start failures and prevents the ASG lifecycle hook from detaching volumes. Update the command to graph_volume_detachment.handler or export a lambda_handler alias in the module.

Useful? React with 👍 / 👎.

.github/workflows/build.yml

+    runs-on: ubuntu-latest
+    outputs:
+      lambda_image: ${{ steps.build-lambda.outputs.lambda_image }}
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+        with:
+          repository: ${{ github.repository }}
+          ref: ${{ github.ref }}
+          token: ${{ secrets.ACTIONS_TOKEN }}
+
+      - name: Set up QEMU
+        uses: docker/setup-qemu-action@v3
+
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v3
+        with:
+          driver-opts: |
+            image=moby/buildkit:master
+            network=host
+
+      - name: Configure AWS credentials
+        uses: aws-actions/configure-aws-credentials@v4
+        with:
+          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+          aws-region: ${{ inputs.aws_region }}
+
+      - name: Login to Amazon ECR
+        id: login-ecr
+        uses: aws-actions/amazon-ecr-login@v2
+        with:
+          mask-password: "true"
+
+      - name: Build and push Lambda container image
+        id: build-lambda
+        env:
+          ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
+          ECR_REPOSITORY: ${{ inputs.ecr_repository }}
+        run: |
+          # Determine version tag (same logic as main build)
+          if [[ "$GITHUB_REF" == refs/tags/v*.*.* ]]; then
+            VERSION_TAG=${GITHUB_REF#refs/tags/}
+          elif [[ "$GITHUB_REF" == refs/heads/release/* ]]; then
+            BRANCH_VERSION=${GITHUB_REF#refs/heads/release/}
+            VERSION_TAG="v${BRANCH_VERSION}-hotfix.${GITHUB_SHA:0:8}"
+          else
+            VERSION_TAG="git-${GITHUB_SHA:0:8}"
+          fi
+
+          # Lambda image tags (ARM64 only - for Graviton cost savings)
+          LAMBDA_TAG="lambda-${{ inputs.environment }}"
+          LAMBDA_VERSION_TAG="lambda-$VERSION_TAG"
+
+          echo "🐑 Building Lambda container image: $LAMBDA_TAG"
+
+          # Build Lambda image (ARM64 only)
+          docker buildx build \
+            --platform linux/arm64 \
+            --push \
+            --cache-from type=registry,ref=$ECR_REGISTRY/$ECR_REPOSITORY:lambda-cache \
+            --cache-to type=registry,ref=$ECR_REGISTRY/$ECR_REPOSITORY:lambda-cache,mode=max \
+            -t $ECR_REGISTRY/$ECR_REPOSITORY:$LAMBDA_TAG \
+            -t $ECR_REGISTRY/$ECR_REPOSITORY:$LAMBDA_VERSION_TAG \
+            -f Dockerfile.lambda .
+
+          # Verify the image was pushed
+          aws ecr describe-images --repository-name $ECR_REPOSITORY --image-ids imageTag=$LAMBDA_TAG || exit 1
+
+          LAMBDA_IMAGE="$ECR_REGISTRY/$ECR_REPOSITORY:$LAMBDA_TAG"
+          echo "lambda_image=$LAMBDA_IMAGE" >> $GITHUB_OUTPUT
+          echo "✅ Lambda image pushed: $LAMBDA_IMAGE"


…cripts - Introduced a new alias `lambda_handler` for the existing `handler` function in both `graph_volume_detachment.py` and `graph_volume_monitor.py`. - This change facilitates Lambda container deployment using CloudFormation ImageConfig.Command.

chatgpt-codex-connector bot reviewed Dec 23, 2025

View reviewed changes

github-advanced-security AI found potential problems Dec 23, 2025

View reviewed changes

jfrench9 merged commit 34697f8 into main Dec 23, 2025
5 checks passed

jfrench9 deleted the refactor/lambda-packaging branch December 23, 2025 18:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Lambda deployment to use container images and streamline workflows#96

Refactor Lambda deployment to use container images and streamline workflows#96
jfrench9 merged 2 commits intomainfrom
refactor/lambda-packaging

jfrench9 commented Dec 23, 2025

Uh oh!

jfrench9 commented Dec 23, 2025

Uh oh!

claude bot commented Dec 23, 2025 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Dec 23, 2025

Uh oh!

chatgpt-codex-connector bot Dec 23, 2025

Uh oh!

Check warning

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jfrench9 commented Dec 23, 2025

Summary

Key Accomplishments

Lambda Container Image Support

Workflow Optimization

Infrastructure Updates

Dependency Management

Breaking Changes

Testing Notes

Infrastructure Considerations

Uh oh!

jfrench9 commented Dec 23, 2025

Uh oh!

claude bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Progress

Code Review

Critical Issues

1. Missing Lambda dependency group validation (cloudformation/*.yaml)

2. Incomplete workflow file rename (.github/workflows/deploy-postgres.yml)

3. ARM64-only deployment lacks fallback (Dockerfile.lambda, build.yml)

High Priority Issues

4. ECR image pull permissions are broad (cloudformation/*.yaml)

5. Lambda timeout unchanged despite container overhead (cloudformation/*.yaml)

6. Manifest generation removed but no migration path documented (bin/tools/package-scripts.sh)

Medium Priority Issues

7. Docker buildx cache strategy could be optimized (.github/workflows/build.yml)

8. Missing rollback documentation

9. uv export command uses --frozen but no lock file validation (Dockerfile.lambda)

Minor Issues / Suggestions

10. Inconsistent Lambda handler naming convention (Dockerfile.lambda)

11. Build workflow output doesn't validate both images built successfully

12. Dependency version ranges could be tighter (pyproject.toml)

13. CloudFormation template parameter descriptions could mention container deployment

Security Review

Breaking Changes Verification

Testing Recommendations

Summary

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Check warning

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude bot commented Dec 23, 2025 •

edited

Loading