Skip to content

Refactor Lambda deployment to use container images and streamline workflows#96

Merged
jfrench9 merged 2 commits intomainfrom
refactor/lambda-packaging
Dec 23, 2025
Merged

Refactor Lambda deployment to use container images and streamline workflows#96
jfrench9 merged 2 commits intomainfrom
refactor/lambda-packaging

Conversation

@jfrench9
Copy link
Copy Markdown
Member

Summary

This PR modernizes our Lambda deployment architecture by introducing container image support and significantly streamlining our CI/CD workflows. The changes reduce complexity while improving maintainability and deployment reliability.

Key Accomplishments

Lambda Container Image Support

  • Added Docker container support for Lambda functions with new Dockerfile configuration
  • Modernized packaging approach to replace traditional ZIP-based deployments
  • Enhanced build process to support containerized Lambda deployments

Workflow Optimization

  • Consolidated and simplified deployment workflows across all environments
  • Enhanced build workflow with improved caching and efficiency
  • Streamlined PostgreSQL deployment by merging IAM configuration into main deployment
  • Reduced workflow complexity while maintaining deployment reliability

Infrastructure Updates

  • Updated CloudFormation templates for graph infrastructure, volumes, PostgreSQL, and Valkey
  • Improved resource configuration and parameter management
  • Enhanced stack configuration management for better environment consistency

Dependency Management

  • Updated project dependencies and lock file for improved compatibility
  • Added new tooling dependencies to support container-based deployments

Breaking Changes

  • Deployment Method: Lambda functions now use container images instead of ZIP packages
  • Workflow Structure: PostgreSQL IAM deployment has been merged into the main PostgreSQL workflow
  • Build Process: Updated build and packaging procedures require container runtime

Testing Notes

  • Verify container image builds successfully in CI environment
  • Validate Lambda function deployments work correctly with new container approach
  • Test all deployment workflows in staging environment before production rollout
  • Confirm CloudFormation stack updates apply cleanly without resource recreation

Infrastructure Considerations

  • Container images will be stored in ECR and may impact storage costs
  • Initial deployments may take longer due to container image pulls
  • Ensure adequate IAM permissions for ECR operations in deployment environments
  • Monitor Lambda cold start times as container images may affect startup performance

🤖 Generated with Claude Code

Branch Info:

  • Source: refactor/lambda-packaging
  • Target: main
  • Type: refactor

Co-Authored-By: Claude noreply@anthropic.com

- Introduced Dockerfile.lambda for building a container image for Lambda functions, utilizing the AWS Lambda Python base image.
- Updated pyproject.toml to define dependency groups for Lambda functions.
- Modified GitHub Actions workflows to deploy Lambda functions using the new container image approach, replacing S3 bucket references with ECR image URIs.
- Refactored CloudFormation templates to accommodate the new Lambda deployment method, including permissions for ECR access.
- Removed legacy Lambda packaging scripts and updated related documentation to reflect the transition to container-based deployment.
@jfrench9
Copy link
Copy Markdown
Member Author

@claude please review this PR

@claude
Copy link
Copy Markdown

claude bot commented Dec 23, 2025

Claude finished @jfrench9's task in 3m 11s —— View job


PR Review Progress

  • Read repository guidelines (CLAUDE.md)
  • Analyze PR diff and changed files
  • Review Dockerfile.lambda (new Lambda container configuration)
  • Review workflow changes (build.yml, staging.yml, prod.yml)
  • Review CloudFormation template updates
  • Review packaging script simplification
  • Check for security concerns
  • Verify breaking changes are properly documented
  • Post comprehensive review feedback

Code Review

This is a well-structured refactoring that modernizes Lambda deployments from ZIP files to container images. The implementation is solid overall, but there are several areas that need attention before merging.

Critical Issues

1. Missing Lambda dependency group validation (cloudformation/*.yaml)

The pyproject.toml adds a new [dependency-groups].lambda section with specific dependencies, but there's no validation that these dependencies match what the Lambda functions actually need.

Issue: The dependency group includes redis>=6.2.0,<7.0, but valkey_rotation.py was previously packaged with redis==5.0.1 (see bin/tools/package-scripts.sh:77). This version mismatch could cause runtime failures.

Recommendation:

  • Verify each Lambda function's actual dependencies match the dependency group
  • The redis version change from 5.0.1 to >=6.2.0,<7.0 is a breaking change that should be documented
  • Test Valkey rotation with the new redis version to ensure compatibility

Location: pyproject.toml:144-149

2. Incomplete workflow file rename (.github/workflows/deploy-postgres.yml)

The workflow was renamed from deploy-postgres-iam.yml to deploy-postgres.yml, but I can only see the new file in the diff. This appears to be a file rename, but git is showing it as RENAMED with SHA: unknown.

Issue: Need to verify:

  • The old deploy-postgres-iam.yml file is actually deleted
  • All references to the old workflow name are updated (checked staging.yml and prod.yml - looks good)
  • The stack name in stacks.yml still references "PostgresIAM" but comments say "Legacy name kept for backward compatibility" - this is acceptable

Recommendation: Verify the old workflow file was properly deleted in the PR.

Location: .github/workflows/deploy-postgres.yml

3. ARM64-only deployment lacks fallback (Dockerfile.lambda, build.yml)

The Lambda container image is built exclusively for ARM64 (Graviton), which is great for cost savings. However, there's no fallback or documentation about what happens if ARM64 becomes unavailable in a region.

Issue:

  • The PR description mentions "Monitor Lambda cold start times as container images may affect startup performance"
  • No baseline metrics or monitoring setup is included in the PR
  • CloudWatch alarms in CloudFormation templates weren't updated to include container-specific metrics

Recommendation:

  • Add CloudWatch metrics for Lambda cold start duration in CloudFormation templates
  • Document expected cold start times in the PR description or infrastructure docs
  • Consider adding a note about ARM64 availability by region

Location: Dockerfile.lambda:6, .github/workflows/build.yml:239-242

High Priority Issues

4. ECR image pull permissions are broad (cloudformation/*.yaml)

All CloudFormation templates add ECR permissions with a wildcard for ecr:GetAuthorizationToken, which is correct since it doesn't support resource-level permissions. However, the ecr:BatchGetImage and ecr:GetDownloadUrlForLayer permissions are scoped to the robosystems repository.

Issue: This is correct, but the ARN uses a hardcoded repository name:

Resource: !Sub "arn:${AWS::Partition}:ecr:${AWS::Region}:${AWS::AccountId}:repository/robosystems"

Recommendation: Verify the ECR repository is actually named "robosystems" in all environments. If it varies by environment, this needs to be parameterized.

Location: All cloudformation/*.yaml files (postgres.yaml:276, valkey.yaml:387, graph-infra.yaml:464, graph-volumes.yaml:188, etc.)

5. Lambda timeout unchanged despite container overhead (cloudformation/*.yaml)

Container-based Lambdas typically have longer cold start times than ZIP-based deployments, especially for the first invocation or after scaling. However, all Lambda timeout values remain unchanged at 300 seconds (5 minutes).

Issue: The PR description mentions "Monitor Lambda cold start times as container images may affect startup performance" but doesn't adjust timeouts as a precaution.

Recommendation:

  • Consider increasing timeouts slightly for initial deployment (e.g., 360s)
  • Monitor actual execution times and adjust down if unnecessary
  • Document baseline timeout metrics before and after migration

Location: All Lambda function definitions in CloudFormation templates

6. Manifest generation removed but no migration path documented (bin/tools/package-scripts.sh)

The packaging script previously generated a manifest file (lambda-manifest-${ENVIRONMENT}.json) that tracked Lambda S3 keys and hashes. This has been completely removed.

Issue:

  • Any automation or monitoring that relied on this manifest will break
  • The PR doesn't document whether this manifest was actually used anywhere
  • No deprecation notice or migration guide

Recommendation:

  • Document whether the manifest was used by any other systems
  • Add a note about this removal in the PR description under "Breaking Changes"
  • Consider adding a comment in package-scripts.sh explaining why the manifest was removed

Location: bin/tools/package-scripts.sh:63-90 (removed code)

Medium Priority Issues

7. Docker buildx cache strategy could be optimized (.github/workflows/build.yml)

The Lambda image build uses cache-to type=registry,mode=max which caches all layers, but the main application image build (earlier in the file) might benefit from the same caching strategy for consistency.

Recommendation: Review if the main image build should also use mode=max for consistency, or document why different caching strategies are appropriate for each image type.

Location: .github/workflows/build.yml:241-242

8. Missing rollback documentation

The PR description mentions this is a breaking change but doesn't provide a rollback procedure.

Issue: If the container deployment fails in production, teams need to know:

  • Can they roll back to the previous ZIP-based version?
  • Are there stack parameter changes that prevent rollback?
  • What's the disaster recovery procedure?

Recommendation: Add a "Rollback Procedure" section to the PR description or link to runbook documentation.

9. uv export command uses --frozen but no lock file validation (Dockerfile.lambda)

The Dockerfile runs uv export --only-group lambda --no-hashes --frozen which requires an up-to-date lock file.

Issue:

  • If uv.lock is out of sync with pyproject.toml, the build will fail
  • No CI validation step ensures the lock file is current before building
  • The build workflow doesn't run uv lock --check or similar validation

Recommendation:

  • Add a step in the build workflow to validate uv.lock is up-to-date before building
  • Consider running uv lock --check in the test workflow as well

Location: Dockerfile.lambda:19, .github/workflows/build.yml:177-251

Minor Issues / Suggestions

10. Inconsistent Lambda handler naming convention (Dockerfile.lambda)

The Lambda functions use two different naming patterns:

  • postgres_rotation.lambda_handler (module.function format)
  • File is named postgres_rotation.py with function lambda_handler

vs. the Dockerfile's default:

  • graph_volume_manager.lambda_handler

Observation: This is actually consistent (all use module.lambda_handler), but the Dockerfile comment could be clearer.

Recommendation: The default CMD in Dockerfile.lambda could include a comment explaining that each CloudFormation template overrides this via ImageConfig.Command.

Location: Dockerfile.lambda:33-35

11. Build workflow output doesn't validate both images built successfully

The build workflow now outputs two images (image and lambda_image), but the success criteria only validates the main image was pushed. The Lambda image validation happens inside the build step but doesn't propagate to job outputs.

Recommendation: Add a validation step at the end of the lambda job to confirm both the environment tag and version tag were pushed successfully.

Location: .github/workflows/build.yml:249-252

12. Dependency version ranges could be tighter (pyproject.toml)

The Lambda dependency group uses wide version ranges:

  • boto3>=1.39.0,<2.0 - major version range
  • redis>=6.2.0,<7.0 - major version change from previous 5.0.1

Recommendation:

  • Consider tighter ranges for Lambda functions to prevent unexpected runtime changes
  • Document why redis was upgraded from 5.x to 6.x (API changes? Bug fixes?)

Location: pyproject.toml:146-147

13. CloudFormation template parameter descriptions could mention container deployment

Several templates still have parameter descriptions that reference "S3 bucket" or "deployment package" without mentioning the new container-based approach.

Example: valkey.yaml:93 - "ECR image URI for Lambda functions (container-based deployment)" - This is good!

Recommendation: Ensure all parameter descriptions are updated to reflect container deployment (spot check shows they are mostly updated, good work).

Security Review

No major security concerns identified

  • ECR permissions are appropriately scoped to the specific repository
  • IAM roles follow least-privilege principle
  • Secrets Manager integration unchanged (good)
  • Container images use official AWS Lambda base images
  • No hardcoded credentials or secrets

Minor security note: The ARM64 base image public.ecr.aws/lambda/python:3.13-arm64 should be pinned to a specific SHA256 digest in production for supply chain security, but using tags is acceptable for development velocity.

Breaking Changes Verification

The PR description lists breaking changes, which are accurate:

Deployment Method: Correctly documented
Workflow Structure: Correctly documented
⚠️ Build Process: Documented, but should mention the new lambda dependency group in pyproject.toml
Missing: Redis version upgrade from 5.0.1 to 6.2.0+ for Valkey rotation Lambda

Testing Recommendations

The PR description includes good testing notes. I'd add:

  1. Validate each Lambda function individually in staging:

    • postgres-init (database creation)
    • postgres-rotation (password rotation)
    • valkey-rotation (auth token rotation) - especially test with new redis library version
    • graph-api-rotation
    • graph-volume-manager
    • graph-volume-monitor
    • graph-volume-detachment
  2. Cold start metrics: Capture baseline metrics before and after deployment:

    • Time to first execution
    • Memory usage
    • Duration metrics
  3. Rollback test: Verify you can roll back to previous infrastructure if needed

Summary

Overall Assessment: This is a well-executed refactoring with a clear architecture improvement. The code quality is high, and the approach is sound. However, there are several issues that should be addressed before merging:

Must Fix Before Merge:

  1. Verify redis library version upgrade compatibility (Critical Issue Inital open source release of robosystems #1)
  2. Confirm old workflow file was deleted (Critical Issue Fix MCP tools query results to include units and improve dimensions #2)
  3. Document ECR repository naming convention (High Priority chore: improve documentation and suppress build output #4)
  4. Add rollback procedure documentation (Medium Priority #8)

Should Fix Before Merge:
5. Add uv.lock validation to build workflow (Medium Priority #9)
6. Document manifest removal as breaking change (High Priority #6)

Nice to Have:
7. Add CloudWatch metrics for cold start monitoring (Critical Issue #3)
8. Tighter dependency version ranges with changelog (Minor Issue #12)

Approval Status: Request Changes - The redis version upgrade and missing documentation are blocking issues that should be resolved before merge.


CI Status: 2 workflow runs currently in progress


Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +416 to +418
ImageConfig:
Command:
- graph_volume_monitor.lambda_handler
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Point volume monitor to the actual handler function

The Lambda container entrypoint is set to graph_volume_monitor.lambda_handler, but bin/lambda/graph_volume_monitor.py only defines def handler(...) (line 48) and does not export lambda_handler. With container images, Lambda resolves the handler from ImageConfig.Command, so this will raise a “handler not found” error on every invocation and break scheduled monitoring/alarm-driven expansions. Use graph_volume_monitor.handler or add an alias in the module so the configured handler exists.

Useful? React with 👍 / 👎.

Comment on lines +839 to +841
ImageConfig:
Command:
- graph_volume_detachment.lambda_handler
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Point volume detachment to the actual handler function

The Lambda container entrypoint is set to graph_volume_detachment.lambda_handler, but bin/lambda/graph_volume_detachment.py defines def handler(...) (line 31) and has no lambda_handler. With container-based Lambda, this mismatch causes cold start failures and prevents the ASG lifecycle hook from detaching volumes. Update the command to graph_volume_detachment.handler or export a lambda_handler alias in the module.

Useful? React with 👍 / 👎.

Comment on lines +183 to +254
runs-on: ubuntu-latest
outputs:
lambda_image: ${{ steps.build-lambda.outputs.lambda_image }}
steps:
- name: Checkout
uses: actions/checkout@v4
with:
repository: ${{ github.repository }}
ref: ${{ github.ref }}
token: ${{ secrets.ACTIONS_TOKEN }}

- name: Set up QEMU
uses: docker/setup-qemu-action@v3

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
with:
driver-opts: |
image=moby/buildkit:master
network=host

- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ inputs.aws_region }}

- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v2
with:
mask-password: "true"

- name: Build and push Lambda container image
id: build-lambda
env:
ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
ECR_REPOSITORY: ${{ inputs.ecr_repository }}
run: |
# Determine version tag (same logic as main build)
if [[ "$GITHUB_REF" == refs/tags/v*.*.* ]]; then
VERSION_TAG=${GITHUB_REF#refs/tags/}
elif [[ "$GITHUB_REF" == refs/heads/release/* ]]; then
BRANCH_VERSION=${GITHUB_REF#refs/heads/release/}
VERSION_TAG="v${BRANCH_VERSION}-hotfix.${GITHUB_SHA:0:8}"
else
VERSION_TAG="git-${GITHUB_SHA:0:8}"
fi

# Lambda image tags (ARM64 only - for Graviton cost savings)
LAMBDA_TAG="lambda-${{ inputs.environment }}"
LAMBDA_VERSION_TAG="lambda-$VERSION_TAG"

echo "🐑 Building Lambda container image: $LAMBDA_TAG"

# Build Lambda image (ARM64 only)
docker buildx build \
--platform linux/arm64 \
--push \
--cache-from type=registry,ref=$ECR_REGISTRY/$ECR_REPOSITORY:lambda-cache \
--cache-to type=registry,ref=$ECR_REGISTRY/$ECR_REPOSITORY:lambda-cache,mode=max \
-t $ECR_REGISTRY/$ECR_REPOSITORY:$LAMBDA_TAG \
-t $ECR_REGISTRY/$ECR_REPOSITORY:$LAMBDA_VERSION_TAG \
-f Dockerfile.lambda .

# Verify the image was pushed
aws ecr describe-images --repository-name $ECR_REPOSITORY --image-ids imageTag=$LAMBDA_TAG || exit 1

LAMBDA_IMAGE="$ECR_REGISTRY/$ECR_REPOSITORY:$LAMBDA_TAG"
echo "lambda_image=$LAMBDA_IMAGE" >> $GITHUB_OUTPUT
echo "✅ Lambda image pushed: $LAMBDA_IMAGE"

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
…cripts

- Introduced a new alias `lambda_handler` for the existing `handler` function in both `graph_volume_detachment.py` and `graph_volume_monitor.py`.
- This change facilitates Lambda container deployment using CloudFormation ImageConfig.Command.
@jfrench9 jfrench9 merged commit 34697f8 into main Dec 23, 2025
5 checks passed
@jfrench9 jfrench9 deleted the refactor/lambda-packaging branch December 23, 2025 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants