This project provisions a robust, production-ready AWS infrastructure for running Nextflow workflows using AWS Batch, following the GA4GH WES API standard. It is inspired by the Amazon Genomics CLI (AGC) but is actively maintained and modernized for cost-effectiveness, scalability, and maintainability.
- End-to-End Infrastructure: Provisions VPC, networking, AWS Batch compute environments, job queues, Lambda-based WES API adapter, and supporting resources.
- WES API Support: Submit and manage workflows using the GA4GH Workflow Execution Service API.
- Nextflow Engine: Dockerized Nextflow engine for reproducible, portable workflow execution.
- Customizable: Easily adapt compute environments, job queues, and storage to your needs via CDK configuration.
- Cost-Effective: Designed for efficient resource usage and minimal operational overhead.
- Extensible: Add support for additional workflow engines or custom orchestration logic.
- Users submit workflows via the WES API endpoint (Lambda).
- Lambda triggers AWS Batch jobs, which run Nextflow containers in a managed compute environment.
- S3 Buckets are used for input, output, and artifact storage.
- IAM Roles/Policies ensure secure, least-privilege access.
- Go to
nextflow-engine/. - Use
buildspec.ymlor the provided Dockerfile to build the Nextflow image. - Push the image to your AWS ECR repository.
- Go to
wes_adapter/. - Run
makeor use the provided scripts to buildwes_adapter.zip. - Update the CDK asset reference if you customize the Lambda.
- Go to
nextflow-cdk/. - Install dependencies:
npm install - Configure your AWS credentials and environment variables (see below).
- Deploy:
npx cdk deploy
You can configure deployment parameters using a .env file in the nextflow-cdk/ directory. Example:
PROJECT_NAME=your-project-name
USER_ID=your-user-id
USER_EMAIL=your-email@example.com
OUTPUT_BUCKET_NAME=your-output-bucket
ARTIFACT_BUCKET_NAME=your-artifact-bucket
READ_BUCKET_ARNS=arn:aws:s3:::your-read-bucket
READ_WRITE_BUCKET_ARNS=arn:aws:s3:::your-readwrite-bucket-1,arn:aws:s3:::your-readwrite-bucket-2
BUCKET_NAME_1=your-bucket-1
BUCKET_NAME_2=your-bucket-2nextflow-cdk/– AWS CDK code for infrastructure provisioningnextflow-engine/– Dockerization and build scripts for the Nextflow enginewes_adapter/– Source code for the WES Adapter Lambda functionjob-orchestrator/– (Optional) Additional orchestration logicdocs/– Architecture diagrams and documentation
- Submit Workflows: Use the WES API endpoint output by the CDK deployment to submit Nextflow workflows.
- Monitor Jobs: Track job status in AWS Batch and CloudWatch Logs.
- Customize: Edit CDK code or environment variables to adjust compute resources, storage, or permissions.
A ready-to-use Postman collection is provided in the docs/ directory as WES REST API.postman_collection.json.
- Import this collection into Postman.
- Update the environment variables (such as the API endpoint and API key) as needed.
- Use the pre-configured requests to submit, monitor, and manage workflows via the WES API after deployment.
- Multiple Environments: Deploy multiple stacks (e.g., dev, staging, prod) by instantiating the CDK stack with different parameters.
- Cross-Stack References: Export/import resources (e.g., S3 bucket names) between stacks using
CfnOutputandFn.importValue. - Tag Propagation: Tags set in CDK are propagated to AWS Batch jobs and other resources for cost tracking and organization.
- Security: IAM roles and policies are configured for least-privilege access. Review and adjust as needed for your organization.
- Nextflow Documentation
- AWS Batch Documentation
- WES API Standard
- Amazon Genomics CLI (AGC) - Archived
- AWS CDK Documentation
For detailed setup, customization, and troubleshooting, see the docs/ directory and comments in the CDK source files. Contributions and issues are welcome!
