Secure, multi-tenant, asynchronous code execution platform with recruiter-facing UI and API.
Distributed code execution platform with queue-based job scheduling, isolated sandboxes, resource limits, and execution observability.
- Architecture:
docs/architecture.md - Security model:
docs/security.md - Scaling model:
docs/scaling.md - System design:
docs/system-design.md - ADR:
docs/adr/0001-why-fargate-over-ec2.md - Demo script:
demo/DEMO_SCRIPT.md
- "Architected a highly elastic worker pool utilizing AWS Fargate Spot instances, reducing distributed compute costs by 70% for asynchronous payload processing."
- "Engineered zero-data-loss failover using Dead Letter Queues (DLQ) and exponential backoff, successfully recovering 100% of inflight jobs during simulated Redis network partitions."
- "Tuned Node.js V8 garbage collection and libuv thread-pool sizing to prevent memory leaks and event-loop starvation during sustained 10k+ req/min payload spikes."
services/api- Auth (
api_key,jwt, orhybrid) + tenant isolation - JWT/Cognito verification via JWKS + issuer/audience checks
- Per-tenant rate limiting with Redis counters
- Quotas (
maxConcurrentJobs,maxDailyJobs) - Job submit + polling (
POST /v1/jobs,GET /v1/jobs/:jobId) - Job history (
GET /v1/jobs) - Cost visibility (
GET /v1/costs) - Tenant audit feed (
GET /v1/audit) - Execution analysis (
POST /v1/jobs/:jobId/analyze) - AI-backed analysis mode (
AI_PROVIDER=openai) with retries/backoff and safe heuristic fallback - Recruiter UI at
/
- Auth (
services/worker- BullMQ queue consumer
- Async dispatch to local Docker runner or ECS/Fargate runner tasks
- Queue-depth CloudWatch metric publishing (
CCEE/PendingJobsCount) - Retry + exponential backoff for transient failures
services/runner- Sandboxed execution runtime for
javascript,python,java - Java path compiles + runs (
javacthenjava)
- Sandboxed execution runtime for
packages/common- Shared schemas and key conventions
infra/terraform- ECS cluster/services, ALB, ElastiCache Redis, IAM, security groups
- CPU and memory limits on job containers (
--cpus,--memorylocally; ECS task/container limits in cloud) - Wall-clock timeout enforcement with hard kill (
SIGKILL) in runner - Process/file limits via
prlimit(--cpu,--nproc,--fsize) - Filesystem isolation:
- container root filesystem set read-only in local runner path
- writable area limited to tmpfs mounts (
/tmp,/workspace) - per-job ephemeral working directory is created and deleted
- Privilege reduction:
- run as non-root user
no-new-privileges- all Linux caps dropped
- Network isolation for local sandbox (
--network none)
- Tenant auth boundary: every read/write requires valid API key and tenant match; cross-tenant job fetches return
404. - Quota boundary: atomic quota reservation blocks bursts (
maxConcurrentJobs,maxDailyJobs). - Rate-limit boundary: per-tenant request ceilings in Redis absorb abusive API polling patterns.
- Submit burst boundary: separate per-tenant submit limiter (
SUBMIT_RATE_LIMIT_PER_MINUTE) blocks enqueue floods. - Payload boundary: API rejects oversized
sourceCode/stdinbefore enqueue (MAX_SOURCE_CODE_BYTES,MAX_STDIN_BYTES). - Resource boundary: CPU/memory/pid/file/time limits bound compute abuse and fork bombs.
- Data boundary: outputs are truncated (
MAX_STDIO_BYTES) to block log-exhaustion abuse. - Audit boundary: auth failures, retries, state transitions, and abuse rejections are appended to an audit stream.
- Client submits job to API.
- API validates request, reserves tenant quota, persists job metadata/history, enqueues BullMQ job.
- Worker consumes queue job and marks running.
- Worker executes locally or dispatches ECS task (
EXECUTION_BACKEND=ecs). - Runner persists result; API polling endpoint exposes state transitions until terminal.
sequenceDiagram
participant Client
participant API
participant Redis
participant QueueWorker as Worker
participant Runner as ECS Runner Task
participant CW as CloudWatch
Client->>API: POST /v1/jobs
API->>Redis: auth/rate/submit/quota checks + persist queued job
API->>Redis: enqueue BullMQ job
QueueWorker->>Redis: consume queue + mark running/dispatched
QueueWorker->>Runner: RunTask (isolated execution)
Runner->>Redis: persist result + audit events
QueueWorker->>CW: publish QueueDepth metric
Client->>API: GET /v1/jobs/:id
API->>Redis: fetch job state/result
- Untrusted code runs in isolated runner containers/tasks with non-root users, dropped Linux capabilities, and
no-new-privileges. - Runtime limits are enforced for CPU, memory, process count, file size, and wall-clock timeout, with hard kill on timeout.
- Local backend also disables container networking (
--network none) to block outbound access from user code.
- Workers publish queue depth (
waiting) to CloudWatch metric namespaceCCEE(metricPendingJobsCountby default). - Terraform provisions ECS Application Auto Scaling with target tracking:
- scale out when queue depth exceeds the target
- scale in toward zero as the queue drains
- Worker ECS service runs in private subnets with
assign_public_ip = false; runner tasks remain ephemeral per execution. - DLQ recovery:
REDIS_URL=... npm run queue:recoverreplays dead-lettered jobs into the main queue.
Failed code execution payloads are automatically routed to the Redis Dead Letter Queue and processed daily via an EventBridge cron schedule at 02:00 AM local time (scheduled in UTC).
Manual Override (Break-Glass Command): To manually trigger the DLQ recovery script outside the scheduled maintenance window, run the following transient Fargate task via the AWS CLI:
CLUSTER=ccee-cluster
TASK_DEF=ccee-dlq-replay
SUBNETS=$(terraform -chdir=infra/terraform output -json private_subnet_ids | python -c 'import json,sys;print(",".join(json.load(sys.stdin)))')
SG=$(aws ec2 describe-security-groups --filters Name=group-name,Values=ccee-worker-sg --query 'SecurityGroups[0].GroupId' --output text)
aws ecs run-task \
--cluster "$CLUSTER" \
--task-definition "$TASK_DEF" \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={subnets=[$SUBNETS],securityGroups=[$SG],assignPublicIp=DISABLED}" \
--overrides '{"containerOverrides": [{"name": "dlq-replay", "command": ["node", "scripts/replay-dlq.mjs"]}]}' \
--count 1- Separate API controls for read traffic and submit traffic:
- request rate limit (
RATE_LIMIT_REQUESTS_PER_MINUTE) - submit burst limit (
SUBMIT_RATE_LIMIT_PER_MINUTE)
- request rate limit (
- Submission payload size guardrails reject oversized source/stdin (
MAX_SOURCE_CODE_BYTES,MAX_STDIN_BYTES). - Quota controls and audit events capture denied attempts (
submission_rejected_size,submission_rejected_burst,submission_rejected_quota).
Execution analysis supports two modes:
AI_PROVIDER=none(default): deterministic heuristic analysisAI_PROVIDER=openai: calls OpenAI model and stores provider-tagged analysis (analysisProvider=openai), with automatic fallback to heuristic mode on timeout/API errors
Relevant env vars:
AI_PROVIDER(noneoropenai)OPENAI_API_KEY(required whenAI_PROVIDER=openai)OPENAI_MODEL(defaultgpt-4.1-mini)AI_ANALYSIS_TIMEOUT_MS(default10000)AI_ANALYSIS_RETRIES(default2)AI_ANALYSIS_RETRY_BACKOFF_MS(default500)
AUTH_MODE(api_key,jwt,hybrid)JWT_JWKS_URL,JWT_ISSUER,JWT_AUDIENCEfor JWT validationJWT_TENANT_CLAIM(defaultcustom:tenant_id)JWT_SUBJECT_CLAIM(defaultsub)TENANT_POLICIES_JSON(tenant quotas for JWT mode)TENANT_API_KEYS_JSON(API key map + quotas)RATE_LIMIT_REQUESTS_PER_MINUTE(default240)RATE_LIMIT_WINDOW_SECONDS(default60)SUBMIT_RATE_LIMIT_PER_MINUTE(default60)MAX_SOURCE_CODE_BYTES(default100000)MAX_STDIN_BYTES(default100000)
Worker metric config:
QUEUE_DEPTH_METRIC_NAMESPACE(defaultCCEE)QUEUE_DEPTH_METRIC_NAME(defaultPendingJobsCount)QUEUE_DEPTH_PUBLISH_INTERVAL_MS(default30000)QUEUE_DEPTH_METRIC_SERVICE_NAME(defaultworker)
Prerequisites:
- Node.js 20+
- Docker Desktop or Colima
Example with Colima:
brew install docker docker-compose colima
colima start --cpu 2 --memory 4 --disk 20Start:
cp .env.example .env
# Optional: enable model-backed analysis
# AI_PROVIDER=openai
# OPENAI_API_KEY=sk-...
./scripts/local-up.shOpen UI:
open http://localhost:8080/# 1) Submit
JOB_ID=$(curl -sS -X POST http://localhost:8080/v1/jobs \
-H 'x-api-key: dev-local-key' \
-H 'content-type: application/json' \
-d '{
"language": "java",
"sourceCode": "public class Main { public static void main(String[] args) { System.out.println(\"hello\"); } }",
"timeoutMs": 3000,
"memoryMb": 256,
"cpuMillicores": 256
}' | jq -r .jobId)
# 2) Poll
curl -sS "http://localhost:8080/v1/jobs/${JOB_ID}" -H 'x-api-key: dev-local-key' | jq .
# 3) History
curl -sS "http://localhost:8080/v1/jobs?limit=10" -H 'x-api-key: dev-local-key' | jq .
# 4) Audit
curl -sS "http://localhost:8080/v1/audit?limit=10" -H 'x-api-key: dev-local-key' | jq .
# 5) Analysis
curl -sS -X POST "http://localhost:8080/v1/jobs/${JOB_ID}/analyze" -H 'x-api-key: dev-local-key' | jq .Stop:
./scripts/local-down.shGET /healthGET /(frontend)GET /v1/quotasGET /v1/costs?days=7POST /v1/jobsGET /v1/jobs/:jobIdGET /v1/jobs?limit=20POST /executions(alias)GET /executions/:id(alias)GET /executions/:id/logsGET /executions?limit=20(alias)GET /v1/audit?limit=20POST /v1/jobs/:jobId/analyzePOST /executions/:id/analyze(alias)
All /v1/* endpoints are protected by configured auth mode:
AUTH_MODE=api_key: requirex-api-key.AUTH_MODE=jwt: requireAuthorization: Bearer <JWT>.AUTH_MODE=hybrid: accepts JWT when provided, otherwise API key.
The Terraform module provisions:
- ALB + API ECS service
- Worker ECS service with
EXECUTION_BACKEND=ecs - ECS Application Auto Scaling target tracking on queue depth for worker service
- Runner task definition
- ElastiCache Redis (TLS)
- IAM + SG boundaries for worker/runner/API/Redis
- Optional VPC, subnets, NAT, and RDS PostgreSQL
Key required vars:
api_imageworker_imagerunner_image
Network vars:
create_vpc(defaultfalse)vpc_id,public_subnet_ids,private_subnet_ids(required whencreate_vpc=false)vpc_cidr,public_subnet_cidrs,private_subnet_cidrs,availability_zones(used whencreate_vpc=true)
AI-specific vars (optional):
ai_provider(noneoropenai)openai_api_keyopenai_modelai_analysis_timeout_ms
Auth/rate-limit vars (optional):
auth_mode(api_key,jwt,hybrid)jwt_jwks_url,jwt_issuer,jwt_audiencejwt_tenant_claim,jwt_subject_claimtenant_policies_jsonrate_limit_requests_per_minuterate_limit_window_secondssubmit_rate_limit_per_minutemax_source_code_bytesmax_stdin_bytesworker_min_capacity,worker_max_capacityworker_queue_depth_targetqueue_depth_metric_namespace,queue_depth_metric_name,queue_depth_publish_interval_msworker_scale_to_zero_cooldown_secondsworker_empty_queue_evaluation_periodsworker_empty_queue_period_secondsworker_scale_from_zero_cooldown_secondsworker_nonempty_queue_evaluation_periodsworker_nonempty_queue_period_secondsworker_nonempty_queue_threshold
RDS vars (optional):
enable_rds(defaultfalse)rds_db_name,rds_username,rds_passwordrds_instance_class,rds_allocated_storage_gb,rds_multi_az
Example:
cd infra/terraform
terraform init
terraform apply \
-var 'api_image=<account>.dkr.ecr.<region>.amazonaws.com/ccee-api:latest' \
-var 'worker_image=<account>.dkr.ecr.<region>.amazonaws.com/ccee-worker:latest' \
-var 'runner_image=<account>.dkr.ecr.<region>.amazonaws.com/ccee-runner:latest'To create a fresh VPC + subnets:
terraform apply \
-var 'create_vpc=true' \
-var 'availability_zones=["us-east-1a","us-east-1b"]' \
-var 'api_image=<account>.dkr.ecr.<region>.amazonaws.com/ccee-api:latest' \
-var 'worker_image=<account>.dkr.ecr.<region>.amazonaws.com/ccee-worker:latest' \
-var 'runner_image=<account>.dkr.ecr.<region>.amazonaws.com/ccee-runner:latest'For production hardening:
- set
redis_auth_token - inject tenant API keys and secrets from a secret manager, not plain env vars
- keep API/worker in private subnets behind proper ingress controls