ForgeQueue is a distributed job processing system that allows users to submit long-running tasks—such as report generation, file processing, or data analysis—without waiting for immediate results. When a task is submitted, the system returns a job ID and processes it asynchronously in the background.
Users can query job status at any time (QUEUED, PROCESSING, COMPLETED, and DEAD_LETTER). The system includes automatic retry with backoff for transient failures, dead-letter handling for persistent failures, rate limiting to prevent abuse, and idempotent submission to avoid duplicate processing.
ForgeQueue provides reliable, trackable, and horizontally scalable background execution through safe multi-worker coordination.
This project was inspired by observing submission queues on competitive programming platforms like Codeforces during high-traffic contests.
Submissions often remain in an “in queue” state before evaluation, highlighting real-world issues such as queue backlogs, fairness, and increased processing latency under load.
This led to a deeper exploration of how distributed systems manage asynchronous workloads, handle concurrency, and maintain fault tolerance — ultimately resulting in the design of ForgeQueue.
- Java 17
- Spring Boot 3.5
- Spring Cloud Gateway
- Hibernate / JPA
- PostgreSQL 15
- Redis 7
- Atomic row-level leasing (
SELECT FOR UPDATE SKIP LOCKED) - 30s visibility timeout
- Exponential backoff (5--300s)
- ±20% retry jitter
- Dead-letter handling
- Resilience4j (Circuit Breaker + Short-Term Retry)
- Docker (multi-stage builds)
- Docker Compose
- GitHub Actions (CI/CD)
- Docker Hub (image registry)
- Testcontainers (integration testing)
- OpenAPI (Swagger), aggregated via Gateway
- Immediate job acceptance with
jobId(non-blocking API) - Duplicate-safe via composite constraint (
user_id,idempotency_key) - Transaction-safe under concurrent submissions
- Atomic job leasing using:
SELECT ... FOR UPDATE SKIP LOCKED
- Ensures a job is processed by only one worker
- Enables safe multi-worker execution without coordination
- Horizontally Scalable -- stateless workers scale independently
- No central coordinator required & System scales linearly by adding more workers
- Jobs in PROCESSING are leased with expiry (lease_expires_at)
- If a worker crashes, lease expires and job becomes eligible again
- Guarantees no job remains permanently stuck
- Millisecond-level retries
- Circuit breaker protection
- Protects downstream systems
- Exponential backoff (5--300 seconds)
- ±20% jitter to prevent retry storms
- Dead-letter transition after max attempts
- Redis-backed RedisRateLimiter (token bucket)
- Header based Per-user limiting , fallback to IP
- Enforced consistently across instances using Redis
- Limits number of active jobs per user
- Redis TTL-based counters prevent stale locks on crashes
- Ensures no single user can saturate worker capacity
System correctness validated under real infrastructure using Testcontainers
Validated properties:
- Safe concurrent job leasing using SELECT FOR UPDATE SKIP LOCKED
- No duplicate job leasing across concurrent worker instances
- Automatic job retry after worker crash (visibility timeout recovery)
- Idempotent job submission under concurrent retries
All tests run against real PostgreSQL and Redis containers and execute automatically in CI.
On success:
- Stores result_payload
- Stores completed_at
On failure:
- Stores last_error_message
- Stores last_error_stacktrace
- Stores failed_at on dead-letter transition
Provides traceable execution history and failure diagnostics.
- Indexed polling on (status, next_run_at, priority)
- Indexed lease expiry lookup
- Short polling interval with bounded batch size
- Optimized for high-concurrency worker coordination
Reduces lock contention and prevents sequential scan degradation.
- Multi-stage Docker builds
- Separate images for Core and Gateway
- Docker Compose orchestration
- Healthchecks enabled
Images are automatically published to Docker Hub on merge to main.
On every push and pull request: - Builds multi-module Maven project - Runs integration tests (Testcontainers) - Validates distributed behavior - Builds Docker images
On merge to main branch: - Logs into Docker Hub via GitHub Secrets -
Builds production images - Tags images as latest - Pushes images
automatically to Docker Hub
- docker.io/
<username>{=html}/forgequeue-core:latest\ - docker.io/
<username>{=html}/forgequeue-gateway:latest
mvn clean install
docker compose up --build
Core:
http://localhost:8080
Gateway (Public API + Swagger):
http://localhost:8081/swagger-ui.html
-
AWS Architecture Blog — Exponential Backoff and Jitter
https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ -
AWS Builders Library — Timeouts, Retries and Backoff with Jitter
https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/ -
Resilience4j Documentation (Retry & Circuit Breaker)
https://resilience4j.readme.io/docs -
PostgreSQL Documentation — Row-level locking (
FOR UPDATE SKIP LOCKED)
https://www.postgresql.org/docs/current/sql-select.html -
RabbitMQ Documentation — Dead Letter Queues
https://www.rabbitmq.com/dlx.html -
Spring Framework — Scheduling
https://docs.spring.io/spring-framework/reference/integration/scheduling.html -
Spring Data JPA Documentation
https://docs.spring.io/spring-data/jpa/docs/current/reference/html/ -
Spring Cloud Gateway — Redis Rate Limiter
https://docs.spring.io/spring-cloud-gateway/docs/current/reference/html/#redis-rate-limiter -
Testcontainers — Integration testing library for managing containerized dependencies during test.
- Tier-Based Scheduling – Priority handling for paid users with fairness control
- Job Notifications – Email/webhook callbacks with retries
- Audit Logging – Track job state transitions
- Metrics & Monitoring – Throughput, failures, alerting (e.g., Prometheus)
- Worker Heartbeat – Extend leases for long-running jobs
Performed load testing using Apache JMeter on the POST /api/jobs endpoint to validate system stability under concurrent traffic.
- Simulated approximately 200 requests/sec in a local single-node environment, observing controlled degradation via rate limiting
- Rate limiting behavior (HTTP 429) verified under burst traffic scenarios
- Stable database connection pool utilization observed (HikariCP)
- No crashes, deadlocks, or job duplication detected during test runs
Testing conducted on a single-machine setup; distributed load generation and multi-node validation planned for future scaling evaluation.
ForgeQueue is designed as a backend systems showcase --- demonstrating distributed coordination, concurrency control, resilience engineering, and production-grade architectural decisions.
Built for learning, system design mastery, and real-world backend engineering.