A durable job queue for Go, backed by Postgres. Single binary, no
broker, crash-safe via SELECT … FOR UPDATE SKIP LOCKED.
Built as a focused MVP to demonstrate the SQL-design half of backend engineering — the sibling project to collab-board, which covers the concurrency / real-time half.
Local stack via Docker Compose (Postgres + migrate + worker + producer):
docker compose up --buildYou'll see the producer enqueueing every 2s and the worker logging each job (with a 10% synthetic failure rate to exercise the retry path). Visit http://localhost:8080/healthz and http://localhost:8080/stats.
Tear down with docker compose down -v (the -v drops the Postgres
volume).
Without Docker:
export DATABASE_URL=postgres://tinybus:tinybus@localhost:5432/tinybus?sslmode=disable
make migrate
make worker # in one terminal
make enqueue # in another, repeatedly
make statsThe single design decision worth reading:
A worker claims a job in one round-trip:
WITH next AS (
SELECT id FROM jobs
WHERE queue = $1
AND locked_at IS NULL
AND dead_at IS NULL
AND run_at <= now()
ORDER BY run_at
FOR UPDATE SKIP LOCKED
LIMIT 1
)
UPDATE jobs
SET locked_at = now(),
locked_by = $2,
attempts = attempts + 1
WHERE id = (SELECT id FROM next)
RETURNING id, queue, payload, attempts, max_attempts, created_at, run_at;Three things to notice:
FOR UPDATE SKIP LOCKEDis what makes this work under load. Two workers competing for the same row: one wins, the other skips and takes the next eligible row. No global lock, no broker, no coordinator.- CTE + UPDATE in a single statement closes the race window. A
SELECTfollowed by a separateUPDATElets two workers see the same row before either claims it. The CTE acquires the row lock, the outerUPDATEmutates the state, both inside the same statement and the same round-trip. attemptsis incremented at claim time, not failure time. So a worker crash after claim still counts as an attempt — a poisoned job that crashes its handler still hitsmax_attemptsand ends up dead, instead of running forever.
That's the whole game. Postgres' row-level lock is the broker.
Workers don't heartbeat. Instead, a sweeper goroutine periodically
clears locked_at from any in-flight job whose lock is older than the
configured lease:
UPDATE jobs
SET locked_at = NULL, locked_by = NULL, last_error = ...
WHERE locked_at IS NOT NULL
AND dead_at IS NULL
AND locked_at < now() - make_interval(secs => $1);This makes tinybus at-least-once, not exactly-once. A handler that runs longer than the lease will be re-run by another worker. Set the lease above your p99 handler runtime, or split long handlers into smaller jobs.
The trade-off: heartbeating from inside the handler conflates "is the worker alive?" with "is the handler making progress?", and adds a connection per in-flight job. River (Go) and Oban (Elixir) both make the same lock-expiry choice for the same reason.
Failed jobs retry with exponential backoff plus equal jitter:
d = 1s * 2^(attempts-1), capped at 5m
half = d / 2
out = half + rand[0, half]
Equal jitter (vs full jitter rand[0, d]) guarantees at least d/2
separation between consecutive retries, while still spreading retries
across the worker pool — avoids the dogpile where all retries fire at
the same instant after a transient outage.
producers ──▶ INSERT ──▶ ┌──────────┐ ◀── UPDATE ── workers
│ jobs │ (CTE+SKIP LOCKED)
└──────────┘
▲
│ UPDATE locked_at = NULL
│ WHERE locked_at < now() - lease
│
sweeper goroutine
(one per Process call)
See images/architecture.svg for the rendered version embedded in the
portfolio entry.
| Column | Type | Notes |
|---|---|---|
id |
BIGSERIAL |
PK |
queue |
TEXT |
logical queue name |
payload |
BYTEA |
opaque, set by producer |
attempts |
INT |
incremented at claim time |
max_attempts |
INT |
default 5 |
last_error |
TEXT |
nullable |
created_at |
TIMESTAMPTZ |
|
run_at |
TIMESTAMPTZ |
when the job becomes eligible |
locked_at |
TIMESTAMPTZ |
NULL = ready, set on claim |
locked_by |
TEXT |
worker id; useful for forensics |
dead_at |
TIMESTAMPTZ |
NULL = alive, set on terminal failure |
State is implicit, not a status column:
| State | Predicate |
|---|---|
| ready | locked_at IS NULL AND dead_at IS NULL AND run_at <= now() |
| delayed | locked_at IS NULL AND dead_at IS NULL AND run_at > now() |
| in-flight | locked_at IS NOT NULL AND dead_at IS NULL |
| dead | dead_at IS NOT NULL |
| completed | row deleted |
Three partial indexes back the hot paths:
idx_jobs_ready—(queue, run_at) WHERE locked_at IS NULL AND dead_at IS NULL— the claim queryidx_jobs_dead—(queue, dead_at DESC) WHERE dead_at IS NOT NULL— dead-letter inspectionidx_jobs_in_flight—(locked_at) WHERE locked_at IS NOT NULL AND dead_at IS NULL— the sweeper
Partial indexes only contain rows that match the predicate, so even a table with millions of historical rows keeps the same fetch latency as one with a hundred ready rows.
tinybus migrate <up|down>
tinybus enqueue --queue=X --payload=Y [--run-in=DUR] [--max-attempts=N]
tinybus worker --queue=X [--concurrency=N] [--http-addr=:8080] [--lease=5m] [--poll=1s] [--fail-pct=N]
tinybus producer --queue=X --interval=2s --payload=...
tinybus stats
Reads DATABASE_URL from the environment. If PORT is set and
--http-addr is unset, the worker listens on :$PORT for /healthz
and /stats (Railway-friendly).
import "github.com/Isidorsson/tinybus/pkg/tinybus"
q, err := tinybus.New(ctx,
tinybus.WithDSN(os.Getenv("DATABASE_URL")),
tinybus.WithConcurrency(4),
tinybus.WithLeaseDuration(2*time.Minute),
)
if err != nil { return err }
defer q.Close()
// Producer
id, err := q.Enqueue(ctx, "email", []byte(`{"to":"a@b.com"}`),
tinybus.RunIn(30*time.Second),
tinybus.MaxAttempts(10),
)
// Worker
err = q.Process(ctx, "email", func(ctx context.Context, job tinybus.Job) error {
return sendEmail(job.Payload)
})go test -race ./... # unit tests (no Docker)
go test -race -tags integration ./... # integration tests (need Docker)The integration tests use testcontainers-go to spin up a real Postgres for each test. Coverage:
- Enqueue → claim → complete (happy path)
- Failed handler → retry with backoff → eventual dead state
- Concurrent workers don't double-process (50 jobs, 4 workers, exactly-once)
- Hung handler → lock-expiry → reclaim
- Migrations up + down
The repo includes railway.json. Workflow:
- Create a Railway project, attach the Postgres plugin (it injects
DATABASE_URLautomatically). - Connect this repo. Railway uses the Dockerfile.
- The default startCommand is
tinybus worker --queue=default. The worker listens on$PORTfor/healthz, which Railway probes. - For schema migrations on first deploy, set the
RAILWAY_RUN_UIDpredeploy command to/tinybus migrate up— or run it manually once viarailway run -- /tinybus migrate up.
For multi-service setups (separate worker and producer services),
duplicate the service in Railway and override startCommand with the
desired subcommand.
docker build -t tinybus:dev .
docker run --rm -e DATABASE_URL=$DATABASE_URL tinybus:dev migrate up
docker run --rm -e DATABASE_URL=$DATABASE_URL tinybus:dev worker --queue=defaulttinybus/
├── cmd/tinybus/ # CLI entrypoint, dispatches to internal/cli
├── internal/cli/ # subcommand implementations
├── pkg/tinybus/ # public library
│ ├── tinybus.go # Queue, Job, Stats, Handler, New, Close
│ ├── options.go # functional options
│ ├── errors.go # sentinel errors
│ ├── enqueue.go # INSERT … RETURNING id
│ ├── claim.go # CTE + FOR UPDATE SKIP LOCKED + UPDATE
│ ├── process.go # worker loop, sweeper, retry/dead
│ ├── recover.go # lease-expiry sweeper
│ ├── backoff.go # equal-jitter exponential
│ ├── stats.go # GROUP BY with FILTER
│ ├── migrate.go # embed.FS + ledger-tracked runner
│ └── migrations/ # *.up.sql / *.down.sql
├── Dockerfile # multi-stage → distroless:nonroot
├── docker-compose.yml # postgres + migrate + worker + producer
├── railway.json # Railway deploy config
└── Makefile
github.com/jackc/pgx/v5— Postgres driver
That's it. Test-only:
github.com/testcontainers/testcontainers-go— gated behind//go:build integration
Everything else is the standard library — net/http, log/slog,
embed, context, sync, math/rand/v2.
- No admin web UI. The CLI + JSON
/statsare enough. - No multi-tenant isolation. One ledger, one namespace.
- No cron / scheduled jobs.
RunAtlets you schedule one-off delays; recurring schedules are a separate problem. - No job priorities. A
prioritycolumn would be a one-line schema change but isn't in v1. - No worker pool autoscaling. Run more workers, or set
--concurrencyhigher. - No tracing. Could pair with an OTel exporter later —
slogis the only observability layer in v1.