Skip to content

Jetmon 2 - Rewrite of core services into Go#60

Open
chrisbliss18 wants to merge 20 commits intomasterfrom
refactor/jetmon2
Open

Jetmon 2 - Rewrite of core services into Go#60
chrisbliss18 wants to merge 20 commits intomasterfrom
refactor/jetmon2

Conversation

@chrisbliss18
Copy link
Copy Markdown
Contributor

Work in progress rewrite of the core services into Go.

Why Go

The current architecture uses forked Node.js processes (8–16MB RSS each at startup, 53MB limit before recycling) as workers, plus a compiled C++ addon to escape Node's event loop for blocking network I/O. Go eliminates both constraints:

  • Goroutines start at ~4KB of stack and grow on demand, making 50,000 concurrent checks on a single host practical without the memory overhead of forked processes or libuv thread pools
  • net/http and crypto/tls are first-class stdlib packages — no native addon, no node-gyp, no compilation step during deployment
  • net/http/httptrace provides DNS, TCP, TLS, and TTFB timing hooks as separate measurements within each check, for free
  • Single static binary deployment with no runtime dependencies, no node_modules, and no addon rebuild on Node.js version upgrades
  • Built-in profiling via pprof, race detector via go test -race, and a mature testing ecosystem
  • Graceful goroutine lifecycle management replaces the fragile worker spawn/recycle/evaporate lifecycle

The Veriflier is rewritten in Go as well, replacing the Qt C++ dependency with a lightweight Go HTTP service. The protocol between Monitor and Verifliers moves from custom HTTPS to gRPC, providing type-safe contracts, built-in retries, and bidirectional streaming for future use.

Benefits of the Rewrite

Memory

The current architecture forks Node.js worker processes that start at 8–16MB RSS and are recycled once they reach 53MB. With a typical deployment of 8–16 workers, the process tree consumes 240–850MB of resident memory just for worker overhead, before any check data is counted. The master process, SSL server, and associated IPC buffers add further overhead.

Jetmon 2 runs as a single process. Go goroutines start at 4KB of stack and grow on demand. A pool of 1,000 concurrent goroutines costs roughly 4MB of stack. Total process RSS for an equivalent workload is estimated at 50–150MB — a 75–90% reduction in memory consumption per host.

Concurrent Checks

Current concurrency is bounded by the number of worker processes. Each worker is a single-threaded Node.js process; even with the C++ addon offloading blocking I/O to a thread pool, practical concurrency per host is in the low hundreds. Scaling beyond that requires adding more hosts and manually partitioning bucket ranges.

Go's goroutine scheduler makes 10,000+ concurrent in-flight checks on a single host practical with no additional configuration. At a conservative network timeout of 10 seconds and average site response time of 200ms, a pool of 1,000 goroutines sustains approximately 5,000 check completions per second. This represents an estimated 10–50× increase in concurrent checks per host, meaning significantly fewer hosts are required to cover the same fleet.

Throughput

The current architecture crosses a process boundary on every unit of work: the master dispatches via IPC, the worker receives, processes, and replies via IPC, and the master aggregates. Each crossing involves serialisation, a context switch, and V8 event loop scheduling on both ends.

Jetmon 2 replaces all IPC with Go channel sends, which are in-process and order-of-magnitude cheaper. V8 GC pauses, which can delay check scheduling and RTT measurement in the current system, are eliminated. Estimated throughput improvement: 3–10× more sites checked per second per host under equivalent conditions.

Check Scheduling Accuracy

The current system uses setTimeout and setInterval for round scheduling. These are subject to V8 event loop delay — a busy event loop can delay a scheduled callback by tens to hundreds of milliseconds, introducing jitter into check timing and RTT measurements.

Go's time.Ticker fires with OS-level timer precision. RTT measurements from net/http/httptrace are taken inside the HTTP stack with no event loop between the measurement point and the timer, making them more accurate and consistent.

Deployment Speed

Current deployment requires npm install, a node-gyp rebuild of the native C++ addon (which must match the installed Node.js version), and a coordinated process restart. A failed addon compilation blocks deployment entirely.

Jetmon 2 deploys as a single static binary with no runtime dependencies. Deployment is: copy binary, systemctl restart jetmon2. Total deployment time drops from several minutes to under 30 seconds. There is no compilation step on the target host and no dependency on a matching Node.js version.

Mean Time to Recovery

A worker process crash in the current system requires the master to detect the exit, spawn a replacement, and wait for the new process to initialise — a sequence that takes several seconds and leaves that worker's in-flight checks unresolved.

In Jetmon 2, a panicking goroutine is recovered by a deferred handler, the result is counted as an error, and a replacement goroutine is immediately spawned from the pool — recovery is in the low milliseconds. For a full process crash, systemd restarts the binary; with Go's fast startup, the process is accepting work again in under 2 seconds.

Operational Complexity

The current system requires managing Node.js version compatibility, native addon compilation, npm dependency trees, and the fragile worker spawn/recycle lifecycle. The node_modules directory and compiled .node addon must be present and consistent on every host.

Jetmon 2 eliminates all of this. There is one artifact to manage: the Go binary. It carries its own runtime, has no external dependencies, and produces a reproducible build from go build. The node-gyp, npm, and Node.js version management concerns disappear entirely.

Chris Jean and others added 13 commits April 19, 2026 16:37
…orld ReadMemStats

- refreshVeriflierClients now diffs addr|token fingerprints and skips
  rebuilding when the verifier list is unchanged, preserving TCP
  connection pools between rounds
- Remove runtime.ReadMemStats stop-the-world call — it was logging but
  taking no action; memory metrics are already covered by EmitMemStats
- Remove unused statusDown constant; the DB transition path goes directly
  from statusRunning to statusConfirmedDown
- Add comment to per-round ClaimBuckets call explaining the rebalancing intent

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes cleanup ordering deadlock in pool tests (LIFO cleanup, close channel
before Drain). Adds tests for wpcom circuit breaker, veriflier transport,
checker.Check paths, config hot-reload, dashboard SSE, audit helpers,
orchestrator memory pressure, retry queue, and pure utility functions.
EVENTS.md: event-sourced architecture — lifecycle, idempotency,
resolution reasons, causal links, and site-row projection.

TAXONOMY.md: five-layer test taxonomy (Reachability → Transport →
Infrastructure → Application → Content + Reverse checks), site/endpoint/
check data model, multi-state vocabulary, event schema, scope matrix,
signal processing, and versioned implementation roadmap.

ROADMAP.md: deferred public REST API — query and manage endpoints,
auth, pagination, and uptime-bench integration context.

AGENTS.md: architectural decision log covering event sourcing, severity
vs. state separation, Seems Down lifecycle, in-place event updates,
idempotent event identity, resolution reasons, causal vs. rollup links,
and Unknown-is-not-downtime invariant.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants