A low-overhead call-path profiler for production C++: it tells you which paths through your code actually run hot and where the time goes per segment — at a cost you can leave switched on. Different from sampling (which finds hot functions, not hot paths) and from heavyweight instrumentation like Callgrind (which is orders of magnitude slower).
Built for the HFT / low-latency case: you have an event loop and a handful of hot paths through a parser / book builder / strategy, and you want a global view of the real path distribution and per-stage latency from a live process.
- A per-thread running hash encodes the active call path. At each gate entry the function's salt is mixed in; on exit it is restored. Two ops on the hot path.
- A per-thread shadow stack records each frame's entry TSC and whether it ever called a child.
- On a leaf return (a frame that called nobody) the running hash is the id of a complete root→leaf path. We fold it into a fixed-size table, bump a counter, and add each segment's duration to a per-segment histogram. No allocation on the hot path, no offline decode.
- The aggregate lives in a POD slab that can be placed in shared memory, so an
external
pathprof-toprenders the live profile without touching the traced process.
This is the probabilistic calling context idea (Bond & McKinley, OOPSLA 2007)
extended with per-segment timing and an online top-N path table. See
docs/THEORY.md for the lineage (Ball–Larus, CCT, precise/adaptive
encoding, XRay, -finstrument-functions/uftrace, Intel PT).
The runtime (enter/leave) is independent of how gates get inserted. Four
front-ends, none requiring a forked compiler:
| Front-end | Toolchain | Gating | Inlined | Runtime toggle | Notes |
|---|---|---|---|---|---|
RAII PP_GATE("x") |
any C++ | you place the macro | yes | no | most portable; pick gates deliberately |
cyg -finstrument-functions |
GCC/Clang | per-TU + exclude lists | no | no | auto; function address is a free salt; broad blast radius |
XRay [[clang::xray_always_instrument]] |
Clang | attribute | no (sled) | yes | ship dark, __xray_patch() live; sees TAIL events |
plugin [[clang::annotate("pathprof")]] |
Clang + .so |
attribute | yes | no | inlined like RAII, no source edits; version-coupled to LLVM |
All four are verified to discover bit-identical paths and counts on the same workload (test/verify.sh).
~3.1 gates per event. Run it yourself with bench/bench.sh.
| mode | cyc/pkt | note |
|---|---|---|
| baseline | ~39 | uninstrumented |
| XRay, unpatched | ~39 | sleds are nops when off (zero overhead) |
| RAII, count-only | ~119 | hash + shadow stack, no rdtsc |
| RAII, timed | ~180 | + rdtsc per gate |
| cyg, timed | ~176 | stock -finstrument-functions |
| plugin, timed | ~187 | Clang pass plugin (inlined, attribute-gated) |
| XRay, timed | ~316 | patchable sleds + indirect handler |
The rdtsc is the tax, not the hash: count-only is ~26 cyc/gate; the rdtsc pair roughly doubles it. Every per-segment histogram floors at ~20–24 cycles — that floor is back-to-back rdtsc latency, so for sub-100-cycle segments rdtsc measures itself. (Count-only + sampled timing is the route to the dozen-cycle and timed target.)
Tag the hot functions with PATHPROF, reset the hash once per event, print at
the end. That is the whole API.
#include "pathprof/annotate.h"
PATHPROF void parse(...) { /* ... */ } // tag each function on the path
for (;;) {
pathprof::on_root(); // reset the running hash per event
parse(next_event());
}
pathprof::summary(20); // top paths + per-segment histogramscd examples/itch-sim && ./build.sh && ./sim 2000000 # the worked example (clang)
bash bench/bench.sh # 4-way overhead table
bash test/verify.sh # cross-mechanism correctness gateThe PATHPROF attribute uses the Clang pass plugin. If you are not on Clang, or
want a gate that inlines with zero toolchain support, the RAII front-end is one
line: #include "pathprof/raii.h" then PP_GATE("parse"); at the top of each
function. Same runtime, same output. The other front-ends and the live
shared-memory readout (tools/pathprof-top) are documented below.
- rdtsc tax — see above.
-UPATHPROF_TIMINGgives the count-only build. - cyg blast radius —
-finstrument-functionsinstruments every function in the TU; scope it with exclude lists /no_instrument_function, andreset()after warmup. The measured region must call no other instrumented function. - TCO — at
-O2each stage tail-calls the next. The RAII/cyg/plugin gates run code on exit, which defeats the tail call and preserves nesting for free. XRay patches the optimized binary, so recover nesting with-fno-optimize-sibling-calls(or handle TAIL events). - Path-id stability — keys are addresses (name-literal / function), so the raw 64-bit path hash is a within-run dedup id; it shifts with ASLR run to run. Paths and counts are fully deterministic. Content-based keys would give a portable id at the cost of a name registry.
- Multi-thread — context is per-thread (
__thread, initial-exec model). v1 shares one slab; per-thread slabs merged at read is the scaling path. - TLS — uses GNU
__thread(notthread_local) to avoid the_ZTWwrapper that the cyg front-end would otherwise instrument into infinite recursion.
Working: all four front-ends, shared-memory readout, the ITCH example, bench and the cross-mechanism correctness gate. Numbers here are from a non-isolated laptop; treat them as qualitative. Not yet: HdrHistogram-grade buckets, count+sampled-time mode, per-thread slab merge, a packaged CMake config export.
MIT. See LICENSE.