Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
02e65e7
docs: add lazy DFA design spec (R1+R2)
jbachorik May 28, 2026
6eed1d2
feat: add NfaStep functional interface
jbachorik May 29, 2026
761ab88
feat: add StateSetKey for NFA state-set interning
jbachorik May 29, 2026
978e7ea
feat: add LazyDFACache with cap/freeze/fallback semantics
jbachorik May 29, 2026
852f0f1
feat: add LAZY_DFA strategy and routing to PatternAnalyzer
jbachorik May 29, 2026
c8d93e0
feat: add LazyDFABytecodeGenerator and wire LAZY_DFA into RuntimeComp…
jbachorik May 29, 2026
7e5dd2d
feat: add LazyDFABenchmark with hit/miss/frozen variants
jbachorik May 29, 2026
aba760e
fix: add missing backref routing test and clarify VarHandle comment
jbachorik May 29, 2026
aa74643
perf: eliminate CHECKCAST on hot path by using int[][] for asciiTables
jbachorik May 29, 2026
4c223e8
feat: add JDK baseline methods to LazyDFABenchmark
jbachorik May 29, 2026
d1056c6
fix: address Copilot review — \b anchor gap, cache DEAD transitions, …
jbachorik May 29, 2026
116ac57
perf: inline LazyDFACache hot loop into generated matches(); fix froz…
jbachorik May 29, 2026
990627f
fix: use import for Arrays, fix missIndex overflow in LazyDFABenchmark
jbachorik May 29, 2026
2c697ec
fix: update test/benchmark patterns for jb/logs-backend DFA_TABLE opt…
jbachorik May 29, 2026
e9ece84
bench: add hardMissPath/jdkHardMissBaseline for fair miss-path compar…
jbachorik May 29, 2026
edd1831
refactor: use BytecodeUtil.pushInt; document sentinel constant cross-…
jbachorik May 29, 2026
1cbf3a2
fix: add LAZY_DFA to annotation processor; upgrade VarHandle fence to…
jbachorik May 29, 2026
491f606
fix: add missing bounded methods; fix matchBounded offsets; add match…
jbachorik May 29, 2026
3e59ed5
fix: correct MatchResultImpl array layout in match/matchBounded metho…
jbachorik May 29, 2026
4535f75
fix: use delegating matches() in AOT path to avoid package-private ac…
jbachorik May 29, 2026
fbd4907
test: add positive accept case; fix cache-sharing test to use distinc…
jbachorik May 29, 2026
db6a897
fix: proper release/acquire semantics for ASCII table publication via…
jbachorik May 29, 2026
3e51418
fix: INT_ARRAY_VH release/acquire, frozen benchmark corpus, test cove…
jbachorik May 29, 2026
e7cc8dd
docs: attribute lazy-DFA technique to dangermike/glob_perf
jbachorik May 29, 2026
9ea1457
chore: remove transient task plans from committed tree
jbachorik May 29, 2026
ecde02a
docs: correct missPath benchmark description — measures cached-DEAD n…
jbachorik May 29, 2026
838cc71
fix: avoid O(n²) substring copies in matchBounded; add capturing-grou…
jbachorik May 29, 2026
950457a
fix: address sphinx review — bounded coverage, assertion fix, sentine…
jbachorik May 29, 2026
e04ed3f
perf: skip INT_ARRAY_VH.getAcquire on x86/TSO; use plain IALOAD in ho…
jbachorik May 29, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
177 changes: 177 additions & 0 deletions doc/plans/glob-perf-nfa-improvements.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
# glob_perf — NFA improvement opportunities

Investigation roadmap drawn from
[DataDog/experimental/users/dangermike/glob_perf](https://github.com/DataDog/experimental/tree/main/users/dangermike/glob_perf).
glob_perf is a cross-language benchmark of multi-pattern glob matching
(Go/Java/Rust, 64 – 65 536 patterns). reggie is single-pattern compile-to-bytecode,
so trie/multi-pattern machinery does **not** port; the NFA-execution
micro-optimizations do.

This is a **research roadmap with pointers**, not a build plan. Designs
are sketched; implementation should be brainstormed/planned per item
before code is written.

## Headline result from glob_perf

`lazydfa` (NFA-trie + lazily materialized DFA cache) is the consistent
winner across languages — beats both pure NFA (`fasttrie`) and Intel
Hyperscan once warm. Java 256-pattern `static/randomkey`: lazydfa
214 ns/op vs fasttrie 300, hyperscan 5101. Hyperscan loses on build
time (~13 s at 16 k patterns) — informative non-goal for reggie's
compile-once model.

## glob_perf source pointers

(GitHub paths under `users/dangermike/glob_perf/`)

| File | What to read it for |
|---|---|
| `README.md` | Methodology, hit/miss split, headline numbers. |
| `README_rules.md` | Glob grammar — irrelevant for porting. |
| `java/src/main/java/.../LazyDFA.java` | Reference impl of R1+R2+R5 below. |
| `java/src/main/java/.../FastTrie.java` | SoA `int[]` keys + binary search reference (R4). |
| `docs/` | Cross-language perf write-up. |

## Recommendations

### R1 — Lazy DFA layer over NFA

**Problem.** `OPTIMIZED_NFA` (chosen by `PatternAnalyzer.java:726, 741`)
recomputes `closure(stateSet, c)` for every input character. Eager
subset construction (`automaton/SubsetConstructor.java`) is too
expensive for many of these patterns — that's why we route to NFA in
the first place.

**Idea.** Intern NFA state-sets to DFA states **on first encounter**,
cap the cache (e.g. 4 k states), fall back to plain NFA stepping when
the cache fills.

**Pointers.**
- New generator next to
`reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/codegen/NFABytecodeGenerator.java`
- Strategy switch in
`reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/analysis/PatternAnalyzer.java`
— gate by state count and lookaround presence.
- Reference: `glob_perf/.../LazyDFA.java` lines around `217-224` for
the state-set interning key (sorted IDs packed into 2-chars-per-int
string + `putIfAbsent`).

**Caveats.**
- Cache must be bounded; glob_perf documents a 300× hit-path regression
at 65 k patterns when the cache stops fitting.
- reggie matchers may be single-threaded — confirm in
`reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/ReggieMatcher.java`
before adopting glob_perf's `AtomicReferenceArray`. Plain `int[]` /
`Object[]` recovers a documented 2-3× Java-vs-Go gap.

### R2 — Per-state ASCII transition table

**Idea.** Each cached DFA state gets an `int[128]` of target state IDs,
indexed by `c & 0x7F` after a `c < 128` test. Two sentinels:
- `-1` (or `null` in `Object[]` form) = uncached → compute and fill.
- A `DEAD` constant = computed-dead → reject without retry.

**Pointers.**
- Emit in the new lazy-DFA generator (paired with R1).
- Retrofittable to `DFASwitchBytecodeGenerator.java` (states 50–300)
if profiling shows the switch dispatch is slower than the table.
- Reference: `LazyDFA.java` lines around `31, 321-326`.

### R3 — Inline `long[]` bitset ops in generated bytecode

**Idea.** Verify generated NFA bytecode uses inlined `LOR`/`LAND`/
`LSHL` for state-set updates on small-state patterns rather than
`INVOKEVIRTUAL StateSet.add`.

**Pointers.**
- `NFABytecodeGenerator.java` already selects between primitive `long`
/ dual-long / `BitSet` / `SparseSet` paths — check the small-state
branch emits direct bitwise ops, not method calls.
- Reference: `LazyDFA.java` lines around `363-365` (`bset/bclr/bhas`).
- Runtime helpers:
`reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/SparseSet.java`,
`StateSet.java`.

### R4 — Struct-of-arrays automaton representation

**Idea.** Store NFA/DFA transitions as parallel `int[] keys` +
`int[] targets`, binary-searched. ~16 keys per cache line vs ~4 for
interleaved `(char,int)` records. glob_perf's `FastTrie` Java impl
wins precisely on this.

**Pointers.**
- `reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/NFA.java`
- `reggie-codegen/src/main/java/com/datadoghq/reggie/codegen/automaton/DFA.java`
- Reference: `FastTrie.java` lines around `54-55, 183-194`.

**Caveat.** This is a compile-time representation change. Audit all
readers (subset constructor, bytecode generators) for assumptions
about edge layout before refactoring.

### R5 — Per-matcher thread-local scratch

**Idea.** Reuse `nextBuf` / bitset buffers across `matches()` calls so
steady-state allocs/op drop to 0.

**Pointers.**
- Base class:
`reggie-runtime/src/main/java/com/datadoghq/reggie/runtime/ReggieMatcher.java`
- Hook: generated `<init>` allocates a `Scratch` once; `matches()`
resets, never reallocates.
- Reference: `LazyDFA.java` lines around `43-51, 89` (`ThreadLocal<NfaScratch>`).
- **Confirm thread-model first** — if matchers are not shared across
threads, plain instance fields beat `ThreadLocal`.

### R6 — Literal fast-path at matcher entry

**Idea.** For `prefix(.*)` shaped patterns, run a tight `charAt` loop
over the literal prefix before any NFA setup.

**Pointers.**
- Existing related generators:
`FixedSequenceBytecodeGenerator.java`,
`LinearPatternBytecodeGenerator.java`.
- Analyzer hook: `PatternAnalyzer.java` — verify the
`literal-prefix + free-tail` shape is detected and routed before
falling into the NFA strategies.
- Reference: `LazyDFA.java` `matchLiteral` around lines `347-358`.

### R7 — Benchmark methodology import

**Idea.** Adopt glob_perf's **hit-vs-miss corpus split** and
**bounded-cache worst-case reporting** in `reggie-benchmark`. Today
benchmarks report blended ns/op; the split would have caught
lazydfa's 300× hit-path regression at scale, and the analogous shape
of regression could occur in reggie if R1 ships unbounded.

**Pointers.**
- `reggie-benchmark/` JMH suites.
- Add: explicit `*_hit` / `*_miss` variants for any new lazy-DFA
benchmark. Assert DFA cache size stays bounded.

## Suggested ordering

1. **R5** (thread-model audit + ThreadLocal/instance scratch) —
cheapest, unlocks zero-alloc baseline that the rest measures against.
2. **R3** (inline bitset ops audit) — verification, not redesign;
small risk, possibly small win.
3. **R7** (benchmark split) — required infrastructure before R1.
4. **R1 + R2** (lazy DFA + ASCII table) — the headline change.
Highest expected payoff, highest implementation effort, requires
R7 to keep honest.
5. **R6** (literal fast path) — independent; ship whenever convenient.
6. **R4** (SoA automaton) — broad refactor; defer until R1 lands and
indicates the cache-locality win is worth the disruption.

## Watch-outs (consolidated)

- **Lazy DFA cache must be bounded.** glob_perf's own data is the
cautionary tale.
- **Confirm matcher thread model** before adopting `AtomicReferenceArray`
or `ThreadLocal` — plain arrays/fields are 2-3× faster if matchers
are single-thread.
- **Hyperscan-style SIMD is not transferable** to one-shot compilation
(13 s build time at 16 k patterns documented). Out of scope.
- glob_perf optimizes the **multi-pattern problem**; the trie/`hasWild`/
star-collapse code is irrelevant to reggie's single-pattern compile-
to-bytecode model. Do **not** port that machinery.
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# Spec: LazyDFA PR #67 — five-item fix batch

**Date:** 2026-05-29
**Branch:** feat-lazy-dfa-r1-r2
**Items:** 3325667007, 3325673306, 3325673350, 3325673394, 3325673423

---

## Item 3325667007 — Acquire/release semantics for int[] element writes in LazyDFACache

### What must change and why

`LazyDFACache.cacheEntry` has two branches. The first branch (table is null) already
uses `TABLES_VH.setRelease` to publish the newly-allocated `int[128]`, establishing
happens-before with the `TABLES_VH.getAcquire` call in the hot loop. The second branch
(table already exists) writes `table[c] = value` as a plain array store. On weakly-ordered
platforms (ARM/RISC-V), a reader thread can see the new DFA state id in `table[c]` before
the per-state data (`nfaStateSets[newId]`, `accepting[newId]`) written by
`computeIfAbsent` on the writer thread has been committed to memory. This can produce
null reads or silent wrong results in `lookupOrCompute` when the reader tries to dereference
`nfaStateSets[state]`.

### Correct behaviour

Element writes to an existing ASCII table must use release semantics, and the corresponding
reads in the hot loop (both the runtime `matches()` method and the generated inlined version)
must use acquire semantics.

Changes:

1. **`LazyDFACache`**: add a second `static final VarHandle INT_ARRAY_VH` for `int[].class`
next to the existing `TABLES_VH`. In `cacheEntry`'s else-branch replace the plain
`table[c] = value` with `INT_ARRAY_VH.setRelease(table, c, value)`. In the
`matches()` hot loop replace `table[c]` with
`(int) INT_ARRAY_VH.getAcquire(table, c)`.

2. **`LazyDFABytecodeGenerator.generateMatchesMethod`**: the inlined hot loop currently
reads `table[c]` with a plain `IALOAD`. Replace it with a VarHandle invocation:
`GETSTATIC LazyDFACache.INT_ARRAY_VH`, push `table` (ALOAD 7) and `c` (ILOAD 6),
then `INVOKEVIRTUAL VarHandle.getAcquire "([II)I"`.

### Constraints

- `INT_ARRAY_VH` must be `static final` and initialised in the existing `static {}` block
alongside `TABLES_VH`.
- On x86/TSO, `setRelease`/`getAcquire` for `int[]` elements compile to plain
store/load — zero overhead.
- The writer thread's own read `table = asciiTables[state]` inside `cacheEntry` to check
for null does NOT need acquire semantics because it is on the writer thread.
- `INT_ARRAY_VH` must be `package-private` (`static final VarHandle INT_ARRAY_VH`) not
private, so the generated hot-loop bytecode (same package) can `GETSTATIC` it.

---

## Item 3325673306 — FrozenState benchmark corpus must exercise the frozen path

### What must change and why

`FrozenState.setup()` uses a 36-character alphabet
(`abcdefghijklmnopqrstuvwxyz0123456789`) to fill the cache. Most generated strings contain
non-`[ab]` characters that hit DEAD on the first step and add only 1-2 DFA states. With
10 000 iterations the cache is unlikely to reach 4096 states, so `frozenPath` measures
normal cached-DEAD-rejection rather than the post-freeze NFA fallback path that the
benchmark claims to measure.

### Correct behaviour

Change the warm-up alphabet from 36 chars to `"ab"` only. With the pattern
`(?:a+b+|b+a+){75}` and only `a`/`b` inputs, every character forces a genuine NFA-derived
DFA transition, ensuring state explosion fills the cap quickly. No assertion is required;
the change is sufficient because the reachable DFA state-sets from random `a`/`b` inputs
for this pattern exceed 4096.

The comment above the warm-up loop should describe the intent clearly.

---

## Item 3325673350 — Processor test for LAZY_DFA end-to-end path

### What must change and why

No existing test in `reggie-processor` exercises the LAZY_DFA code path emitted by
`ReggieMatcherBytecodeGenerator`. The delegating `matches()` path (which avoids
package-private access) and the `findMatchFrom` delegation are entirely untested in the
processor module.

### Correct behaviour

Add a `testLazyDfaStrategy` test to
`reggie-processor/.../processor/ReggieMatcherBytecodeGeneratorTest.java`.

Requirements:
- Compile the pattern `(?:a+b+|b+a+){75}` via the existing `compile()` helper. This
pattern is known to route to `LAZY_DFA`.
- Assert `matches()` accepts `"ab".repeat(75)` and rejects `"ab".repeat(74) + "b"`.
- Assert `find()` returns `true` for `"xx" + "ab".repeat(75) + "yy"` and `false` for `"xx"`.
- The test must use only public `ReggieMatcher` API via reflection (the same pattern as
every other test in that file).

---

## Item 3325673394 — LazyDFABytecodeGeneratorTest missing coverage for match/matchBounded/findMatchFrom

### What must change and why

`LazyDFABytecodeGeneratorTest` only calls `matches()`. The methods `match()`,
`matchBounded()`, and `findMatchFrom()` generated by `LazyDFABytecodeGenerator` are not
exercised.

### Correct behaviour

Add three tests to `LazyDFABytecodeGeneratorTest` (runtime module):

1. `testMatchMethod`: call `match("ab".repeat(75))`, assert result is non-null,
`result.start(0) == 0`, `result.end(0) == 150`.
2. `testMatchBoundedMethod`: call `matchBounded("xxab".repeat(38) + "xx", 2, 78)` —
the substring `[2,78)` is `"ab".repeat(38)` which does NOT match (requires 75 groups).
Use a correct bounded input: `"xx" + "ab".repeat(75)`, start=2, end=152. Assert
non-null and offsets.
3. `testFindMatchFromMethod`: call `findMatchFrom("xx" + "ab".repeat(75) + "yy", 0)`,
assert non-null result, `result.start(0) == 2`, `result.end(0) == 152`.

Use `RuntimeCompiler.compile(LARGE_NFA_PATTERN)` to get the matcher, then reflectively
invoke the methods.

For `match` and `matchBounded` and `findMatchFrom`, use the `MatchResult` interface type
and call `start(int)` / `end(int)` via reflection.

---

## Item 3325673423 — Spec doc: fix incorrect "c & 0x7F" description

### What must change and why

`docs/superpowers/specs/2026-05-28-lazy-dfa-design.md` line 18 says the ASCII transition
table is "indexed by `c & 0x7F`". The actual code uses a `c < 128` guard; characters
with `c >= 128` bypass the table entirely and fall through to the NFA step. Masking
non-ASCII chars with `0x7F` would alias them to unrelated ASCII transitions, which is
wrong.

### Correct behaviour

Replace every occurrence of "indexed by `c & 0x7F`" in the spec with accurate language:
the table covers ASCII characters (`c < 128`) only; non-ASCII characters (`c >= 128`)
bypass the table and fall through to the NFA step.

---

## Cross-cutting constraints

- All changes are in these modules: `reggie-runtime`, `reggie-codegen`, `reggie-benchmark`,
`reggie-processor`, and one doc file.
- No new external dependencies.
- Run `./gradlew spotlessApply` before committing.
- Build: `./gradlew :reggie-runtime:compileJava :reggie-codegen:compileJava --no-daemon`
- Test: `./gradlew :reggie-runtime:test :reggie-codegen:test --no-daemon`
Loading
Loading