perf(detectors): quick-reject pre-screen on auth detectors (-31% detector CPU)#111
Merged
Conversation
…ctor CPU) Profiling on a 30K-file polyglot fixture (kept at ~/projects/polyglot-bench: spring-petclinic-microservices, airflow, istio, eShop, angular/components, nuxt, actix/examples, ktor-samples, nlohmann/json, play-samples, PSScriptAnalyzer, terraform-aws-eks; 14 distinct languages) showed the three cross-cutting auth detectors burning 55% of all detector CPU because they ran the lines × patterns double loop on every supported-language file — even files with zero auth keywords. Fix: per-detector PRE_SCREEN Pattern with all distinctive literal substrings of the underlying patterns. One regex pass over file content; if no keyword present, the file cannot match — short-circuit before the line loop. Measured impact (JFR ExecutionSample, JDK 25, polyglot fixture): CertificateAuthDetector: 244 → 147 samples (-39.8%, -0.97s CPU) SessionHeaderAuthDetector: 206 → 43 samples (-79.1%, -1.63s CPU) LdapAuthDetector: 47 → 25 samples (-46.8%, -0.22s CPU) Auth subtotal: 497 → 215 samples (-56.7%, -2.82s) All detectors total: 902 → 624 samples (-30.8%, -2.78s) Detection semantics unchanged — pre-screen rejects only files where no underlying pattern can match (keyword absent). Tests covering keyword-bearing fixtures pass through pre-screen and run the existing logic byte-for-byte. Tests: 3689 / 0 failures / 0 errors / 32 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three cross-cutting auth detectors (
CertificateAuthDetector,SessionHeaderAuthDetector,LdapAuthDetector) burn 55% of all detector CPUon real-world polyglot scans because they run a
lines × patternsdouble loopon every supported-language file — even files with zero auth keywords.
This PR adds a per-detector
PRE_SCREENPattern: one regex pass over filecontent; if no distinctive literal substring of any underlying pattern is
present, the file cannot match — short-circuit before the line loop.
Measured impact
JFR
ExecutionSampleprofile, JDK 25 Temurin, on a kept 30K-file polyglotfixture (12 repos under
~/projects/polyglot-bench/: spring-petclinic-ms,airflow, istio, eShop, angular/components, nuxt, actix/examples, ktor-samples,
nlohmann/json, play-samples, PSScriptAnalyzer, terraform-aws-eks; 14 distinct
languages active including Python, TS, Java, Go, C#, Rust, Kotlin, Scala, etc.):
CertificateAuthDetectorSessionHeaderAuthDetectorLdapAuthDetector(Each sample ≈ 10ms at JFR's
profilesetting.)Why this is safe
PRE_SCREENis constructed as a regex alternation of every distinctive literalsubstring drawn from the existing patterns in
ALL_PATTERNS/LANGUAGE_PATTERNS.Files that don't contain any of those substrings cannot match any underlying
pattern by construction — so the early
return DetectorResult.empty()isidentical in observable behavior to running the existing line loop and emitting
zero nodes.
Detection semantics unchanged for files that DO contain at least one keyword:
pre-screen passes, the existing line × patterns logic runs unmodified, same
nodes emitted with the same IDs/labels/properties/line numbers.
Tests
3689 / 0 failures / 0 errors / 32 skipped — same as baseline. All 65 auth
detector tests pass without modification (they all use keyword-bearing
fixtures, which pre-screen lets through). The "no match on plain code"
negative tests still pass — pre-screen rejects (faster path), result is the
same empty
DetectorResult.What's NOT in this PR
(where the bottleneck is the tree walk, not regex) or already use
single-pass
Matcher.find(). Pre-screen's gain is small on those andthe regression risk on AST code paths isn't justified.
PRE_SCREENkeeps blastradius minimal and the optimization explicit at each call site. If a
pattern emerges across many regex detectors, a follow-up PR can hoist
to
AbstractRegexDetector.Test plan
mvn test -Dtest='*Auth*Test,AuthDetectorsCoverageTest'— 65/65 passmvn test(full suite) — 3689/0/0/32 skipped🤖 Generated with Claude Code