fix(yara): use content hash for rule cache invalidation#156
Conversation
|
just added on minor linting fix. |
rng1995
left a comment
There was a problem hiding this comment.
Independent review (deferring merge to @keshprad)
Took a look as part of a review sweep. The fix is correct and well-scoped: hashing actual file content (p.read_bytes()) instead of st_size closes the stale-cache hole where a same-length rule edit kept serving previously compiled rules. The TestContentHashInvalidation cases (same-size/different-content, identical-content stability, recompile-after-edit) cover it well.
Not stacking an approval or merging on top of your existing approval — leaving the merge to you per your note about verifying unit tests, linting, and static-analysis on internal CI first. LGTM from my side.
rng1995
left a comment
There was a problem hiding this comment.
Independent confirmation — deferring the merge to @keshprad per their stated plan
Not stacking another approval (this is already approved by @keshprad). Posting a comment-level note only.
The fix is correct and well-scoped: _content_hash() now hashes file content via p.read_bytes() instead of str(p.stat().st_size), so same-length rule edits correctly invalidate the compiled-rule cache:
h.update(str(p).encode())
h.update(p.read_bytes())The trade-off (reading small rule files vs. stat-only) is negligible relative to YARA compilation, as the PR notes.
Test coverage is solid — TestContentHashInvalidation covers same-size/different-content -> different hash, identical content -> stable hash, and _load_rules recompiling after a same-length edit.
Deferring the merge to @keshprad, who said they'll merge after double-checking unit tests, linting, and static-analysis on the internal CI (and has already pushed a follow-up lint fix). No action needed from me — just an independent +1 on correctness.
_content_hash() previously hashed file paths and sizes only. If a rule file was edited without changing its byte count, the cache would serve stale compiled rules. Switch to hashing actual file content via read_bytes() so any edit invalidates the cache. Fixes NVIDIA#152
b6b1e86 to
6bd9c2f
Compare
Summary
The YARA rule cache (
_content_hash) invalidated based on file paths and sizes only. If a rule file is edited without changing its byte count (e.g., changing a pattern string to a same-length alternative), the module continues serving stale compiled rules from the previous version.Changes
static_yara.py:_content_hash()now hashes actual file content (read_bytes()) instead of justst_size.test_static_yara.py: 3 new tests inTestContentHashInvalidation— validates that same-size edits produce different hashes, identical content produces stable hashes, and_load_rulesrecompiles after a same-size edit.Before / After
Trade-off
Reading file content is slightly slower than stat-only (~microseconds per rule file). Given that YARA rule directories typically contain <50 small files and compilation itself is the expensive step, this is negligible.
Testing
Fixes #152