Skip to content

fix(yara): use content hash for rule cache invalidation#156

Merged
rng1995 merged 1 commit into
NVIDIA:mainfrom
mimran-khan:fix/yara-cache-content-hash
Jun 23, 2026
Merged

fix(yara): use content hash for rule cache invalidation#156
rng1995 merged 1 commit into
NVIDIA:mainfrom
mimran-khan:fix/yara-cache-content-hash

Conversation

@mimran-khan

Copy link
Copy Markdown
Contributor

Summary

The YARA rule cache (_content_hash) invalidated based on file paths and sizes only. If a rule file is edited without changing its byte count (e.g., changing a pattern string to a same-length alternative), the module continues serving stale compiled rules from the previous version.

Changes

  • static_yara.py: _content_hash() now hashes actual file content (read_bytes()) instead of just st_size.
  • test_static_yara.py: 3 new tests in TestContentHashInvalidation — validates that same-size edits produce different hashes, identical content produces stable hashes, and _load_rules recompiles after a same-size edit.

Before / After

# Before — size-only, misses same-length edits
h.update(str(p.stat().st_size).encode())

# After — content-based, catches all edits
h.update(p.read_bytes())

Trade-off

Reading file content is slightly slower than stat-only (~microseconds per rule file). Given that YARA rule directories typically contain <50 small files and compilation itself is the expensive step, this is negligible.

Testing

  • 3 new tests pass
  • Full existing YARA test suite continues to pass

Fixes #152

@keshprad keshprad left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me. thanks!

I'll merge later today after double checking all the unit tests, linting issues, static analysis reports on our internal CI

@keshprad

Copy link
Copy Markdown
Member

just added on minor linting fix.

@rng1995 rng1995 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent review (deferring merge to @keshprad)

Took a look as part of a review sweep. The fix is correct and well-scoped: hashing actual file content (p.read_bytes()) instead of st_size closes the stale-cache hole where a same-length rule edit kept serving previously compiled rules. The TestContentHashInvalidation cases (same-size/different-content, identical-content stability, recompile-after-edit) cover it well.

Not stacking an approval or merging on top of your existing approval — leaving the merge to you per your note about verifying unit tests, linting, and static-analysis on internal CI first. LGTM from my side.

@rng1995 rng1995 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent confirmation — deferring the merge to @keshprad per their stated plan

Not stacking another approval (this is already approved by @keshprad). Posting a comment-level note only.

The fix is correct and well-scoped: _content_hash() now hashes file content via p.read_bytes() instead of str(p.stat().st_size), so same-length rule edits correctly invalidate the compiled-rule cache:

h.update(str(p).encode())
h.update(p.read_bytes())

The trade-off (reading small rule files vs. stat-only) is negligible relative to YARA compilation, as the PR notes.

Test coverage is solid — TestContentHashInvalidation covers same-size/different-content -> different hash, identical content -> stable hash, and _load_rules recompiling after a same-length edit.

Deferring the merge to @keshprad, who said they'll merge after double-checking unit tests, linting, and static-analysis on the internal CI (and has already pushed a follow-up lint fix). No action needed from me — just an independent +1 on correctness.

_content_hash() previously hashed file paths and sizes only. If a rule
file was edited without changing its byte count, the cache would serve
stale compiled rules. Switch to hashing actual file content via
read_bytes() so any edit invalidates the cache.

Fixes NVIDIA#152
@mimran-khan mimran-khan force-pushed the fix/yara-cache-content-hash branch from b6b1e86 to 6bd9c2f Compare June 23, 2026 09:42
@rng1995 rng1995 merged commit 3a6578f into NVIDIA:main Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] YARA rule cache uses file size instead of content hash — stale rules served after edits

3 participants