Skip to content

fix(scoring): prevent risk score saturation via per-rule diminishing returns#139

Merged
rng1995 merged 1 commit into
NVIDIA:mainfrom
mimran-khan:fix/scoring-saturation-134
Jun 22, 2026
Merged

fix(scoring): prevent risk score saturation via per-rule diminishing returns#139
rng1995 merged 1 commit into
NVIDIA:mainfrom
mimran-khan:fix/scoring-saturation-134

Conversation

@mimran-khan

Copy link
Copy Markdown
Contributor

Summary

Fixes #134 — Risk score saturates to 100 on any multi-file skill due to unbounded additive scoring.

The previous scoring formula treated every finding as a fully independent contribution regardless of whether the same rule_id had already been counted. A skill using subprocess across 10 files would hit 100/100 CRITICAL from repeated TM1 matches alone, making the score meaningless for real-world multi-file skills.

Changes

src/skillspector/nodes/report.py

  • Replaced the linear additive scorer with per-rule diminishing returns:
    • 1st occurrence of a rule: 100% of base points
    • 2nd occurrence: 50%
    • 3rd occurrence: 25%
    • 4th+ occurrences: 0 points (capped)
  • Applied confidence weighting: each finding's contribution is scaled by its confidence field (0.0–1.0), so low-confidence pattern matches contribute proportionally less
  • Zero/negative confidence findings are skipped entirely (do not consume weight slots)
  • Out-of-range confidence values are clamped to [0.0, 1.0]
  • Same base severity points (CRITICAL=50, HIGH=25, MEDIUM=10, LOW=5), same 1.3x executable multiplier, same band thresholds

tests/nodes/test_report.py

  • Expanded test coverage from 9 tests to 38 tests
  • Added dedicated test classes: basic scoring, diminishing returns, executable multiplier, edge cases, band boundaries, real-world scenarios, and report node integration
  • Edge cases: zero confidence, negative confidence, confidence >1.0, empty rule_id, same rule with mixed severities, exact band boundaries

Behavioral Impact

Scenario Old Score New Score
TM1 × 10 files (MEDIUM, confidence 0.5) 100 (CRITICAL) 8 (LOW)
4 different CRITICAL rules (confidence 0.9) 100 (CRITICAL) 100 (CRITICAL)
1 CRITICAL (confidence 1.0) 50 (MEDIUM) 50 (MEDIUM)
2 HIGH different rules (confidence 1.0), executable 65 (HIGH) 65 (HIGH)

Diverse genuine vulnerabilities still accumulate to high scores. Repeated same-rule matches no longer dominate.

Test Plan

  • All 38 report tests pass
  • Full suite: 705 passed, 0 failed, 12 skipped
  • Ruff lint: all checks passed
  • Ruff format: clean
  • No new mypy errors introduced (8 pre-existing in unrelated SARIF code)

…returns

The additive scoring formula allowed repeated pattern matches of the same
rule_id to inflate the risk score unboundedly, causing any multi-file skill
to trivially saturate at 100/100 CRITICAL regardless of actual risk.

This commit introduces per-rule diminishing returns: the first occurrence
scores full points, the second half, the third a quarter, and subsequent
occurrences are ignored. Confidence weighting is also applied so low-confidence
matches contribute proportionally less.

Fixes NVIDIA#134

Signed-off-by: Mohammed Imran Khan <mohammed_imran.khan@outlook.com>
@mimran-khan

Copy link
Copy Markdown
Contributor Author

PS: These issues were found while running bunch of skill evals using SkillSpector and hence raising all the issues so that i can use these with correct reference to Upstream code

@rng1995 rng1995 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: per-rule diminishing returns for risk scoring

Thanks for tackling score saturation. The core logic is sound and the test suite is excellent — diminishing weights are indexed safely (count is always 0/1/2 because the >= _MAX_OCCURRENCES_PER_RULE guard continues first), the cap at 3 occurrences is clear, the severity bands are unchanged, and the int() truncation matches the asserted values. Bucketing unknown rule_id as UNKNOWN and clamping confidence are nice touches.

Approving, with a couple of non-blocking items I'd like addressed:

1. Undocumented confidence scaling (please document)

Beyond diminishing returns, this PR also makes scoring confidence-weighted (score += base_points * weight * confidence) and skips findings with confidence <= 0. That is a meaningful change to the scoring contract, but it is not mentioned in the title or the function docstring (which only documents diminishing returns, base points, and the executable multiplier). Please add the confidence factor to the docstring and the PR description so the semantics are explicit. (Good news: this only affects the score, not the findings list — zero-confidence findings are still emitted in SARIF — so it doesn't suppress reporting.)

2. Order-dependency for same rule_id, mixed severities (minor)

Weights are applied in finding-iteration order, so for the same rule_id with differing severities the result depends on ordering: a LOW occurrence appearing before a CRITICAL of the same rule gives the CRITICAL the reduced weight. rule_id usually maps to one severity so impact is low, but consider sorting by severity within a rule bucket before applying weights.

Inter-PR note

PR #142 (dedup before scoring) targets the same saturation problem via a different mechanism and edits report.py from the same base. Once both land, dedup runs first, so for identical repeated patterns these diminishing weights will rarely trigger — partial redundancy worth confirming. Separately, #140 (doc-path 0.3x confidence) and #143 (0.4 confidence threshold) now compound with this confidence-weighted scoring; please sanity-check end-to-end that genuine high-severity findings still produce appropriate scores.

@mimran-khan

Copy link
Copy Markdown
Contributor Author

Thanks for the approval and the helpful feedback! Both non-blocking items addressed:

  1. Confidence scaling documented — docstring now explicitly mentions that each finding's contribution is scaled by confidence (clamped [0,1]) and that zero-confidence findings are skipped without consuming a weight slot.

  2. Severity sort within rule bucket — findings are now sorted by (rule_id, severity DESC) before applying diminishing weights, so the highest-severity occurrence of each rule always gets the full weight regardless of input order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Risk score saturates to 100 on any multi-file skill due to unbounded additive scoring

2 participants