fix(scoring): prevent risk score saturation via per-rule diminishing returns by mimran-khan · Pull Request #139 · NVIDIA/SkillSpector

mimran-khan · 2026-06-22T17:31:06Z

Summary

Fixes #134 — Risk score saturates to 100 on any multi-file skill due to unbounded additive scoring.

The previous scoring formula treated every finding as a fully independent contribution regardless of whether the same rule_id had already been counted. A skill using subprocess across 10 files would hit 100/100 CRITICAL from repeated TM1 matches alone, making the score meaningless for real-world multi-file skills.

Changes

src/skillspector/nodes/report.py

Replaced the linear additive scorer with per-rule diminishing returns:
- 1st occurrence of a rule: 100% of base points
- 2nd occurrence: 50%
- 3rd occurrence: 25%
- 4th+ occurrences: 0 points (capped)
Applied confidence weighting: each finding's contribution is scaled by its confidence field (0.0–1.0), so low-confidence pattern matches contribute proportionally less
Zero/negative confidence findings are skipped entirely (do not consume weight slots)
Out-of-range confidence values are clamped to [0.0, 1.0]
Same base severity points (CRITICAL=50, HIGH=25, MEDIUM=10, LOW=5), same 1.3x executable multiplier, same band thresholds

tests/nodes/test_report.py

Expanded test coverage from 9 tests to 38 tests
Added dedicated test classes: basic scoring, diminishing returns, executable multiplier, edge cases, band boundaries, real-world scenarios, and report node integration
Edge cases: zero confidence, negative confidence, confidence >1.0, empty rule_id, same rule with mixed severities, exact band boundaries

Behavioral Impact

Scenario	Old Score	New Score
TM1 × 10 files (MEDIUM, confidence 0.5)	100 (CRITICAL)	8 (LOW)
4 different CRITICAL rules (confidence 0.9)	100 (CRITICAL)	100 (CRITICAL)
1 CRITICAL (confidence 1.0)	50 (MEDIUM)	50 (MEDIUM)
2 HIGH different rules (confidence 1.0), executable	65 (HIGH)	65 (HIGH)

Diverse genuine vulnerabilities still accumulate to high scores. Repeated same-rule matches no longer dominate.

Test Plan

All 38 report tests pass
Full suite: 705 passed, 0 failed, 12 skipped
Ruff lint: all checks passed
Ruff format: clean
No new mypy errors introduced (8 pre-existing in unrelated SARIF code)

…returns The additive scoring formula allowed repeated pattern matches of the same rule_id to inflate the risk score unboundedly, causing any multi-file skill to trivially saturate at 100/100 CRITICAL regardless of actual risk. This commit introduces per-rule diminishing returns: the first occurrence scores full points, the second half, the third a quarter, and subsequent occurrences are ignored. Confidence weighting is also applied so low-confidence matches contribute proportionally less. Fixes NVIDIA#134 Signed-off-by: Mohammed Imran Khan <mohammed_imran.khan@outlook.com>

mimran-khan · 2026-06-22T19:15:04Z

PS: These issues were found while running bunch of skill evals using SkillSpector and hence raising all the issues so that i can use these with correct reference to Upstream code

rng1995

Review: per-rule diminishing returns for risk scoring

Thanks for tackling score saturation. The core logic is sound and the test suite is excellent — diminishing weights are indexed safely (count is always 0/1/2 because the >= _MAX_OCCURRENCES_PER_RULE guard continues first), the cap at 3 occurrences is clear, the severity bands are unchanged, and the int() truncation matches the asserted values. Bucketing unknown rule_id as UNKNOWN and clamping confidence are nice touches.

Approving, with a couple of non-blocking items I'd like addressed:

1. Undocumented confidence scaling (please document)

Beyond diminishing returns, this PR also makes scoring confidence-weighted (score += base_points * weight * confidence) and skips findings with confidence <= 0. That is a meaningful change to the scoring contract, but it is not mentioned in the title or the function docstring (which only documents diminishing returns, base points, and the executable multiplier). Please add the confidence factor to the docstring and the PR description so the semantics are explicit. (Good news: this only affects the score, not the findings list — zero-confidence findings are still emitted in SARIF — so it doesn't suppress reporting.)

2. Order-dependency for same `rule_id`, mixed severities (minor)

Weights are applied in finding-iteration order, so for the same rule_id with differing severities the result depends on ordering: a LOW occurrence appearing before a CRITICAL of the same rule gives the CRITICAL the reduced weight. rule_id usually maps to one severity so impact is low, but consider sorting by severity within a rule bucket before applying weights.

Inter-PR note

PR #142 (dedup before scoring) targets the same saturation problem via a different mechanism and edits report.py from the same base. Once both land, dedup runs first, so for identical repeated patterns these diminishing weights will rarely trigger — partial redundancy worth confirming. Separately, #140 (doc-path 0.3x confidence) and #143 (0.4 confidence threshold) now compound with this confidence-weighted scoring; please sanity-check end-to-end that genuine high-severity findings still produce appropriate scores.

mimran-khan · 2026-06-22T19:56:29Z

Thanks for the approval and the helpful feedback! Both non-blocking items addressed:

Confidence scaling documented — docstring now explicitly mentions that each finding's contribution is scaled by confidence (clamped [0,1]) and that zero-confidence findings are skipped without consuming a weight slot.
Severity sort within rule bucket — findings are now sorted by (rule_id, severity DESC) before applying diminishing weights, so the highest-severity occurrence of each rule always gets the full weight regardless of input order.

rng1995 approved these changes Jun 22, 2026

View reviewed changes

This was referenced Jun 22, 2026

fix(static-patterns): filter false positives from documentation and code examples #140

Open

fix(findings): deduplicate cross-analyzer findings before scoring #142

Open

fix(meta-analyzer): add heuristic fallback filter for --no-llm mode #143

Open

rng1995 merged commit 2186e45 into NVIDIA:main Jun 22, 2026

mimran-khan mentioned this pull request Jun 22, 2026

fix(scoring): document confidence scaling, sort by severity within rule bucket #153

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scoring): prevent risk score saturation via per-rule diminishing returns#139

fix(scoring): prevent risk score saturation via per-rule diminishing returns#139
rng1995 merged 1 commit into
NVIDIA:mainfrom
mimran-khan:fix/scoring-saturation-134

mimran-khan commented Jun 22, 2026

Uh oh!

mimran-khan commented Jun 22, 2026

Uh oh!

rng1995 left a comment

Uh oh!

mimran-khan commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mimran-khan commented Jun 22, 2026

Summary

Changes

Behavioral Impact

Test Plan

Uh oh!

mimran-khan commented Jun 22, 2026

Uh oh!

rng1995 left a comment

Choose a reason for hiding this comment

Review: per-rule diminishing returns for risk scoring

1. Undocumented confidence scaling (please document)

2. Order-dependency for same rule_id, mixed severities (minor)

Inter-PR note

Uh oh!

mimran-khan commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

2. Order-dependency for same `rule_id`, mixed severities (minor)