Add needle in haystack test by alessiodevoto · Pull Request #121 · NVIDIA/kvpress

alessiodevoto · 2025-08-22T07:16:33Z

PR description

This small PR adds the standard NIAH test to the benchmarks. This test allows to stress test the model at different context lengths and needle depths.

Checklist

Tests are working (make test)
Code is formatted correctly (make style, on errors try fix with make format)
Copyright header is included
All commits are signed-off using git commit -s
(new press) mypress_press.py is in the presses directory
(new press) MyPress is in __init__.py
(new press) README.md is updated with a 1 liner about the new press in the Available presses section
(new press) New press is in the default_presses list in tests/default_presses.py
(new press) A docstring is provided that follows the same structure as the existing ones

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

maxjeblick

Thanks a lot for the PR!

I have one question/comment:
In your opinion, would it make sense to create the needle datasets beforehand (like ruler, etc.), upload them and reuse them?

Uploading dataset beforehand wouldn't allow on-the-fly benchmarking with arbitrary context_length/needle_depth combinations.
IMO, for kvpress, having ~40 = 5 * 8 combinations of context_length/needle_depth should probably be enough. I'm not concerned of having a common tokenizer for the dataset, but please feel free to chime in on this.

The change would improve the code quality of the PR, as evaluate.py is now also responsible for dataset creation. WIth precomputed datasets, no changes to evaluate.py are needed.

maxjeblick

Alertnatively, one could

add a create dataset script in needle_in_a_haystack folder.
encode the needle_depth as dataset = needle_in_haystack_needle_depth
in load_dataset have an if-else block, loading the needle dataset if dataset starts with neddle_in_haystack
Remove _insert_needle_in_haystack method

By this,

the code changes in evaluate.py are slim
no new parameter is introduced

alessiodevoto · 2025-08-22T09:11:34Z

I see your point, and I also thought of this, but decided to do it this way for 2 main reasons:

the tokenizer. It might affect quite a lot, so we would have at least 3 models * 10 depths (standard benchmarking) * n>5 (I think we should offer more than only 5 context lengths. (Also, it is not really standard to have pre-tokenized dataset)
This test is used to see at which point the model stops working, by gradually increasing the context length. Therefore it is important that the user is able to control this in a fine-grained way.

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

maxjeblick · 2025-08-22T10:57:18Z

Ok mkaes sense.
Then, I'd suggest to refactor the code and move dataset creation to a separate file, as described above.

alessiodevoto · 2025-08-22T10:58:08Z

On second thought, it might make sense to move the insert_needle inside the needle_in_haystack directory, as it is specific to that benchmark, wdyt ? This way the eval code stays (almost) the same.

alessiodevoto · 2025-08-22T10:59:58Z

Done :)

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

alessiodevoto · 2025-08-22T14:16:09Z

Thanks @maxjeblick for your feedback, updates:

moved the dataset insert_needle in the dataset directory (so evaluate stays slim)
refactored evaluation as discussed, tested on ruler and works fine!

maxjeblick

Thanks a lot, LGTM!

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

alessiodevoto · 2025-08-22T14:22:05Z

Sorry @maxjeblick had to fix a typo, should be ok now 😄

alessiodevoto added 3 commits August 21, 2025 15:36

needle

e2fe544

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

niah

4ce1c43

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

rouge scorer

5e777b5

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

alessiodevoto requested a review from maxjeblick August 22, 2025 07:40

maxjeblick reviewed Aug 22, 2025

View reviewed changes

Comment thread evaluation/benchmarks/needle_in_haystack/__init__.py

add inits

a46c963

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

maxjeblick reviewed Aug 22, 2025

View reviewed changes

refactor

7412875

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

alessiodevoto added 3 commits August 22, 2025 14:05

refactor eval

be5b6dc

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

add niah

e96247d

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

style

2b7d181

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

maxjeblick approved these changes Aug 22, 2025

View reviewed changes

revert

0eb4fa6

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

maxjeblick approved these changes Aug 22, 2025

View reviewed changes

alessiodevoto merged commit ab418ac into main Aug 22, 2025
3 checks passed

alessiodevoto deleted the aledev/needle branch August 22, 2025 14:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add needle in haystack test#121

Add needle in haystack test#121
alessiodevoto merged 9 commits intomainfrom
aledev/needle

alessiodevoto commented Aug 22, 2025

Uh oh!

Uh oh!

maxjeblick left a comment

Uh oh!

maxjeblick left a comment •

edited

Loading

Uh oh!

alessiodevoto commented Aug 22, 2025

Uh oh!

maxjeblick commented Aug 22, 2025

Uh oh!

alessiodevoto commented Aug 22, 2025

Uh oh!

alessiodevoto commented Aug 22, 2025

Uh oh!

alessiodevoto commented Aug 22, 2025

Uh oh!

maxjeblick left a comment

Uh oh!

alessiodevoto commented Aug 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alessiodevoto commented Aug 22, 2025

PR description

Checklist

Uh oh!

Uh oh!

maxjeblick left a comment

Choose a reason for hiding this comment

Uh oh!

maxjeblick left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alessiodevoto commented Aug 22, 2025

Uh oh!

maxjeblick commented Aug 22, 2025

Uh oh!

alessiodevoto commented Aug 22, 2025

Uh oh!

alessiodevoto commented Aug 22, 2025

Uh oh!

alessiodevoto commented Aug 22, 2025

Uh oh!

maxjeblick left a comment

Choose a reason for hiding this comment

Uh oh!

alessiodevoto commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maxjeblick left a comment •

edited

Loading

alessiodevoto commented Aug 22, 2025 •

edited

Loading