Skip to content

Add needle in haystack test#121

Merged
alessiodevoto merged 9 commits intomainfrom
aledev/needle
Aug 22, 2025
Merged

Add needle in haystack test#121
alessiodevoto merged 9 commits intomainfrom
aledev/needle

Conversation

@alessiodevoto
Copy link
Copy Markdown
Collaborator

PR description

This small PR adds the standard NIAH test to the benchmarks. This test allows to stress test the model at different context lengths and needle depths.

Checklist

  • Tests are working (make test)
  • Code is formatted correctly (make style, on errors try fix with make format)
  • Copyright header is included
  • All commits are signed-off using git commit -s
  • (new press) mypress_press.py is in the presses directory
  • (new press) MyPress is in __init__.py
  • (new press) README.md is updated with a 1 liner about the new press in the Available presses section
  • (new press) New press is in the default_presses list in tests/default_presses.py
  • (new press) A docstring is provided that follows the same structure as the existing ones

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>
Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>
Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>
Comment thread evaluation/benchmarks/needle_in_haystack/__init__.py
Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>
Copy link
Copy Markdown
Collaborator

@maxjeblick maxjeblick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR!

I have one question/comment:
In your opinion, would it make sense to create the needle datasets beforehand (like ruler, etc.), upload them and reuse them?

Uploading dataset beforehand wouldn't allow on-the-fly benchmarking with arbitrary context_length/needle_depth combinations.
IMO, for kvpress, having ~40 = 5 * 8 combinations of context_length/needle_depth should probably be enough. I'm not concerned of having a common tokenizer for the dataset, but please feel free to chime in on this.

The change would improve the code quality of the PR, as evaluate.py is now also responsible for dataset creation. WIth precomputed datasets, no changes to evaluate.py are needed.

Copy link
Copy Markdown
Collaborator

@maxjeblick maxjeblick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alertnatively, one could

  • add a create dataset script in needle_in_a_haystack folder.
  • encode the needle_depth as dataset = needle_in_haystack_needle_depth
  • in load_dataset have an if-else block, loading the needle dataset if dataset starts with neddle_in_haystack
  • Remove _insert_needle_in_haystack method

By this,

  • the code changes in evaluate.py are slim
  • no new parameter is introduced

@alessiodevoto
Copy link
Copy Markdown
Collaborator Author

I see your point, and I also thought of this, but decided to do it this way for 2 main reasons:

  • the tokenizer. It might affect quite a lot, so we would have at least 3 models * 10 depths (standard benchmarking) * n>5 (I think we should offer more than only 5 context lengths. (Also, it is not really standard to have pre-tokenized dataset)
  • This test is used to see at which point the model stops working, by gradually increasing the context length. Therefore it is important that the user is able to control this in a fine-grained way.

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>
@maxjeblick
Copy link
Copy Markdown
Collaborator

Ok mkaes sense.
Then, I'd suggest to refactor the code and move dataset creation to a separate file, as described above.

@alessiodevoto
Copy link
Copy Markdown
Collaborator Author

On second thought, it might make sense to move the insert_needle inside the needle_in_haystack directory, as it is specific to that benchmark, wdyt ? This way the eval code stays (almost) the same.

@alessiodevoto
Copy link
Copy Markdown
Collaborator Author

Done :)

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>
Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>
Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>
@alessiodevoto
Copy link
Copy Markdown
Collaborator Author

Thanks @maxjeblick for your feedback, updates:

  • moved the dataset insert_needle in the dataset directory (so evaluate stays slim)
  • refactored evaluation as discussed, tested on ruler and works fine!

Copy link
Copy Markdown
Collaborator

@maxjeblick maxjeblick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, LGTM!

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>
@alessiodevoto
Copy link
Copy Markdown
Collaborator Author

alessiodevoto commented Aug 22, 2025

Sorry @maxjeblick had to fix a typo, should be ok now 😄

@alessiodevoto alessiodevoto merged commit ab418ac into main Aug 22, 2025
3 checks passed
@alessiodevoto alessiodevoto deleted the aledev/needle branch August 22, 2025 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants