Add decoding time compression by alessiodevoto · Pull Request #138 · NVIDIA/kvpress

alessiodevoto · 2025-09-29T13:03:44Z

PR description

(Not ready to merge)
This PR introduces decoding time compression ( #55 ) and includes significant contributions from @maxjeblick (Thanks Max ! 🙏).

The main additions are 2 presses, Decoding and PrefillDecoding, that perform decoding time compression. Apart from standard code review, some things to discuss:

Where to put the documentation for decoding time compression in the README.md (right now I left the original comments in the generation.md file, but it has to be moved). We need to be extra careful to make sure it is not confusing.
Right now the evaluation code needs to be refactored for supporting decoding times compression evaluation. We will need to add benchmarks and change the eval loop slightly. We can address this in a future PR.

Checklist

Tests are working (make test)
Code is formatted correctly (make style, on errors try fix with make format)
Copyright header is included
All commits are signed-off using git commit -s
(new press) mypress_press.py is in the presses directory
(new press) MyPress is in __init__.py
(new press) README.md is updated with a 1 liner about the new press in the Available presses section
(new press) New press is in the default_presses list in tests/default_presses.py
(new press) A docstring is provided that follows the same structure as the existing ones

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

copy-pr-bot · 2025-09-29T13:03:49Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

alessiodevoto · 2025-09-29T13:04:45Z

/ok to test a43fc19

alessiodevoto · 2025-09-29T13:12:59Z

Is this normal @maxjeblick ?

tests/integration/test_ruler.py ssssssssssssssssssssssssssssssssssssssss [ 17%]

maxjeblick · 2025-09-29T13:16:48Z

+            logger.debug(f"Compressed Context Length: {cache.get_seq_length()}")
+
+            # Greedy decoding for each question
+            answers = []


Here, we don't exit the context manager after prefilling. This may break kvzip press.

A straightforward solution might be to have two context mangers

should_perform_prefill_compression = press is not None or not isinstance(press, (DecodingPress, PrefillDecodingPress) with press(self.model) if not should_prefill_compression else contextlib.nullcontext():

(and a subsequent with block for answer generation). By this, we 100% ensure same control flow in case no decoding press is used.

As you @maxjeblick pointed out KVZip is not supported because it is not a scorer press

Ok sorry @maxjeblick I missed the problem, I tried the 2 context managers approach, wdyt ?

That looks good!

maxjeblick

Thanks a lot for working on decoding press and adapting the code!

Some comments:

IMO, decoding press can be refactored a bit, in particulat cache handling (qunatized/non quantized) now appears in various presses and can be factored out.
The notebook needs probably be rerun to produce a correct output (max generation length is too low). The generated text output can also be formatted nice for display
The PR changes the pipeline logic, the with press context manager now exists AFTER generation, not before. This will most likely cause kvzip press to not work any longer (it relies on the context manager to exit after prefilling).

maxjeblick · 2025-09-30T07:19:41Z

+        Target number of tokens to keep after compression.
+    hidden_states_buffer_size : int, default=128
+        Maximum number of hidden states to keep before compression. Larger values use more GPU memory.
+        NoteSome presses don't need buffered hidden states and can set this to 0 to use only the


maxjeblick · 2025-09-30T07:26:33Z

+            )
+
+            cache_layer = cache.layers[module.layer_idx]
+            if isinstance(cache, QuantizedCache):


Thiis is a candidate for refactoring, as it appears also in the base presses.

Moved to utils

maxjeblick · 2025-09-30T07:31:00Z

+            return output
+        # print(f"Adding hidden states to buffer: {hidden_states.shape}")
+        # Add current hidden states to buffer for this layer
+        self.hidden_states_buffer[layer_idx].append(hidden_states.detach().clone())


Hidden states buffer might be longer than hidden_states_buffer_size.
IMO, the code makes sense; we may need to adapt the docstring to be more explict.

We should already use torch.cat, s.t. self.hidden_states_buffer[layer_idx] is always a tensor; makes it more easy to use.

We say that in the doctring: "Buffered hidden states from recent decoding steps (shape: [batch, buffer_len, hidden_dim])"

We should already use torch.cat
Maybe it is cleaner like this, handling the first torch.cat when the buffer is empty would require extra complexity and less readable code ?

maxjeblick · 2025-09-30T07:33:10Z

+            logger.debug(f"Applied decoding compression: " f"keys.shape: {keys.shape}, values.shape: {values.shape}")
+
+            # Update cache with compressed keys and values
+            if isinstance(cache, QuantizedCache):


Again: Could become a dedicated util function.

maxjeblick · 2025-09-30T07:36:55Z

 [project]
 name = "kvpress"
-version = "0.3.0"
+version = "0.3.1"


I wouldn't update the version.
Decoding press will become version 1.0.0, with this PR, we are adding this functionality (we won't cut a version yet to be able test this feature more thourpoughly).

maxjeblick · 2025-09-30T07:38:21Z

@@ -0,0 +1,111 @@
+# Generation Presses (Experimental)


We can move this to the main readme in a new dropdown section.

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

alessiodevoto · 2025-10-01T11:17:40Z

/ok to test a0053c9

alessiodevoto · 2025-10-01T11:55:23Z

/ok to test a0053c9

maxjeblick · 2025-10-01T10:47:48Z

@@ -132,13 +134,7 @@ def forward_hook(self, module: nn.Module, input: list[torch.Tensor], kwargs: dic

        cache_layer = cache.layers[module.layer_idx]
        if isinstance(cache, QuantizedCache):


We could also extract keys, values = extract_key_values(cache_layer).
WDYT?

Do you mean just
def extract_key_values(layer): return layer.keys, layer.values

Maybe a bit overkill in the end we are just accessing two fields ?

Maybe a bit overkill in the end we are just accessing two fields ?

The same If-else block is present in several parts of the code.
To me, it makes thus sense to extract the whole block.

I see, makes sense!

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

alessiodevoto · 2025-10-02T08:28:12Z

/ok to test a2982f9

maxjeblick · 2025-10-02T09:03:20Z


-        with press(self.model) if press is not None else contextlib.nullcontext():
+        # We only perform prefill compression if the press is not a decoding or prefill decoding press
+        perform_prefill_compression = press is not None and not isinstance(press, (DecodingPress, PrefillDecodingPress))


PrefillDecodingPress needs to be excluded.

Right, my bad 🫠

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

alessiodevoto · 2025-10-02T09:49:28Z

@maxjeblick could you also check the README, if it is clear enough ?

alessiodevoto · 2025-10-02T11:16:53Z

/ok to test 31cf83c

maxjeblick · 2025-10-13T08:29:36Z

Moved PR to #139 due to DCO issues.
@alessiodevoto I added you as co-author in b16417b which contains the content of this PR.

maxjeblick added 30 commits July 3, 2025 16:42

wip on press

eb403b6

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

wip on generation press

6a74be1

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

fix q_len to reflect current seq len

2ab232b

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

fix q_len to reflect current seq len

d3f78a7

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

move classes

c6d908c

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

add comment

68940e4

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

add comment

fe1cf37

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

fix hidden_states buffer used during compression

dcd2b05

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

intorduce buffer_size

ed5786d

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

verbose logging

b273781

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

do not remove cache when decoding

ea4ba57

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

do not remove cache when decoding

d4de4cd

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

wip on debugging press

4083e63

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

wip on debugging press

b45cb92

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

fix logic

47a01e7

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

find n kept

d10a4df

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

wip on failing presses

cdf415b

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

wip on failing presses

de96fc8

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

fix bug in decoding press

173dd74

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

improve testing

eb913e1

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

some refactoring

d59e6f0

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

add readme

59fadde

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

add readme

61140b6

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

add example notebook for decoding

c0f39d3

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

add hidden_state buffer size

aa5bd3e

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

improve logic

f91c4f5

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

better docstring

31c6cca

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

add to readme

23578a0

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

add to readme

b42545b

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

improve readme

37f91b7

Signed-off-by: Maximilian Jeblick <maximilianjeblick@gmail.com>

alessiodevoto added 9 commits September 1, 2025 08:18

dry

9c8235e

merge main

bc0e8fa

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

update decoding press and tests

d83d681

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

polish

3d39d3c

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

polish

7bc6722

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

test

8683b0e

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

notebook

d6c48d9

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

test decoding

7352c4f

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

decoding presses

a43fc19

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

maxjeblick reviewed Sep 29, 2025

View reviewed changes

alessiodevoto commented Sep 29, 2025

View reviewed changes

Comment thread evaluation/evaluate_decoding.py Outdated

maxjeblick reviewed Sep 30, 2025

View reviewed changes

alessiodevoto added 2 commits October 1, 2025 09:52

refactor

fc179d5

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

fix readme

a0053c9

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

maxjeblick reviewed Oct 2, 2025

View reviewed changes

pipeline

a2982f9

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

maxjeblick reviewed Oct 2, 2025

View reviewed changes

alessiodevoto added 2 commits October 2, 2025 09:37

fix pipeline

d8b4363

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

refactor

31cf83c

Signed-off-by: alessiodevoto <devoto.alessio@gmail.com>

maxjeblick mentioned this pull request Oct 7, 2025

Add decoding press functionality #139

Merged

maxjeblick closed this Oct 13, 2025

		@@ -132,13 +134,7 @@ def forward_hook(self, module: nn.Module, input: list[torch.Tensor], kwargs: dic

		cache_layer = cache.layers[module.layer_idx]
		if isinstance(cache, QuantizedCache):

Conversation

alessiodevoto commented Sep 29, 2025

PR description

Checklist

Uh oh!

copy-pr-bot Bot commented Sep 29, 2025

Uh oh!

alessiodevoto commented Sep 29, 2025

Uh oh!

alessiodevoto commented Sep 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maxjeblick Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maxjeblick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alessiodevoto commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alessiodevoto commented Oct 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alessiodevoto commented Oct 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alessiodevoto commented Oct 2, 2025

Uh oh!

alessiodevoto commented Oct 2, 2025

Uh oh!

maxjeblick commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

maxjeblick Sep 30, 2025 •

edited

Loading

alessiodevoto commented Oct 1, 2025 •

edited

Loading