Skip to content

Add return_cache option to TransformerBridge.generate#1337

Merged
jlarson4 merged 3 commits into
TransformerLensOrg:devfrom
RecreationalMath:return-cache-in-generate
May 28, 2026
Merged

Add return_cache option to TransformerBridge.generate#1337
jlarson4 merged 3 commits into
TransformerLensOrg:devfrom
RecreationalMath:return-cache-in-generate

Conversation

@RecreationalMath
Copy link
Copy Markdown
Contributor

Description

Adds an opt-in return_cache flag to TransformerBridge.generate(). When return_cache=True, generate returns (output, cache) where cache is a standard ActivationCache over the full prompt+generated sequence, identical to run_with_cache(output). This resolves the gap in #697, where run_with_cache only covers the prompt and generate returns no activations. A names_filter argument lets callers scope the cache, and a device argument offloads the returned cache to another device (e.g. CPU); the cache over prompt+max_new_tokens can be large, so the docstring notes the memory cost.

Semantics are "recompute one clean forward over the generated sequence," so the cache is consistent with the rest of TransformerLens, includes attention patterns and all hook points, and avoids the cached-eager-attention path behind #1322. For a causal LM this is numerically identical to capturing during generation (verified), without the ragged per-step shapes. This PR covers single-sequence decoder-only text; encoder-decoder / SSM / multimodal / batched / inputs_embeds raise a clear error pointing to the run_with_cache-on-output workaround. Capturing during generation (for active-hook/steering scenarios) remains available via with model.hooks(...) around generate and can be added later as an explicit opt-in.

Fixes #697

Type of change

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Checklist:

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my feature works
  • New and existing unit tests pass locally with my changes
  • I have not rewritten tests relating to key interfaces which would affect backward compatibility

…ge.generate

generate(return_cache=True) now also returns an ActivationCache for the full prompt + generated sequence, identical to run_with_cache(output), via one clean recompute forward over the output. Adds names_filter and device passthroughs to scope and offload the cache. Supported for single-sequence decoder-only text generation; encoder-decoder, SSM, multimodal, batched, and inputs_embeds inputs raise a clear NotImplementedError pointing to run_with_cache. Device offload moves cache_dict directly to avoid ActivationCache.to's spurious move_model DeprecationWarning.
@RecreationalMath
Copy link
Copy Markdown
Contributor Author

RecreationalMath commented May 27, 2026

Heads-up on a small follow-up, not a blocker for this PR.

The device= offload here moves the cache tensors directly (cache.cache_dict = {k: v.to(device) ...}) rather than passing device= into run_with_cache. That is deliberate as run_with_cache(device=) currently moves the whole model to the cache device and never restores it (filed as #1336). The direct move is correct and side-effect-free, but it offloads the cache after it is built, so it does not reduce peak memory.

Once #1336 is fixed, this can be simplified to a run_with_cache(output_tokens, names_filter=names_filter, device=device) passthrough, which offloads at capture time and therefore lowers peak memory for large caches.

I'd recommend keeping this as a self-contained version and switching to the passthrough in a small follow-up PR once #1336 lands. But if you prefer to hold this PR until #1336 is in and use the passthrough here directly, let me know @jlarson4

@jlarson4 jlarson4 merged commit f676d8a into TransformerLensOrg:dev May 28, 2026
24 checks passed
@jlarson4
Copy link
Copy Markdown
Collaborator

I agree with your assessment here, and think it is fine to merge this as-is with the temporary solution. Typically, I'd ask you to add a note to #1336 that it should update this as a side-effect of the solution, but since you're the one handling that issue, I'll trust that you take care of it.

Thank you for the thorough investigation of both this issue and the new one you discovered! Great work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants