CASK

Core-Aware Selective KV Compression for Reasoning Traces

CASK treats reasoning-time KV compression as a behavior-preserving selective consolidation problem rather than a pure scoring problem. The paper-facing method is cask: protect a small core of reasoning states, selectively consolidate redundant scratch states, and use a two-stage policy for prompt-heavy runs.

At A Glance

Item	Current answer
Main method	`cask`
Main baseline	`triattention`
Main metric family	teacher-forced reference fidelity
Headline claim	CASK improves the minimum usable budget frontier rather than trying to be the most aggressive compressor at every setting
Prompt-heavy policy	Stage 1 prefix eviction, then Stage 2 decode consolidation
Primary runner	`python scripts/cli.py run-one ... --method cask`
Main replay harness	`scripts/replay_reference_fidelity.py`
Command provenance	`artifacts/COMMAND_MAP.md`
Figure assets	`docs/assets/`
Paper source	`paper/`

Method Summary

Stage	What happens	Why it exists
Stage 1: prefix compression	TriAttention-style eviction plus a small coverage reserve	Prevent prompt-heavy runs from exhausting the budget before decode starts
Stage 2: decode compression	Split decode states into `protected core` and `mergeable scratch`, then consolidate scratch only	Preserve answer-critical states while compressing redundant reasoning work

Safety Hook

CASK is built around two paper-facing diagnostics: lost representative mass and kappa-dispersion. If a scratch group preserves most of the horizon-weighted mass and its members stay close to the representative under the calibrated kappa geometry, then replacing that group with one representative changes the next-token score only by a small additive amount. This is the main reason the repo reports fidelity and saved ratio together rather than treating compression as a pure ranking problem.

Mode Map

Mode	Purpose	Status
`fullkv`	Uncompressed reference run	primary reference
`triattention`	Original eviction baseline	primary baseline
`cask`	CASK mainline implementation	primary paper method
`expectedattention`	Closest prior-style comparison kept in-tree	optional baseline
`horizonkv`	Legacy internal alias for archived Phase 1 scorer work	not paper-facing

If you are running the current paper candidate, use --method cask.

Benchmark Snapshot

The tables below report both fidelity and terminal saved ratio so the quality-memory tradeoff is visible at a glance.

H100 Reasoning Replay Gate

Slice	Budget	Tri Top-1	CASK Top-1	Tri Mean NLL	CASK Mean NLL	Tri Saved Ratio	CASK Saved Ratio
`AIME24 ref6`	`256`	`86.10`	`88.43`	`0.463`	`0.359`	`65.31%`	`65.28%`
`AIME24 ref6`	`384`	`88.25`	`90.69`	`0.383`	`0.268`	`61.55%`	`61.55%`
`AIME24 ref6`	`512`	`89.42`	`91.68`	`0.333`	`0.233`	`43.57%`	`43.52%`
`AIME25 ref6`	`256`	`85.71`	`86.77`	`0.500`	`0.504`	`63.37%`	`55.87%`
`AIME25 ref6`	`384`	`89.13`	`90.32`	`0.357`	`0.313`	`59.52%`	`63.26%`
`AIME25 ref6`	`512`	`89.94`	`91.68`	`0.321`	`0.254`	`44.76%`	`37.25%`

Detail: H100 reasoning replay package

Prompt-Heavy Replay Highlights

Dataset	Budget	Tri Top-1	CASK Top-1	Tri Mean NLL	CASK Mean NLL	CASK Saved Ratio	Read
`qasper`	`256`	`67.19`	`71.09`	`1.315`	`1.247`	`90.90%`	same-budget replay win
`multi_news`	`384`	`53.71`	`61.33`	`2.052`	`1.540`	`84.07%`	decode-active replay win
`hotpotqa`	`384`	`81.25`	`96.88`	`1.344`	`0.110`	`96.49%`	strongest same-budget witness
`2wikimqa`	`384`	`59.38`	`56.25`	`3.415`	`2.397`	`94.41%`	retained boundary

Detail: Prompt-heavy replay readout

Output-Level Bridge Highlights

Task	Comparison	Lexical / Sequence	Semantic Sim.	Official Metric	CASK Terminal Saved Ratio	Read
`qasper`	`CASK @ 256` vs `TriAttention @ 512`	`0.238 > 0.173`	`0.791 > 0.678`	`12.77 > 11.94`	`90.90%`	clean budget crossing on all three axes
`multi_news`	`CASK @ 384` vs `TriAttention @ 384`	`0.169 > 0.000`	`0.952 > 0.452`	`15.16 > 0.00`	`84.07%`	strongest decode-active output bridge
`hotpotqa`	`CASK @ 256` vs `TriAttention @ 256`	`1.000 = 1.000`	`1.000 = 1.000`	`27.27 = 27.27`	`97.57%`	non-regression parity

Detail: H100 actual-output bridge package

Current Read

Axis	Current read
Reasoning replay	CASK beats TriAttention at the same budget on the tracked H100 gate and shows partial crossing
Prompt-heavy replay	CASK has a strong same-budget replay package, with `multi_news` and `vcsum` as the current replay-level decode-active witnesses
Output-level bridge	`multi_news` remains the strongest decode-active output bridge; `vcsum` is a lexical-vs-semantic boundary rather than a clean headline win
Savings interpretation	The claim is not "always compress more"; it is "keep full-KV behavior alive at a lower usable budget"
Claim boundary	Active decode regime and `prefix_budget_exhausted` regime must be separated explicitly

Artifact Index

Start here: artifacts/README.md

Command trace: artifacts/COMMAND_MAP.md

Which package should you open?

If you want to know...	Open this	Why this is the right package
whether CASK wins the main reasoning replay gate	H100 reasoning replay gate	contains the `AIME24` / `AIME25` synchronized replay tables and crossing read
whether replay gains show up in actual generation	H100 actual-output bridge	contains the tracked `qasper`, `multi_news`, and `hotpotqa` output bridge rows
how to read the full prompt-heavy story	H100 prompt-heavy follow-up	separates replay-level decode-active wins from output-level bridge rows and `prefix_budget_exhausted` boundaries
which figures are current and ready to paste into the draft	Figure asset pack	bundles the synchronized PNG/PDF versions of the reasoning gate, prompt-heavy, bridge, and method-overview figures

Package Roles

Package	Use it for	Do not use it for
H100 reasoning replay gate	main reasoning replay headline	final benchmark-accuracy headline
H100 actual-output bridge	showing replay-to-output linkage	broad decode-stage generalization claims by itself
H100 prompt-heavy follow-up	regime separation and prompt-heavy narrative	pretending every prompt-heavy task is decode-active

Installation

Python 3.10+ is required.

git clone https://github.com/Skyline-23/CASK.git
cd CASK
pip install -e .

For Linux benchmarking and vLLM runtime work, install the matching CUDA, FlashAttention, and vLLM stack separately. The HuggingFace replay harnesses also run on a single 16 GB consumer GPU with sdpa.

Quick Start

Task	Command
FullKV reference	`python scripts/cli.py run-one --model Qwen3-8B --dataset math500 --method fullkv`
TriAttention baseline	`python scripts/cli.py run-one --model Qwen3-8B --dataset math500 --method triattention --budget 104 --stats-path cask/calibration/for_aime24_experiment/qwen3_8b.pt`
CASK mainline	`python scripts/cli.py run-one --model Qwen3-8B --dataset math500 --method cask --budget 104 --stats-path cask/calibration/for_aime24_experiment/qwen3_8b.pt`
Teacher-forced replay	`python scripts/replay_reference_fidelity.py --reference ... --model-path ... --method cask --budget 104 --triattention-stats-file ...`

Replay example:

python scripts/replay_reference_fidelity.py \
    --reference experiments/outputs/math500/Qwen3-8B/sample1/fullkv/fullkv_selection_geometry_reference \
    --model-path experiments/models/Qwen3-8B \
    --method cask \
    --budget 104 \
    --triattention-stats-file cask/calibration/for_aime24_experiment/qwen3_8b.pt \
    --attn-implementation sdpa \
    --count-prompt-tokens true \
    --slack-budget-trigger true \
    --json-output experiments/reports/geometry_teacher_forced_fidelity_vs_fullkv_sm104_v2.json

Repository Layout

Path	Purpose
`scripts/cli.py`	high-level experiment wrapper
`scripts/worker.py`	HuggingFace execution path
`scripts/replay_reference_fidelity.py`	teacher-forced replay harness
`scripts/run_replay_fidelity_frontier.py`	batch replay frontier launcher
`scripts/run_promptheavy_pack.py`	generic prompt-heavy fullkv + replay package planner/launcher
`scripts/run_replay_queue.ps1`	PowerShell queue wrapper for replay configs
`scripts/run_longbench_suite.py`	LongBench generation harness
`scripts/compare_kv_fidelity.py`	output-level comparison helper
`scripts/build_promptheavy_saved_ratio_audit.py`	package prompt-heavy replay summaries
`scripts/build_actual_bridge_artifacts.py`	package actual-output bridge summaries
`scripts/sync_reasoning_gate.py`	sync replay reports into the tracked reasoning-gate summaries
`scripts/build_paper_figures.py`	generate the current paper-facing figure pack under `docs/assets/`
`scripts/refresh_paper_figures.ps1`	optional sync + figure refresh wrapper
`paper/`	NeurIPS paper source, bibliography, and style files
`cask/methods/triattention.py`	TriAttention baseline implementation
`cask/methods/cask.py`	CASK implementation
`artifacts/`	tracked paper-facing summaries
`docs/assets/`	paper-facing rendered figures

Provenance

This codebase started from the TriAttention implementation and now serves as the active research repository for CASK. Some internal names remain for compatibility with existing tooling, but the paper-facing method is CASK.

License

Apache 2.0. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CASK

At A Glance

Method Summary

Safety Hook

Mode Map

Benchmark Snapshot

H100 Reasoning Replay Gate

Prompt-Heavy Replay Highlights

Output-Level Bridge Highlights

Current Read

Artifact Index

Which package should you open?

Package Roles

Installation

Quick Start

Repository Layout

Provenance

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
artifacts		artifacts
cask		cask
data		data
docs		docs
experiments		experiments
paper		paper
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

CASK

At A Glance

Method Summary

Safety Hook

Mode Map

Benchmark Snapshot

H100 Reasoning Replay Gate

Prompt-Heavy Replay Highlights

Output-Level Bridge Highlights

Current Read

Artifact Index

Which package should you open?

Package Roles

Installation

Quick Start

Repository Layout

Provenance

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages