MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering

Official PyTorch implementation of "MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering" [CVPR'26].

💡 Overview

Efficiently and accurately responsing to user questions over long, live video streams (either third-person view or first-person view) remains challenging, especially when the questions involve fine-grained details in the far past. Existing sparse sampling and sliding window approaches often trade-off visual details for efficiency. Video KV-cache memory provides a good alternative, but per-frame caching not only neglects information granularity but also brings heavy redanducy. We thus propose MuKV, a multi-grained KV-cache compression approach designed to improve streaming VideoQA. We highlight the followings:

Multi-Grained Context: Represent past videos in hierarchically compressed KV tokens at segment, frame, and patch levels.
Redundancy Minimization: Adaptively trim irrelevant tokens utilizing token attention importance and frequency signal.
Efficiency and Accuracy: Significantly improved QA accuracy, without sacrificing offline memory and online QA efficiency. The strength gets boosted as video length increases.

Figure 1: A comparison with ReKV under different online inference token count and video lengths.

🚀 Getting Started

1. Environment Setup

We provide a convenient bash script to setup the exact dependencies and isolated conda environment automatically.

# It will create a conda env named 'mukv', install torch, flash-attn, transformers, etc.
bash prepare.sh

Activate the environment before proceeding:

conda activate mukv

2. Model Preparation

The core scripts are adapted to run across several Large Vision/Language models (e.g. LLaVA-OneVision).

We support the official LLaVA-OneVision weights on Hugging Face:

0.5B Model: https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf

By default, the code points to the 0.5B instance. The transformers library will download the weights automatically when you first run the server. You may also specify any other pre-downloaded local path using the --model_path argument.

3. Data Preparation

We conduct experiments primarily on RVS-Ego and RVS-Movie (MovieNet).

RVS-Ego & RVS-Movie: We follow the original Real-Time VideoQA benchmarks. Annotations and instructions can be obtained from the RVS Dataset Hugging Face repository.

Structure the annotations (.json/.csv) and video tensors (.npy/.mp4) inside the data/ directory exactly as shown below:

MuKV/
├── scripts/                # Execution Logic
├── model/                  # MuKV Implementation
├── assets/                 # Readme Images
├── data/
│   ├── rvs/
│   │   ├── ego/
│   │   │   ├── ego4d_oe.json
│   │   │   └── videos_npy_2fps/ (or videos/)
│   │   └── movie/
│   │       ├── movienet_oe.json
│   │       └── videos_npy/ (or videos/)

⚡ Inference & Evaluation

We abstract the entry points into simple run_mukv_<dataset>.py handlers inside the scripts/ folder. You must execute all python commands directly from the root MuKV/ directory.

Evaluate on RVS-Ego (Open-Ended)

python scripts/run_mukv_rvs_ego.py \
    --model_path "llava-hf/llava-onevision-qwen2-0.5b-ov-hf" \
    --anno_path "data/rvs/ego/ego4d_oe.json" \
    --video_format "mp4" \
    --enable_compression true \
    --enable_rerank true

Evaluate on RVS-Movie (Open-Ended)

python scripts/run_mukv_rvs_movie.py \
    --model_path "llava-hf/llava-onevision-qwen2-0.5b-ov-hf" \
    --anno_path "data/rvs/movie/movienet_oe.json" \
    --enable_compression true

Logs, resulting prediction CSVs, and inference time memory stat snapshots will automatically be collected under the generated results/mukv/ log directory.

Running via Shell Scripts

If you want to run exactly configured end-to-end evaluations without manually copying command-line arguments, you can directly execute the ready-made shell scripts inside scripts/sh/:

bash scripts/sh/run_mukv_rvs_ego.sh

🙏 Acknowledgements

Our methodology expands upon the impressive foundation set by LLaVA-OneVision. We thank the authors for their open-source contributions.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
model		model
scripts		scripts
video_qa		video_qa
.gitignore		.gitignore
README.md		README.md
prepare.sh		prepare.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering

💡 Overview

🚀 Getting Started

1. Environment Setup

2. Model Preparation

3. Data Preparation

⚡ Inference & Evaluation

Evaluate on RVS-Ego (Open-Ended)

Evaluate on RVS-Movie (Open-Ended)

Running via Shell Scripts

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering

💡 Overview

🚀 Getting Started

1. Environment Setup

2. Model Preparation

3. Data Preparation

⚡ Inference & Evaluation

Evaluate on RVS-Ego (Open-Ended)

Evaluate on RVS-Movie (Open-Ended)

Running via Shell Scripts

🙏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages