An efficient train-free system that accelerates LLM inference, utilizing entropy information of the context.
Experiments on Llama, Qwen and Pangu model series have shown that our method achieve high end-to-end speedup while maintaining generation quality. We currently release implementation that supports openPangu model series, including openPangu-Embedded-1B-v1.1 and openPangu-Embedded-7B-v1.1.
pip install -r requirements.txtTo conduct experiment on LongBench dataset, run the following script:
# Run experiment
bash experiment/LongBench/evaluate_longbench.sh
# Run evaluation
python experiment/LongBench/eval.py