What's Changed
A new release, one that took too long again. We have some cool new features, however.
- ExllamaV2 tensor parallel: You can now run ExllamaV2 quantized models on multiple GPUs. This should be the fastest multi-gpu experience with exllamav2 models.
- Support for Command-R+
- Support for DBRX
- Support for Llama-3
- Support for Qwen 2 MoE
min_tokens
sampling param: You can now set a minimum amount of tokens to generate.- Fused MoE for AWQ and GPTQ quants: AWQ and GPTQ kernels have been updated with optimized fused MoE code. They should be significantly faster now.
- CMake build system: Slightly faster, much cleaner builds.
- CPU support: You can now run aphrodite on CPU only systems! Needs an AVX512-compatible CPU for now.
- Speculative Decoding: Speculative Decoding is finally here! You can either use a draft model, or use prompt lookup decoding with an ngram model (built-in).
- Chunked Prefill: Before this, Aphrodite would process prompts in chunks equal to the model's context length. Now, you can enable this option (via
--enable-chunked-prefill
) to process in chunks of 768 by default, massively increasing the amount of context you can fit. Does not currently work with context shift or FP8 KV cache. - Context Shift reworked: Context shift finally works now. Enable it with
--context-shift
and Aphrodite will cache processed prompts and re-use them. - FP8 E4M3 KV Cache: This is for ROCm only. Support will be extended to NVIDIA soon. E4M3 has higher quality compared to E5M2, but doesn't lead to any throughput increase.
- Auto-truncation in API: The API server can now optionally left-truncate your prompts. Simply pass
truncate_prompt_tokens=1024
to truncate any prompt larger than 1024 tokens. - Support for Llava vision models: Currently 1.5 is supported. With the next release, we should have 1.6 along with a proper GPT4-V compatible API.
- LM Format Enforcer: You can now use LMFE for guided generations.
- EETQ Quantization: EETQ support has been added - a SOTA 8bit quantization method.
- Arbitrary GGUF model support: We were limited to only Llama models for GGUF, now any GGUF is supported. You will need to convert the model beforehand for them, however.
- Aphrodite CLI app: You no longer have to type
python -m aphrodite...
. Simply typeaphrodite run meta-llama/Meta-Llama-3-8B
to get started. Pass extra flags as normal. - Sharded GGUF support: You can now load sharded GGUF models. Pre-conversion needed.
- NVIDIA P100/GP100 support: Support has been restored.
Thanks to all the new contributors!
Full Changelog: v0.5.2...v0.5.3