Release v0.5.3 · PygmalionAI/aphrodite-engine

What's Changed

A new release, one that took too long again. We have some cool new features, however.

ExllamaV2 tensor parallel: You can now run ExllamaV2 quantized models on multiple GPUs. This should be the fastest multi-gpu experience with exllamav2 models.
Support for Command-R+
Support for DBRX
Support for Llama-3
Support for Qwen 2 MoE
min_tokens sampling param: You can now set a minimum amount of tokens to generate.
Fused MoE for AWQ and GPTQ quants: AWQ and GPTQ kernels have been updated with optimized fused MoE code. They should be significantly faster now.
CMake build system: Slightly faster, much cleaner builds.
CPU support: You can now run aphrodite on CPU only systems! Needs an AVX512-compatible CPU for now.
Speculative Decoding: Speculative Decoding is finally here! You can either use a draft model, or use prompt lookup decoding with an ngram model (built-in).
Chunked Prefill: Before this, Aphrodite would process prompts in chunks equal to the model's context length. Now, you can enable this option (via --enable-chunked-prefill) to process in chunks of 768 by default, massively increasing the amount of context you can fit. Does not currently work with context shift or FP8 KV cache.
Context Shift reworked: Context shift finally works now. Enable it with --context-shift and Aphrodite will cache processed prompts and re-use them.
FP8 E4M3 KV Cache: This is for ROCm only. Support will be extended to NVIDIA soon. E4M3 has higher quality compared to E5M2, but doesn't lead to any throughput increase.
Auto-truncation in API: The API server can now optionally left-truncate your prompts. Simply pass truncate_prompt_tokens=1024 to truncate any prompt larger than 1024 tokens.
Support for Llava vision models: Currently 1.5 is supported. With the next release, we should have 1.6 along with a proper GPT4-V compatible API.
LM Format Enforcer: You can now use LMFE for guided generations.
EETQ Quantization: EETQ support has been added - a SOTA 8bit quantization method.
Arbitrary GGUF model support: We were limited to only Llama models for GGUF, now any GGUF is supported. You will need to convert the model beforehand for them, however.
Aphrodite CLI app: You no longer have to type python -m aphrodite.... Simply type aphrodite run meta-llama/Meta-Llama-3-8B to get started. Pass extra flags as normal.
Sharded GGUF support: You can now load sharded GGUF models. Pre-conversion needed.
NVIDIA P100/GP100 support: Support has been restored.

Thanks to all the new contributors!

Full Changelog: v0.5.2...v0.5.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.5.3

What's Changed