Skip to content

v0.5.3

Latest
Compare
Choose a tag to compare
@github-actions github-actions released this 11 May 22:34
· 4 commits to main since this release

What's Changed

A new release, one that took too long again. We have some cool new features, however.

  • ExllamaV2 tensor parallel: You can now run ExllamaV2 quantized models on multiple GPUs. This should be the fastest multi-gpu experience with exllamav2 models.
  • Support for Command-R+
  • Support for DBRX
  • Support for Llama-3
  • Support for Qwen 2 MoE
  • min_tokens sampling param: You can now set a minimum amount of tokens to generate.
  • Fused MoE for AWQ and GPTQ quants: AWQ and GPTQ kernels have been updated with optimized fused MoE code. They should be significantly faster now.
  • CMake build system: Slightly faster, much cleaner builds.
  • CPU support: You can now run aphrodite on CPU only systems! Needs an AVX512-compatible CPU for now.
  • Speculative Decoding: Speculative Decoding is finally here! You can either use a draft model, or use prompt lookup decoding with an ngram model (built-in).
  • Chunked Prefill: Before this, Aphrodite would process prompts in chunks equal to the model's context length. Now, you can enable this option (via --enable-chunked-prefill) to process in chunks of 768 by default, massively increasing the amount of context you can fit. Does not currently work with context shift or FP8 KV cache.
  • Context Shift reworked: Context shift finally works now. Enable it with --context-shift and Aphrodite will cache processed prompts and re-use them.
  • FP8 E4M3 KV Cache: This is for ROCm only. Support will be extended to NVIDIA soon. E4M3 has higher quality compared to E5M2, but doesn't lead to any throughput increase.
  • Auto-truncation in API: The API server can now optionally left-truncate your prompts. Simply pass truncate_prompt_tokens=1024 to truncate any prompt larger than 1024 tokens.
  • Support for Llava vision models: Currently 1.5 is supported. With the next release, we should have 1.6 along with a proper GPT4-V compatible API.
  • LM Format Enforcer: You can now use LMFE for guided generations.
  • EETQ Quantization: EETQ support has been added - a SOTA 8bit quantization method.
  • Arbitrary GGUF model support: We were limited to only Llama models for GGUF, now any GGUF is supported. You will need to convert the model beforehand for them, however.
  • Aphrodite CLI app: You no longer have to type python -m aphrodite.... Simply type aphrodite run meta-llama/Meta-Llama-3-8B to get started. Pass extra flags as normal.
  • Sharded GGUF support: You can now load sharded GGUF models. Pre-conversion needed.
  • NVIDIA P100/GP100 support: Support has been restored.

Thanks to all the new contributors!

Full Changelog: v0.5.2...v0.5.3