Skip to content

llamafile v0.7

Compare
Choose a tag to compare
@jart jart released this 31 Mar 04:21
· 87 commits to main since this release
c7780c4

llamafile lets you distribute and run LLMs with a single file

[line drawing of llama animal head in front of slightly open manilla folder filled with files]

This release improves the performance and accuracy of both CPU and GPU computations in addition to security.

  • tinyBLAS now gives outputs consistent with the cuBLAS thanks to Kahan summation on matvec ops. This is good news for Windows users, because llamafile releases bundle tinyBLAS DLLs for driver-only GPU support. That support will now be faster, and more accurate than before, thereby reducing the need to install the CUDA / ROCm SDKs yourself.
  • Prompt evaluation now goes much faster on CPU. For example, f16 weights on Raspberry Pi 5 are now 8x faster. These new optimizations mostly apply to F16, BF16, Q8_0, Q4_0, Q4_0, and F32 weights. Depending on the hardware and weights being used, we've observed llamafile-0.7 going anywhere between 30% to 500% faster than llama.cpp upstream.
  • Support for the bf16 data type has been introduced for CPU only, which is the Google Brain floating point format.
  • Support for AVX512 has been introduced. Owners of CPUs like Zen4 can expect to see 10x faster prompt eval times.
  • If you want to run llamafile-0.7 [...] --recompile --gpu amd support on Windows, this release requires that you use version 5.7+ of the ROCm HIP SDK, which may be downloaded here.
  • This release includes a security fix for CVE-2024-23496 (see #294).
  • This release is synced with llama.cpp 2024-03-22 upstream.