Release llamafile v0.7 · Mozilla-Ocho/llamafile

llamafile lets you distribute and run LLMs with a single file

This release improves the performance and accuracy of both CPU and GPU computations in addition to security.

tinyBLAS now gives outputs consistent with the cuBLAS thanks to Kahan summation on matvec ops. This is good news for Windows users, because llamafile releases bundle tinyBLAS DLLs for driver-only GPU support. That support will now be faster, and more accurate than before, thereby reducing the need to install the CUDA / ROCm SDKs yourself.
Prompt evaluation now goes much faster on CPU. For example, f16 weights on Raspberry Pi 5 are now 8x faster. These new optimizations mostly apply to F16, BF16, Q8_0, Q4_0, Q4_0, and F32 weights. Depending on the hardware and weights being used, we've observed llamafile-0.7 going anywhere between 30% to 500% faster than llama.cpp upstream.
Support for the bf16 data type has been introduced for CPU only, which is the Google Brain floating point format.
Support for AVX512 has been introduced. Owners of CPUs like Zen4 can expect to see 10x faster prompt eval times.
If you want to run llamafile-0.7 [...] --recompile --gpu amd support on Windows, this release requires that you use version 5.7+ of the ROCm HIP SDK, which may be downloaded here.
This release includes a security fix for CVE-2024-23496 (see #294).
This release is synced with llama.cpp 2024-03-22 upstream.

Provide feedback