Skip to content

llamafile v0.8.7

Compare
Choose a tag to compare
@jart jart released this 24 Jun 15:00
· 81 commits to main since this release
b2f587c

This release includes important performance enhancements for quants.

  • 293a528 Performance improvements on Arm for legacy and k-quants (#453)
  • c38feb4 Optimized matrix multiplications for i-quants on __aarch64__ (#464)

This release fixes bugs. For example, we're now using a brand new memory
manager, which is believed to support platforms like Android that have a
virtual address space with fewer than 47 bits. This release also restores our
prebuilt Windows AMD GPU support, thanks to tinyBLAS.

It should be noted that, in future releases, we plan to introduce a new
server for llamafile. This new server is being designed for performance
and production-worthiness. It's not included in this release, since the
new server currently only supports a tokenization endpoint. However the
endpoint is capable of doing 2 million requests per second whereas with
the current server, the most we've ever seen is a few thousand.

  • e0656ea Introduce new llamafile server