Releases · Mozilla-Ocho/llamafile

26 Apr 20:33

jart

0.8.1

2095d50

llamafile v0.8.1 Latest

Latest

Support for Phi-3 Mini 4k has been introduced
A bug causing GPU module crashes on some systems has been resolved
Support for Command-R Plus has now been vetted with proper 64-bit indexing
We now support more AMD GPU architectures thanks to better detection of offload archs (#368)
We now ship prebuilt NVIDIA and ROCm modules for both Windows and Linux users. They link tinyBLAS which is a libre math library that only depends on the graphics driver being installed. Since it's slower, llamafile will automatically build a native module for your system if the CUDA or ROCm SDKs are installed. You can control this behavior using --nocompile or --recompile. Yes, Our LLavA llamafile still manages to squeak under the Windows 4GB file size limit!
An assertion error has been fixed that happened when using llamafile-quantize to create K quants from an F32 GGUF file
A new llamafile-tokenize command line tool has been introduced. For example, if you want to count how many "tokens" are in a text file, you can say cat file.txt | llamafile-tokenize -m model.llamafile | wc -l since it prints each token on a single line.

Assets 6

24 Apr 22:05

jart

0.8

82f87bd

llamafile v0.8

llamafile lets you distribute and run LLMs with a single file

llamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. llamafile goes 2x faster than llama.cpp and 25x faster than ollama for some use cases like CPU prompt evaluation. It has a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a shell-scriptable CLI interface which together put you in control of artificial intelligence.

This release further improves performance and introduces support for new models.

Support for LLaMA3 is now available
Support for Grok has been introduced
Support for Mixtral 8x22b has been introduced
Support for Command-R models has been introduced
MoE models (e.g. Mixtral, Grok) now go 2-5x faster on CPU 4db03a1
F16 is now 20% faster on Raspberry Pi 5 (TinyLLaMA 1.1b prompt eval improved 62 -> 75 tok/sec)
F16 is now 30% faster on Skylake (TinyLLaMA 1.1b prompt eval improved 171 -> 219 tok/sec)
F16 is now 60% faster on Apple M2 (Mistral 7b prompt eval improved 79 -> 128 tok/sec)
Add ability to override chat template in web gui when creating llamafiles da5cbe4
Improve markdown and syntax highlighting in server (#88)
CPU feature detection has been improved

Downloads

You can download prebuilt llamafiles from:

https://huggingface.co/jartine
llamafiles quantized and compiled by us
https://huggingface.co/models?library=llamafile
llamafiles built by our user community

Errata

The new web gui chat template override feature isn't working as intended. If you want to use LLaMA3 8B then you need to manually copy and paste the chat templates from our README into the llamafile web GUI.
The llamafile-quantize program may fail with an assertion error when K-quantizing weights from an F32 converted file. You can work around this by asking llama.cpp's convert.py script to output an FP16 GGUF file, and then running lllamafile-quantize on that instead.

Assets 5

24 Apr 17:08

jart

0.7.4

73bf13d

llamafile v0.7.4

Display prompt eval tokens per second in web gui e4d97b2
Add ability to override chat template in web gui ebd096e
Simply and optimize the sgemm code more ef1c524

Assets 5

19 Apr 22:55

jart

0.7.3

8ecb0ae

llamafile v0.7.3

Improve markdown and syntax highlighting in server (#88) re-fixes #68

Assets 5

19 Apr 20:45

jart

0.7.2

cfae06f

llamafile v0.7.2

Fix stop token bug with meta llama3 70b instruct da4d780
Fix LLaVA shell scriptability regression ff9decc (#346)

Assets 5

13 Apr 04:21

jart

0.7.1

e5d53ac

llamafile v0.7.1

This release fixes bugs in the 0.7.0 release.

Fix 2 embeddings-related issues in server.cpp (#324)
Detect search query to start webchat (#333)
Use LLAMAFILE_GPU_ERROR value -2 instead of -1 (#291)
Fix --silent-prompt flag regression #328
Clamp out of range values in K quantizer ef0307e
Update to latest q5_k quantization code a8b0b15
Change file format magic number for recently bf16 file format introduced in 0.7.0. This is a breaking change. It's due to a numbering conflict with the upstream project. We're still waiting on a permanent assignment for bfloat16 so this could potentially change again. Follow ggerganov/llama.cpp#6412 for updates.

Mixtral 8x22b and Grok support are not available in this release, but they are available if you build llamafile from source on the main branch at HEAD. We're currently dealing with an AMD Windows GPU support regression there. Once it's resolved, a 0.8 release will ship.

Assets 5

31 Mar 04:21

jart

0.7

c7780c4

llamafile v0.7

llamafile lets you distribute and run LLMs with a single file

This release improves the performance and accuracy of both CPU and GPU computations in addition to security.

tinyBLAS now gives outputs consistent with the cuBLAS thanks to Kahan summation on matvec ops. This is good news for Windows users, because llamafile releases bundle tinyBLAS DLLs for driver-only GPU support. That support will now be faster, and more accurate than before, thereby reducing the need to install the CUDA / ROCm SDKs yourself.
Prompt evaluation now goes much faster on CPU. For example, f16 weights on Raspberry Pi 5 are now 8x faster. These new optimizations mostly apply to F16, BF16, Q8_0, Q4_0, Q4_0, and F32 weights. Depending on the hardware and weights being used, we've observed llamafile-0.7 going anywhere between 30% to 500% faster than llama.cpp upstream.
Support for the bf16 data type has been introduced for CPU only, which is the Google Brain floating point format.
Support for AVX512 has been introduced. Owners of CPUs like Zen4 can expect to see 10x faster prompt eval times.
If you want to run llamafile-0.7 [...] --recompile --gpu amd support on Windows, this release requires that you use version 5.7+ of the ROCm HIP SDK, which may be downloaded here.
This release includes a security fix for CVE-2024-23496 (see #294).
This release is synced with llama.cpp 2024-03-22 upstream.

Assets 5

27 Jan 21:35

jart

0.6.2

d4c602d

llamafile v0.6.2

llamafile lets you distribute and run LLMs with a single file

This release synchronizes with llama.cpp upstream and polishes GPU
auto-configuration. Support for splitting a model onto multiple NVIDIA
GPUs has been restored.

dfd3335 Synchronize with llama.cpp 2024-01-27
c008e43 Synchronize with llama.cpp 2024-01-26
e34b35c Make GPU auto configuration more resilient
79b88f8 Sanitize -ngl flag on Apple Metal

There's a known issue with support for splitting onto multiple AMD GPUs,
which currently doesn't work. This is an upstream issue we're working to
solve. The workaround is to set export HIP_VISIBLE_DEVICES=0 in your
environment when running llamafile, so it'll only see the first GPU.

Example llamafiles

Our llamafiles on Hugging Face are updated shortly after a release goes live.

Flagship models

Supreme models (highest-end consumer hardware)

Tiny models (small enough to use on raspberry pi)

Other models:

If you have a slow Internet connection and want to update your llamafiles without needing to redownload, then see the instructions here: #24 (comment) You can also download llamafile-0.6.2 and simply say ./llamafile-0.6.2 -m old.llamafile to run your old weights.

Assets 5

20 Jan 08:09

jart

0.6.1

389c389

llamafile v0.6.1

llamafile lets you distribute and run LLMs with a single file

This release fixes a crash that can happen on Apple Metal GPUs.

9c85d9c Fix free() related crash in ggml-metal.m

Windows users will see better performance with tinyBLAS. Please note we
still recommend installing the CUDA SDK (NVIDIA), or HIP/ROCm SDK (AMD)
for maximum performance and accuracy if you're in their support vector.

df0b3ff Use thread-local register file for matmul speedups (#205)
4892494 Change BM/BN/BK to template parameters (#203)
ed05ba9 Reduce server memory use on Windows

This release also synchronizes with llama.cpp upstream (as of Jan 9th)
along with other improvements.

133b05e Sync with llama.cpp upstream
67d97b5 Use hipcc on $PATH if it exists
15e2339 Do better job reporting AMD hipBLAS errors
c617679 Don't crash when --image argument is invalid
3e8aa78 Clarify install/gpu docs/behavior per feedback
eb4989a Fix typo in OpenAI API

Example llamafiles

Our llamafiles on Hugging Face are updated shortly after a release goes live.

Flagship models

Supreme models (highest-end consumer hardware)

Tiny models (small enough to use on raspberry pi)

Other models:

If you have a slow Internet connection and want to update your llamafiles without needing to redownload, then see the instructions here: #24 (comment) You can also download llamafile-0.6.1 and simply say ./llamafile-0.6.1 -m old.llamafile to run your old weights.

Assets 5

09 Jan 11:48

jart

0.6

64d1e65

llamafile v0.6

llamafile lets you distribute and run LLMs with a single file

This release features significant improvements to GPU support.

4616816 Introduce support for multiple GPUs
6559da6 Introduce AMD GPU support for Linux
20d5f46 Make CLIP GPU acceleration work on UNIX / Windows

The llamafile server is now more reliable. Invalid JSON won't crash the
server. Opening a browser tab won't prevent the server from starting.

3384234 Upgrade to cosmocc 3.2.4
585c2d8 Make browser tab launching more reliable
7a5ec37 Show IP addresses when binding to 0.0.0.0
d39ec38 Enable setting thread affinity on NUMA systems

You can now say llamafile -m foo.llamafile to load a model from a
llamafile without having to execute it, or extract the gguf file.

bb136e1 Support opening weights from llamafiles

The documentation has been improved (but still a work in progress).

7ad00db Add more content to manual

Example llamafiles

Our llamafiles on Hugging Face are updated shortly after a release goes live.

Flagship models

Supreme models (highest-end consumer hardware)

Tiny models (small enough to use on raspberry pi)

Other models:

If you have a slow Internet connection and want to update your llamafiles without needing to redownload, then see the instructions here: #24 (comment) You can also download llamafile-0.6 and simply say ./llamafile-0.6 -m old.llamafile to run your old weights.

Assets 5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downloads

Errata

Example llamafiles

Example llamafiles

Example llamafiles

Releases: Mozilla-Ocho/llamafile

llamafile v0.8.1

llamafile v0.8

Downloads

Errata

llamafile v0.7.4

llamafile v0.7.3

llamafile v0.7.2

llamafile v0.7.1

llamafile v0.7

llamafile v0.6.2

Example llamafiles

llamafile v0.6.1

Example llamafiles

llamafile v0.6

Example llamafiles