llama.cpp

Inference of Meta's LLaMA model (and others) in pure C/C++

Description

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance.

Plain C/C++ implementation without any dependencies
AVX, AVX2 and AVX512 support for x86 architectures
1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
Custom CUDA kernels for running LLMs on NVIDIA GPUs
CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

Build

Use make to build the project:

make

CUDA

This provides GPU acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager (e.g. apt install nvidia-cuda-toolkit) or from here: CUDA Toolkit.

Make sure that the the nvcc compiler is in your PATH.

export PATH=/usr/local/cuda/bin:$PATH

Use the flag LLAMA_CUDA=1 to enable CUDA support:

LLAMA_CUDA=1 make

Usage

Download a model

After building the project, you need to download a model in GGUF format. For example we can download the Mistral-7B-Instruct-v0.2-GGUF model from TheBloke.

# Download the 4-bit quantized model
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf

Run the model

After downloading the model, you can run inference using the following command:

./main -ngl 35 -m mistral-7b-instruct-v0.2.Q4_K_M.gguf --color -c 32768 --temp 0.7 --repeat_penalty 1.1 -n -1  -p "Who won the 2018 world cup?"

Experiments

GPU acceleration

We run experiments varying the n_gpu_layers parameter to see how the GPU acceleration affects the inference time. We use the Mistral-7B-Instruct-v0.2-GGUF model on a NVIDIA RTX 2080 GPU and a context size of 32768. The prompt is "What is the capital of the United States?". We use the seed 470 for reproducibility.

# Quick run:
./main -ngl 35 -c 32768 -s 470 -m mistral-7b-instruct-v0.2.Q4_K_M.gguf -p "What is the capital of the United States?"

# Script to run the ngl experiment
python scripts/run_ngl.py

# Script to plot the results
python scripts/show_ngl.py

Multi-threading

We run experiments varying the number of threads to see how multi-threading affects the inference time. We use the Mistral-7B-Instruct-v0.2-GGUF model on a Intel i7-8700K CPU and a context size of 32768. The prompt is "What is the capital of the United States?". We use the seed 470 for reproducibility.

# Quick run:
./main -t 1 -c 32768 -s 470 -m mistral-7b-instruct-v0.2.Q4_K_M.gguf -p "What is the capital of the United States?"

# Script to run the threads experiment
python scripts/threads.py

# Script to plot the results
python scripts/show_threads.py

CPU-only vs GPU acceleration

# CPU-only
./main -t -c 32768 -s 470 -m mistral-7b-instruct-v0.2.Q4_K_M.gguf -p "What is the capital of the United States?"

# GPU acceleration
./main -ngl 35 -c 32768 -s 470 -m mistral-7b-instruct-v0.2.Q4_K_M.gguf -p "What is the capital of the United States?"

Experiment 1
- CPU hardware: Intel i7-8700K CPU @ 3.70GHz with 6 cores and 2 threads per core
- GPU hardware: 2 NVIDIA RTX 2080 with 8GB of VRAM
- CPU inference: 8.47 tokens/s
- GPU inference: 81.89 tokens/s
- Speedup: 9.67x
Experiment 2
- CPU hardware: Intel Xeon CPU @ 2.20GHz with 1 core and 2 threads per core
- GPU hardware: NVIDIA Tesla T4 with 16GB of VRAM
- CPU inference: 1.57 tokens/s
- GPU inference: 37.62 tokens/s
- Speedup: 23.96x

Name		Name	Last commit message	Last commit date
Latest commit History 2,910 Commits
.github		.github
ci		ci
common		common
docs		docs
examples/main		examples/main
ggml-cuda		ggml-cuda
models		models
pocs		pocs
results		results
scripts		scripts
tests		tests
.gitignore		.gitignore
AUTHORS		AUTHORS
LICENSE		LICENSE
Makefile		Makefile
README-sycl.md		README-sycl.md
README.md		README.md
SECURITY.md		SECURITY.md
ggml-alloc.c		ggml-alloc.c
ggml-alloc.h		ggml-alloc.h
ggml-backend-impl.h		ggml-backend-impl.h
ggml-backend.c		ggml-backend.c
ggml-backend.h		ggml-backend.h
ggml-common.h		ggml-common.h
ggml-cuda.cu		ggml-cuda.cu
ggml-cuda.h		ggml-cuda.h
ggml-impl.h		ggml-impl.h
ggml-metal.h		ggml-metal.h
ggml-metal.m		ggml-metal.m
ggml-metal.metal		ggml-metal.metal
ggml-mpi.c		ggml-mpi.c
ggml-mpi.h		ggml-mpi.h
ggml-opencl.cpp		ggml-opencl.cpp
ggml-opencl.h		ggml-opencl.h
ggml-quants.c		ggml-quants.c
ggml-quants.h		ggml-quants.h
ggml-rpc.cpp		ggml-rpc.cpp
ggml-rpc.h		ggml-rpc.h
ggml-sycl.cpp		ggml-sycl.cpp
ggml-sycl.h		ggml-sycl.h
ggml.c		ggml.c
ggml.h		ggml.h
ggml_vk_generate_shaders.py		ggml_vk_generate_shaders.py
llama.cpp		llama.cpp
llama.h		llama.h
requirements.txt		requirements.txt
sgemm.cpp		sgemm.cpp
sgemm.h		sgemm.h
unicode-data.cpp		unicode-data.cpp
unicode-data.h		unicode-data.h
unicode.cpp		unicode.cpp
unicode.h		unicode.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama.cpp

Description

Build

CUDA

Usage

Download a model

Run the model

Experiments

GPU acceleration

Multi-threading

CPU-only vs GPU acceleration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llama.cpp

Description

Build

CUDA

Usage

Download a model

Run the model

Experiments

GPU acceleration

Multi-threading

CPU-only vs GPU acceleration

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages