This repository contains a benchmark script for llama.cpp using the llama-cpp-python API. Please refer to this document for how to install a Llama model and run the benchmark script against it. This assumes installation on a Linux Debian-based distribution (like Ubuntu).
Run the following command:
sudo vim setup.sh
Paste the following script:
#!/bin/bash
# Step 1: Install python
sudo apt-get update
sudo apt-get install python3.10
sudo apt install python3-pip
# Step 2: Install CUDA keyring and CUDA
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda
sudo apt install -y ninja-build
export PATH=/usr/local/cuda/bin:$PATH
source ~/.bashrc
# Step 3: Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make LLAMA_CUBLAS=1
# Step 4: Install git-lfs
sudo apt-get install git-lfs
# Step 5: Clone & configure llama benchmarking repo
cd ..
git clone https://github.com/NPCAI-Studio/llama-cpp-benchmark.git
cd llama-cpp-benchmark
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip3 install -r requirements.txt
# Step 6: Install a GGML model from HuggingFace
mkdir models
cd models
git lfs install
sudo apt install wget
wget https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q4_0.bin
# Step 7: Convert GGML model to GGUF
cd ..
cd ..
cd llama.cpp
pip3 install -r requirements.txt
python3 convert-llama-ggmlv3-to-gguf.py --input "../llama-cpp-benchmark/models/llama-2-13b-chat.ggmlv3.q4_0.bin" --output "../llama-cpp-benchmark/models/gguf_model.bin" --name llama-gguf --desc "GGUF converted model"
# Step 8: Run the benchmark
echo ""
echo ""
echo "Setup completed succussfully. Run the benchmark with the following command:"
echo ""
echo "python3 npcai_benchmark.py --model_path models/gguf_model.bin --n_gpu_layers 48"
echo ""
echo ""
Save and close the file.
Run the following command:
sudo chmod u+x setup.sh
Run the following command:
sudo ./setup.sh
Run the following commands:
sudo apt-get update
sudo apt-get install python3.10
sudo apt install python3-pip
Run the follow command:
sudo apt-get install git
This step is required if running this model with GPU offloading. This assumes the instance has NVIDIA GPUs running. Run the following commands.
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda
sudo apt install -y ninja-build
export PATH=/usr/local/cuda/bin:$PATH
source ~/.bashrc
Run the following commands:
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
If building to run solely on CPU, run this command:
make
If building to run with GPU offloading, instead run this command:
make LLAMA_CUBLAS=1
Run the following command:
sudo apt-get install git-lfs
Run the following commands (assuming the cwd is llama.cpp
from step #3):
cd ..
git clone https://github.com/NPCAI-Studio/llama-cpp-benchmark.git
cd llama-cpp-benchmark
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip3 install -r requirements.txt
Find the binary you want to install on HuggingFace. Copy the link and run the following commands (assuming the cwd is llama-cpp-benchmark
from step #5):
mkdir models
cd models
git lfs install
sudo apt install wget
wget https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q4_0.bin
Update 8.28: As of this date, the latest commit of the llama.cpp library requires a GGUF model format instead of GGML. TheBloke on HuggingFace has not yet added GGUF model versions, but has included a conversion script. To convert a downloaded GGML model to GGUF, run the following commands (assuming you are in the directory from the previous step):
cd ..
cd ..
cd llama.cpp
pip3 install -r requirements.txt
python3 convert-llama-ggmlv3-to-gguf.py --input "../llama-cpp-benchmark/models/llama-2-13b-chat.ggmlv3.q4_0.bin" --output "../llama-cpp-benchmark/models/gguf_model.bin" --name <model_name> --desc <model_desc>
Navigate back to the llama-cpp-benchmark
repository root directory. Run the following command:
python3 npcai_benchmark.py --model_path "models/<model_name>"
If you wish to offload layers to the GPU, run with the following commands:
python3 npcai_benchmark.py --model_path "models/<model_name>" --n_gpu_layers -1