<a href="https://colab.research.google.com/github/0xVolt/llama-gpu-chain/blob/main/llama_cpp_gpu_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Run GPU Inference of the LLaMa models with cuBLAS and `llama-cpp`

## Import and download dependencies

In [None]:
!pip install huggingface_hub transformers sentencepiece
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99
Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.24.tar.gz (8.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m57.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.24-cp310-cp310-manylinux_2_35_

In [None]:
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

## Download model from `huggingface_hub`

In [5]:
checkpoint = "TheBloke/CodeLlama-13B-Instruct-GGUF"
fileName = r"codellama-13b-instruct.Q4_K_M.gguf"

In [6]:
modelPath = hf_hub_download(
    repo_id=checkpoint,
    filename=fileName
)

codellama-13b-instruct.Q4_K_M.gguf:   0%|          | 0.00/7.87G [00:00<?, ?B/s]

## Load downloaded model

In [7]:
llamaCppLLM = Llama(
    model_path=modelPath,
    n_threads=2,
    n_batch=512,
    n_gpu_layers=32
)

AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 


#### Things to note:

1. `n_threads` - refers to your CPU cores
2. `n_batches` - needs to be between 1 and `n_ctx`, i.e., the number of characters in the context window. Consider this param when tweaking the code to optimize GPU usage. **Look at how much VRAM you have!**
3. `n_gpu_layers` - change this according to the GPU you're using and how much VRAM is has
4. If you notice that `BLAS = 1`, this means that `llama-cpp` has setup properly with a GPU backend. In this case, we're usung `cuBLAS` to run inference on an Nvidia GPU.

*Note: *