<a href="https://colab.research.google.com/github/0xVolt/llama-gpu-chain/blob/main/gpu_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLaMa GPU inference with `llama-cpp` and `cuBLAS`

### References

1. [YouTube video on GPU inferences](https://www.youtube.com/watch?v=iLBekSpVFq4)
2. [GitHub repository with code for GPU inference](https://github.com/MuhammadMoinFaisal/LargeLanguageModelsProjects/blob/main/Run%20Llama2%20Google%20Colab/Llama_2_updated.ipynb)

## Import and download dependencies

In [None]:
%pip install huggingface_hub transformers sentencepiece
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --verbose

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99
Using pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.26.tar.gz (8.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Running command pip subprocess to install build dependencies
  Using pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
  Collecting scikit-build-core[pyproject]>=0.5.1
    Downloading scikit_build_core-0.7.0-py3-none-any.whl (136 kB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 136.6/136.6 kB 1.5 MB/s eta 0:00:00
  Collecting exceptiongroup (from 

In [None]:
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
import json

## Download model from `huggingface_hub`

In [None]:
checkpoint = "TheBloke/CodeLlama-13B-Instruct-GGUF"
fileName = r"codellama-13b-instruct.Q4_K_M.gguf"

In [None]:
modelPath = hf_hub_download(
    repo_id=checkpoint,
    filename=fileName
)

Here's where the model was downloaded

In [None]:
modelPath

## Load downloaded model

In [None]:
llm = Llama(
    model_path=modelPath,
    n_threads=2,
    n_batch=512,
    n_gpu_layers=28,
    n_ctx=3584,
    verbose=True
)

#### Things to note:

1. `n_threads` - refers to your CPU cores
2. `n_batches` - needs to be between 1 and `n_ctx`, i.e., the number of characters in the context window. Consider this param when tweaking the code to optimize GPU usage. **Look at how much VRAM you have!**
3. `n_gpu_layers` - change this according to the GPU you're using and how much VRAM is has
4. If you notice that `BLAS = 1`, this means that `llama-cpp` has setup properly with a GPU backend. In this case, we're usung `cuBLAS` to run inference on an Nvidia GPU.

In [None]:
# with open(r"./test/testScript1.py", "r") as file:
#     function = file.read()

function = """
def checkGPU(tensorflow):
    if tensorflow == True:
        import tensorflow as tf
        print("Number of GPUs available with tensorflow:", len(tf.config.list_physical_devices('GPU')))
    else:
        import torch
        print('Checking if the GPU is available with PyTorch:', torch.cuda.is_available())
"""

function

In [None]:
prompt = f'''SYSTEM: You are a helpful, respectful and honest assistant. With every line of code that you read, try to understand it and explain it's working. Split the documentation into fields such as function name, function description, arguments, return values and line-by-line explanation. Output should be in markdown syntax.


USER: Write this function's documentation:\n{function}

ASSISTANT:
'''

In [None]:
prompt

In [None]:
response = llm(
    prompt=prompt,
    max_tokens=3584,
    temperature=0.4,
    top_p=0.95,
    top_k=150,
    repeat_penalty=1.2,
    echo=True
)

In [None]:
print(json.dumps(response, indent=2))

In [None]:
print(response["choices"][0]["text"])