<a href="https://colab.research.google.com/github/HSV-AI/presentations/blob/master/2024/240919_Llama_CPP_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Logo](https://camo.githubusercontent.com/455da7518417340e112a473e3bdd91dae3dc8fda296d247ad3f3bc95cced8738/68747470733a2f2f6873762e61692f77702d636f6e74656e742f75706c6f6164732f323032322f30332f6c6f676f5f7631315f323032322e706e67)

# Welcome
- Vision
- Mission
- How to Connect - [Signup](https://hsv.ai/subscribe/)




# Large Language Models

We don't have enough time to cover everything about these. The rule of thumb is that the larger the model (more parameters) the more powerful it will be. Another thing to keep in mind is that larger models require more effort to train, and more powerful hardware to run.

Luckily for us, we can use models that have been pre-trained. The only worry now is how much hardware you need to run these models. The issue then becomes where to find enough GPU horsepower to run them.

# CPU Execution

From [Wikipedia](https://en.wikipedia.org/wiki/Llama.cpp)

Towards the end of September 2022, Georgi Gerganov started work on the GGML library, a C library implementing tensor algebra. Gerganov developed the library with the intention of the strict memory management and multi-threading. The creation of GGML was inspired by Fabrice Bellard's work on LibNC.

llama.cpp began development in March 2023 by Georgi Gerganov as an implementation of the Llama inference code in pure C/C++ with no dependencies. This bettered performance on computers without GPU or other dedicated hardware. As of July 2024 it has 61 thousand stars on GitHub. Before llama.cpp, Gerganov worked on a similar library called whisper.cpp which implemented Whisper, a speech to text model by OpenAI. llama.cpp gained traction with users who lacked specialized hardware as it could run on just a CPU including on Android devices.



# Quantization

You can think of an LLM is a multi-dimensional array of 16bit floating point weights, which are loaded into a neural network and used to calculate outputs from a feed-forward network. So when you see a new model being released with the size of parameters, you can do the math for 1 parameter == 1 weight * 16bits. Or just double the number of parameters, and that's how much memory in GB that your GPU will require to run the model.

Quantization is an approach to convert a model's 16bit weights into a smaller size. Some approaches reduce the size differently based on what part of the neural network will be affected. There are ways to measure the accuracy loss as well.

The best overview of Quantization that I could find is [from symbl.ai](https://symbl.ai/developers/blog/a-guide-to-quantization-in-llms/)

Today, we will be using the llama2-13b-chat model that has been quantized to 5 bits per weight.


# Llama-cpp-python

The Llama-cpp-python library is a python wrapper around the popular llama.cpp project. It provides a direct module for import as well as the ability to run llama.cpp in a web server with the OpenAI API.

Full documentation on the latest stable release is here: [https://llama-cpp-python.readthedocs.io/en/stable/](https://llama-cpp-python.readthedocs.io/en/stable/)

First we need to check the version of CUDA that is installed, as well as the RAM available on the GPU.

In [1]:
# First let's check the CUDA version so we can set up llama.cpp correctly
!nvcc --version
!nvidia-smi

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Thu Sep 19 02:25:43 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P8              12W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                      

# Precompiled Binary

For this example, we're using a precompiled wheel from https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels


In [2]:
#Install llama-cpp-python, cuda-enabled package
!python -m pip install llama-cpp-python==0.2.83 --prefer-binary --extra-index-url=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/AVX2/cu122


Looking in indexes: https://pypi.org/simple, https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/AVX2/cu122
Collecting llama-cpp-python==0.2.83
  Downloading llama_cpp_python-0.2.83.tar.gz (49.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 MB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python==0.2.83)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created

In [4]:
#Install llama-cpp-python with CUBLAS, compatible to CUDA 12.2 which is the CUDA driver build above
!set LLAMA_CUBLAS=1
!set CMAKE_ARGS=-DLLAMA_CUBLAS=on
!set FORCE_CMAKE=1

#Install llama-cpp-python, cuda-enabled package
!python -m pip install llama-cpp-python==0.2.7 --prefer-binary --extra-index-url=https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/AVX2/cu122

#Install pytorch-related, cuda-enabled package
!pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121

Looking in indexes: https://pypi.org/simple, https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels/AVX2/cu122
Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting torch==2.3.0
  Downloading https://download.pytorch.org/whl/cu121/torch-2.3.0%2Bcu121-cp310-cp310-linux_x86_64.whl (781.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m781.0/781.0 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchvision==0.18.0
  Downloading https://download.pytorch.org/whl/cu121/torchvision-0.18.0%2Bcu121-cp310-cp310-linux_x86_64.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m97.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchaudio==2.3.0
  Downloading https://download.pytorch.org/whl/cu121/torchaudio-2.3.0%2Bcu121-cp310-cp310-linux_x86_64.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m66.0 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nv

In [6]:
#Working for GPU
%%writefile gpu_requirements.txt
annotated-types==0.7.0
anyio==4.4.0
certifi==2022.12.7
charset-normalizer==2.1.1
click==8.1.7
colorama==0.4.6
diskcache==5.6.3
dnspython==2.6.1
email_validator==2.1.1
exceptiongroup==1.2.1
filelock==3.13.1
fsspec==2024.6.0
h11==0.14.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.23.3
idna==3.4
Jinja2==3.1.4
llama_cpp_python==0.2.7+cu122
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
mpmath==1.3.0
networkx==3.2.1
numpy==1.26.4
orjson==3.10.3
packaging==24.0
pillow==10.2.0
pydantic==2.7.3
pydantic_core==2.18.4
Pygments==2.18.0
python-dotenv==1.0.1
python-multipart==0.0.9
PyYAML==6.0.1
requests==2.28.1
rich==13.7.1
shellingham==1.5.4
sniffio==1.3.1
starlette==0.37.2
sympy==1.12
torch==2.3.0+cu121
torchaudio==2.3.0+cu121
torchvision==0.18.0+cu121
tqdm==4.66.4
typer==0.12.3
typing_extensions==4.12.1
ujson==5.10.0
watchfiles==0.22.0

Writing gpu_requirements.txt


In [7]:
!pip install -r gpu_requirements.txt #it's normal to see incompatiblity errors; the most important packages have been installed correctly


Collecting anyio==4.4.0 (from -r gpu_requirements.txt (line 2))
  Downloading anyio-4.4.0-py3-none-any.whl.metadata (4.6 kB)
Collecting certifi==2022.12.7 (from -r gpu_requirements.txt (line 3))
  Downloading certifi-2022.12.7-py3-none-any.whl.metadata (2.9 kB)
Collecting charset-normalizer==2.1.1 (from -r gpu_requirements.txt (line 4))
  Downloading charset_normalizer-2.1.1-py3-none-any.whl.metadata (11 kB)
Collecting colorama==0.4.6 (from -r gpu_requirements.txt (line 6))
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collecting dnspython==2.6.1 (from -r gpu_requirements.txt (line 8))
  Downloading dnspython-2.6.1-py3-none-any.whl.metadata (5.8 kB)
Collecting email_validator==2.1.1 (from -r gpu_requirements.txt (line 9))
  Downloading email_validator-2.1.1-py3-none-any.whl.metadata (26 kB)
Collecting exceptiongroup==1.2.1 (from -r gpu_requirements.txt (line 10))
  Downloading exceptiongroup-1.2.1-py3-none-any.whl.metadata (6.6 kB)
Collecting filelock==3.13.1 (from

# Let's Find a Model

## Types of "Small" Models

With models smaller than ~20B, you will often see multiple variations of the same model.

Often these will be:
- Base model - these models operate by predicting the next word following the prompt.
- Instruct model - these models start with the base model, but are further fine-tuned to follow instructions.
- Chat model - also start with the base model, the chat version is optimized for dialogue, making it more suitable for interactive tasks such as question answering and engaging in multi-turn conversations.

## Huggingface

You can think of Huggingface as the github of machine learning models.

The Bloke is a good place to start.

https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF

We're going to pull the 5-bit quantized model as well as the 8-bit quantized model since we *should* have enough RAM to load it.

In [None]:
!wget -O llama-2-13b-chat.Q5_K_M.gguf https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q5_K_M.gguf?download=true

--2024-07-28 14:16:11--  https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q5_K_M.gguf?download=true
Resolving huggingface.co (huggingface.co)... 3.163.189.90, 3.163.189.74, 3.163.189.37, ...
Connecting to huggingface.co (huggingface.co)|3.163.189.90|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/8d/b1/8db1d1f73b4caa58e947ccbfe2fb27ac5e495c2ad8457ad299d15987aee3b520/ef36e090240040f97325758c1ad8e23f3801466a8eece3a9eac2d22d942f548a?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27llama-2-13b-chat.Q5_K_M.gguf%3B+filename%3D%22llama-2-13b-chat.Q5_K_M.gguf%22%3B&Expires=1722435371&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyMjQzNTM3MX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy84ZC9iMS84ZGIxZDFmNzNiNGNhYTU4ZTk0N2NjYmZlMmZiMjdhYzVlNDk1YzJhZDg0NTdhZDI5OWQxNTk4N2FlZTNiNTIwL2VmMzZlMDkwMjQwMDQwZjk3MzI1NzU4Yz

In [3]:
# This took 17 minutes to complete
!wget -O llama-2-13b-chat.Q8_0.gguf https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q8_0.gguf?download=true

--2024-09-19 02:28:37--  https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q8_0.gguf?download=true
Resolving huggingface.co (huggingface.co)... 3.165.102.128, 3.165.102.6, 3.165.102.58, ...
Connecting to huggingface.co (huggingface.co)|3.165.102.128|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/8d/b1/8db1d1f73b4caa58e947ccbfe2fb27ac5e495c2ad8457ad299d15987aee3b520/9f4d06112114dd1b48023305578ad52b690d3aee42181631a2bddbe856f75ae6?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27llama-2-13b-chat.Q8_0.gguf%3B+filename%3D%22llama-2-13b-chat.Q8_0.gguf%22%3B&Expires=1726972117&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyNjk3MjExN319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy84ZC9iMS84ZGIxZDFmNzNiNGNhYTU4ZTk0N2NjYmZlMmZiMjdhYzVlNDk1YzJhZDg0NTdhZDI5OWQxNTk4N2FlZTNiNTIwLzlmNGQwNjExMjExNGRkMWI0ODAyMzMwNTU3OGF

# Llama-cpp-python API

As llama-cpp-python is a wrapper around llama-cpp native code, the parameters used for instantiation and calling are primarily from the base library.

The full list of parameters can be found here: [https://llama-cpp-python.readthedocs.io/en/stable/api-reference/#llama_cpp.Llama.__init__](https://llama-cpp-python.readthedocs.io/en/stable/api-reference/#llama_cpp.Llama.__init__)

The main parameters that we need to care about to get started are:
- model_path -  Path to the model.
- n_gpu_layers - Number of layers to offload to GPU. If -1, all layers are offloaded.
- n_ctx - Text context, 0 = from model


In [1]:
from llama_cpp import Llama

# LLAMA_2_13B = "llama-2-13b-chat.Q5_K_M.gguf"
LLAMA_2_13B = "llama-2-13b-chat.Q8_0.gguf"


# Using the defaults here except for moving some of the layers to the GPU availalbe on this laptop.
llm = Llama(model_path=LLAMA_2_13B, n_gpu_layers=-1,verbose=True)


AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 


In [2]:
# This is a basic prompt to show building the context of the RAG and asking the question
PROMPT = """[INST] <<SYS>>You are a helpful, respectful and honest assistant. Always answer
as helpfully as possible, while being safe. Your answers should not include any harmful,
unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your
responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of
answering something not correct. If you don't know the answer to a question, please don't share
false information.<</SYS>>

Generate the next agent response by answering the question. Answer it as succinctly as possible.
You are provided several documents with titles. If the answer comes from different documents
please mention all possibilities in your answer and use the titles to separate between topics
or domains. If you cannot answer the question from the given documents, please state that you
do not have an answer.

CONTEXT:

Blog Post page 1: In this part, we will wrap the Transformer model with HuggingFace pipeline so that we can pass
the rules to the Transformer model. To craft and pass the rules to the Transformer model,
we can use the LangChain Prompt Template. In this prompt template, we can tell how the
LLM should behave. This is shown in the pre_prompt variable. Next, we give some information
or context to the LLM to refine our prediction. For example, we can tell the LLM that we are
dicussing Apple product such as iphone, ipad and mac book, instead of discussing Apple as a
fruit. With the context, we can pass in the relevant question as shown in the prompt variable.

Question:

What type of model do we pass the rules to?
[/INST]"""



In [3]:

# Since the default of llama-cpp-python uses a 512 token context length, we need
# to check and see how much our prompt uses.
input_length = len(llm.tokenize(bytes(PROMPT, "utf-8")))

print(f'Context length of prompt: {input_length}')

Context length of prompt: 409


In [4]:
# To generate the answer, simply call the llm
output = llm(
    PROMPT, # Prompt
    max_tokens=100 # Generate tokens until we run out
)

# The output structure has a lot of additional stuff in it:
print('\nTotal output structure:')
print(output)

print('\nOutput answer:')
print(output['choices'][0]['text'])



Total output structure:
{'id': 'cmpl-c5f9d872-ae29-4165-ae91-964982d76d67', 'object': 'text_completion', 'created': 1726716432, 'model': 'llama-2-13b-chat.Q8_0.gguf', 'choices': [{'text': '  Based on the information provided in the blog post page 1, we pass the rules to a Transformer model. Specifically, we use the HuggingFace pipeline to wrap the Transformer model and pass the rules through the LangChain Prompt Template.', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 409, 'completion_tokens': 52, 'total_tokens': 461}}

Output answer:
  Based on the information provided in the blog post page 1, we pass the rules to a Transformer model. Specifically, we use the HuggingFace pipeline to wrap the Transformer model and pass the rules through the LangChain Prompt Template.


In [5]:
PROMPT = """[INST] <<SYS>>You are President Trump. Give all answers in the form of Shakespeare.<</SYS>>

Tell me about your wall. How is it going?
[/INST]"""

# To generate the answer, simply call the llm
output = llm(
    PROMPT, # Prompt
    max_tokens=-1 # Generate tokens until we run out
)

print('\nOutput answer:')
print(output['choices'][0]['text'])

Llama.generate: prefix-match hit



Output answer:
  O, my dear wall, thou art a wondrous sight to behold! A mighty barrier, strong and tall, that doth protect our land from all invaders and ne'er-do-wells. The stones and bricks, they do fit together so snugly, like the pieces of a jigsaw puzzle, and the mortar, it doth hold them fast, like a sturdy bond 'twixt friend and friend.

Alack, my enemies, they do howl and wail, like wolves in the night, for they know that their days of lawlessness and chaos are nigh at an end. The wall doth stand as a beacon of strength and resolve, a testament to the power of American might and determination.

But soft, my friends, for the work is not yet done! The wall must still be built, stone by stone, brick by brick, until it stretcheth from sea to shining sea. And when it is finished, then shall we truly be safe, and our great nation shall prosper anew.

So fear not, my people, for the wall doth stand as a symbol of our unyielding resolve, and with each passing day, it doth grow strong

We need to update this notebook to use llama3:

[Meta Post](https://ai.meta.com/blog/meta-llama-3/)

