# Introduction

In other notebooks we demonstrated how we can use **Llama 2** model for various tasks, from testing it on math problems, to creating a sequential task chain (with output of previous task used as parameter in the input of the next task) and to create Retrieval Augmented Generation system, with Llama 2 as LLM, ChromaDB as vector database and Langchain as task chaining framework.  

In this notebook we will experiment with llama.cpp. This library help us to run Llama and other models on lower performance hardware (consumer hardware). It converts/quantizes Llama model to GGUF format.

# Installation


We start by installing llama.cpp.

In [1]:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
!git clone https://github.com/ggerganov/llama.cpp.git

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.7.tar.gz (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l- \ | / - \ done
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25l- \ done
[?25h  Preparing metadata (pyproject.toml) ... [?25l- done
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | done
[?25h  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.7-cp310-cp310-ma

Then we convert our model to **llama.cpp** format.

In [2]:
!python llama.cpp/convert.py /kaggle/input/llama-2/pytorch/7b-chat-hf/1 \
  --outfile llama-7b.gguf \
  --outtype q8_0

Loading model file /kaggle/input/llama-2/pytorch/7b-chat-hf/1/model-00001-of-00002.safetensors
Loading model file /kaggle/input/llama-2/pytorch/7b-chat-hf/1/model-00001-of-00002.safetensors
Loading model file /kaggle/input/llama-2/pytorch/7b-chat-hf/1/model-00002-of-00002.safetensors
params = Params(n_vocab=32000, n_embd=4096, n_layer=32, n_ctx=2048, n_ff=11008, n_head=32, n_head_kv=32, f_norm_eps=1e-05, f_rope_freq_base=None, f_rope_scale=None, ftype=<GGMLFileType.MostlyQ8_0: 7>, path_model=PosixPath('/kaggle/input/llama-2/pytorch/7b-chat-hf/1'))
Loading vocab file '/kaggle/input/llama-2/pytorch/7b-chat-hf/1/tokenizer.model', type 'spm'
Permuting layer 0
Permuting layer 1
Permuting layer 2
Permuting layer 3
Permuting layer 4
Permuting layer 5
Permuting layer 6
Permuting layer 7
Permuting layer 8
Permuting layer 9
Permuting layer 10
Permuting layer 11
Permuting layer 12
Permuting layer 13
Permuting layer 14
Permuting layer 15
Permuting layer 16
Permuting layer 17

# Import packages

In [3]:
from llama_cpp import Llama

# Test the model


Let's quickly test the model. We initialize first the model.

In [4]:
llm = Llama(model_path="/kaggle/working/llama-7b.gguf")

ggml_init_cublas: found 2 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5
  Device 1: Tesla T4, compute capability 7.5
llama_model_loader: loaded meta data with 18 key-value pairs and 291 tensors from /kaggle/working/llama-7b.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q8_0     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q8_0     [  409

Let's define a question.

In [5]:
output = llm("Q: Name three capital cities in Europe? A: ", max_tokens=38, stop=["Q:", "\n"], echo=True)


llama_print_timings:        load time =  6884.73 ms
llama_print_timings:      sample time =    19.03 ms /    19 runs   (    1.00 ms per token,   998.32 tokens per second)
llama_print_timings: prompt eval time =  6884.61 ms /    13 tokens (  529.59 ms per token,     1.89 tokens per second)
llama_print_timings:        eval time = 15813.53 ms /    18 runs   (  878.53 ms per token,     1.14 tokens per second)
llama_print_timings:       total time = 22793.10 ms


And now let's see the output.

In [6]:
output

{'id': 'cmpl-88e5b028-5cf1-430b-bcc3-8b0a06eb77fd',
 'object': 'text_completion',
 'created': 1695933869,
 'model': '/kaggle/working/llama-7b.gguf',
 'choices': [{'text': 'Q: Name three capital cities in Europe? A: 1. Berlin, Germany 2. Paris, France 3. London, United Kingdom',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 13, 'completion_tokens': 19, 'total_tokens': 32}}

Next, let's run a math question.

In [7]:
output = llm("If a circle has the radius 3, what is its area?")

Llama.generate: prefix-match hit

llama_print_timings:        load time =  6884.73 ms
llama_print_timings:      sample time =   112.04 ms /   106 runs   (    1.06 ms per token,   946.09 tokens per second)
llama_print_timings: prompt eval time =  7585.80 ms /    14 tokens (  541.84 ms per token,     1.85 tokens per second)
llama_print_timings:        eval time = 92415.07 ms /   105 runs   (  880.14 ms per token,     1.14 tokens per second)
llama_print_timings:       total time = 100555.42 ms


Let's check the answer.

In [8]:
print(output['choices'][0]['text'])



Answer: The area of a circle is given by the formula A = πr^2, where r is the radius of the circle. In this case, the radius of the circle is 3, so the area of the circle is:

A = π(3)^2
= 3.14 (9)
= 29.06

Therefore, the area of the circle with a radius of 3 is approximately 29.06.
