In [1]:
from IPython.display import clear_output, Markdown, display

In [2]:
# No need to run this on colab. These libraries come pre-installed on colab
# %pip install torch torchvision torchaudio

# Content:

In this demo, we will take a look at the llama-v2 language model.

To use llama-v2, we will use the llama-cpp-python library, which is a python converted version of the llama-cpp library

For this, we need to install the library and to download the model weights file. The file can be downloaded from huggingface [repo](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF) of [TheBloke](https://huggingface.co/TheBloke). Credits to him for quantizing the model, saving it in different formats like GGML and GGUF and sharing with the community. He has a lot of other models on his channel that you can check out, including different versions of llama (like the 70B param sized one and code llama etc)

## Downloading model file

In [3]:
!wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf

clear_output()

## Installing llama-cpp-python

installing supports different versions of hardware acceleration.

We will go with Cuda. Checkout the [Github Repo](https://github.com/abetlen/llama-cpp-python) for more options

In [None]:
#!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python==0.2.74  # This takes a few mins when building wheel. Be patient.

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.59.tar.gz (37.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m37.4/37.4 MB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.59-cp310-cp310-manylinux_2_35_x86_64.whl size=38831179 sha256=1100ff5c2ff4803156a8073d02da4c102667cdd57aa36b92c59ebfaa2b17dd98
  Stored i

## Running Llama-v2

In [4]:
import json

from llama_cpp import Llama

In [5]:
model = Llama(
    "llama-2-7b-chat.Q5_K_M.gguf",
    n_gpu_layers=-1, # To use GPU
    n_ctx=2048, # default ctx window is 512

    # llama-cpp supports multiple variations of llama and they can have different prompt format.
    # specifying the chat format isn't necessary but will let us use the create_chat_completions function.
    # alternatively we can just restructure our prompt and use the simple __call__
    chat_format="llama-2",
)

# clear_output()

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from llama-2-7b-chat.Q5_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 

Let's try a code generation example.

Note that llama-v2 and llama-v2 chat isn't specifically trained for code generations but it has some capability for it

Models trained for code generation are what's used by applications like github copilot or codenium.

Llama-2, being a text generation model can be trained and used for that purpose as well.
There is even a version of llama-2 called codellama, build for code generation.

The model file for codellama can also be found at TheBloke's repo.

In [6]:
sys_msg = "You are a Python coding instructor. Help your students with their code."
user_prompt = """
Write me a function to calculate the fibbonaci series uptil length N.
Do NOT use recursion. Instead, use dynamic programming."""

prompt = f"""
[INST] <<SYS>>
{sys_msg}
<</SYS>>
{user_prompt}[/INST]
"""

In [7]:
output = model(
    prompt,
    max_tokens=None  # sets no length limit
)


llama_print_timings:        load time =     795.34 ms
llama_print_timings:      sample time =      74.72 ms /   502 runs   (    0.15 ms per token,  6718.78 tokens per second)
llama_print_timings: prompt eval time =     795.27 ms /    68 tokens (   11.70 ms per token,    85.51 tokens per second)
llama_print_timings:        eval time =    6237.09 ms /   501 runs   (   12.45 ms per token,    80.33 tokens per second)
llama_print_timings:       total time =    7781.36 ms /   569 tokens


In [8]:
output

{'id': 'cmpl-cdd3c728-c118-43b6-b18d-d82650dcda44',
 'object': 'text_completion',
 'created': 1741964642,
 'model': 'llama-2-7b-chat.Q5_K_M.gguf',
 'choices': [{'text': "Great! I'd be happy to help you with that. Here's a function in Python to calculate the Fibonacci series up to length N without using recursion:\n```python\ndef fib_dynamic(N):\n    # Create a 2D array to store the Fibonacci series\n    fib = [[0] * (N+1), [0]]\n    # Initialize the first two terms of the series\n    fib[0][0] = fib[1][0] = 1\n    # Calculate the next term in the series\n    for i in range(2, N+1):\n        fib[i][0] = fib[i-1][0] + fib[i-1][1]\n    return fib\n```\nHere's how the function works:\n1. We create a 2D array `fib` with two rows and `N+1` columns. The first row contains the initial values of 1 and 1, respectively. The second row is empty for now.\n2. We initialize the first term of the series by setting `fib[0][0] = fib[1][0] = 1`.\n3. Now comes the key part of the function - calculating th

In [9]:
output["choices"][0]

{'text': "Great! I'd be happy to help you with that. Here's a function in Python to calculate the Fibonacci series up to length N without using recursion:\n```python\ndef fib_dynamic(N):\n    # Create a 2D array to store the Fibonacci series\n    fib = [[0] * (N+1), [0]]\n    # Initialize the first two terms of the series\n    fib[0][0] = fib[1][0] = 1\n    # Calculate the next term in the series\n    for i in range(2, N+1):\n        fib[i][0] = fib[i-1][0] + fib[i-1][1]\n    return fib\n```\nHere's how the function works:\n1. We create a 2D array `fib` with two rows and `N+1` columns. The first row contains the initial values of 1 and 1, respectively. The second row is empty for now.\n2. We initialize the first term of the series by setting `fib[0][0] = fib[1][0] = 1`.\n3. Now comes the key part of the function - calculating the next term in the series. We use a loop that ranges from 2 to `N+1` to calculate each term. For each `i`, we use the previous terms' values to calculate the cu

In [10]:
# response
display(Markdown(output["choices"][0]["text"]))

Great! I'd be happy to help you with that. Here's a function in Python to calculate the Fibonacci series up to length N without using recursion:
```python
def fib_dynamic(N):
    # Create a 2D array to store the Fibonacci series
    fib = [[0] * (N+1), [0]]
    # Initialize the first two terms of the series
    fib[0][0] = fib[1][0] = 1
    # Calculate the next term in the series
    for i in range(2, N+1):
        fib[i][0] = fib[i-1][0] + fib[i-1][1]
    return fib
```
Here's how the function works:
1. We create a 2D array `fib` with two rows and `N+1` columns. The first row contains the initial values of 1 and 1, respectively. The second row is empty for now.
2. We initialize the first term of the series by setting `fib[0][0] = fib[1][0] = 1`.
3. Now comes the key part of the function - calculating the next term in the series. We use a loop that ranges from 2 to `N+1` to calculate each term. For each `i`, we use the previous terms' values to calculate the current term. Specifically, we add the previous term's value to the previous term's second value (i.e., `fib[i-1][1]`). This is the formula for the Fibonacci sequence: `F(n) = F(n-1) + F(n-2)`.
4. After the loop, we return the `fib` array to get the calculated values of the Fibonacci series up to length `N`.
Here's an example usage of the function:
```python
# Calculate the Fibonacci series up to length 5
fib = fib_dynamic(5)
print(fib)
# Output: [[0, 1, 1, 2, 3], [0]]
```
I hope this helps! Let me know if you have any questions or need further clarification.

Let's try a translation example now



Like before, llama-v2 has some capability of translating but is not SPECIFICALLY trained for it

The translations here will vary in quality depending on what languages are being used.

The more of that language llama-v2 has seen during training, the better the results

For example, the translation of english to japanese will be better than from english to arabic.

instead of using llama-2-chat, it'd be better to use a finetuned seq2seq model for translations

In [11]:
# Let's use create_chat_completions for now.
# Note that we need to specify the chat format for this to work correctly


sys_msg = "You are state of the are languge translator who is assisting people with translations"
user_prompt = """
Translate this English to Japanese:
'I woke up really early in the morning and then went for a jog to freshen up my mind'
"""

In [12]:
output = model.create_chat_completion(
    messages = [
          {"role": "system", "content": sys_msg},
          {
              "role": "user",
              "content": user_prompt
          }
      ]
)

Llama.generate: prefix-match hit

llama_print_timings:        load time =     795.34 ms
llama_print_timings:      sample time =      36.24 ms /   235 runs   (    0.15 ms per token,  6484.55 tokens per second)
llama_print_timings: prompt eval time =     159.21 ms /    72 tokens (    2.21 ms per token,   452.25 tokens per second)
llama_print_timings:        eval time =    2912.69 ms /   234 runs   (   12.45 ms per token,    80.34 tokens per second)
llama_print_timings:       total time =    3396.83 ms /   306 tokens


In [13]:
output

{'id': 'chatcmpl-60d861f2-3525-4a73-89d6-9d976fd9b180',
 'object': 'chat.completion',
 'created': 1741964713,
 'model': 'llama-2-7b-chat.Q5_K_M.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': '  Sure, I\'d be happy to help! Here\'s the translation of "I woke up really early in the morning and then went for a jog to freshen up my mind" into Japanese:\n「朝に起きたのは非常に早く、そして脳を清算するためにランニングをしました」\nHere\'s a breakdown of the translation:\n* 朝 (asobu) - morning\n* に (ni) - at/in\n* 起きた (okiru) - woke up\n* 非常に (hikari ni) - really early\n* 早く (hayaku) - early\n* そして (soshite) - and then\n* 脳を清算する (kokoro o seisaneru) - to freshen up my mind\n* ランニング (running) - jogging\nI hope this helps! Let me know if you have any other questions.'},
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 73, 'completion_tokens': 234, 'total_tokens': 307}}

In [14]:
print(output['choices'][0]['message']['content'])

  Sure, I'd be happy to help! Here's the translation of "I woke up really early in the morning and then went for a jog to freshen up my mind" into Japanese:
「朝に起きたのは非常に早く、そして脳を清算するためにランニングをしました」
Here's a breakdown of the translation:
* 朝 (asobu) - morning
* に (ni) - at/in
* 起きた (okiru) - woke up
* 非常に (hikari ni) - really early
* 早く (hayaku) - early
* そして (soshite) - and then
* 脳を清算する (kokoro o seisaneru) - to freshen up my mind
* ランニング (running) - jogging
I hope this helps! Let me know if you have any other questions.


For the next example, let's ask it to solve a simple mathematical equation

for this example, we will also restrict it to produce JSON output which is more practical when the output needs to be further processed.

In [15]:
sys_msg = """
You will be given reviews and your job is to analyse them, find and extract the negative parts.

1. Only output negative parts. If a review has no negative experiences in it, just output [].
2. One review can have multiple negative parts. find them all.
3. Do not alter parts of reviews or add something that isn't in the review.
4. Use the same keyword for the same issue across reviews. Do not use multiple keywords for the same issue.
5. Make sure the output is a valid JSON

Use this JSON format and dont say ANYTHING other than this JSON:

[
    {
        "keyword-for-the-problem": "Part of the review mentioning the problem"
    }
]

"""

user_prompt = """
Analyse this review: The food was decent, but nothing extraordinary. It felt a bit overpriced for what we got. Service was a bit slow, but the staff was polite.
"""

prompt = f"""
[INST] <<SYS>>
{sys_msg}
<</SYS>>
{user_prompt}[/INST]
"""

In [16]:
output = model(
    prompt,
    max_tokens=None  # sets no length limit
)

Llama.generate: prefix-match hit

llama_print_timings:        load time =     795.34 ms
llama_print_timings:      sample time =       3.09 ms /    20 runs   (    0.15 ms per token,  6466.21 tokens per second)
llama_print_timings: prompt eval time =      77.15 ms /   231 tokens (    0.33 ms per token,  2994.21 tokens per second)
llama_print_timings:        eval time =     234.64 ms /    19 runs   (   12.35 ms per token,    80.98 tokens per second)
llama_print_timings:       total time =     337.67 ms /   250 tokens


In [17]:
output = json.loads(output["choices"][0]["text"])

In [18]:
output

{'keyword-for-the-problem': 'overpriced'}

As most likely will be the case when you re-run this, the output has some issues.

Let's try **one shotting** by adding an example in the prompt

In [19]:
sys_msg = """
You will be given reviews and your job is to analyse them, find and extract the negative parts.

1. Only output negative parts. If a review has no negative experiences in it, just output [].
2. One review can have multiple negative parts. find them all.
3. Do not alter parts of reviews or add something that isn't in the review.
4. Use the same keyword for the same issue across reviews. Do not use multiple keywords for the same issue.
5. Make sure the output is a valid JSON

Use this JSON format and dont say ANYHING other than this JSON:

[
    {
        "keyword-for-the-problem": "Part of the review mentioning the problem"
    }
]

Here is an example:

Review: `Decent food, but nothing outstanding. The service was average, and the atmosphere was a bit lacking. It's an okay option if you're in the area.`

[
  {
    "boring-ambience": "atmosphere was a bit lacking"
  }
]

"""

user_prompt = """

Review: `The food was decent, but nothing extraordinary. It felt a bit overpriced for what we got. Service was a bit slow, but the staff was polite.`
"""

prompt = f"""
[INST]<<SYS>>
{sys_msg}
<</SYS>>
{user_prompt}[/INST]
"""

output = model(
    prompt,
    max_tokens=None,  # sets no length limit
    stop=['<END>']
)

output = json.loads(output["choices"][0]["text"])

Llama.generate: prefix-match hit

llama_print_timings:        load time =     795.34 ms
llama_print_timings:      sample time =       3.49 ms /    24 runs   (    0.15 ms per token,  6868.92 tokens per second)
llama_print_timings: prompt eval time =     265.97 ms /   305 tokens (    0.87 ms per token,  1146.75 tokens per second)
llama_print_timings:        eval time =     284.08 ms /    23 runs   (   12.35 ms per token,    80.96 tokens per second)
llama_print_timings:       total time =     580.07 ms /   328 tokens


In [20]:
output

{'overpriced': 'It felt a bit overpriced for what we got'}