In [None]:
from IPython.display import clear_output, Markdown, display

In [None]:
# No need to run this on colab. These libraries come pre-installed on colab
# %pip install torch torchvision torchaudio

# Content:

In this demo, we will take a look at the llama-v2 language model.

To use llama-v2, we will use the llama-cpp-python library, which is a python converted version of the llama-cpp library

For this, we need to install the library and to download the model weights file. The file can be downloaded from huggingface [repo](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF) of [TheBloke](https://huggingface.co/TheBloke). Credits to him for quantizing the model, saving it in different formats like GGML and GGUF and sharing with the community. He has a lot of other models on his channel that you can check out, including different versions of llama (like the 70B param sized one and code llama etc)

## Downloading model file

In [None]:
!wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf

clear_output()

## Installing llama-cpp-python

installing supports different versions of hardware acceleration.

We will go with Cuda. Checkout the [Github Repo](https://github.com/abetlen/llama-cpp-python) for more options

In [None]:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python==0.2.74  # This takes a few mins when building wheel. Be patient.

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.59.tar.gz (37.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m37.4/37.4 MB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.59-cp310-cp310-manylinux_2_35_x86_64.whl size=38831179 sha256=1100ff5c2ff4803156a8073d02da4c102667cdd57aa36b92c59ebfaa2b17dd98
  Stored i

## Running Llama-v2

In [None]:
import json

from llama_cpp import Llama

In [None]:
model = Llama(
    "llama-2-7b-chat.Q5_K_M.gguf",
    n_gpu_layers=-1, # To use GPU
    n_ctx=2048, # default ctx window is 512

    # llama-cpp supports multiple variations of llama and they can have different prompt format.
    # specifying the chat format isn't necessary but will let us use the create_chat_completions function.
    # alternatively we can just restructure our prompt and use the simple __call__
    chat_format="llama-2",
)

# clear_output()

Let's try a code generation example.

Note that llama-v2 and llama-v2 chat isn't specifically trained for code generations but it has some capability for it

Models trained for code generation are what's used by applications like github copilot or codenium.

Llama-2, being a text generation model can be trained and used for that purpose as well.
There is even a version of llama-2 called codellama, build for code generation.

The model file for codellama can also be found at TheBloke's repo.

In [None]:
sys_msg = "You are a Python coding instructor. Help your students with their code."
user_prompt = """
Write me a function to calculate the fibbonaci series uptil length N.
Do NOT use recursion. Instead, use dynamic programming."""

prompt = f"""
[INST] <<SYS>>
{sys_msg}
<</SYS>>
{user_prompt}[/INST]
"""

In [None]:
output = model(
    prompt,
    max_tokens=None  # sets no length limit
)

Llama.generate: prefix-match hit

llama_print_timings:        load time =     243.03 ms
llama_print_timings:      sample time =     248.98 ms /   400 runs   (    0.62 ms per token,  1606.55 tokens per second)
llama_print_timings: prompt eval time =     219.73 ms /    35 tokens (    6.28 ms per token,   159.28 tokens per second)
llama_print_timings:        eval time =   11043.40 ms /   399 runs   (   27.68 ms per token,    36.13 tokens per second)
llama_print_timings:       total time =   12900.07 ms /   434 tokens


In [None]:
output

{'id': 'cmpl-3eea1755-e55c-42b5-b988-d020dd5e3355',
 'object': 'text_completion',
 'created': 1712180566,
 'model': 'llama-2-7b-chat.Q5_K_M.gguf',
 'choices': [{'text': "Of course! Here is an example of how you could write a Python function to calculate the Fibonacci sequence up to a given length `N` using dynamic programming:\n```\ndef fibonacci(n):\n    # Initialize a list to store the Fibonacci numbers\n    fibs = [0, 1]\n    # Loop until the length of the list is equal to N\n    for _ in range(n-2):\n        # Add the previous two Fibonacci numbers to the current one\n        fibs.append(fibs[-1] + fibs[-2])\n    return fibs\n```\nExplanation:\n\nThe idea behind this function is to use dynamic programming to compute the `n`-th Fibonacci number, `Fib(n)`. We do this by storing the first `n` Fibonacci numbers in a list, `fibs`, and then computing each subsequent number as a sum of the previous two.\nThe function initializes `fibs` with the first two Fibonacci numbers, `0` and `1`. Th

In [None]:
output["choices"][0]

{'text': "Of course! Here is an example of how you could write a Python function to calculate the Fibonacci sequence up to a given length `N` using dynamic programming:\n```\ndef fibonacci(n):\n    # Initialize a list to store the Fibonacci numbers\n    fibs = [0, 1]\n    # Loop until the length of the list is equal to N\n    for _ in range(n-2):\n        # Add the previous two Fibonacci numbers to the current one\n        fibs.append(fibs[-1] + fibs[-2])\n    return fibs\n```\nExplanation:\n\nThe idea behind this function is to use dynamic programming to compute the `n`-th Fibonacci number, `Fib(n)`. We do this by storing the first `n` Fibonacci numbers in a list, `fibs`, and then computing each subsequent number as a sum of the previous two.\nThe function initializes `fibs` with the first two Fibonacci numbers, `0` and `1`. Then, it loops `n-2` times to compute the remaining `n`-th Fibonacci number. On each iteration, it adds the previous two Fibonacci numbers to the current one, whi

In [None]:
# response
display(Markdown(output["choices"][0]["text"]))

Of course! Here is an example of how you could write a Python function to calculate the Fibonacci sequence up to a given length `N` using dynamic programming:
```
def fibonacci(n):
    # Initialize a list to store the Fibonacci numbers
    fibs = [0, 1]
    # Loop until the length of the list is equal to N
    for _ in range(n-2):
        # Add the previous two Fibonacci numbers to the current one
        fibs.append(fibs[-1] + fibs[-2])
    return fibs
```
Explanation:

The idea behind this function is to use dynamic programming to compute the `n`-th Fibonacci number, `Fib(n)`. We do this by storing the first `n` Fibonacci numbers in a list, `fibs`, and then computing each subsequent number as a sum of the previous two.
The function initializes `fibs` with the first two Fibonacci numbers, `0` and `1`. Then, it loops `n-2` times to compute the remaining `n`-th Fibonacci number. On each iteration, it adds the previous two Fibonacci numbers to the current one, which gives us the next number in the sequence.
Here's an example of how you could use this function:
```
# Calculate the Fibonacci sequence up to length 5
print(fibonacci(5)) # [0, 1, 1, 2, 3]
```
This will output the first five Fibonacci numbers: `0`, `1`, `1`, `2`, and `3`.
I hope this helps! Let me know if you have any questions or need further clarification.

Let's try a translation example now



Like before, llama-v2 has some capability of translating but is not SPECIFICALLY trained for it

The translations here will vary in quality depending on what languages are being used.

The more of that language llama-v2 has seen during training, the better the results

For example, the translation of english to japanese will be better than from english to arabic.

instead of using llama-2-chat, it'd be better to use a finetuned seq2seq model for translations

In [None]:
# Let's use create_chat_completions for now.
# Note that we need to specify the chat format for this to work correctly


sys_msg = "You are state of the are languge translator who is assisting people with translations"
user_prompt = """
Translate this English to Japanese:
'I woke up really early in the morning and then went for a jog to freshen up my mind'
"""

In [None]:
output = model.create_chat_completion(
    messages = [
          {"role": "system", "content": sys_msg},
          {
              "role": "user",
              "content": user_prompt
          }
      ]
)

Llama.generate: prefix-match hit

llama_print_timings:        load time =     243.03 ms
llama_print_timings:      sample time =     134.06 ms /   202 runs   (    0.66 ms per token,  1506.83 tokens per second)
llama_print_timings: prompt eval time =     239.54 ms /    63 tokens (    3.80 ms per token,   263.01 tokens per second)
llama_print_timings:        eval time =    5265.27 ms /   201 runs   (   26.20 ms per token,    38.17 tokens per second)
llama_print_timings:       total time =    6422.72 ms /   264 tokens


In [None]:
output

{'id': 'chatcmpl-85eeeeb4-9001-41bf-9e88-b3ae70787ce1',
 'object': 'chat.completion',
 'created': 1712184349,
 'model': 'llama-2-7b-chat.Q5_K_M.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': '  Sure, I\'d be happy to help! Here\'s the translation of "I woke up really early in the morning and then went for a jog to freshen up my mind" into Japanese:\n「朝のうちに寝ていたが、脳を清めるためにランニングをした」\nHere\'s a breakdown of the translation:\n* 朝 (asobu) - morning\n* うちに (uchi ni) - in the morning\n* 寝ていた (nite ita) - woke up\n* 脳を清める (kokoro o kyūmu) - to freshen up my mind\n* た (ta) - to\n* ランニング (ranningu) - jogging\nI hope this helps! Let me know if you have any other questions.'},
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 73, 'completion_tokens': 201, 'total_tokens': 274}}

In [None]:
print(output['choices'][0]['message']['content'])

  Sure, I'd be happy to help! Here's the translation of "I woke up really early in the morning and then went for a jog to freshen up my mind" into Japanese:
「朝のうちに寝ていたが、脳を清めるためにランニングをした」
Here's a breakdown of the translation:
* 朝 (asobu) - morning
* うちに (uchi ni) - in the morning
* 寝ていた (nite ita) - woke up
* 脳を清める (kokoro o kyūmu) - to freshen up my mind
* た (ta) - to
* ランニング (ranningu) - jogging
I hope this helps! Let me know if you have any other questions.


For the next example, let's ask it to solve a simple mathematical equation

for this example, we will also restrict it to produce JSON output which is more practical when the output needs to be further processed.

In [None]:
sys_msg = """
You will be given reviews and your job is to analyse them, find and extract the negative parts.

1. Only output negative parts. If a review has no negative experiences in it, just output [].
2. One review can have multiple negative parts. find them all.
3. Do not alter parts of reviews or add something that isn't in the review.
4. Use the same keyword for the same issue across reviews. Do not use multiple keywords for the same issue.
5. Make sure the output is a valid JSON

Use this JSON format and dont say ANYHING other than this JSON:

[
    {
        "keyword-for-the-problem": "Part of the review mentioning the problem"
    }
]

"""

user_prompt = """
Analyse this review: The food was decent, but nothing extraordinary. It felt a bit overpriced for what we got. Service was a bit slow, but the staff was polite.
"""

prompt = f"""
[INST] <<SYS>>
{sys_msg}
<</SYS>>
{user_prompt}[/INST]
"""

In [None]:
output = model(
    prompt,
    max_tokens=None  # sets no length limit
)

Llama.generate: prefix-match hit

llama_print_timings:        load time =     553.30 ms
llama_print_timings:      sample time =      11.29 ms /    20 runs   (    0.56 ms per token,  1771.79 tokens per second)
llama_print_timings: prompt eval time =     348.96 ms /   226 tokens (    1.54 ms per token,   647.64 tokens per second)
llama_print_timings:        eval time =     478.23 ms /    19 runs   (   25.17 ms per token,    39.73 tokens per second)
llama_print_timings:       total time =     891.98 ms /   245 tokens


In [None]:
output = json.loads(output["choices"][0]["text"])

In [None]:
output

{'keyword-for-the-problem': 'overpriced'}

As most likely will be the case when you re-run this, the output has some issues.

Let's try **one shotting** by adding an example in the prompt

In [None]:
sys_msg = """
You will be given reviews and your job is to analyse them, find and extract the negative parts.

1. Only output negative parts. If a review has no negative experiences in it, just output [].
2. One review can have multiple negative parts. find them all.
3. Do not alter parts of reviews or add something that isn't in the review.
4. Use the same keyword for the same issue across reviews. Do not use multiple keywords for the same issue.
5. Make sure the output is a valid JSON

Use this JSON format and dont say ANYHING other than this JSON:

[
    {
        "keyword-for-the-problem": "Part of the review mentioning the problem"
    }
]

Here is an example:

Review: `Decent food, but nothing outstanding. The service was average, and the atmosphere was a bit lacking. It's an okay option if you're in the area.`

[
  {
    "boring-ambience": "atmosphere was a bit lacking"
  }
]

"""

user_prompt = """

Review: `The food was decent, but nothing extraordinary. It felt a bit overpriced for what we got. Service was a bit slow, but the staff was polite.`
"""

prompt = f"""
[INST]<<SYS>>
{sys_msg}
<</SYS>>
{user_prompt}[/INST]
"""

output = model(
    prompt,
    max_tokens=None,  # sets no length limit
    stop=['<END>']
)

output = json.loads(output["choices"][0]["text"])

Llama.generate: prefix-match hit

llama_print_timings:        load time =     553.30 ms
llama_print_timings:      sample time =      25.35 ms /    50 runs   (    0.51 ms per token,  1972.08 tokens per second)
llama_print_timings: prompt eval time =     216.36 ms /    40 tokens (    5.41 ms per token,   184.88 tokens per second)
llama_print_timings:        eval time =    1256.43 ms /    49 runs   (   25.64 ms per token,    39.00 tokens per second)
llama_print_timings:       total time =    1619.58 ms /    89 tokens


In [None]:
output

[{'overpriced': 'It felt a bit overpriced for what we got'},
 {'slow-service': 'Service was a bit slow'}]