# Llama 2 Chat and Conversations

We now know how to load llama2 as a local large language model, that fundamentally it takes a sequence of tokens and
uses it's pretrained weights to generate a sequence of output tokens, and that this process is probabalistic in nature
and we have some techniques to help guide that randomness. Remember, these pretrained weights come from a large corpus
of data -- over 2 trillion tokens -- and how this data was fed into the model determines the weights, representing,
roughly, the relationship between tokens. It's important then to understand how that data was formatted when it was fed
into the model in order to generate reasonable prompts. The base model which we've been using up until this point
doesn't have any particular template, but the chat model does as it went through the fine tuning process.

From the llama 2 paper, the template is as follows:

`<s>[INST] <<SYS>>\n{your_system_message}\n<</SYS>>\n\n{user_message_1} [/INST] {{ llama_answer_1 }} </s>`

First, the sequence starts with the letter s enclosed in less than and greater than symbols. Then the string [INST]
appears, indicating that a single instruction is being given. You can see this is mirrored at the end of the string with
a [/INST], indicating the closing of the instruction. Then we have the <<SYS>> indicator inside a set of ASCII
guillemets, indicating that the prompt begins with a system message to set the context. There is a variety of whitespace
-- which is important because it was in the fine tuned data -- and a user message which is the actual content of your
prompt query. The system will append to this it's response to the query, and future queries in the same conversation
should be user messages wrapped in the [INST][/INST] tags.

This can be a lot to keep in mind, but llama.cpp and its python bindings help wrap this format a bit for us, let's take
a look.


In [1]:
# Read in the path for the model file, note the new environment variable!
# Also, by default llama.cpp sets the temperature of our model down to 0.2
# for chatting, but we can override that and make it whatever we want.
# Here I'll put zero, as I want to get pretty repeatable accurate results.
import os
from llama_cpp import Llama
from llama_cpp.llama_types import *

model: Llama = Llama(
    model_path=os.environ["LLAMA_13B_CHAT"], verbose=False, temperature=0
)

In [2]:
# The high level type llama.cpp uses for this is a type dictionary which
# is called ChatCompletionMessage. This dictionary will be converted at
# inference time to the underlying LLM format you need, based on the model type

# Also, since everything is a typed dictionary in the llama.cpp bindings,
# I'm going to bring in a json helper library for improving display
import json

# This will be our conversation histor, our messages
messages = []

# Let's first add a system message, indicating that only a single word should be used in the response
message = ChatCompletionMessage()
message["role"] = "system"
message["content"] = (
    'You are an expert astronomer, answer each question which is given to you as factually as possible. If you do not know the answer say "unknown".'
)
messages.append(message)

# Now let's add a user query
message = ChatCompletionMessage()
message["role"] = "user"
message["content"] = "What is the fifth planet in the solar system?"
messages.append(message)

# Now let's see the result
result = model.create_chat_completion(messages=messages)
print(json.dumps(result, indent=4))

{
    "id": "chatcmpl-a93e53a1-698a-42c7-821b-ac7d3838806b",
    "object": "chat.completion",
    "created": 1718484598,
    "model": "/data/llama-2-13b-chat.Q5_K_M.gguf",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "  The fifth planet in the solar system is Earth."
            },
            "logprobs": null,
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 62,
        "completion_tokens": 11,
        "total_tokens": 73
    }
}


Ok! So we can see we get a UUID it looks like back to the chat, as well as which model was used then the list of choices
which is just the assistant saying the word "Earth".

Earth!? I know we got rid of Pluto since I was a kid, but did we add a few more? Well, let's give the model a pass on
accuracy for the moment.

How do we know what our message actually looked like, and what kind of roles are available to us?

Well, this comes down to abstractions -- see llama.cpp isn't just for llama 2, and each model that's created has its own
chat format depending on how it was finetuned. Despite models seeming similar in how we use them, there is potential for
huge variation, and llama.cpp has been abstracted to work with a number of models, including other open source models
like Minstral. Not only does each model have it's own vocabulary, model architecture, pre-trained weights, and fine
tuned message format, some don't just include text as sequence input, and have multimodal capabilities, bringing in
images and video. It's a pretty exciting time!

But back to llama 2 here, if we want to see how the message is being translated into text we can take a look at the chat
formatters available.


In [3]:
# The chat format is automatically loaded when we load the model, we
# can see what the format is with this variable
model.chat_format  # this will be the str llama-2 in our example

# We can actually load and call the prompt formatting function
# ourselves if we would like to
from llama_cpp.llama_chat_format import format_llama2, ChatFormatterResponse

response: ChatFormatterResponse = format_llama2(messages=messages)
print(response.prompt)

<s>[INST] <<SYS>>
You are an expert astronomer, answer each question which is given to you as factually as possible. If you do not know the answer say "unknown".
<</SYS>> What is the fifth planet in the solar system? [/INST]


Ok, so this looks just like the format I introduced you to, but of course the newline characters are rendered as returns
in the document.


# Turning a Chat into a Conversation

One of the real powers of the chat module is that we can have multi turn conversations. Large language models work with
a set context window, and we've already discussed that with llama 2 this context window is up to 4,096 tokens large, but
by default llama.cpp creates a context window of 512 tokens to save some computational resources. The most
straightforward conversation for a llama 2 model then is just a series of messages that we append all together into one
larger sequence, iteratively adding in the responses from the model through turns. Let's take a look at an example.


In [4]:
# I'll create a list here to hold the messages
messages = []

# Each message is a ChatCompletionMessage object, which again, is just a typed
# dictionary. The first message will be the system context.
message = ChatCompletionMessage()
message["role"] = "system"
message["content"] = (
    'You are an expert astronomer, answer each question which is given to you as factually as possible. If you do not know the answer say "unknown".'
)
messages.append(message)

# Now let's add a user query
message = ChatCompletionMessage()
message["role"] = "user"
message["content"] = "What is the fifth planet in the solar system?"
messages.append(message)

# Now let's get the result
result = model.create_chat_completion(messages=messages)

# We're actually not even going to look at the result, and assume it is wrong,
# so let's add that to our list of messages. In llama 2 we get back a typed
# dict where the message is the first choice
messages.append(result["choices"][0]["message"])

# Now let's ask for a clarification
message = ChatCompletionMessage()
message["role"] = "user"
message["content"] = (
    "Are you certain that is correct? I thought Earth was the third planet in our solar system. Which planet comes after Earth in distance from the Sun?"
)
messages.append(message)

# Let's see how we do
result = model.create_chat_completion(messages=messages)["choices"][0]["message"]
print(json.dumps(result, indent=4))

{
    "role": "assistant",
    "content": "  You are correct, Earth is the third planet in our solar system, not the fifth. The order of the planets in our solar system, starting from the Sun and moving outward, is:\n\n1. Mercury\n2. Venus\n3. Earth\n4. Mars\n5. Jupiter\n\nSo, there is no planet after Earth in terms of distance from the Sun."
}


Well, I'm a bit disappointed in this result, but keep in mind we're taking a smaller model -- the 13 billion parameter
model -- and then quantizing it heavily to run on the CPU on the Coursera web environment for this class. A less
compressed -- that is, less quantized -- model may perform better, as might a larger model. Or we could try and tune our
prompt a bit more through some prompt engineering strategies.

An actually, that's kind of a nice way to end this module. We've gotten our hands dirty with llama 2 and llama.cpp using
python. We've explored how tokenization works, and the parameters by which we can tune how this probabilistic model
actually chooses a new output token. We've also seen that different models can have different kinds of interaction norms
depending on how they have been fine tuned, with llama 2 chat being a good example of a pretty simple but important
template. And, we've seen in this template that there are three kinds of roles that exist for messages -- a system
message, which is optional and sets the norm for how the model should respond, then any number of user and assistant
messages, representing turns in a dialog.

But now I'd like to learn from you, how might you tackle improving the response from llama 2 on this question? You don't
have to limit yourself to just the concepts we've talked about here so far -- feel free to bring in knowledge you might
have from other videos or readings you might have seen. And, feel free to actually try and improve the results for this
query in the notebooks associate with the course.

In the next module, we'll look a bit more at how we can programatically integrate with large language models, and I'll
give you a short practice assignment.
