# Querying Llama

Now that we have some knowledge of the fundamentals of how llama 2 is representing data, both within its model
architecture itself as well as the sequences of tokens coming in and out of the model, we can start to query the model.
Now, I've used the word "query" here, but I don't mean it in the relational database sense, but in the human language
sense: to ask a question. A database query is deterministic -- if the data hasn't changed and the query hasn't changed
then the same result is always returned. But a language model is a bit more nondeterministic, a more precise word to
describe our interaction with the model is "inference", or "predict".


In [1]:
# Let's write our first llama.cpp based application.
# To create a new model to query we have to identify the quantized
# file which we want to use. In this course we have two models created
# for experimentation, one being the base 13B paramneter model and the
# other being the chat-tuned 13B parameter model. We'll explore the
# base model first.

# Read in the path for the model file
import os

model_path: str = os.environ["LLAMA_13B"]

# Import the llama.cpp python bindings and load the model
from llama_cpp import Llama

model: Llama = Llama(model_path=model_path)

llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /data/llama-2-13b.Q5_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   4:                          llama.block_count u32              = 40
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              

Ok, all we've done is load the model and llama.cpp has given us a ton of debug information! This is worth taking a quick
look at, especially if you might end up using multiple kinds of models, either different architectures or different
quantization levels.

We see that this is a llama 2 model, that the context length is up to 4096 tokens - and we'll talk about that in a
moment - some information about the special tokens which exist for the moment, and if we scroll down we can see the
number of layers in the model, which speaks to the internal data structure, and whether these have been configured to be
offloaded onto a GPU or not. One of the interesting options in llama.cpp is that you don't have to choose between the
CPU or the GPU, you can do a bit of both depending upon your hardware setup.

As we get to the bottom here we see the `llama_new_context_with_model` log lines, and this indicates the default seems
to be a context of 521 tokens.


In [2]:
# Before we go further I want to reduce the verbosity of the output
# a bit -- feel free to leave this set to True to see more of the
# nitty gritty details of what's happening!
model.verbose = False

# Now that we have a model let's actually do some inference! The
# method we use for this is called create_completion(), and by
# default we only need to pass it a prompt to complete and it
# returns to us a Completion object, which is a TypedDict.

from llama_cpp.llama_types import *

result: Completion = model.create_completion(prompt="The capital of Michigan is ")

# The Completion type has a choices key which shows us the list of
# responses the LLM generated, let's take a look
print(result["choices"])

[{'text': '178 miles from the Canadian border and 568 miles from New', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}]


Ok! So, if you are following along in the notebook you undoubtedly have a different response than I'm showing here.
That's what we mean by inference, the model is making a series of probabilistic choices based on the input sequence and
the weights to generate this output sequence. We can make the model behave more deterministically by setting it's
`temperature`, which is a parameter from zero to one where values closer to zero cause the model will behave more
deterministically and values closer to one cause the model to behave more non-deterministic, and creative. Let's do a
little experiment.


In [3]:
# Let's try a few different temperature values
temps: list[float] = [0.0, 0.5, 1.0]

# Now, for each of these temperatures, let's do three completions
prompt: str = "The planets in the solar system include "
for temp in temps:
    for i in range(0, 3):
        result: Completion = model.create_completion(prompt=prompt, temperature=temp)
        print(f'temp={temp}, run={i}, result: {result["choices"][0]["text"]}')

temp=0.0, run=0, result: 8 planets, 5 dwarf planets and 123
temp=0.0, run=1, result: 8 planets, 5 dwarf planets and 123
temp=0.0, run=2, result: 8 planets, 5 dwarf planets and 123
temp=0.5, run=0, result: 8 planets and 1 dwarf planet. They are Mercury,
temp=0.5, run=1, result: 8 planets and one dwarf planet.
The inner planets are
temp=0.5, run=2, result: 8 planets, and Pluto is no longer considered a planet.
The
temp=1.0, run=0, result: 8 major planets. It is believed that there are billions of galaxies and
temp=1.0, run=1, result: 8 planets and some dwarf planets, comets, and astero
temp=1.0, run=2, result: 8 planets. But what are these planets called? You must have heard


Interesting! We can see that the low temperature returns really consistent results, and that the high remperature gives
different kinds of answers. Now, don't go reading into the tea leaves too much here, but play with this, experiment and
get a feeling for how the temperature effects your prompt completions. Underneath the temperature is changing how the
next token is picked from a set of candidate tokens, so higher levels of temperature will deviate more quickly from one
another on repeated querying.

You've undoubtedly noticed that we're just getting short little responses here. Generating tokens is slow, and by
default the `chat_completion()` method limits the number of tokens returned to just 16. We can change this to -1 to
generate as many tokens as available.


In [4]:
# Let's just do one run, and we'll leave the temperature at
# it's default value, which is 0.8
result: Completion = model.create_completion(prompt=prompt, max_tokens=-1)
print(result["choices"][0]["text"])

149 moons, dwarf planets and other objects.
The planets are Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.
Mercury is closest to the sun but it's size makes it appear smaller than our moon.
Venus is next closest but it has a dense atmosphere which makes it appear larger than it really is.
Earth has a thick atmosphere but it has a molten core that makes it appear larger than it really is.
Mars has a thin atmosphere which makes it appear smaller than it really is.
Jupiter has a thick atmosphere that makes it appear larger than it really is.
Saturn has a thick atmosphere that makes it appear larger than it really is.
Uranus has a thick atmosphere that makes it appear larger than it really is.
Neptune has a thick atmosphere that makes it appear larger than it really is.
Pluto has a thick atmosphere that makes it appear larger than it really is.


Ok, the first thing you'll notice, especially if you are following along in the lab workspaces with me, is that it's
slow. We're already running this model in a quantized form, so to speed things up further we need to either get better
hardware, or further reduce the model size, perhaps through additional quantization. You'll also likely notice -- and I
say likely because everything here is non-deterministic -- that the model still doesn't just "finish". It trails off
eventually, but not at 16 characters. And this comes down to the content length.


# Context Length

Once you understand tokenization the context length is really simple, it's just the maximum length of the token sequence
that the model is trained on. In the case of the original llama model the context length was just over two thousand
tokens (2,024), which means the input data used for training the model was broken up into a maximum this length of
sequence. For llama 2 that was increased to over four thousand tokens (4,096).

You can think of the context length like the amount of "memory" that an LLM has for a given query. The bigger it is the
more you can put in the query and thus the more the LLM will be aware of when return to you a response. Training -- and
inference -- with a large context length can increase quality of the output, but they do so at the cost of increased
computation. When you create a new `Llama()` object in llama.cpp the default maximum context length is set to 512
tokens.


In [5]:
# Let's do one last demo, and here I want to show you how we don't have to
# wait for whole query to finish, but can instead use the streaming features
# of llama.cpp to see tokens as they are completed.

# I'm going to create a new model with a nice large context size
model: Llama = Llama(model_path=model_path, verbose=False, n_ctx=4096)

# If we pass the stream=True parameter to create_completion() we will get back
# an iterator of CreatCompletionStreamResponse objects, which are just typed
# dictionaries similar to the Completion type
token_count: int = 0
for result in model.create_completion(
    prompt="Some fun things to do for vacation in the state of Michigan includes ",
    max_tokens=-1,
    stream=True,
):
    # I'm only going to print a newline every 50 tokens or so
    if token_count % 50 == 0:
        print("")
    token_count = token_count + 1
    print(result["choices"][0]["text"], end="")


8 miles of coastline, 40,000 lakes and rivers, 360 species of birds, 190 species of fish, 70 mammal species, 75 reptile species,
 more than 3,200 miles of shoreline, and 8 million acres of forest.
The largest city in Michigan is Detroit. Other major cities are Grand Rapids, Warren, Sterling Heights, Lansing,
 Ann Arbor and Flint.
Michigan is known for its four distinct seasons. There is something to do here year-round.
Michigan’s largest industry is automobile manufacturing. The Big Three – General Motors, Ford
 Motor Company and Chrysler – all have headquarters here.
Michigan is a popular vacation destination. More than 48 million people visit the state each year.
Michigan has more than 100 lighthouses. The
 oldest is the Huron Light Station on Lake Superior. The tallest is the White River Light Station on Lake Michigan.
Michigan is home to several world-renowned universities, including the University of Michigan and Michigan State University.

The state has more than 100 breweries. 

Ok, while you learn a little bit more about all the wonderful things you can do on vacation here in the State of
Michigan, I'm going to go grab a diet coke and refresh my voice -- I'll see you in the next video!
