# Controlling Model Output

## Stopping Token Generation

Controlling the output of the language model is a difficult task and, as you saw in a previous lecture, we can use few
shot prompting to help with some of the formatting requirements, but it can be a bit tricky to have the model just stop
after one item. Thankfully, llama.cpp has a couple of other mechanisms we can leverage to improve our control over the
model output. The first of these is a stopping criteria.


In [1]:
# Let's load and configure our llama 2 model
import os
from llama_cpp import Llama

model: Llama = Llama(model_path=os.environ["LLAMA_13B"], verbose=False)

In [2]:
# The most blunt way to stop the LLM generation is to indicate
# the token or tokens you want to break on in a list when calling
# create_completion(). Here I'll ask for the continuation of this
# prompt, but I only want a single sentence so I'll stop on a
# period or other select punctuation.

token_count: int = 0
for result in model.create_completion(
    prompt="What is the sound of one hand clapping?",
    max_tokens=512,
    stream=True,
    stop=[".", "?", "!"],
):
    if token_count % 50 == 0:
        print("")
    token_count = token_count + 1
    print(result["choices"][0]["text"], end="")


 It depends on who you ask

Now, this form of stopping is actually done in the python library itself, and not the underlying llama.cpp. There is
also an option to set a callback function to determine when to stop as well, and this is through the `stopping_criteria`
parameter.

I have a confess, this is a source of frustration that I have with the tools we are using. While the tools work well,
the docs sometimes lag, or are unclear. The newness of the tools -- and the lack of a clear standard or even dominant
tool for development in this space -- also makes it a bit of work sometimes to understand whats happening under the
hood.

If we go to the documentation on the `llama-cpp-python` bindings website, we see that the `stopping_criteria` is
actually a list of callable objects -- so functions -- which take in two parameters, both numpy arrays. The first
parameter is an array of unsigned int's and the second an array of single precision float values. The expectation is
that the callback function will return a boolean, presumably indicating whether we should stop processing or not.

I think this is a bit easier to understand if we just use the debugger to take a look.


In [3]:
# Let's define the function we are going to use as our
# stopping criteria. At the moment I'm just going to
# return false, and set a breakpoint on this line
def should_stop(input_ids, logits) -> bool:
    return False


# Now we'll do the rest as we did previously, this
# time setting up our stopping_criteria callback. I want
# to use a temperature of zero just for a demonstration
token_count: int = 0
for result in model.create_completion(
    prompt="What is the sound of one hand clapping?",
    max_tokens=512,
    stream=True,
    temperature=0,
    stopping_criteria=should_stop,
):
    if token_count % 50 == 0:
        print("")
    token_count = token_count + 1
    print(result["choices"][0]["text"], end="")



I’

Ok, to debug the cell in vs code I'm going to set a breakpoint in the margin on line five. Then we can hit the drop down
on the left hand side to run the cell in debug mode.

In the debug panel to the left we can see our local variables -- the `input_ids` parameter, and if we expand that we can
see it has 11 values in it, all numbers. Take a minute to think about this -- what do you think those numbers are?


It turns out that is our sequence of tokens. If we expand the globals section of the debugger we see that the llama
object is there as `model`. We looked at tokenization previously, and the model object has a handy function to
detokenize a sequence, converting it back into the text representation which is more familiar to us.

To do that we have a couple of options available to us, and I'm going to show you both just in case you are unfamiliar
with the vs code debugging facilities. The first is to add a watch variable. This is some expression which will persist
across various debugging runs, and is a hand method to use when debugging loops. I'm going to add the watch expression
as `model.detokenize(input_ids)`. Once I do this it becomes clear that this is our prompt!

Another option is to use the debug console at the bottom of the window. This allows you to write arbitrary python to
interact with the interpreter. If I put in the same code here, `model.detokenize(input_ids)`, we get the same result.


Now, this isn't at all unique to llama 2, but it's an important skill in when you want to understand new and emerging
packages and the documentation is lacking. If you haven't seen this before, I encourage you to try it out, and perhaps
look up a few videos on debugging in vs code. If you can master your toolkit, your ability to solve problems will
increase two fold.


Let's go back to our function. We're returning `False` here, which will indicate that the stream of tokens has not yet
met a stopping criteria. We can hit the debug continue button, or F5, a few times to go through a few breakpoint
iterations. As we do this, we can see that the sequence grows in size, and that the detokenization of it is giving us
the prompt plus the continuation.


Let's turn our attention to the logits. The logits is huge, 32,000 items, so you can imagine this aligns with our
vocabulary. Each of the values in the logits array is the log odds that a given token will be the right one for the next
response. Don't let this scare you away, the log odds -- a logit -- is basically just a probability value that ranges
from negative infinity to positive infinity.

Let's open the debug console and see what some of the best logits are.


I'm not going to go into the details about how this mapping is done, but what we want are the values which are closest
to zero. These are our most likely candidates, and the values which are out towards positive or negative infinity are
our least likely candidates.

Do do this I'm going to `import numpy as np` first, then I'm going to take the absolute value of the logits array, sort
it by size, then take the last five items -- which will be our five closest to zero -- and detokenize them to see if it
makes sense.

Keep in mind that when you see the string output, it's actually in reverse order. The most likely token is the very last
one, so we can slice the last five items using numpy slicing syntax to reverse them.

Here's the code for that: `model.detokenize(np.absolute(logits).argsort()[-5:][::-1])`


## Summary

Ok, in this lecture I wanted to show you two different ways you can control the output of the LLM, the first is through
a sequence of stopping items provided as a list, and the second is through the `stopping_criteria` where you can provide
multiple different callback functions which operate on the underlying probabilities of the model responses. Both are
useful, and you can imagine that with a function in the `stopping_criteria` you can do anything you might want to do in
python, including validating that the response fits a specific regular expression pattern, or even raising a partially
completed prompt to another service, process, or person to validate.

Along the way I demonstrated how you can use the vs code environment to dig into the underlying operation of the LLM
and, frankly, I just showed a simple example! Once you have mastered debugging skills within the platform, you can
quickly pick up software packages like llama.cpp's python bindings by just stepping through the code to see how it
works. I think this is an incredibly important developer skill that often we don't do a good job teaching, so I hope
you'll consider exploring this on your own to better understand how the llama.cpp package works.
