# The Art of Prompt Design: Prompt Boundaries and Token Healing

In this post, we'll discuss how the greedy tokenization methods used by language models can introduce unintended biases into your prompts. This is part of a series on the art of prompt design that demonstrates how to use <a href="https://github.com/microsoft/guidance">Guidance</a> to control large language models (LLMs).

When building prompts for language models, it is crucial to understand how the model perceives the text. Language models are not trained on raw text, but rather on tokens, which are chunks of text that often occur together, similar to words. The model learns the "meaning" of each token independently, just like how we learn the meaning of words in a language.

This understanding impacts how language models see text and how we can prompt them, as every prompt must be a set of tokens. GPT-style models utilize tokenization methods like [Byte Pair Encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) (BPE) that map all input bytes to token ids in a greedy manner.

Standard greedy token mapping works well during training, but it can lead to subtle issues during prompting and inference. These issues arise because the greedy token boundaries often don't line up with the end of the prompt, especially when considering the generated tokens that will come next. While the end of a prompt will always align with a token boundary in practice, as the prompt is tokenized before being extended by the model, there may be instances where the first characters of the completion are part of a longer token that would span the prompt boundary. In such cases, the longer token cannot be used even though the model would expect it based on the training data.

## An example prompt boundary problem
The inability to use tokens that span prompt boundaries can lead to subtle yet important biases in the model's output. Consider the following example where we are trying to generate an https url string:

In [44]:
import guidance

# we use StableLM for openness, but these issues impact all models
# guidance.llm = guidance.llms.Transformers("stabilityai/stablelm-base-alpha-3b", device=0)

# we turn token healing off so that guidance acts like a normal prompting library
program = guidance('''The link is <a href="http:{{gen max_tokens=10 token_healing=False}}''')
program()

Note that the output generated by the LLM does not complete the url with the obvious next characters (two forward slashes) and instead creates an invalid url string start has a space in the middle of it. This is surprising because language models are great at completing text, and knowing that `//` comes after `https:` is an extremely easy completion problem. So why is it failing? To understand why this happens let's change our prompt boundary so that our prompt does not include the colon character:

In [45]:
guidance('''The link is <a href="http{{gen max_tokens=10 token_healing=False}}''')()

The problem is fixed! Now the language model generates a valid url string like we expect. And it is not just random coincidence that we fixed the problem, as demonstrated by resampling with a high temperature:

In [46]:
# without the colon we always get a valid link
program = guidance('''The link is <a href="http{{gen 'completions' max_tokens=10 token_healing=False temperature=1.0 n=5}}''')
program()["completions"]

['://archive.google.com/search?source',
 '://www.coombe.in">Link',
 '://www.realadventures.com/',
 '://my.gardenblog.com">my',
 '://www.youtube.com/v/d']

In [47]:
# with the colon we almost always get an invalid link
program = guidance('''The link is <a href="http:{{gen 'completions' max_tokens=10 token_healing=False temperature=1.0 n=5}}''')
program()["completions"]

['//www.elgato.com/hot',
 ' //www.flickr.com/photos/',
 '\\\\www.hackernews.in" title',
 '\nGoogle-Maps-PartOne-Use',
 ' //www.gohattoi.com"']

To understand why the colon matters so much we need to look at the tokenized representation of the prompts. Below is the tokenization of prompt that ends in a colon. The tokenization of the prompt that does not end in a colon is similar, but without the colon token at the end.

In [49]:
def print_tokens(tokens):
    print("len = " + str(len(tokens)))
    for i in tokens:
        print(str(i) + "\t`" + guidance.llm.decode([i]) + "`")

print_tokens(guidance.llm.encode('The link is <a href="http:'))

len = 9
510	`The`
3048	` link`
310	` is`
654	` <`
66	`a`
3860	` href`
568	`="`
2413	`http`
27	`:`


As a human, when we see the string `The link is <a href="http:` we know that the next characters are very likely to be "//" because we have seen lots of url strings and we know that two slashes are part of the url syntax. However the language model does not see strings, it "sees" tokens. While the language model has also seen lots of urls, they were encoded using a greedy tokenization stratagy that does not include the token `27`:

In [50]:
print_tokens(guidance.llm.encode('The link is <a href="http://www.google.com/search?q'))

len = 18
510	`The`
3048	` link`
310	` is`
654	` <`
66	`a`
3860	` href`
568	`="`
2413	`http`
1358	`://`
2700	`www`
15	`.`
9906	`google`
15	`.`
681	`com`
16	`/`
8716	`search`
32	`?`
82	`q`


Because URLs are so common the tokenizer has a special token that captures the colon and the following double slashes as a single token `1358`. So when the language model sees the token `27` (a colon by itself) at the end of our original prompt, it has never seen it as part of a normal URL string. In fact, when the model sees `27` it can be sure that what comes next is very unlikely to be anything that could have been encoded together with the colon using a "longer token" (where by longer we mean the string represented by the token is longer). This is because in the model's training data those characters would have been encoded together with the colon, so the token `27` only appears when such longer tokens are not possible. When a model sees a token it knows two things, first the learned embedding/meaning of that token, and second that whatever comes after that token wasn't compressed by the greedy tokenizer. It is easy to forget about this second peice of information, but it is very important for understanding how prompt boundaries work.

To figure out what symbols we are accidently avoiding by ending our prompt with a colon we can search over the string representation of all the tokens in the model's vocabulary and look for ones that start with a colon. For reasons we will discuss next, `guidance` has built-in support for this:

In [41]:
print_tokens(guidance.llm.prefix_matches(":"))

len = 34
27	`:`
21610	`:/`
1358	`://`
1450	`::`
41210	`::::`
5136	`:"`
46064	`:")`
18031	`:"){`
49777	`:",`
27506	`:*`
6098	`:**`
48471	`:**]{}`
8048	`:\`
10477	`:(`
13522	`:=`
25942	`:=\`
18459	`:#`
19282	`:</`
21382	`:[`
22314	`:-`
42841	`:--`
22426	`:'`
23338	`:_`
25731	`:@"`
27976	`:%`
30337	`:``
34417	`:]`
35490	`:$`
47279	`:$$\`
37731	`:)`
41924	`:{`
46186	`:{\`
43118	`:.`
44662	`:&`


There are 34 different tokens that all start with a colon! This means that if we end our prompt with a colon we are accidently telling the model that it should not generate completions that match of these 34 token strings. *This subtle and powerful bias can have all kinds of unintended consequences.* And it is not just restricted to the colon character, it applies to any string that could be potentially extended to make a longer single token. In fact even our "fixed" prompt that ends with "http" has a built in bias as well, it communicates to the model that what comes next after "http" must not be "s", otherwise "http" would not have been encoded as a separate token:

In [43]:
print_tokens(guidance.llm.prefix_matches("http"))

len = 2
2413	`http`
3614	`https`


Another example of this is the "[" character. Consider the following prompt and completion:

In [51]:
guidance('''An example ["like this"] and another example [{{gen max_tokens=10 token_healing=False}}''', caching=False)()

In [55]:
print_tokens(guidance.llm.encode('An example ["like this"] and another example ['))

len = 10
1145	`An`
1650	` example`
15640	` ["`
3022	`like`
436	` this`
9686	`"]`
285	` and`
1529	` another`
1650	` example`
544	` [`


Why is the second string not quoted? It is because by ending our prompt with the token ' [' token we are telling the model that it should not generate completions that match the following 27 longer tokens (one of which adds the quote character):

In [54]:
print_tokens(guidance.llm.prefix_matches(" ["))

len = 27
544	` [`
1008	` [@`
3921	` [*`
4299	` [**`
23734	` [****,`
8168	` []`
24345	` [],`
26991	` [];`
27501	` []{`
8605	` [[`
44965	` [[*`
14412	` ['`
15640	` ["`
16731	` [$`
20629	` [$\`
21810	` [(`
49824	` [(\[`
21938	` […]`
24430	` [\`
27075	` [^`
28591	` [-`
31789	` [...]`
33440	` [{`
42989	` [_`
43521	` [<`
44308	` [``
49193	` [#`


## Fixing unintended bias with "token healing"

So what can we do to avoid these unintended biases? One option is to only ever end our prompts with tokens that cannot be extended into longer tokens (for example a role tag for chat-based models). But this can be very limiting, especially when we start to mix in the kind fo rich strcture that Guidance makes possible. So to address this problem Guidance has a feature called "token healing" that automatically backs up the generation process by one token before the end of the prompt, then constrains the first token generated to have a prefix that matches the last token in the prompt. This allows the generated text string to have the token encoding that the model would expect based on its training data, not an unusual alternative encoding forcing by the prompt boundary. Token healing allows you to express your prompts however you wish, without worrying about boundaries.

To see how this work we will re-run all of the above examples, now with token healing turned on. Since token healing is on by default for Transformer models this just means removing the `token_healing=False` flag from the `guidance` calls.

With token healing we can now generate valid URLs, even when the prompt ends with a colon:

In [57]:
guidance('''The link is <a href="http:{{gen max_tokens=10}}''')()

With token healing we now can also sometimes generate https URLs, even when the prompt ends with "http":

In [61]:
program = guidance('''The link is <a href="http{{gen 'completions' max_tokens=10 n=10 temperature=1}}''')
program()["completions"]

['://www.visitdeliveroo.com"',
 '://en.wikipedia.org/wiki/List',
 '://www.sourceforge.net/mailagent',
 's://www.google.com/analytics?',
 '://www.ihg.com/hotels',
 '://www.ihg.com/holiday',
 's://play.google.com/store/apps',
 '://www.youtube.com/v/s',
 '://www.bawitv.net/',
 '://ox-d.sos.ox.']

And finally, with token healing we now get quoted strings even when the prompt ends with a " [" token:

In [62]:
guidance('''An example ["like this"] and another example [{{gen max_tokens=10}}''', caching=False)()

## Conclusion

When you write prompts, remember that greedy tokenization can have a significant impact on how language models interpret your prompts, particularly when the prompt ends with a token that could be extended into a longer token. This easy-to-miss source of bias can impact your results in suprising and unintended ways. To address to this is problem it is important to either end your prompt in a non-extendable token, or use something like Guidance's "token healing" feature so you can to express your prompts however you wish, without worrying about token boundary artifacts. 