# Tokenization

Up until now we've been talking about tokens and text as if they were the same thing and, while similar, it's important
to note the differences, and I think we can dive into some llama code to actually explore this. Now, the key issue is
that the way in which inference in a neural network, which is the underpinning architecture of large langauage models,
works is that it applies mathematical operations to input sequences. In our case case these operations have been learned
through pretraining, so the weights in our model get applied to some input sequence of numbers to generate a new output
sequence of numbers. So the role of tokenization here is to convert input text into numberic values, and vice versa with
the results of the model


In [None]:
# Each language model has its own tokenization format, and we can load and play with
# llama2's tokenizer by loading it directly
from llama import Tokenizer

tok = Tokenizer("tokenizer.model")

# We can take a look at the size of the vocabulary, in the case of llama2 that vocabulary
# is 32,000 tokens large
tok.n_words

: 

Now, we can represent a lot of things with 32,000 tokens, and one might think this is plenty and the approach would be
to tokenize every single character in the text and use the model to predict the next character as output. This doesn't
work well. The approach taken instead is to tokenize common words in the corpus of training data -- and this is why a
diverse and representative set of training data is needed -- and add to it "backups" for those words we don't see very
often, by including in tokens for individual characters.


In [None]:
# We can use the encode() function to generate a sequence of token numbers for
# a given sentance. There are a few special tokens we need to be aware of,
# specifically the bos (beginning of sequence) and eos (end of sequence) tokens.
# We can control whether we want these to appear in our sequence or not through
# the ecode bos and eos parameters.
tok.encode("Llama is an animal.", bos=True, eos=True)

In this case of this sentence, we can see that in addition to the bos and eos tokens, seven other tokens were generated.
Perhaps a bit mysterious, as there are only four words in the sentenance. Let's unpack each token and see how the llama
tokenizer worked.


In [None]:
# I'll just iterate over each token and as the tokenizer to decode() it.
for token in tok.encode("Llama is an animal.", bos=True, eos=True):
    print(f"Token {token} is {tok.decode(token)}")

We can see that the start and end tokens have no visible representation, and that seems reasonable. Suprisingly, the
word Llama isn't in the vocabulary, and instead the tokenizer has fallen back to spelling it out letter by letter until
it hits a segment which is in the vocabulary (in this case, "ama"). We see that the words, "is", "an", and "animal" are
all in the vocabulary, as well as the full stop or period. Let's play with this and interrogate the tokenizer a little
more.


In [None]:
# Here's our original sentence
original = tok.encode("Llama is an animal.", bos=True, eos=True)
print(original)
print(f"{[tok.decode(token) for token in original]}")

# Let's see what changes if we make it all lower case
lower = tok.encode("llama is an animal.", bos=True, eos=True)
print(lower)
print(f"{[tok.decode(token) for token in lower]}")

# And let's throw in a misspelling
misspelt = tok.encode("llama is an aminal.", bos=True, eos=True)
print(misspelt)
print(f"{[tok.decode(token) for token in misspelt]}")

Ok! So, spelling matters, as does capitalization. We see that the word llama still doesn't appear directly in the
vocabulary, but that the tokenization of the lower case word and the sentence case words of llama result in not only
different sequences but a different length of those sequences. We also see that a misspelling of the word "animal"
results in two tokens being returned.


In [None]:
# It's also useful to take a look at numbers, given that tokens are represented
# as numbers and LLMs have historically been really bad at solving math problems!
# I'm going to not put in the BOS and EOS tokens anymore, but don't forget they
# are there for your prompts.
eq = tok.encode("11+23=", bos=False, eos=False)
print(eq)
print(f"{[tok.decode(token) for token in eq]}")

So, this can be a bit perplexing, as there is a token right at the beginning of the sequence -- 29871 -- which actually
is a special token put in by the tokenizer, and represents a start of word whitespace. Ignoring that for the moment,
it's important for you to understand that the tokenizer has no symbolic understanding of numbers -- the number eleven is
simply a one followed by another one.

Understanding this process at a high level is important, because our language model doesn't operate on words or
sentences or even text -- it operates on sequences of tokens. This is important when understanding more advanced
concepts in using an LLM, such as Retrieval Augmented Generation (RAG), where the LLM has access to a database of
documents all indexed by chunks. These chunks are tokens which are then expressed as vectors and, the details aren't
important right now, but it is important to know that tokens (and sequences of them) are the fundamental unit of input
and output of a language model.

It also lets us talk about something else important: Context length, and we'll do that by writing some prompts in the
next lecture.
