####  TOKENIZATION IN TRANSFORMERS - "HIDDEN" FEATURE ENGINEERING

One of the promises of deep learning in general is the ability to gobble up unstructured data and make sense of it, without the need for carefully crafted features. This is kind of true, as we need much less of it than in classical ML. BUT, its important to be aware of when/where feature engineering is happening, and the role it plays. One such place, which might be a bit overlooked, is in the tokenization process of transformer models.

So what is a tokenizer really? Let's start by thinking about what we even do when we tokenize. When we tokenize, we are turning an object into a number. The object can be anything. It can be a word, a product in a shopping cart, a pixel in an image, etc. Two important properties of these object are as follows:

1. The object is a single entity with inherent properties, and can be identified as such.
2. The objects appear in a context, where they are connected to other objects. And their meaning is derived from the context in which they appear. 

Lets take an example:
Consider a shopping cart with a few distinct objects: Flour, eggs, milk, sugar, butter, chocolate.
Each of these objects have distinct qualities that make them recognizable and distinct from each other. One by one they can be eaten, or used in a wide variety of recipes. Together in this specific context, they can be used to bake a cake.  

These two properties are very important, because of how these numbers are further processed by a transformer model. You see, when we input these numbers into a transformer, they are mapped into a high dimensional representation of the object, called an embedding. These embeddings are just vectors, one vector for each object we have defined. The nice thing about vectors is that they easily capture both similarity and context. And they can be manipulated by a deep learning model, like a transformer, to represent the meaning of the object, both in isolation and in the context of other objects.

One famous and early use of tokenization into a vector embedding is the word2vec model. Simplified a bit, it worked by training a neural net to embed words into word-embeddings, and then predict the surrounding words in a window around a target word based on these words embeddings. This turned out to generate impressive representations of word semantics, and was an early example of self supervised learning. A lot has happened in NLP since then, but the idea of representing words as vectors is still a core concept. In this notebook, we will look at how we can use tokenization to turn text into numbers, and how these numbers can be used as input to a transformer model.

Lets start by looking at a simple tokenization method in python:

In [9]:
# Given a sentence, we can easily split the text into tokens by doing this:
data = "Hello, world!"
characters = list(data)
map = {char: i for i, char in enumerate(set(characters))}
tokens = [map[char] for char in characters]


new_sentence = "Goodbye World!"
characters = list(new_sentence)
tokens = [map.get(char, -1) for char in characters]

print("sentence:", new_sentence)
print("tokens:", tokens)


sentence: Goodbye World!
tokens: [-1, 0, 0, 5, -1, -1, 8, 4, -1, 0, 1, 2, 5, 6]


In [7]:
print(map)

{'o': 0, 'r': 1, 'l': 2, 'H': 3, ' ': 4, 'd': 5, '!': 6, 'w': 7, 'e': 8, ',': 9}


We just tokenized our dataset! We mapped the characters to numbers, and turned our sentence into a list of numbers. Great! However, this is not a very good tokenization method. We have a few problems, lets think back to the points we made earlier:

Sure, each object here is a character, and therefore a distinct entity with a clear relation with surrounding characters. However, the complexity of each object is limited. A single character is not a complex object. It only contains a limited amount of information. If we encode only a limited amount of information for each object, we shift the burden on the model to infer the meaning primarily from the relationship between objects. This is not optimal. For a transformer model, we are better off with trying to encode more information in each object. How to do this? Just add more information to each object - more characters! This shifts us towards a more complex object, like a word. To be sure we capture the full complexity of the object we might need a larger vector to represent the object. Then our model can encode more information about the object, and makes it easier for the model to relate objects to each other.

Another problem of this method is the vocabulary size. We only have 128 different characters in our vocabulary. When we encode text with this method, out tokenized text becomes very long. Becuase of how the transformer architecture is designed, it is important that the tokenized text is not too long. This is because every token looks at every other token in the sequence. This means quadratic complexity in the number of tokens. So, in short, shorter sequences are better. Just look at this python example for an illustration:


In [12]:
def tokenize_with_chars(sentence):
    characters = list(sentence)
    map = {char: i for i, char in enumerate(set(characters))}
    tokens = [map.get(char, -1) for char in characters]
    return tokens

def tokenize_with_words(sentence):
    words = sentence.split()
    map = {word: i for i, word in enumerate(set(words))}
    tokens = [map.get(word, -1) for word in words]
    return tokens

print("tokenize with chars:", tokenize_with_chars("Hello, world!"))
print("tokenize with words:", tokenize_with_words("Hello, world!"))


tokenize with chars: [3, 8, 2, 2, 0, 9, 4, 7, 0, 1, 2, 5, 6]
tokenize with words: [1, 0]




This example highlights the role of feature engineering in NLP. It is almost decieving how such a small detail can have such a large impact on the performance of a model. We will now look at a more sophisticated method for tokenization, that addresses the problems we have identified.


In [13]:
from tiktoken import encoding_for_model
encoding = encoding_for_model("gpt-2")
print(encoding.encode("Hello, world!"))

[15496, 11, 995, 0]


: 

So what e