The data used by LLMs can be split in to smaller subwords and encoded into a continuous space (vectors), a process referred as embedding.
Some examples of embedding include: 
- Retrieval-augmented generation (RAG).
- Word embeding (word2vec)

The embedding size varies depending on the specific foundation model. For instance, GPT-2 uses an embedding size of 768 dimensions.

## Tokenizing

In [3]:
from importlib.metadata import version
import tiktoken
import torch

In [4]:
with open("./data/example_text.txt","r",encoding="utf-8") as f:
    raw = f.read()
    
print("Total characters:",len(raw))
print(raw[:100])

Total characters: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no g


In [8]:
import re
preprocessed = re.split(r'([,.?_!"()\']|--|\s)',raw)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:50])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his', 'glory', ',', 'he', 'had', 'dropped', 'his', 'painting', ',', 'married', 'a', 'rich', 'widow', ',', 'and', 'established', 'himself']


List all unique tokens

In [64]:
all_tokens = sorted(list(set(preprocessed)))
# Extend vocabulary to deal with unknown words (not existing in the training vocabulary, and end of file symbols to distinguish between different text sections):
all_tokens.extend(["<|unk|>","<|endoftext|>"])

vocabulary_size = len(all_tokens)
print(f"Vocabulary size: {vocabulary_size}")
print("First words:",all_words[:50])

Vocabulary size: 1161
First words: ['!', '"', "'", '(', ')', ',', '--', '.', ':', ';', '?', 'A', 'Ah', 'Among', 'And', 'Are', 'Arrt', 'As', 'At', 'Be', 'Begin', 'Burlington', 'But', 'By', 'Carlo', 'Carlo;', 'Chicago', 'Claude', 'Come', 'Croft', 'Destroyed', 'Devonshire', 'Don', 'Dubarry', 'Emperors', 'Florence', 'For', 'Gallery', 'Gideon', 'Gisburn', 'Gisburns', 'Grafton', 'Greek', 'Grindle', 'Grindle:', 'Grindles', 'HAD', 'Had', 'Hang', 'Has']


In [70]:
vocabulary = {token:integer for integer,token in enumerate(all_tokens)}

# Print last entries of the vocabulary (including unknown words and end of text)
for i,item in enumerate(list(vocabulary.items())[-5:]):
    print(item)


('younger', 1156)
('your', 1157)
('yourself', 1158)
('<|unk|>', 1159)
('<|endoftext|>', 1160)


Simple text tokenizer

In [71]:
class Tokenizer:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [item if item in self.str_to_int
                        else "<|unk|>" for item in preprocessed] #Replace unknown words by the "<|unk|>" token
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

Instantiate the tokenizer class object and tokenize a text of interest

In [79]:
tokenizer = Tokenizer(vocabulary) # Instantiate Tokenizer object
text = " ".join([raw[130:250],"Hello"]) # Include a word non-existing in the vocabulary, i.e. the word 'Hello' doesn't exist in the original input text.
text = " ".join([text,"<|endoftext|>"]) # Include the End of Text character to specify the end of the input text to be tokenized (useful for concatenating multiple, un-related texts)

print(f"Input text: \n{text}")

ids = tokenizer.encode(text)
print(f"Tokenized text:\n{ids}")
print("Decoding token:")
print(tokenizer.decode(ids))


Input text: 
at, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on Hello <|endoftext|>
Tokenized text:
[185, 5, 579, 1013, 546, 738, 559, 504, 5, 541, 522, 377, 559, 766, 5, 676, 119, 862, 1129, 5, 162, 404, 557, 579, 119, 1093, 743, 1159, 1160]
Decoding token:
at, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on <|unk|> <|endoftext|>


Tokenization using Byte pair encoding

In [81]:
from importlib_metadata import version
import tiktoken
print("tiktoken ver=", version("tiktoken"))

tiktoken ver= 0.6.0


In [82]:
tokenizer = tiktoken.get_encoding("gpt2") # Instantiate BPE tokenized (byte pair encoding)

In [101]:
text = " ".join([raw[100:297],"<|endoftext|>",raw[355:550]]) # Concatenate two batches of text
text = " ".join([text,"AnUnknownWord"])
integers = tokenizer.encode(text,allowed_special={"<|endoftext|>"})
print(f"Tokenized text:{integers}")
strings = tokenizer.decode(integers)
print(f"Decoded token:{strings}")

 # The BPE tokenizer encodes and decodes unknown words by breaking them down into individual characters.

Tokenized text:[630, 5975, 284, 502, 284, 3285, 326, 11, 287, 262, 6001, 286, 465, 13476, 11, 339, 550, 5710, 465, 12036, 11, 6405, 257, 5527, 27075, 11, 290, 4920, 2241, 287, 257, 4489, 64, 319, 262, 34686, 41976, 13, 357, 10915, 314, 2138, 1807, 340, 561, 220, 50256, 326, 373, 644, 262, 1466, 1444, 340, 13, 314, 460, 3285, 9074, 13, 46606, 536, 5469, 438, 14363, 938, 4842, 1650, 353, 438, 2934, 489, 3255, 465, 48422, 540, 450, 67, 3299, 13, 366, 5189, 1781, 340, 338, 1016, 284, 3758, 262, 1988, 286, 616, 4286, 705, 1014, 510, 26, 275, 1052, 20035, 26449]
Decoded token:reat surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would <|endoftext|> that was what the women called it. I can hear Mrs. Gideon Thwing--his last Chicago sitter--deploring his unaccountable abdication. "Of course it's going to send the value of my picture 'way up; b AnUnknownWord


## Data sampling