## Understanding Word embeddings

Okay, we need to represent words as continuous valued vectors. This concept is referred to as embedding. This is important because we need to convert data into a format that neural networks can process.

Word embeddings are the mostc common. But also sentence or paragraph embeddings are popular choices for RAG.

The earlier text embeddings are from Word2Vec. The main idea is that words appear in similar contexts tend to have similiar meanings.

We can use pretrained models such as w2v to generate embeddings. But LLMs commonly produce their own embeddings that are part of the input layer and are *updated during training*.

### Tokenizing text

First part is to split text into words, and convert words into tokens. This tokens can be individual words or special characters.

In [12]:
import re

In [13]:
with open("the_veredict.txt", "r", encoding= "utf-8") as f:
    text = f.read()
print("Total number of characters: ", len(text))
print(text[:100])

Total number of characters:  20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no g


In [14]:
test = "Hello, world, This is a test."
result = re.split(r'(\s)', test) #Regular expression to split the text by spaces
print(result)

['Hello,', ' ', 'world,', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'test.']


In [15]:
result = re.split(r'([,.]|\s)', test) #Now also split by commas and dots
print(result)

['Hello', ',', '', ' ', 'world', ',', '', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


In [16]:
result = [item for item in result if item.strip()] #Let's remove empty strings
result

['Hello', ',', 'world', ',', 'This', 'is', 'a', 'test', '.']

In [26]:
def tokenize(text):
    return [item for item in re.split(r'([,.:;?_!"()\']|--|\s)', text) if item.strip()]
tokenize(test)

['Hello', ',', 'world', ',', 'This', 'is', 'a', 'test', '.']

In [28]:
processed = tokenize(text)

### Converting tokens into token IDs

We need to build a vocabulary first

In [31]:
all_words = sorted(set(processed)) #Sorting the words and removing duplicates
vocab_size = len(all_words)
print(vocab_size)

1130


In [33]:
all_words[:10], all_words[-10:]

(['!', '"', "'", '(', ')', ',', '--', '.', ':', ';'],
 ['would',
  'wouldn',
  'year',
  'years',
  'yellow',
  'yet',
  'you',
  'younger',
  'your',
  'yourself'])

In [36]:
vocab = {token:integer for integer, token in enumerate(all_words)}
vocab

{'!': 0,
 '"': 1,
 "'": 2,
 '(': 3,
 ')': 4,
 ',': 5,
 '--': 6,
 '.': 7,
 ':': 8,
 ';': 9,
 '?': 10,
 'A': 11,
 'Ah': 12,
 'Among': 13,
 'And': 14,
 'Are': 15,
 'Arrt': 16,
 'As': 17,
 'At': 18,
 'Be': 19,
 'Begin': 20,
 'Burlington': 21,
 'But': 22,
 'By': 23,
 'Carlo': 24,
 'Chicago': 25,
 'Claude': 26,
 'Come': 27,
 'Croft': 28,
 'Destroyed': 29,
 'Devonshire': 30,
 'Don': 31,
 'Dubarry': 32,
 'Emperors': 33,
 'Florence': 34,
 'For': 35,
 'Gallery': 36,
 'Gideon': 37,
 'Gisburn': 38,
 'Gisburns': 39,
 'Grafton': 40,
 'Greek': 41,
 'Grindle': 42,
 'Grindles': 43,
 'HAD': 44,
 'Had': 45,
 'Hang': 46,
 'Has': 47,
 'He': 48,
 'Her': 49,
 'Hermia': 50,
 'His': 51,
 'How': 52,
 'I': 53,
 'If': 54,
 'In': 55,
 'It': 56,
 'Jack': 57,
 'Jove': 58,
 'Just': 59,
 'Lord': 60,
 'Made': 61,
 'Miss': 62,
 'Money': 63,
 'Monte': 64,
 'Moon-dancers': 65,
 'Mr': 66,
 'Mrs': 67,
 'My': 68,
 'Never': 69,
 'No': 70,
 'Now': 71,
 'Nutley': 72,
 'Of': 73,
 'Oh': 74,
 'On': 75,
 'Once': 76,
 'Only': 77,
 '

In [45]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab #Vocabulary as a dictionary
        self.int_to_str = {v:k for k,v in self.str_to_int.items()} #Reverse vocabulary
    def encode(self, text): #Just use the function we already defined
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        #Converts tokens to IDs. 
        ids = [self.str_to_int[item] for item in preprocessed]
        return ids
    def decode(self, ids):
        text = " ".join([self.int_to_str[id] for id in ids])
        return text
tokenizer = SimpleTokenizerV1(vocab)
ids = tokenizer.encode(text)
print(ids)
print(tokenizer.decode(ids))

[53, 44, 149, 1003, 57, 38, 818, 115, 256, 486, 6, 1002, 115, 500, 435, 392, 6, 908, 585, 1077, 709, 508, 961, 1016, 663, 1016, 535, 987, 5, 568, 988, 538, 722, 549, 496, 5, 533, 514, 370, 549, 748, 5, 661, 115, 841, 1102, 5, 157, 397, 547, 568, 115, 1066, 727, 988, 84, 7, 3, 99, 53, 818, 1003, 585, 1120, 530, 208, 85, 734, 34, 7, 4, 1, 93, 538, 722, 549, 496, 1, 6, 987, 1077, 1089, 988, 1112, 242, 585, 7, 53, 244, 535, 67, 7, 37, 100, 6, 549, 602, 25, 897, 6, 326, 549, 1042, 116, 7, 1, 73, 297, 585, 2, 850, 498, 1016, 866, 988, 1059, 722, 697, 769, 2, 1083, 1051, 9, 239, 53, 359, 2, 970, 998, 722, 987, 5, 66, 7, 83, 6, 988, 646, 1016, 16, 584, 145, 53, 998, 722, 7, 1, 93, 1116, 5, 727, 67, 7, 100, 2, 850, 633, 5, 693, 586, 114, 847, 114, 177, 1002, 994, 1088, 827, 568, 156, 389, 1069, 722, 677, 7, 14, 585, 1077, 711, 731, 988, 67, 7, 101, 1097, 688, 7, 45, 711, 988, 410, 50, 28, 5, 180, 988, 602, 40, 36, 882, 5, 929, 663, 209, 38, 2, 850, 1, 65, 1, 1016, 856, 5, 1108, 976, 568, 539, 4

In [51]:
test = "Gabriel" #This should raise a KeyError
try:
    tokenizer.encode(test)
except KeyError:
    print("KeyError raised as expected for unknown token.")
else:
    print("No KeyError raised, test failed.")

KeyError raised as expected for unknown token.


### Special context tokens

We need to include two special tokens <unk> and <eos> (end of sentence)

In [55]:
all_tokens = sorted(set(processed))
all_tokens.extend(["<unk>", "<eos>"])
vocab = {token:integer for integer, token in enumerate(all_tokens)}
vocab["<unk>"], vocab["<eos>"]

(1130, 1131)

In [None]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {v:k for k,v in self.str_to_int.items()}
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        preprocessed = [item if item in self.str_to_int
            else "<unk>" for item in preprocessed]
        ids = [self.str_to_int[item] for item in preprocessed]
        return ids
    def decode(self, ids):
        text = " ".join([self.int_to_str[id] for id in ids])
        return text
tokenizer = SimpleTokenizerV2(vocab)
ids = tokenizer.encode(text)
print(ids)
print(tokenizer.decode(ids))

[53, 44, 149, 1003, 57, 38, 818, 115, 256, 486, 6, 1002, 115, 500, 435, 392, 6, 908, 585, 1077, 709, 508, 961, 1016, 663, 1016, 535, 987, 5, 568, 988, 538, 722, 549, 496, 5, 533, 514, 370, 549, 748, 5, 661, 115, 841, 1102, 5, 157, 397, 547, 568, 115, 1066, 727, 988, 84, 7, 3, 99, 53, 818, 1003, 585, 1120, 530, 208, 85, 734, 34, 7, 4, 1, 93, 538, 722, 549, 496, 1, 6, 987, 1077, 1089, 988, 1112, 242, 585, 7, 53, 244, 535, 67, 7, 37, 100, 6, 549, 602, 25, 897, 6, 326, 549, 1042, 116, 7, 1, 73, 297, 585, 2, 850, 498, 1016, 866, 988, 1059, 722, 697, 769, 2, 1083, 1051, 9, 239, 53, 359, 2, 970, 998, 722, 987, 5, 66, 7, 83, 6, 988, 646, 1016, 16, 584, 145, 53, 998, 722, 7, 1, 93, 1116, 5, 727, 67, 7, 100, 2, 850, 633, 5, 693, 586, 114, 847, 114, 177, 1002, 994, 1088, 827, 568, 156, 389, 1069, 722, 677, 7, 14, 585, 1077, 711, 731, 988, 67, 7, 101, 1097, 688, 7, 45, 711, 988, 410, 50, 28, 5, 180, 988, 602, 40, 36, 882, 5, 929, 663, 209, 38, 2, 850, 1, 65, 1, 1016, 856, 5, 1108, 976, 568, 539, 4

In [64]:
test = "Gabriel"
ids = tokenizer.encode(test)
print(ids)
print("No KeyError raised.")

[1130]
No KeyError raised.


Other spetial tokens:
- \<UNK> Unknown token, used for words not in the vocabulary
- \<EOS> End of sentence token, marks the end of a sentence
- \<BOS> Begin of sentence.
- \<PAD> Ensure all texts have the same length, inserting this token

GPT only uses <endoftext> for simplicity.

## Byte pair encoding

GPT uses a byte pair encoding tokenizer which breaks words down into subword units.

In [85]:
import tiktoken

In [89]:
tokenizer = tiktoken.get_encoding("gpt2")

In [90]:
tokenizer.n_vocab

50257

In [91]:
integers = tokenizer.encode("Hello world XD")
integers, tokenizer.decode(integers)

([15496, 995, 46537], 'Hello world XD')

In [92]:
integers = tokenizer.encode("uwu")
integers, tokenizer.decode(integers)

([84, 43812], 'uwu')

Algorithm breaks down words that aren't in its predefined vocab into smaller subword units or even individual characters

In [93]:
integers = tokenizer.encode("marichiweu")
integers, tokenizer.decode(integers)

([3876, 16590, 732, 84], 'marichiweu')

### Data sampling with a sliding window

We need to generate the input-target pairs required for training an LLM. We will use a sliding windown and use the next word prediction as our labeled dataset

In [97]:
enc_text = tokenizer.encode(text)
print(len(enc_text))

5145


In [102]:
enc_sample = enc_text[50:]
context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]
print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


In [106]:
for i in range(1, context_size + 1):
    context = enc_sample[:i]
    next_word = enc_sample[i]
    print(tokenizer.decode(context), "--->", tokenizer.decode([next_word]))

 and --->  established
 and established --->  himself
 and established himself --->  in
 and established himself in --->  a


Let's use pytorch to create an efficient loader for our data

In [108]:
import torch
from torch.utils.data import Dataset, DataLoader

In [117]:
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []
        token_ids = tokenizer.encode(txt)
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i+max_length]
            target_chunk = token_ids[i+1:i+max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
    #This are two mandatory methods in a pytorch dataset
    def __len__(self):
        return len(self.input_ids)
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [118]:
dataset = GPTDatasetV1(text, tokenizer, 10, 1)
for row in dataset:
    print(row)

(tensor([   40,   367,  2885,  1464,  1807,  3619,   402,   271, 10899,  2138]), tensor([  367,  2885,  1464,  1807,  3619,   402,   271, 10899,  2138,   257]))
(tensor([  367,  2885,  1464,  1807,  3619,   402,   271, 10899,  2138,   257]), tensor([ 2885,  1464,  1807,  3619,   402,   271, 10899,  2138,   257,  7026]))
(tensor([ 2885,  1464,  1807,  3619,   402,   271, 10899,  2138,   257,  7026]), tensor([ 1464,  1807,  3619,   402,   271, 10899,  2138,   257,  7026, 15632]))
(tensor([ 1464,  1807,  3619,   402,   271, 10899,  2138,   257,  7026, 15632]), tensor([ 1807,  3619,   402,   271, 10899,  2138,   257,  7026, 15632,   438]))
(tensor([ 1807,  3619,   402,   271, 10899,  2138,   257,  7026, 15632,   438]), tensor([ 3619,   402,   271, 10899,  2138,   257,  7026, 15632,   438,  2016]))
(tensor([ 3619,   402,   271, 10899,  2138,   257,  7026, 15632,   438,  2016]), tensor([  402,   271, 10899,  2138,   257,  7026, 15632,   438,  2016,   257]))
(tensor([  402,   271, 10899,  213

In [119]:
def create_dataloader_v1(txt, batch_size = 4, max_length = 256,
    stride = 128, shuffle = True, drop_last = True, num_workers = 0):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    dataloader = DataLoader(
        dataset,
        batch_size = batch_size,
        shuffle = shuffle,
        drop_last = drop_last,
        num_workers = num_workers
    )
    return dataloader

In [123]:
dataloader = create_dataloader_v1(text)
data_iter = iter(dataloader)
next(data_iter)

[tensor([[  262,  1633,   286,  ...,  1544,   373, 10574],
         [  383,  8631,  3872,  ...,  1813,   284,   423],
         [10197,   832,   262,  ...,  9074,    13,   402],
         [18560,   438,  7091,  ...,   338,  1804,   340]]),
 tensor([[ 1633,   286, 24380,  ...,   373, 10574,    26],
         [ 8631,  3872,   373,  ...,   284,   423,   520],
         [  832,   262, 46475,  ...,    13,   402,   271],
         [  438,  7091,   750,  ...,  1804,   340,   329]])]

In [124]:
next(data_iter)

[tensor([[  314,   550,  1775,  ...,   402,   271, 10899],
         [ 4150,     8,  3688,  ..., 14093,   656,   465],
         [12036,   683,     0,  ...,   284,   616,   835],
         [  423,  4750,   326,  ...,   262,  8216,    13]]),
 tensor([[  550,  1775,   683,  ...,   271, 10899,    11],
         [    8,  3688,   284,  ...,   656,   465,  1021],
         [  683,     0,  3226,  ...,   616,   835,   286],
         [ 4750,   326,  9074,  ...,  8216,    13,   314]])]