In [308]:
# importation of librairies 
import re

Loading the data

In [309]:
with open("Data/the-verdict.txt", 'r',encoding='utf-8') as f:
    raw_text = f.read()

print("Total number of caracter: ", len(raw_text))

Total number of caracter:  20479


In [310]:
# printing the first 100 caracters 
print(raw_text[:99])

I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


### Tokenization

Using some simple example text, we can use the re.split command with the
following syntax to split a text on whitespace characters:

In [311]:
text = "hello, world. This, is a test"
result = re.split(r'(\s)',text)
print(result)

['hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test']


Note that the simple tokenization scheme above mostly works for separating
the example text into individual words, however, some words are still
connected to punctuation characters that we want to have as separate list
entries. We also refrain from making all text lowercase because capitalization
helps LLMs distinguish between proper nouns and common nouns,understand sentence structure, and learn to generate text with proper
capitalization

Let's modify the regular expression splits on whitespaces (\s) and commas,
and periods ([,.]):

In [312]:
result = re.split(r'([,.]|\s)', text)
print(result)

['hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test']


A small remaining issue is that the list still includes whitespace characters.
Optionally, we can remove these redundant characters safely as follows:

In [313]:
result = [item for item in result if item.strip()]
print(result)

['hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test']


Handle other types of pontuation (,.:?_!"()'--)

In [314]:
text_2 = "Hello, word. Is this-- a test?"
result = re.split(r'([,.:?_!"()\']|--|\s)',text_2)
# removing white space 
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'word', '.', 'Is', 'this', '--', 'a', 'test', '?']


Now that we got a basic tokenizer working, let's apply it to Edith Wharton's
entire short story:

In [315]:
preprocessed = re.split(r'([,.?_!"()\']|--|\s)',raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed))

4649


The above print statement outputs 4649, which is the number of tokens in this
text (without whitespaces)

In [316]:
print(preprocessed[:10])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius']


### Converting tokens into token IDs

In the previous section, we tokenized Edith Wharton's short story and
assigned it to a Python variable called preprocessed. Let's now create a list
of all unique tokens and sort them alphabetically to determine the vocabulary size

In [317]:
all_words = sorted(list(set(preprocessed)))
vocab_size = len(all_words)
print(vocab_size)

1159


all_words = sorted(list(set(preprocessed)))

Cette ligne de code Python effectue les opérations suivantes :

set(preprocessed):

preprocessed est probablement une liste ou un autre itérable contenant des mots ou des jetons (tokens).
set(...) crée un ensemble (set) à partir de preprocessed. Un ensemble est une collection non ordonnée d'éléments uniques. En convertissant preprocessed en un ensemble, on supprime tous les mots en double.
list(...):

Cette opération reconvertit l'ensemble en une liste. Les ensembles n'ont pas d'ordre défini, tandis que les listes maintiennent l'ordre des éléments.
sorted(...):

Cette fonction trie la liste de mots par ordre alphabétique.

After determining that the vocabulary size is 1,159 via the above code, we
create the vocabulary and print its first 50 entries for illustration purposes:

In [318]:
# Creating a vocabulary
vocab = {token:integer for integer, token in enumerate(all_words)}
for i,item in enumerate(vocab.items()):
    print(item)
    if i > 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Carlo;', 25)
('Chicago', 26)
('Claude', 27)
('Come', 28)
('Croft', 29)
('Destroyed', 30)
('Devonshire', 31)
('Don', 32)
('Dubarry', 33)
('Emperors', 34)
('Florence', 35)
('For', 36)
('Gallery', 37)
('Gideon', 38)
('Gisburn', 39)
('Gisburns', 40)
('Grafton', 41)
('Greek', 42)
('Grindle', 43)
('Grindle:', 44)
('Grindles', 45)
('HAD', 46)
('Had', 47)
('Hang', 48)
('Has', 49)
('He', 50)
('Her', 51)


As we can see, based on the output above, the dictionary contains individual
tokens associated with unique integer labels. Our next goal is to apply this
vocabulary to convert new text into token IDs

Let's implement a complete tokenizer class in Python with an encode method
that splits text into tokens and carries out the string-to-integer mapping to
produce token IDs via the vocabulary. In addition, we implement a decode
method that carries out the reverse integer-to-string mapping to convert the
token IDs back into text.

In [319]:
# Implementing a simple text tokenizer 
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encoder(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)',text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decoder(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])',r'\1',text)
        return text

Let's instantiate a new tokenizer object from the SimpleTokenizerV1 class
and tokenize a passage from Edith Wharton's short story to try it out in
practice:

In [320]:
# intiation of our tokenizer with the vocabulary in parameter 
tokenizer = SimpleTokenizerV1(vocab=vocab)

In [321]:
# creating the test text 
text_3 = """It's the last he painted, you know," Mrs. Gisburn said with pardonable pride. "The last but one," she corrected herself--"but the other doesn't count, because he destroyed it."""

In [322]:
# computing the token IDs 
ids = tokenizer.encoder(text_3)
print(ids)

[58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7, 39, 873, 1136, 773, 812, 7, 1, 96, 615, 246, 745, 5, 1, 901, 298, 551, 6, 1, 246, 1013, 751, 363, 2, 995, 301, 5, 211, 541, 337, 596, 7]


Next, let's see if we can turn these token IDs back into text using the decode
method:

In [323]:
print(tokenizer.decoder(ids))

It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride." The last but one," she corrected herself --" but the other doesn' t count, because he destroyed it.


Based on the output above, we can see that the decode method successfully
converted the token IDs back into the original text

So far, so good. We implemented a tokenizer capable of tokenizing and detokenizing text based on a snippet from the training set. Let's now apply it to
a new text sample that is not contained in the training set:

In [324]:
# text_4 = "Hello, do you like tea?"
# tokenizer.encoder(text=text_4)


The problem is that the word "Hello" was not used in the The Verdict short
story. Hence, it is not contained in the vocabulary. This highlights the need to
consider large and diverse training sets to extend the vocabulary when
working on LLMs.

In the next section, we will test the tokenizer further on text that contains
unknown words, and we will also discuss additional special tokens that can
be used to provide further context for an LLM during training.

### Adding special context tokens

Nous pouvons modifier le tokenizer pour utiliser un jeton <|unk|>
s'il rencontre un mot qui ne fait pas partie du vocabulaire. De plus, nous ajoutons
un jeton entre des textes sans rapport. Par exemple, lors de la formation de LLM de type GPT
sur plusieurs documents ou livres indépendants, il est courant d'insérer un jeton
avant chaque document ou livre qui suit une source de texte précédente, comme
illustré à la figure 2.10. Cela aide le LLM à comprendre que, même si ces
les sources de texte sont concaténées pour la formation, elles ne sont en fait pas liées.

Let's now modify the vocabulary to include these two special tokens, <unk>
and <|endoftext|>, by adding these to the list of all unique words that we
created in the previous section

In [325]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>","<|unk|>"])
vocab = {token:integer for integer, token in enumerate(all_tokens)}


print(len(vocab.items()))

1161


Based on the output of the print statement above, the new vocabulary size is
1161 (the vocabulary size in the previous section was 1159).

As an additional quick check, let's print the last 5 entries of the updated
vocabulary

In [326]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1156)
('your', 1157)
('yourself', 1158)
('<|endoftext|>', 1159)
('<|unk|>', 1160)


Based on the code output above, we can confirm that the two new special
tokens were indeed successfully incorporated into the vocabulary. Next, we
adjust the tokenizer from code listing 2.3 accordingly, as shown in listing 2.4

In [327]:
# A simple text tokenizer that handles unknown words
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encoder(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)',text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [item if item in self.str_to_int
                        else "<|unk|>" for item in preprocessed]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decoder(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])',r'\1',text)
        return text

Let's now try this new tokenizer out in practice. For this, we will use a simple
text sample that we concatenate from two independent and unrelated
sentences

In [328]:
text_5 = "Hello, do you like tea?"
text_6 = "In the sunlit terraces of the palace."
text_7 = "<|endoftext|> ".join((text_5,text_6))

print(text_7)

Hello, do you like tea?<|endoftext|> In the sunlit terraces of the palace.


Next, let's tokenize the sample text using the SimpleTokenizerV2 on the
vocab we previously created

In [329]:
tokenizer_v2 = SimpleTokenizerV2(vocab=vocab)
print(tokenizer_v2.encoder(text_7))

[1160, 5, 362, 1155, 642, 1000, 10, 1159, 57, 1013, 981, 1009, 738, 1013, 1160, 7]


Above, we can see that the list of token IDs contains 1159 for the
<|endoftext|> separator token as well as two 1160 tokens, which are used for
unknown words.

Let's de-tokenize the text for a quick sanity check:


In [330]:
print(tokenizer_v2.decoder(tokenizer_v2.encoder(text=text_7)))

<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.


Based on comparing the de-tokenized text above with the original input text,
we know that the training dataset, Edith Wharton's short story The Verdict,
did not contain the words "Hello" and "palace."

Moreover, the tokenizer used for GPT models also doesn't use an <|unk|>
token for out-of-vocabulary words. Instead, GPT models use a byte pair
encoding tokenizer, which breaks down words into subword units, which we
will discuss in the next section

### Byte pair encoding (Encodage par paire d'octets): Using tiktoken librairy

We implemented a simple tokenization scheme in the previous sections for
illustration purposes. This section covers a more sophisticated tokenization
scheme based on a concept called byte pair encoding (BPE). The BPE
tokenizer covered in this section was used to train LLMs such as GPT-2,
GPT-3, and the original model used in ChatGPT

Since implementing BPE can be relatively complicated, we will use an
existing Python open-source library called tiktoken
(https://github.com/openai/tiktoken), which implements the BPE algorithm
very efficiently based on source code in Rust. Similar to other Python
libraries, we can install the tiktoken library via Python's pip installer from the
terminal

In [331]:
# version of tiktoken
from importlib.metadata import version
import tiktoken 
print("tiktoken version: ",version("tiktoken"))

tiktoken version:  0.8.0


Once installed, we can instantiate the BPE tokenizer from tiktoken as
follows:

In [332]:
bpe_tokenizer = tiktoken.get_encoding("gpt2")

The usage of this tokenizer is similar to SimpleTokenizerV2 we implemented
previously via an encode method

In [333]:
integers = bpe_tokenizer.encode(text_7, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 262, 20562, 13]


We can then convert the token IDs back into text using the decode method,
similar to our SimpleTokenizerV2 earlier

In [334]:
strings = bpe_tokenizer.decode(integers)
print(strings)

Hello, do you like tea?<|endoftext|> In the sunlit terraces of the palace.


We can make two noteworthy observations based on the token IDs and
decoded text above. First, the <|endoftext|> token is assigned a relatively
large token ID, namely, 50256. In fact, the BPE tokenizer, which was used to
train models such as GPT-2, GPT-3, and the original model used in
ChatGPT, has a total vocabulary size of 50,257, with <|endoftext|> being
assigned the largest token ID

Deuxièmement, le tokenizer BPE ci-dessus code et décode les mots inconnus, tels que
comme "someunknownPlace" correctement. Le tokenizer BPE peut gérer n'importe quel
mot inconnu. Comment y parvenir sans utiliser de jetons <|unk|> ?

The algorithm underlying BPE breaks down words that aren't in itspredefined vocabulary into smaller subword units or even individual
characters, enabling it to handle out-of-vocabulary words. So, thanks to the
BPE algorithm, if the tokenizer encounters an unfamiliar word during
tokenization, it can represent it as a sequence of subword tokens or
characters ( p46)

#### Exercice 1

Try the BPE tokenizer from the tiktoken library on the unknown words
"Akwirw ier" and print the individual token IDs. Then, call the decode
function on each of the resulting integers in this list to reproduce the mapping
shown in Figure 2.11. Lastly, call the decode method on the token IDs to
check whether it can reconstruct the original input, "Akwirw ier".

In [335]:
text_exe = "Akwirw ier"

exe_tokenizer = tiktoken.get_encoding("gpt2")
print("====================== Encoder ===============================")
integer_exe = exe_tokenizer.encode(text=text_exe, allowed_special={"<|endoftext|>"})
print(integer_exe)
print("\n")
print("====================== Decoder ===============================")
strings_exe = exe_tokenizer.decode(integer_exe)
print(strings_exe)

[33901, 86, 343, 86, 220, 959]


Akwirw ier


### Data sampling with a sliding window

The previous section covered the tokenization steps and conversion from
string tokens into integer token IDs in great detail. The next step before we
can finally create the embeddings for the LLM is to generate the input-target
pairs required for training an LLM.


What do these input-target pairs look like? As we learned in chapter 1, LLMs
are pretrained by predicting the next word in a text, as depicted in figure 2.12 (P46)

In this section we implement a data loader that fetches the input-target pairs
depicted in Figure 2.12 from the training dataset using a sliding window
approach.

In [336]:
# To get started, we will first tokenize the whole The Verdict short story we
# worked with earlier using the BPE tokenizer introduced in the previoussection

with open("Data/the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

enc_text = bpe_tokenizer.encode(raw_text)
print(len(enc_text))

5145


Next, we remove the first 50 tokens from the dataset for demonstration
purposes as it results in a slightly more interesting text passage in the next
steps

In [337]:
enc_sample = enc_text[50:]

One of the easiest and most intuitive ways to create the input-target pairs for
the next-word prediction task is to create two variables, x and y, where x
contains the input tokens and y contains the targets, which are the inputs
shifted by 1

In [338]:
context_size = 4

x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x: {x}")
print(f"y: {y}")

x: [290, 4920, 2241, 287]
y: [4920, 2241, 287, 257]


Processing the inputs along with the targets, which are the inputs shifted by
one position, we can then create the next-word prediction tasks depicted
earlier in figure 2.12, as follows:

In [339]:
for i in range(1,context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context,"=======>", desired)



Everything left of the arrow (=====>) refers to the input an LLM would
receive, and the token ID on the right side of the arrow represents the target
token ID that the LLM is supposed to predict.

For illustration purposes, let's repeat the previous code but convert the token
IDs into text

In [340]:
for i in range(1,context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(bpe_tokenizer.decode(context),"=======>",bpe_tokenizer.decode([desired]))



We've now created the input-target pairs that we can turn into use for the
LLM training in upcoming chapters

There's only one more task before we can turn the tokens into embeddings, as
we mentioned at the beginning of this chapter: implementing an efficient data
loader that iterates over the input dataset and returns the inputs and targets as
PyTorch tensors, which can be thought of as multidimensional arrays

In particular, we are interested in returning two tensors: an input tensor
containing the text that the LLM sees and a target tensor that includes the
targets for the LLM to predict

For the efficient data loader implementation, we will use PyTorch's built-in
Dataset and DataLoader classes

In [341]:
# A dataset for batched inputs ant target 
import torch
from torch.utils.data import Dataset, DataLoader

In [342]:
class GPTDatasetV1(Dataset):
    def __init__(self,txt, tokenizer, max_length, stride):
        self.tokenizer = tokenizer
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt)

        for i in range(0, len(token_ids)- max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i+1:i+max_length+1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

The following code will use the GPTDatasetV1 to load the inputs in batches
via a PyTorch DataLoader

In [343]:
# A DataLoader to generate batches with input-with pairs 
def create_dataloader_V1(txt, batch_size=4, max_length=256, stride=128,shuffle=True,drop_last=True):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt=txt, tokenizer=tokenizer, max_length=max_length,stride=stride)
    dataloader = DataLoader(
        dataset= dataset, batch_size=batch_size, shuffle=shuffle,drop_last=drop_last
    )

    return dataloader

Let's test the dataloader with a batch size of 1 for an LLM with a context
size of 4 to develop an intuition of how the GPTDatasetV1 class from listing
2.5 and the create_dataloader_v1 function from listing 2.6 work together:

In [344]:
with open("Data/the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

In [345]:
dataloader = create_dataloader_V1(
    txt= raw_text, batch_size=1, max_length=4, stride=1, shuffle=False, drop_last=True
)

In [346]:
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

# Executing the preceding code prints the following:

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


The first_batch variable contains two tensors: the first tensor stores the
input token IDs, and the second tensor stores the target token IDs. Since the
max_length is set to 4, each of the two tensors contains 4 token IDs. Note
that an input size of 4 is relatively small and only chosen for illustration
purposes. It is common to train LLMs with input sizes of at least 256.


To illustrate the meaning of stride=1, let's fetch another batch from this
dataset:

In [347]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


If we compare the first with the second batch, we can see that the second
batch's token IDs are shifted by one position compared to the first batch (for
example, the second ID in the first batch's input is 367, which is the first ID
of the second batch's input). The stride setting dictates the number of
positions the inputs shift across batches, emulating a sliding window
approach

Exercise 2.2

To develop more intuition for how the data loader works, try to run it with
different settings such as max_length=2 and stride=2 and max_length=8 and
stride=2

In [348]:
dataloader = create_dataloader_V1(
    txt= raw_text, batch_size=8, max_length=4, stride=1, shuffle=False, drop_last=True
)

data_iter = iter(dataloader)
inputs , targets = next(data_iter)
print("Inputs: \n", inputs)
print("targets: \n", targets)

Inputs: 
 tensor([[   40,   367,  2885,  1464],
        [  367,  2885,  1464,  1807],
        [ 2885,  1464,  1807,  3619],
        [ 1464,  1807,  3619,   402],
        [ 1807,  3619,   402,   271],
        [ 3619,   402,   271, 10899],
        [  402,   271, 10899,  2138],
        [  271, 10899,  2138,   257]])
targets: 
 tensor([[  367,  2885,  1464,  1807],
        [ 2885,  1464,  1807,  3619],
        [ 1464,  1807,  3619,   402],
        [ 1807,  3619,   402,   271],
        [ 3619,   402,   271, 10899],
        [  402,   271, 10899,  2138],
        [  271, 10899,  2138,   257],
        [10899,  2138,   257,  7026]])


### Creating token embeddings

The last step for preparing the input text for LLM training is to convert the
token IDs into embedding vectors

Let's illustrate how the token ID to embedding vector conversion works with
a hands-on example. Suppose we have the following four input tokens with
IDs 2, 3, 5, and 1:

In [349]:
input_ids = torch.tensor([2,3,5,1])

For the sake of simplicity and illustration purposes, suppose we have a small
vocabulary of only 6 words (instead of the 50,257 words in the BPE
tokenizer vocabulary), and we want to create embeddings of size 3 (in GPT3, the embedding size is 12,288 dimensions):

In [350]:
vocab_size = 6
output_dim = 3

Using the vocab_size and output_dim, we can instantiate an embedding
layer in PyTorch, setting the random seed to 123 for reproducibility purposes:

In [351]:
torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(
    vocab_size,
    output_dim
)

print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


We can see that the weight matrix of the embedding layer contains small,
random values. These values are optimized during LLM training as part of
the LLM optimization itself, as we will see in upcoming chapters. Moreover,
we can see that the weight matrix has six rows and three columns. There is
one row for each of the six possible tokens in the vocabulary. And there is
one column for each of the three embedding dimensions

After we instantiated the embedding layer, let's now apply it to a token ID to
obtain the embedding vector:

In [352]:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


If we compare the embedding vector for token ID 3 to the previous
embedding matrix, we see that it is identical to the 4th row (Python starts
with a zero index, so it's the row corresponding to index 3). In other words,the embedding layer is essentially a look-up operation that retrieves rows
from the embedding layer's weight matrix via a token ID.

Previously, we have seen how to convert a single token ID into a threedimensional embedding vector. Let's now apply that to all four input IDs we
defined earlier (torch.tensor([2, 3, 5, 1]))

In [353]:
# Embedding layers versus matrix multiplication
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


### Encoding word positions

Previously, we focused on very small embedding sizes in this chapter for
illustration purposes. We now consider more realistic and useful embedding
sizes and encode the input tokens into a 256-dimensional vector
representation. This is smaller than what the original GPT-3 model used (in
GPT-3, the embedding size is 12,288 dimensions) but still reasonable for
experimentation. Furthermore, we assume that the token IDs were created by
the BPE tokenizer that we implemented earlier, which has a vocabulary size
of 50,257:

In [354]:
output_dim = 256
vocab_size = 50257

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
token_embedding_layer

Embedding(50257, 256)

Let's instantiate the data loader from section 2.6, Data sampling with a
sliding window, first

In [355]:
max_length = 4 
dataloader = create_dataloader_V1(
    raw_text, batch_size=8, max_length=max_length, stride=max_length
)
iter_dataloader = iter(dataloader)
inputs_2, targets_2 = next(iter_dataloader)
print("Token ID: \n" , inputs_2)
print("\n inputs chape \n: " , inputs_2.shape)


Token ID: 
 tensor([[24818,   417,    12, 12239],
        [  314,  3114,   379,   262],
        [ 2156,   286,  4116,    13],
        [  866,   262,  2119,    11],
        [ 3363,    11,   340,   373],
        [  198,     1,    40,  2900],
        [  465, 14475,    13,   198],
        [ 3081,   286,  2045,  1190]])

 inputs chape 
:  torch.Size([8, 4])


As we can see, the token ID tensor is 8x4-dimensional, meaning that the data
batch consists of 8 text samples with 4 tokens each.

Let's now use the embedding layer to embed these token IDs into 256-
dimensional vectors:

In [356]:
token_embeddings = token_embedding_layer(inputs_2)
print(token_embeddings)

tensor([[[-0.0673, -0.3863, -0.5027,  ..., -1.0610,  0.5170,  0.4422],
         [-1.0366,  0.8779,  0.4560,  ..., -0.6743,  0.9195, -1.5858],
         [-0.4357,  0.5339,  0.2413,  ...,  0.5189, -1.9390,  0.8580],
         [ 1.1013, -0.2652,  0.7247,  ...,  0.3632, -0.5016,  0.1404]],

        [[ 0.5786, -1.8926, -1.7647,  ..., -0.1368,  0.3491,  2.1122],
         [-1.1059, -0.3257,  0.1568,  ...,  0.1386,  1.0505, -0.1206],
         [-1.3986,  0.7459, -0.0840,  ..., -1.9663,  0.8480, -0.8870],
         [-0.3962,  0.5593,  3.4120,  ...,  0.6146, -0.3981,  0.7999]],

        [[ 0.4188, -0.2066, -1.1127,  ..., -0.1625,  1.9322,  0.6564],
         [-0.5594,  1.2612, -0.9617,  ..., -0.8864, -0.0426, -2.1107],
         [ 0.3512,  0.0883, -0.3792,  ..., -0.0899, -0.0107,  0.8041],
         [-1.1577, -1.9382,  0.9027,  ..., -0.0718, -0.8468, -1.0623]],

        ...,

        [[-0.1437, -0.9782,  1.5918,  ...,  0.3068, -1.1135, -0.7202],
         [ 1.2277, -0.4297, -2.2121,  ..., -0.1640, -0.33

In [357]:
print(token_embeddings.shape)

torch.Size([8, 4, 256])


As we can tell based on the 8x4x256-dimensional tensor output, each token
ID is now embedded as a 256-dimensional vector.

For a GPT model's absolute embedding approach, we just need to create
another embedding layer that has the same dimension as the token_embeddings_layer

In [358]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


As we can see, the positional embedding tensor consists of four 256-
dimensional vectors. We can now add these directly to the token embeddings,
where PyTorch will add the 4x256-dimensional pos_embeddings tensor to
each 4x256-dimensional token embedding tensor in each of the 8 batches:

In [359]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])


The input_embeddings we created, are the embedded input examples that can now be processed by the main LLM
modules, which we will begin implementing in chapter 3