# POSITIONAL EMBEDDING (ENCODING WORDS POSITION)

Previously, we focused on very small embedding sizes in the chapter for illustration purposes.

We now consider more realistic and useful embedding sizes, and encode the input token into a 256-dimensional vector representation.

This is smaller than what the original GPT-3 model used (in GPT-3, the embedding size is 12,288 dimensions for the largest model), but still reasonable for demonstration.

Further move, we assume that the token IDs were cratede by the BPE tokenizer, that we implemented earlier, which has a vocabulary size of 50,257 tokens.

In [27]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import importlib
import tiktoken

In [28]:
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [29]:
def create_dataloader_v1(txt, batch_size=4, max_length=256, 
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

In [30]:
vocab_size = 50257  # Vocabulary size of the tokenizer
output_dim = 256   # Dimensionality of the embedding vector

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

using the token_embedding_layer above, if we sample data from the data loader, we embed each token in each batch into a batch size of 8 with four tokens each, the result will be an 8 x 4 x 256 tensor.

Let's instantiate the data loader (Data sampling with a sliding window). first:

In [33]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20480
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


In [34]:
max_length = 4  # Maximum sequence length
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length, stride=max_length, shuffle=False,
)
data_iter = iter(dataloader)
inputs, target = next(data_iter)

In [35]:
print('Token IDs:\n', inputs)
print('\nInput shape:\n', inputs.shape)

Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Input shape:
 torch.Size([8, 4])


As we can see, the token ID tensor is 8x4-dimensional, meaning that the data batch consists of 8 text samples with 4 tokens each.

Let's now use the embedding layer to embed these token IDs into 256-dimensional
vectors:

In [36]:
token_embeddings = token_embedding_layer(inputs)
print('\nToken Embeddings shape:\n', token_embeddings.shape)


Token Embeddings shape:
 torch.Size([8, 4, 256])


As we can tell based on the 8x4x256-dimensional tensor output, each token ID is now
embedded as a 256-dimensional vector.

For a GPT model's absolute embedding approach, we just need to create another
embedding layer that has the same dimension as the token_embedding_layer:

In [37]:
context_length = max_length  # Context length for the GPT model
position_embedding_layer = torch.nn.Embedding(context_length, output_dim)

In [39]:
position_embeddings = position_embedding_layer(torch.arange(max_length))
print('\nPosition Embeddings shape:\n', position_embeddings.shape)


Position Embeddings shape:
 torch.Size([4, 256])


As shown in the preceding code example, the input to the pos_embeddings is usually a
placeholder vector torch.arange(context_length), which contains a sequence of
numbers 0, 1, ..., up to the maximum input length âˆ’ 1. 

The context_length is a variable
that represents the supported input size of the LLM. 

Here, we choose it similar to the
maximum length of the input text. 

In practice, input text can be longer than the supported
context length, in which case we have to truncate the text.

As we can see, the positional embedding tensor consists of four 256-dimensional vectors.
We can now add these directly to the token embeddings, where PyTorch will add the 4x256-
dimensional pos_embeddings tensor to each 4x256-dimensional token embedding tensor in
each of the 8 batches:

In [40]:
input_embeddings = token_embeddings + position_embeddings
print('\nInput Embeddings shape:\n', input_embeddings.shape)


Input Embeddings shape:
 torch.Size([8, 4, 256])
