In [None]:
with open("/content/verdict_story.txt", "r") as data_file:
  raw_text = data_file.read()

print("Total number of character: ", len(raw_text))
print(raw_text[0:101])

Regular expression (re) Python library to split the text to obtain the list of tokens

In [68]:
import re

text = 'Hello, world. This, is a test.'

result = re.split(r'(\s)', text)  ## here splitting text on whitespace characters:

print(result)
# print(len(result))

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


token : 'Hello,'
we want token: 'Hello', ','
We want that commas,  fullstop and punctuations to be separate token split from the text so that LLM will get the idea that what are commas and puctuations.

and other thing we want that we don't want that the white spaces to be separate token

####Let's modify the regular expression splits on whitespaces(\s) and commas, and periods([,.]):

In [69]:
text = 'Hello, world. This, is a test.'

result = re.split(r'([,.]|\s)', text)  # split text on whitespaces ',' '.'

print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


We can see that the words and punctuation characters are now separate list entries jus as we wanted.

####A small remaining issue is that the list still includes whitespaces characters. Optionally, we can rmeove these redudant characters safely as follows:

In [70]:
result = [item for item in result if item.strip()]  ## for whitespace item.strip is false
result                                              # so we are not returning them or words or '.' or ','

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']

REMOVING WHITESPACES OR NOT

When developing a simple tokenizer, whether we should encode whitespaces as separate characters or just remove them depends on our application and its requirements. Removing whitespaces reduces the memory and computing requirements. However, keeping whitespaces can be useful if we train models that are sensitive to the exact structure of the text (for example, Python code, which is sensitive to indentation and spacing). Here, we remove whitespaces for simplicity and brevity of the tokenized outputs. Later, we will switch to a tokenization scheme that includes whitespaces.

In [None]:
# removing white spaces
[item for item in result if item.strip()]
# item.strip() return a true value if white space is not there (item.strip() = True)
# so where ever white spacces are there it outputs a False value  (item.strip() = False)

The tokenization scheme we devised above works well on the simple sample text. Let's modify it a bit further so that it can also handle other types of punctuation, such as question marks, quotation marks, and the double-dashes we have seen earlier in the first 100 characters of Edith Wharton's short story, along with additional special characters:

In [None]:
text = "Hello, World. Is this-- a test?"

result = re.split(r'([,.:;?_!()\']|--|\s)', text) # text will get split on ,.:;?_!()\']|--|\s
print(result)

In [None]:
# to remove the whitespaces from each item and then filter out any empty strings.
result = [item for item in result if item.strip()]
result

In [None]:
# So in two lines we have build tokenizer

text = "Hello, World. Is this-- a test?"

result = re.split(r'([,.:;?_!()\']|--|\s)', text)
result = [item for item in result if item.strip()]
print(result)

####Now we got a basic tokenizer working, let's apply it to Edith Wharton's text

In [74]:
preprocessed = re.split(r'([,.:;?_!()\']|--|\s)', raw_text) #split will happen on ,.:;?_!()\']|--|\s
preprocessed = [item for item in preprocessed if item.strip()] #removed whitespaces

print(preprocessed[:30])  # print first 30 tokens to illustrate

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


In [75]:

print(len(preprocessed))

4606


###Step 2: Creating Token

In the previous section, we tokenized Edith Wharton's short story and assigned it to a Python variable called preprocessed. Let's now create a list of all unique tokens and sort them alphabetically to determine the vocabulary size:

Let's create a list of tuple of all unique token and sort them alphabetically to determine the vocabulary size:

In [None]:
all_words = sorted(set(preprocessed))

vocab_size = len(all_words)
print(vocab_size)

After determining that the vocabulary size is 1,130 via the above code, we create the vocabulary and print its first 51 entries for illustration purposes:

In [None]:
# mapping of words/tokens to token IDs
vocab = {token: integer for integer, token in enumerate(all_words)}
#dictionary comprehension

In [None]:
#1. enumerate(all_words)
# This converts your list of unique words into pairs:
# [(0, 'apple'), (1, 'boy'), (2, 'cat'), ...]

#2. Dictionary Comprehension {token: integer for ...}
# This takes each (integer, token) pair and creates a dictionary entry:
    # 'word' : id
    # {
    #     'apple': 0,
    #     'boy': 1,
    #     'cat': 2,
    #      ...
    # }


# 3. simple code for dictionary comrehension
vocab = {}
for integer, token in enumerate(all_words):
  vocab[token] = integer


In [None]:
for i, item in enumerate(vocab.items()):
  print(item)
  if i>=50:
    break

As we can see, based on the out above, the dict contains individual tokens associated with unique integer labels.

Later in this book, when we want to convert the outputs of an LLM from numbers back into text, we also need a way to turn token IDs into text.

For this, we can create an inverse version of the vocabulary that maps token IDs back to corresponding text tokens.



Let's implement a complete tokenizer class in Python.

The class will have an encode method that splits text into tokens and carries out the string-to-integer mapping to produce token IDs via the vocabulary.

In addition, we implement a decode method that carries out the reverse integer-to-string mapping to convert the token IDs back into text.

Step 1: Store the vocabulary as a class attribute for access in the encode and decode methods

Step 2: Create an inverse vocabulary that maps token IDs back to the original text tokens

Step 3: Process input text into token IDs

Step 4: Convert token IDs back into text

Step 5: Replace spaces before the specified punctuation

create a class of Tokenizer

In [None]:
class SimpleTokenizerV1:
  def __init__(self, vocab):
    self.str_to_int = vocab    ##vocab is mapping of str to int so str_to_int is vocab directly
    self.int_to_str = {i: s for s,i in vocab.items()}

  def encode(self, text):
    preprocessed = re.split(r'([,.:;?_!()\']|--|\s)', text) #split the text

    preprocessed = [
        item.strip() for item in preprocessed if item.strip()  ## removing white spaces
    ]

    ids = [self.str_to_int[s] for s in preprocessed]
    return ids

  def decode(self, ids):
    text = " ".join([self.int_to_str[i] for i in ids])  #get the tokens from ids and join those tokens
    # Replace spaces which are present before specified punctuations
    text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
    return text


so in above class it already takes vocab as input.

so we already have the mapping of tokens to IDs, because that is present in the vocabulary.

we just need to convert the reverse, from the token IDs from the token and in decode method

and in text to token we did same re.split and item.strip


Let's instantiate a new tokenizer object from the SimpleTokenizerV1 class and tokenize a passage from Edith Wharton's short story to try it out in practice:

In [None]:
tokenizer = SimpleTokenizerV1(vocab)

# testing the encode method by passing the sample text
# text is from "training set"
text = """"It's the last he painted, you know,"
           Mrs. Gisburn said with pardonable pride."""

ids = tokenizer.encode(text)
print(ids)

The code above prints the following token IDs: Next, let's see if we can turn these token IDs back into text using the decode method:

In [76]:
# test the decode method
text = tokenizer.decode(ids)
print(text)

NameError: name 'ids' is not defined

The above encode and decode doing perfect job of tokenizing and de-tokenizing when we gave the text snipet from the training set.

so we have implemented the tokenizer capable of tokenizing and de-tokenizing the text based on a snipet from the training set.

Let's now apply it to a new text sample that is not contained in the training set:

In [None]:
text = "Hello, do you like tea?"
print(tokenizer.encode(text))

The problem is that the word "Hello" is not used on the verdict short story.

Hence, it is not contained in the vocabulary.

This highlights the need tp consider large and diverse training set to extend the vocabulary when working on LLMs.

##ADDING SPECIAL CONTEXT TOKENS

In the previous section, we implemented a simple tokenizer and applied it to a passage from the training set.

In this section, we will modify this tokenizer to handle unknown words.

In particular, we will modify the vocabulary and tokenizer we implemented in the previous section, SimpleTokenizerV2, to support two new tokens, <|unk|> and <|endoftext|>

We can modify the tokenizer to use an <|unk|> token if it encounters a word that is not part of the vocabulary.

Furthermore, we add a token between unrelated texts.

For example, when training GPT-like LLMs on multiple independent documents or books, it is common to insert a token before each document or book that follows a previous text source

Let's now modify the vocabulary to include these two special tokens, and <|endoftext|>, by adding these to the list of all unique words that we created in the previous section:

In [None]:
# preprocessed

In [None]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|unk|>", "<|endoftext|>"])

vocab = {token:integer for integer, token in enumerate(all_tokens)} #map token to token IDs

In [None]:
len(vocab.items())

earlier the vocab size was 1158 now we added two tokens so the new vocabulary size is 1160

As an additional quick check, let's print the last 5 entries of the updated vocabulary:

In [None]:
for token, integer in list(vocab.items())[-5:]:
  print(token, integer)

now let's create simple tokenizer class which can handle unknown words

Step 1: Replace unknown words by <|unk|> tokens

Step 2: Replace spaces before the specified punctuations

In [None]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]  # remove whitespaces
        preprocessed = [
            item if item in self.str_to_int
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])  #join tokens using the .join using spaces " ".join
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

In [None]:
tokenizer = SimpleTokenizerV2(vocab)

In [None]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))

print(text)

In [None]:
tokenizer.encode(text)

In [None]:
# de-tokenize
tokenizer.decode(tokenizer.encode(text))

Based on comparing the de-tokenized text above with the original input text, we know that the training dataset, Edith Wharton's short story The Verdict, did not contain the words "Hello" and "palace."

So far, we have discussed tokenization as an essential step in processing text as input to LLMs. Depending on the LLM, some researchers also consider additional special tokens such as the following:

[BOS] (beginning of sequence): This token marks the start of a text. It signifies to the LLM where a piece of content begins.

[EOS] (end of sequence): This token is positioned at the end of a text, and is especially useful when concatenating multiple unrelated texts, similar to <|endoftext|>. For instance, when combining two different Wikipedia articles or books, the [EOS] token indicates where one article ends and the next one begins.

[PAD] (padding): When training LLMs with batch sizes larger than one, the batch might contain texts of varying lengths. To ensure all texts have the same length, the shorter texts are extended or "padded" using the [PAD] token, up to the length of the longest text in the batch.

Note that the tokenizer used for GPT models does not need any of these tokens mentioned above but only uses an <|endoftext|> token for simplicity

the tokenizer used for GPT models also doesn't use an <|unk|> token for outof-vocabulary words. Instead, GPT models use a byte pair encoding tokenizer, which breaks down words into subword units

##BYTE PAIR ENCODING

**BPE Tokenizer**

In [None]:
# the verdict story text data
with open("/content/Data.txt", "r") as data_file:
  raw_text = data_file.read()

print("Total number of character: ", len(raw_text))
# print(raw_text[0:101])

In [None]:
!pip install tiktoken

In [None]:
import tiktoken
import importlib

# check the versio of tiktokens
print("tiktokens version:", importlib.metadata.version("tiktoken"))

In [None]:
tokenizer = tiktoken.get_encoding("gpt2")

In [None]:
# # to get the tokeniser correspoding to a specific model in the openAI API:
# tokenizer = tiktoken.encoding_for_model("gpt-4o")

In [None]:
text = "Hello do you like tea ? <|endoftext|> In the sunlit terrace of someunknownPlace"

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

the code above prints the token IDs

We can convert the token IDs back into text using the decode method, similar to our SimpleTokenizer

In [None]:
string = tokenizer.decode([6])
print(string)

string = tokenizer.decode([8])
print(string)

string = tokenizer.decode([50])
print(string)

# decode the bove text
string = tokenizer.decode(integers)
print(string)

let's take another example to illustrate how BPE deals with unknown words

In [None]:
integers = tokenizer.encode("htyhty ier")
print(integers)

string = tokenizer.decode(integers)
print(string)

In [None]:
integers = tokenizer.encode("babaYaga")
print(integers)

string = tokenizer.decode(integers)
string

Lecture

##Create Input-Target pairs

input-target pars using sliding window approach

first tokenize the verdict story (text data) earlier we took BPE tokenizer that we introduced

In [None]:
import re

In [None]:
with open("/content/verdict_story.txt", 'r') as file:
  raw_text = file.read()

# print(raw_text)
# tokenizer

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))
print(max(enc_text))
print(enc_text)

KeyError: '<|unk|>'

the above code will return 5147, the total number of tokens in the training set after applying the BPE encoder

Next, we remove the first 50 tokens from the dataset for the demostration purpose as it result in slightly more interesting text passage .

we can keep entire tokens also

In [None]:
enc_sample = enc_text[50:]
len(enc_sample)

one of the easiest and most intuitive ways to create the input-target pairs for the next word prediction tasks is to create two variables, x and y, where as x contains input tokens and y contains the targets, which are the input shifted by 1:

the context size determines how many words/tokens are included in the input

In [None]:
context_size = 4 #length of the input
# The context size of 4 means that the model is trained to look at a sequence of 4 words (or tokens)
# to predict the next word in the sequence.
# The input x is the first 4 tokens [1, 2, 3, 4], and the target y is the next 4 tokens [2, 3, 4, 5]

x = enc_sample[ : context_size]
y = enc_sample[1: context_size+1]

print(f"x: {x}")
print(f"y:      {y}")

processing the inputs along with the targets, which are the inputs shifted by one position, we can then create then create the next-word prediction tasks as follows:

In [None]:
for i in range(1, context_size+1):
  context = enc_sample[ : i]
  desired = enc_sample[i]

  print(context,  "----->", desired)


Everything left of the arrow(--->) refers to the input LLM would recieve, and the token ID on the right side of the arrow represents the target token ID that the LLM is supposed to predict

for illustration purpose let's repreat the previous code but convert the tokens IDs into text:

In [None]:
for i in range(1, context_size+1):
  context = enc_sample[ : i]
  desired = enc_sample[i]

  context = tokenizer.decode(context)  # decode
  desired = tokenizer.decode([desired]) #decode

  print(context,  "----->", desired)

we hav created input-target pairs that LLM uses for training in upcoming chapters.

There is only one more task before we can turn the tokens into embeddings: implementing an efficient data loader that iterated over the input dataset and returns the inputs and targets as PyTorch tensors, which can be thought of as multi-dimensionals arrays.

In particular, we are interested in returning two tensors: an input tensor containing the text that the LLM sees and a target tensor that includes the targets for the LLM to predict.

Lecture dataset and dataloader

###IMPLEMENTING A DATA LOADER

for the efficient data loader implementation we will use PyTorch's built in Dataset and DataLoader classes.

Step1: Tokenize the text

Step2: Use a sliding window to chunk the book into overlapping sequences of max_length

Step3: Return the total number of rows in the dataset

Step4: Return a single row of input and output from the dataset.

In [None]:
def __len__(self):   # return the total number of rows in the dataset
  return len(self.input_ids)

In [None]:
# we have to define a method called get item , what is does is
# we provide the index(idx) and it will return that particular row of the input and that particular row of the output tensor

def __getitem__(self, idx):
  return self.input_ids[idx], self.target_ids[idx]


# why it is needed because when we create a DataLoader it will look this method of dataset class,
# and then only it will able to create input-output pair one after another,
# because this fxn is returning the inp-out pair according to give idx number

# Dataloader needs the dataset to be in map style or iterable style
# we are using map style dataset

In [None]:
# txt = input text
# tokenizer = tokenizer to tokenize the text into token ids
# max_length = context size
# stride = window will slide after that many token (= 1 or context size)


In [None]:
import torch
from torch.utils.data import Dataset, DataLoader
import tiktoken

In [None]:
from torch.utils.data import Dataset, DataLoader

class GPTDataset(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

The above GPTDataset class is based on the PyTorch Dataset class.

It defines how the individual rows are fetched from the dataset.

Each row consists of a number of token IDs (based on the max_length/context size) assigned to input chunk tensor.

The target_chunk tensor contains the correspoding targets.



In [None]:
print("ok")

ok


The following code will use the Dataset to load the inputs in batches via PyTorch DataLoader:

Step1: Initialize the tokenizer

Step2: Create the datset instance

Step3: drop_last=True, drops the last batch if if is shorter than the specified batch_size to prevent loss spikes during training

Step4: The number of CPU process to use for preprocessing.

In [None]:
# it will help to create input-output pairs
# and help to create batches
# batch_size = that CPU's processs runs parallely

def create_dataloader(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):

  # Initialize the tokenizer
  tokenizer = tiktoken.get_encoding("gpt2")

  # create dataset instance
  dataset = GPTDataset(txt, tokenizer, max_length, stride)

  # create dataloader
  dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)

  return dataloader

Let's now test the dataloader with the batch_size of 1 for an LLM with a context size of 4,

This will develop an intuition of how the GPTDataset class and create_dataloader function will work together:


In [None]:
with open("/content/verdict_story.txt", "r") as file:
  raw_text = file.read()

Convert dataloader into a python iterator to fetch the next entry via Python's built in next() function

In [None]:
dataloader = create_dataloader(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False
)

data_iter = iter(dataloader)

first_batch = next(data_iter)
print(first_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


The first_batch variable contains two tensors: first tensor stores input tokens and second tensor stores the target token IDs

Since max_length is set to 4, each two tensors contains 4 tokens IDs

max_length/input size of 4 is relatively small and only chose for illustration purpose. It is common to train LLMs with input sizes of at least 256

To illustrate the meaning of stride=1 let's fetach another batch from this dataset:

In [None]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


The stride setting dictates the number of positions the inputs shift across the batches, emulating a sliding window approach.

Batch size of 1, such as we sampled from the data loader so far, are useful for illustration purpose.

We know from the experience with deep learning, that small batch sizes requires less memory during training but lead to more noisy model updates.

Just like in regular deep learning, the batch size is a trade-off and hyperparameter to experiment with when training LLMs.

(batch size is the number of data model has to process beefore updating its params so
  If the batch size is too small the parameters updates is very quick but the updates will be noisy,
  If the batch size is very large the model will go through entire dataset before updating its params, it is not effective that's why batch concept came into picture.)

Before we move on to the two final sections of this chapter that are focused on creating token IDs, let's have a brief look at how we can use the data loader to sample a batch size greater than 1:

In [None]:
dataloader = create_dataloader(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


Note In the above code we increase the stride to 4. TO utilize the dataset fully (we don't skip a single word) but also avoid any overlap between the batches, since more overlap could lead to increased overfitting.

Lecture

##Create TOKEN EMBEDDINGS

####Small hands on demo playing with embeddings

In [None]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m38.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [None]:
import gensim.downloader as api
model = api.load("word2vec-google-news-300")  #download the model and return as object ready for use

[==------------------------------------------------] 5.7% 94.2/1662.8MB downloaded

KeyboardInterrupt: 

In [None]:
# word2vec-google-news-300 pretrained model by google each word is mapped to 300 dim vector

Example of a word as a vector

In [None]:
word_vectors = model
# word_vectors is a dictionary because each word is getting mapped to 300 dim vector

# Let's us look how the vector embedding of a word looks like
print(word_vectors['computers']) #.Example: accesing the vector for the word 'computer'

In [None]:
print(word_vectors['cat'].shape)  # see each word is maaped to 300dim vector

####King + Women - Man = ?

In [None]:
#Example of using most_similar
print(word_vectors.most_similar(positive=['king', 'women'], negative=['men'], topn=10))  # top10 most similar words


In [None]:
# top 10 with percentage
print(word_vectors.most_similar(positive=['king', 'women'], negative=['women'], topn=10))

####Let check similarity b/w a few pairs of words

In [None]:
# Example of calculate similarity
print(word_vectors.similarity('women', 'men'))
print(word_vectors.similarity('king', 'queen'))
print(word_vectors.similarity('uncle', 'aunt'))
print(word_vectors.similarity('boy', 'girl'))
print(word_vectors.similarity('paper', 'water'))

In [None]:
# finding the similar words
print(word_vectors.most_similar('tower', topn=10))

print(word_vectors.most_similar('rock', topn=10))

Now done with the demo
now Ceating Token Embeddings

Let's illustrate how the token ID to embedding vector conversion works with hands on example. Suppose we have the following four input tokens with IDs 2, 3, 5, 1

In [None]:
import torch
input_ids = torch.tensor([2, 3, 5, 1])

For the sake of simplicity and illustration purpose, suppose we have a small vocabulary of only 6 words(instead of the 50,257 words in the BPE tokenizer vocabulary), and we want to create embeddings of size 3 dim (in GPT-3, the embedding size is 12,288 embeddings)

Using the vocab_size and and output_dim, we can instantiate an embedding layer in PyTOrch. setting random seed=123 for reproducibility purpose.

In [None]:
vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

The print statement in the code prrints the embedding layer's underlying weight matrix:

In [None]:
print(embedding_layer)
print(embedding_layer.weight)

We can see that the weight matrix of the embedding layer contains small "random values". These values are optimized during LLM training as part of the LLM optimization itself, as we will see in the upcoming chapters.Moreover, we can see that the weight matrix has six rows and three columns. There is one row for each of the six possible tokens in the vocabulary. And there is one column for each of the three embedding dimensions.

After we instatiated the embedding layer, let's now apply it to any single token to obtain the embedding vector for that token ID let's see:

In [None]:
print(embedding_layer(torch.tensor(3)))

If we compare the embedding vector for the token ID 3 to the previous embedding matrix, we see that it is identical to the 4th row(Python start with zero indexing, it is the row correspoding to the index3).

In other words, the embedding layer is essentially a look-up operation that retrives rows from the embedding layer's weight matrix via token ID.


Above we have seen how to convert single token ID into a three-dim embd vector. Let's now apply it to a input_ids (torch.tensor([2, 3, 5, 1])).


In [None]:
print(embedding_layer(input_ids))

Each row in this output matrix is obtained via a lookup operation from the embedding weight matrix.

Lecture

##POSITIONAL EMBEDDINGS(ENCODING WORD POSITIONS)

Previously, we focused on very small embedding sizes in this chapter for illustration purpose.

We now consider more realistic and useful ebedding sizes and encode the input tokens into a 256-dimensional vector representation.

This is smaller than what the original GPT-3 model used (in GPT-3, the embedding size is 12,288 dimensions) but still reasonable for experimentation.

Furthermore, we assume that the token IDs were created by the BPE tokenizer that we implmeneted ealier, which has a vocabulary size of 50,257:

ask to GPT:

What is the vector embedding size for GPT2 or GPT3 ?

What is the vocabulary size for GPT2 pretraining?

In [None]:
import torch

In [None]:
vocab_size = 50257
output_dim = 256

#It takes two input.....torch.nn.Embedding(num_embeddings, embedding_dim)
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

print(token_embedding_layer.weight)

Parameter containing:
tensor([[ 1.0194e+00,  1.2845e+00,  2.3843e-01,  ..., -2.5092e-01,
         -3.2715e-01,  5.1654e-04],
        [-1.8763e-01,  1.1083e+00, -1.4993e+00,  ..., -3.1205e-01,
          1.0904e+00, -1.2905e+00],
        [ 5.4586e-01,  3.6950e-01, -8.0691e-01,  ...,  4.5755e-01,
         -2.7557e-01,  4.2744e-01],
        ...,
        [ 2.9920e-01,  8.6643e-01, -5.8656e-01,  ...,  4.5017e-01,
         -5.2731e-02,  1.3625e+00],
        [ 7.5357e-01,  1.6395e-01,  1.0407e+00,  ..., -2.0124e+00,
          1.8168e+00, -2.2102e-01],
        [-2.1088e+00, -5.2424e-01,  1.5862e+00,  ...,  6.3221e-01,
          2.0748e+00,  8.5035e-01]], requires_grad=True)


Using the token_embedding_layer above, if we sample data from the data loader, we embed each token in each batch into a 256-dim vector.

If we have a batch size of 8 and 4 tokens each, the result will be an 8x4x256.

Let's instantiate the dataloader (Data sampling with a sliding window) first:

In [None]:
max_length = 4
dataloader = create_dataloader(raw_text, batch_size=8, max_length=max_length, stride=max_length, shuffle=False)

data_iter = iter(dataloader)

inputs, targets = next(data_iter)

In [None]:
print("Inputs shape:\n", inputs.shape)   # 2D tensor of shape 8*4
print("Token IDs: \n", inputs)

Inputs shape:
 torch.Size([8, 4])
Token IDs: 
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])


As we can see that, the token ID tensor is 8*4-dimensional, meaning that the data batch consists of 8 text sample with 4 tokens each.

Let's now use the embedding layer to embed these tokens IDs into 256-dimensional vector:

token embeddings

In [None]:
token_embeddings = token_embedding_layer(inputs)
# print(token_embeddings)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


As we can tell based on the 8x4x256-dimensional tensor output, each token ID is now embedded as a 256-dimensional vector.

Fot GPY model we use absolute encoding approach

In [65]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)  #4, 256

In [66]:
pos_embedding = pos_embedding_layer(torch.arange(max_length))
print(pos_embedding.shape)

torch.Size([4, 256])


As shown in the preceding code example, the input to the pos_embeddings is usually a placeholder vector torch.arange(context_length), it will create the vector sequences of numbers 0, 1,... utpo maximum input length -1 (means max_length - 1) so here 0,1,2,3..total 4 vectors

As we can see, the positional embedding tensor consists of four 256-dimensional vectors. We can now add these directly to the token embedding, where PyTorch will add the 4x256-dim pos_embeddings tensor to each 4x256-dim token embeddings tensor in each of the 8 batches:

In [67]:
input_embeddings = token_embeddings + pos_embedding
print(input_embeddings.shape)

torch.Size([8, 4, 256])


The input_embeddings we have created are the embedded input examples that now can be processed by the model