### First read the text

In [1]:
with open("../the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


### Use re to simply split the text

In [2]:
import re
text = "Hello, world. This, is a test."
result = re.split(r'([,.]|\s)', text)
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


### Use tokenizer to process the whole article

In [3]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed))

4690


In [4]:
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


### Build a dictionary for tokens

In [5]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)

1130


In [6]:
vocab = {token:integer for integer,token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break


('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


### Create a simple tokenizer for decoding and encodeing

In [7]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        #  Stores the vocabulary as a class attribute for access in the encode and decode methods
        self.str_to_int = vocab
        #  Creates an inverse vocabulary that maps token IDs back to the original text tokens
        self.int_to_str = {i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decode(self, ids):
        #  Converts token IDs back into text
        text = " ".join([self.int_to_str[i] for i in ids])
        # Removes spaces before the specified punctuation
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)   
        return text

In [8]:
# vocab is str to int dict
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know," 
        Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


In [9]:
print(tokenizer.decode(ids))

" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


In [10]:

text = "Hello, do you like tea?"
# there is no "Hello" in our dictionary
# print(tokenizer.encode(text))

### include "end of text" and "unk" in dictionary

In [11]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:integer for integer,token in enumerate(all_tokens)}
# the new size is 1132 rather than 1130
print(len(vocab.items()))

1132


In [12]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


### New tokenizerV2 use |unk| to replace unknown words

In [13]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        preprocessed = [item if item in self.str_to_int           
                        else "<|unk|>" for item in preprocessed]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)   
        return text

In [14]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [15]:
tokenizer = SimpleTokenizerV2(vocab)
print(tokenizer.encode(text))

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]


### use tiktoken to perform BPE

In [16]:
from importlib.metadata import version
import tiktoken
print("tiktoken version: ", version("tiktoken"))

tiktoken version:  0.9.0


In [17]:
tokenizer = tiktoken.get_encoding("gpt2")
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
    "of someunknownPlace."
)
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [18]:
# convert the IDs back
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


### Implement a data loader fetches the input–target pairs

In [19]:
enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5145


In [20]:
# we remove the first 50 tokens from the dataset for demonstration purposes,
# as it results in a slightly more interesting text passage in the next steps

enc_sample = enc_text[50:]

# x contains the input tokens and y contains the targets
context_size = 4        
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]
print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


In [21]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


In [22]:
# make IDs into text
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


In [23]:
# implement the Dataset
import torch
from torch.utils.data import DataLoader, Dataset
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []
        token_ids = tokenizer.encode(txt)

        for i in range(0, len(token_ids) - max_length, stride):    
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):   
        return len(self.input_ids)
    
    def __getitem__(self, idx):        
        return self.input_ids[idx], self.target_ids[idx]

In [24]:
# implement the Dataloader
def create_dataloader_v1(txt, batch_size=4, max_length=256,
                        stride=128, shuffle=True, drop_last=True,
                        num_workers=0):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,    
        num_workers=num_workers    
    )
    return dataloader

In [25]:
# test our dataloader
# set the stride == length to avoid duplicate
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


### From IDs to embeddings

In [26]:
vocab_size = 6
output_dim = 3

In [27]:
# instantiate an embedding layer
torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


In [28]:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


In [29]:
# try 4 IDs
input_ids = torch.tensor([2, 3, 5, 1])
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


### Give position to embeddings and use a larger dimension

In [30]:
vocab_size = 50257
output_dim = 256
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

In [31]:
max_length = 4
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
 torch.Size([8, 4])


In [32]:
# turn IDs to embeddings
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)
# 8 is batch, 4 is tokens in each batch, 256 is embedding's dimension of each token

torch.Size([8, 4, 256])


In [None]:
# creat embedding layer to do absolute embedding
context_length = max_length     # the input size of LLM
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
# the pos embedding layer is actually a embedding of [0:context_length]
print(pos_embeddings.shape)

torch.Size([4, 256])


In [35]:
# add position to embedded tokens
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])


### Summary
LLMs require textual data to be converted into numerical vectors, known as
embeddings, since they can’t process raw text. Embeddings transform discrete
data (like words or images) into continuous vector spaces, making them com
patible with neural network operations. 

As the first step, raw text is broken into tokens, which can be words or characters.
Then, the tokens are converted into integer representations, termed token IDs.

Special tokens, such as <|unk|> and <|endoftext|>, can be added to enhance
the model’s understanding and handle various contexts, such as unknown
words or marking the boundary between unrelated texts.

The byte pair encoding (BPE) tokenizer used for LLMs like GPT-2 and GPT-3
can efficiently handle unknown words by breaking them down into subword
units or individual characters.

We use a sliding window approach on tokenized data to generate input–target
pairs for LLM training.

Embedding layers in PyTorch function as a lookup operation, retrieving vectors
corresponding to token IDs. The resulting embedding vectors provide continu
ous representations of tokens, which is crucial for training deep learning mod
els like LLMs. 

While token embeddings provide consistent vector representations for each
token, they lack a sense of the token’s position in a sequence. To rectify this,
two main types of positional embeddings exist: absolute and relative. OpenAI’s
GPT models utilize absolute positional embeddings, which are added to the token
embedding vectors and are optimized during the model training.

大语言模型（LLM）需要将文本数据转换为数值向量，称为嵌入（embeddings），因为它们无法直接处理原始文本。嵌入将离散数据（例如单词或图像）转化为连续的向量空间，使其能够与神经网络操作兼容。

首先，将原始文本拆分成标记（tokens），这些标记可以是单词或字符。随后，这些标记会被转换成整数表示，称为标记ID。

为了增强模型的理解能力并处理各种语境（例如处理未知单词或标记不相关文本之间的边界），可以添加特殊标记，如<|unk|>和<|endoftext|>。

用于GPT-2和GPT-3等大语言模型的字节对编码（BPE）标记器，通过将未知单词拆分成子词单元或单个字符，能够高效地处理未知单词。

我们在标记化数据上采用滑动窗口方法生成输入-目标对，以供大语言模型训练使用。

在PyTorch中，嵌入层的作用类似于查找操作，它检索与标记ID对应的向量。所得的嵌入向量为标记提供了连续的表示，这对于训练像大语言模型这样的深度学习模型至关重要。

尽管标记嵌入为每个标记提供了稳定的向量表示，但它们缺乏关于标记在序列中位置的信息。为了解决这一问题，主要有两种位置信息嵌入方式：绝对位置信息嵌入和相对位置信息嵌入。OpenAI的GPT模型采用绝对位置信息嵌入，这些嵌入会与标记嵌入向量相加，并在模型训练过程中进行优化。