## 2.1 Tokenizing Text
* 在这个阶段，将text转化为小单元

![](assets/tokenizeText.png)

In [1]:
import os
import urllib
import re

In [2]:
with open('the-verdict.txt', 'r', encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of characters:", len(raw_text))
print(raw_text[:999])

Total number of characters: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would have been Rome or Florence.)

"The height of his glory"--that was what the women called it. I can hear Mrs. Gideon Thwing--his last Chicago sitter--deploring his unaccountable abdication. "Of course it's going to send the value of my picture 'way up; but I don't think of that, Mr. Rickham--the loss to Arrt is all I think of." The word, on Mrs. Thwing's lips, multiplied its _rs_ as though they were reflected in an endless vista of mirrors. And it was not only the Mrs. Thwings who mourned. Had not the exquisite Hermia Croft, at the last Grafton Gallery show, stopped me before Gisburn's "Moon-dancers" to say, with tears in her eyes: "We shall not look upon its li

* 目标是将这个text tokenize并且embedded
* 建立一个简单的tokenizer

In [5]:
preprocessed = re.split(r'([,.!?;_:"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print("preprocessed:", preprocessed[:30])

preprocessed: ['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


* Let's calculate the number of the tokens

In [6]:
print(len(preprocessed))

4690


## 2.3 Converting tokens into token IDs
* 构建一个词典包含所有的唯一tokens

In [7]:
all_words = sorted(set(preprocessed))
print("Total number of unique words:", len(all_words))

Total number of unique words: 1130


In [11]:
vocab = {word: i for i, word in enumerate(all_words)}
for i, word in enumerate(vocab.items()):
    if i >= 20:
        break;
    print(word)

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)


* 实现一个tokenizer，实现`encode`和`decode`方法

In [15]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.vocab = vocab
        self.int_to_str = {i: word for word, i in vocab.items()}
        self.unk_token = '<unk>'
        self.pad_token = '<pad>'
        self.unk_index = vocab.get(self.unk_token, -1)
        self.pad_index = vocab.get(self.pad_token, -1)

    def encode(self, text):
        tokens = re.split(r'([,.!?;_:"()\']|--|\s)', text)
        tokens = [token.strip() for token in tokens if token.strip()]
        return [self.vocab.get(token, self.unk_index) for token in tokens]
    
    def decode(self, token_ids):
        text = " ".join([self.int_to_str.get(token_id, self.unk_token) for token_id in token_ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)  # 将标点符号前的空格去掉
        return text
    
    def __len__(self):
        return len(self.vocab)
    def __getitem__(self, index):
        if index < 0 or index >= len(self.vocab):
            raise IndexError("Index out of range")
        return list(self.vocab.keys())[index]

* `encode`函数将text转为token IDs
* `decode`函数将tokens ID转为 text

In [16]:
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


* Decode the token IDs back to text

In [18]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

## 2.4 添加额外的上下文Token
![](assets/special_tokens.png "speical tokens")

* 一些特殊的tokens
  * `[BOS]` text的开头
  * `[EOS]` text的结尾，用于两个不相关的text分隔
  * `[UNk]` 代表没有包含在词典中的word
  * `[PAD]` 将short的token扩展为一个batch中最长token的长度的占位
* `<|endoftext|>`和`[EOS]`是相同的，GPT-2不使用任何上述的tokens，只使用`<|endoftext|>`来减少复杂度
* GPT-2不使用`<|unk|>`，而是使用 *byte pair encoding (BPE)* 来将word分为子word

In [20]:
all_tokens = sorted(set(preprocessed))
all_tokens.extend(['<|endoftext|>', '<|unk|>'])
vocab = {word: i for i, word in enumerate(all_tokens)}

In [21]:
len(vocab)

1132

In [26]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i: word for word, i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.!?;_:"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int.get(token, self.str_to_int['<|unk|>']) for token in preprocessed]
        return ids
    
    def decode(self, token_ids):
        text = " ".join([self.int_to_str.get(token_id, '<|unk|>') for token_id in token_ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [28]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [29]:
tokenizer.encode(text)

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]

In [32]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

## 2.5 BytePair Encoding
* GPT-2使用BPE作为tokenizer
* BPE允许将未在vocabulary中定义的word拆分为更小的subword，甚至是独立的字符

In [33]:
import importlib
import tiktoken

In [34]:
print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.9.0


In [35]:
tokenizer = tiktoken.get_encoding("gpt2")

In [36]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [37]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


* BPE将未知的word分割为独立的subword

![](assets/subword.png "subwords")

In [38]:
test_text = "Akwirwier"
tokenIDs = tokenizer.encode(test_text)
print(tokenIDs)

[33901, 86, 343, 86, 959]


In [39]:
for i in tokenIDs:
    print(tokenizer.decode([i]))

Ak
w
ir
w
ier


In [40]:
test_strings = tokenizer.decode(tokenIDs)
print(test_strings)
print(test_strings == test_text)

Akwirwier
True


## 2.6 Data smpling with sliding window

* 现在我们要实现一个简单的data loader，对输入数据集进行迭代，返回inputs和targets

![](assets/text_sliding.png)

In [41]:
import torch
print("Pytorch version: ", torch.__version__)

Pytorch version:  2.7.0+cu126


* 创建dataset和dataloader来从输入text dataset中提取chunks

In [48]:
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, text, tokenizer, max_length, stride):
        self.tokenizer = tokenizer
        self.input_ids = []
        self.target_ids = []
        
        # Tokenize the entire text
        token_ids = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
        assert len(token_ids) > max_length, "Number of tokenized inputs must at least be equal to max_length+1"

        # 使用滑动窗口来将整个text分割为多个重叠的片段
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunks = token_ids[i:i + max_length]
            target_chunks = token_ids[i + 1:i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunks))
            self.target_ids.append(torch.tensor(target_chunks))
        
    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, index):
        return self.input_ids[index], self.target_ids[index]

In [49]:
def create_dataloader_v1(text, batch_size=4, max_length=256, 
                      stride=128, shuffle=True, drop_last=True,
                      num_workers=0):
    # 初始化tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")
    # Create dataset
    dataset = GPTDatasetV1(text, tokenizer, max_length, stride)

    # 创建dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )
    return dataloader

* 创建dataloader为batch_size=1，LLM上下文的大小为4

In [50]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

In [51]:
dataloader = create_dataloader_v1(raw_text, batch_size=1, max_length=4, 
                                  stride=1, shuffle=False)
data_iter = iter(dataloader)
first_batch = next(data_iter)
print("Input IDs:", first_batch[0])
print("Target IDs:", first_batch[1])

Input IDs: tensor([[  40,  367, 2885, 1464]])
Target IDs: tensor([[ 367, 2885, 1464, 1807]])


In [52]:
input_str = tokenizer.decode(first_batch[0][0].tolist())
print("Input String:", input_str)
target_str = tokenizer.decode(first_batch[1][0].tolist())
print("Target String:", target_str)

Input String: I HAD always
Target String:  HAD always thought


## 2.7Creating token embeddings

In [53]:
input_ids = torch.tensor([2, 3, 5, 1])

In [54]:
vocabe_size = 6
out_dim = 3
torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocabe_size, out_dim)

* embedding_layer 的weights形状为6×3

In [55]:
print("embedding_layer.weights:", embedding_layer.weight)

embedding_layer.weights: Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


## 2.8 Encoding word position
* BPE的字典长度为50257
* 将token encode 为256维的vector 表示

In [56]:
vocab_size = 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

In [57]:
max_length = 4
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=max_length,
                                  stride=max_length, shuffle=False)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

In [58]:
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
 torch.Size([8, 4])


* GPT-2 使用absolute position embedding，创建另一个embedding layer

In [59]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

In [60]:
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


In [61]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


In [62]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])
