以the-verdict.txt为例展示手动分词，先统计总字符数：

In [1]:
with open('../../datasets/the_verdict/the-verdict.txt', 'r', encoding='utf-8') as f:
    raw_text = f.read()
print("Total number of character:", len(raw_text)) # 20479

Total number of character: 20479


简单分词：

In [2]:
import re

def tokenize_scratch(text: str) -> list[str]:
    result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
    result = [item for item in result if item.strip()]
    return result

tokenized_text = tokenize_scratch(raw_text)
print("Total number of tokens (from scratch):", len(tokenized_text))

Total number of tokens (from scratch): 4690


将token转换为tokenID，首先需要构建一个词表：

In [3]:
def build_vocab_scratch(tokenized_text: list[str]) -> tuple[dict[str, int], int]:
    all_tokens = sorted(set(tokenized_text))
    vocab_size = len(all_tokens)
    vocab = {token: idx for idx, token in enumerate(all_tokens)}
    return vocab, vocab_size

vocab, vocab_size = build_vocab_scratch(tokenized_text)
print("Total number of unique tokens (vocab size):", vocab_size)

Total number of unique tokens (vocab size): 1130


创建一个逆向词表（inverse vocabulary），将 token ID 映射回对应的词元

In [4]:
class SimpleTokenizer:
    def __init__(self, vocab: dict[str, int]):
        self.str_to_int = vocab
        self.int_to_str = {idx: token for token, idx in vocab.items()}
        
    def encode(self, text: str) -> list[int]:
        tokenized_text = tokenize_scratch(text)
        ids = [self.str_to_int[token] for token in tokenized_text]
        return ids
    
    def decode(self, ids: list[int]) -> str:
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text
    
tokenizer = SimpleTokenizer(vocab)
text = """"It's the last he painted, you know," Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


新增支持 <|unk|> 和 <|endoftext|> 两个特殊词元

In [5]:
def build_vocab_scratch(tokenized_text: list[str]) -> tuple[dict[str, int], int]:
    all_tokens = sorted(list(set(tokenized_text)))
    all_tokens.extend(["<|endoftext|>", "<|unk>"])
    vocab_size = len(all_tokens)
    vocab = {token: idx for idx, token in enumerate(all_tokens)}
    return vocab, vocab_size

class SimpleTokenizer:
    def __init__(self, vocab: dict[str, int]):
        self.str_to_int = vocab
        self.int_to_str = {idx: token for token, idx in vocab.items()}
        
    def encode(self, text: str) -> list[int]:
        tokenized_text = tokenize_scratch(text)
        tokenized_text = [token if token in self.str_to_int else "<|unk>" for token in tokenized_text]
        ids = [self.str_to_int[token] for token in tokenized_text]
        return ids
    
    def decode(self, ids: list[int]) -> str:
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text
    
vocab, vocab_size = build_vocab_scratch(tokenized_text)
print("Total number of unique tokens (vocab size):", vocab_size)
tokenizer = SimpleTokenizer(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
ids = tokenizer.encode(text)
print(ids)
print(tokenizer.decode(ids))

Total number of unique tokens (vocab size): 1132
[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]
<|unk>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk>.


GPT使用字节对编码（BPE）的方式。

BPE 算法的实现相对复杂，我们可以使用 tiktoken 库，该库基于 Rust 源代码高效实现了 BPE 算法

In [6]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")
text = "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace." 
ids = tokenizer.encode(text, allowed_special={"<|endoftext|>"}) 
print(ids)
print(tokenizer.decode(ids))

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]
Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.


下面我们实现一个数据加载器（data loader），通过滑动窗口（sliding window）方法从训练数据集中获取输入-目标对，首先对全文进行分词：

In [7]:
enc_text = tokenizer.encode(raw_text)
print("Total number of tokens (with gpt2 tokenizer):", len(enc_text)) # 5145

Total number of tokens (with gpt2 tokenizer): 5145


BPE 分词器的 encode 方法会一次性完成分词和 token ID 转换两个步骤。下面实现数据加载器：

In [8]:
import torch
from torch.utils.data import Dataset, DataLoader

class GPT2Dataset(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []
        
        token_ids = tokenizer.encode(txt)
        
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i : i + max_length]
            target_chunk = token_ids[i + 1 : i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
            
    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]
    
def create_GPT2_dataloader(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPT2Dataset(txt, tokenizer, max_length, stride)
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers,
    )
    return dataloader

为 LLM 训练准备输入文本的最后一步，是将 token ID 转化为嵌入向量

In [9]:
input_ids = torch.tensor([2, 3, 5, 1])
vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight.data)
print(embedding_layer(input_ids))

tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]])
tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


In [10]:
vocab_size = 50257
hidden_dim = 256
token_embedding_layer = torch.nn.Embedding(vocab_size, hidden_dim)

将每个批次中的每个词元转化为一个 256 维嵌入向量。假设批次大小为 8 且每个样本包含 4 个词元，最终将生成一个 8×4×256 的三维张量

In [12]:
max_length = 4
dataloader = create_GPT2_dataloader(
    raw_text, batch_size=8, max_length=max_length, 
    stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs shape:\n", inputs.shape)

token_embeddings = token_embedding_layer(inputs)
print("Embedding inputs shape:\n", token_embeddings.shape)

Inputs shape:
 torch.Size([8, 4])
Embedding inputs shape:
 torch.Size([8, 4, 256])


对于 GPT 模型的绝对嵌入方法，只需创建另一个与 token_embedding_layer 维度相同的嵌入层

In [13]:
pos_embedding_layer = torch.nn.Embedding(max_length, hidden_dim)
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print("Position embedding shape:\n", pos_embeddings.shape) # (4, 256)

Position embedding shape:
 torch.Size([4, 256])


直接将位置嵌入编码与词元嵌入相加，得到最终输入到 LLM 中的嵌入向量：

In [14]:
input_embeddings = token_embeddings + pos_embeddings
print("Input embeddings shape:\n", input_embeddings.shape) # (8, 4, 256)

Input embeddings shape:
 torch.Size([8, 4, 256])
