# Chapter 2: Working with Text
# 第二章：处理文本

- 如果文本顺序为先英文，后中文，则为原文翻译；先中文，后英文，则为译者后期注释。
- If the text order is English first, then Chinese, it will be the original text; if Chinese comes first, followed by English, then it will be translator's annotations later.

Packages that are being used in this notebook:

在这个笔记本中使用的软件包：

In [1]:
from importlib.metadata import version

import tiktoken
import torch

print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

torch version: 2.1.0
tiktoken version: 0.5.1


## 2.1 Understanding word embeddings
## 2.1 理解词嵌入

- No code in this section
- 本节无代码

## 2.2 Tokenizing text文本分词

## 2.2 文本分词

- Load raw text we want to work with
- [The Verdict by Edith Wharton](https://en.wikisource.org/wiki/The_Verdict) is a public domain short story
- 加载我们想要使用的原始文本是 Edith Wharton 的短篇小说《The Verdict》，它是公共领域的作品。

In [2]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


- The goal is to tokenize and embed this text for an LLM
- 目标是为一个LLM对这个文本进行分词和嵌入处理。
- Let's develop a simple tokenizer based on some simple sample text that we can then later apply to the text above
- 让我们基于一些简单的示例文本开发一个简单的分词器，然后稍后可以将其应用到上面的文本中。
- The following regular expression will split on whitespaces
- 以下正则表达式将在空格上进行分割。

In [3]:
import re

text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)
'''
\s 表示匹配任意空白字符（包括空格、制表符、换行符等）的字符类。
\s represents a character class that matches any whitespace character (including space, tab, newline, etc.).
() 表示捕获分组，将括号内的部分作为分隔符进行分割，并保留分隔符。
() represents a capturing group, which splits the string based on the content inside the parentheses and retains the delimiter.
'''
print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


- We don't only want to split on whitespaces but also commas and periods, so let's modify the regular expression to do that as well
- 我们不仅希望在空格上进行分割，还希望在逗号和句号上进行分割，所以让我们修改正则表达式以实现这一点。

In [None]:
result = re.split(r'([,.]|\s)', text)
'''
[,.] 表示匹配逗号或句号中的任何一个字符。
[,.] represents matching any one character from the comma or period.
| 表示逻辑“或”操作符，即匹配前面或后面的任何一个字符。
| represents the logical "or" operator, matching either the character before or after it.
'''
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


- As we can see, this creates empty strings, let's remove them
- 正如我们所看到的，这会创建空字符串，让我们将它们移除。

In [None]:
# Strip whitespace from each item and then filter out any empty strings. # 去掉每个项目中的空白字符，然后过滤掉任何空字符串。
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


- This looks pretty good, but let's also handle other types of punctuation, such as periods, question marks, and so on
- 这看起来很不错，但让我们也处理其他类型的标点符号，比如句号、问号等。

In [None]:
text = "Hello, world. Is this-- a test?"

result = re.split(r'([,.?_!"()\']|--|\s)', text) # 注意单引号前需要转义字符。# Note that the escape character is needed before the single quote.
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


- This is pretty good, and we are now ready to apply this tokenization to the raw text
- 这很好，我们现在已经准备好将这种分词应用到原始文本中了。

In [7]:
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


- Let's calculate the total number of tokens
- 让我们计算总的分词标记的数量。

In [8]:
print(len(preprocessed))

4649


## 2.3 Converting tokens into token IDs
## 2.3 将分词标记（此后简称为标记）转换为标记ID

- From these tokens, we can now build a vocabulary that consists of all the unique tokens
- 从这些标记中，我们现在可以构建一个词汇表，其中包含所有唯一的标记。

In [9]:
all_words = sorted(list(set(preprocessed)))
vocab_size = len(all_words)

print(vocab_size)

1159


In [10]:
vocab = {token:integer for integer,token in enumerate(all_words)}

- Below are the first 50 entries in this vocabulary:
- 下面是词汇表中的前50个条目：

In [11]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Carlo;', 25)
('Chicago', 26)
('Claude', 27)
('Come', 28)
('Croft', 29)
('Destroyed', 30)
('Devonshire', 31)
('Don', 32)
('Dubarry', 33)
('Emperors', 34)
('Florence', 35)
('For', 36)
('Gallery', 37)
('Gideon', 38)
('Gisburn', 39)
('Gisburns', 40)
('Grafton', 41)
('Greek', 42)
('Grindle', 43)
('Grindle:', 44)
('Grindles', 45)
('HAD', 46)
('Had', 47)
('Hang', 48)
('Has', 49)
('He', 50)


- Putting it now all together into a tokenizer class
- 将所有内容整合到一个分词器类中。

In [12]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations # 在指定的标点符号前替换去掉空格。
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

- We can use the tokenizer to encode (that is, tokenize) texts into integers
- 我们可以使用分词器将文本编码为整数（即分词过程）。
- These integers can then be embedded (later) as input of/for the LLM
- 然后，这些整数可以作为LLM的输入进行嵌入（稍后进行）。

In [13]:
tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know," Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7, 39, 873, 1136, 773, 812, 7]


- We can decode the integers back into text
- 我们可以将整数解码回文本。

In [14]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

In [15]:
tokenizer.decode(tokenizer.encode(text))

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

## 2.4 Adding special context tokens
## 2.4 添加特殊的上下文标记

- Some tokenizers use special tokens to help the LLM with additional context
- 一些分词器使用特殊标记来帮助LLM获取额外的上下文信息。
- Some of these special tokens are
- 一些特殊标记包括:
  - `[BOS]` (beginning of sequence) marks the beginning of text
  - `[BOS]`（序列开始）标记文本的开头
  - `[EOS]` (end of sequence) marks where the text ends (this is usually used to concatenate multiple unrelated texts, e.g., two different Wikipedia articles or two different books, and so on)
  - `[EOS]`（序列结束）标记文本的结束（通常用于连接多个不相关的文本，例如两篇不同的维基百科文章或两本不同的书籍等）
  - `[PAD]` (padding) if we train LLMs with a batch size greater than 1 (we may include multiple texts with different lengths; with the padding token we pad the shorter texts to the longest length so that all texts have an equal length)
  - `[PAD]`（填充）用于训练批量大小大于1的LLM（我们可能包含多个具有不同长度的文本；使用填充标记将较短的文本填充到最长长度，以使所有文本的长度相等）
- `[UNK]` to represent works that are not included in the vocabulary
- `[UNK]`（未知）用于表示词汇表（vocab）中未包含的词语
- Note that GPT-2 does not need any of these tokens mentioned above but only uses an `<|endoftext|>` token to reduce complexity
- 注意，GPT-2 不需要上述提到的任何特殊标记，而只使用一个特殊标记`<|endoftext|>`来减少复杂性。
- The  `<|endoftext|>` is analogous to the `[EOS]` token mentioned above
- 这个特殊标记`<|endoftext|>`类似于上面提到的 `[EOS]` 标记。
- GPT also uses the `<|endoftext|>` for padding (since we typically use a mask when training on batched inputs, we would not attend padded tokens anyways, so it does not matter what these tokens are)
- GPT也使用了 `<|endoftext|>` 进行填充（因为通常在批量输入上训练时使用掩码，我们不会关注填充的标记，所以填充的标记是无关紧要的）。
- GPT-2 does not use an `<UNK>` token for out-of-vocabulary words; instead, GPT-2 uses a byte-pair encoding (BPE) tokenizer, which breaks down words into subword units which we will discuss in a later section
- GPT-2 不使用 `<UNK>` 标记来表示词汇表外的单词；相反，GPT-2 使用字节对编码（byte-pair encoding BPE）分词器，将单词分解为子词单元，我们将在后面的部分讨论这个。


- Let's see what happens if we tokenize the following text:
- 让我们看看如果我们对以下文本进行分词会发生什么：

In [16]:
tokenizer = SimpleTokenizerV1(vocab)

text = "Hello, do you like tea. Is this-- a test?"

tokenizer.encode(text)

KeyError: 'Hello'

- The above produces an error because the word "Hello" is not contained in the vocabulary
- 上面的操作会产生错误，因为单词 "Hello" 不包含在词汇表中。
- To deal with such cases, we can add special tokens like `"<|unk|>"` to the vocabulary to represent unknown words
- 为了处理这种情况，我们可以向词汇表中添加特殊标记，如 `"<|unk|>"`，用来表示未知的单词。
- Since we are already extending the vocabulary, let's add another token called `"<|endoftext|>"` which is used in GPT-2 training to denote the end of a text (and it's also used between concatenated text, like if our training datasets consists of multiple articles, books, etc.)
- 既然我们已经在扩展词汇表，让我们再添加另一个标记叫做 `"<|endoftext|>"`，它在GPT-2训练中用于表示文本的结束（并且还用于连接的文本之间，例如如果我们的训练数据集包含多篇文章、书籍等）。

In [17]:
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]

all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}

In [18]:
len(vocab.items())

1161

In [19]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1156)
('your', 1157)
('yourself', 1158)
('<|endoftext|>', 1159)
('<|unk|>', 1160)


- We also need to adjust the tokenizer accordingly so that it knows when and how to use the new `<unk>` token
- 我们还需要相应地调整分词器，以便它知道何时以及如何使用新的 `<unk>` 标记。

In [20]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [item if item in self.str_to_int 
                        else "<|unk|>" for item in preprocessed]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations # 在指定的标点符号前替换去掉空格。
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

- Let's try to tokenize text with the modified tokenizer:
- 让我们尝试使用修改后的分词器对文本进行分词：

In [21]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [22]:
tokenizer.encode(text)

[1160,
 5,
 362,
 1155,
 642,
 1000,
 10,
 1159,
 57,
 1013,
 981,
 1009,
 738,
 1013,
 1160,
 7]

In [23]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

## 2.5 BytePair encoding
## 2.5 字节对编码

- GPT-2 used BytePair encoding (BPE) as its tokenizer
- GPT-2 使用字节对编码（BytePair Encoding，BPE）作为其分词器。
- it allows the model to break down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words
- 它允许模型将不在预定义词汇表中的单词分解为较小的子词单元甚至是单个字符，从而使其能够处理词汇表外的单词。
- For instance, if GPT-2's vocabulary doesn't have the word "unfamiliarword," it might tokenize it as ["unfam", "iliar", "word"] or some other subword breakdown, depending on its trained BPE merges
- 例如，如果 GPT-2 的词汇表中没有单词 "unfamiliarword"，它可能会将其分词为 ["unfam", "iliar", "word"] 或其他一些子词分解，这取决于它训练过的 BPE 合并方式。
- The original BPE tokenizer can be found here: [https://github.com/openai/gpt-2/blob/master/src/encoder.py](https://github.com/openai/gpt-2/blob/master/src/encoder.py)
- 原始的BPE分词器可以在这里找到：[https://github.com/openai/gpt-2/blob/master/src/encoder.py](https://github.com/openai/gpt-2/blob/master/src/encoder.py)。
- In this chapter, we are using the BPE tokenizer from OpenAI's open-source [tiktoken](https://github.com/openai/tiktoken) library, which implements its core algorithms in Rust to improve computational performance
- 在本章中，我们使用了来自OpenAI开源 [tiktoken](https://github.com/openai/tiktoken) 库的BPE分词器，该库使用Rust实现其核心算法以提高计算性能。
- I created a notebook in the [./bytepair_encoder](./bytepair_encoder) that compares these two implementations side-by-side (tiktoken was about 5x faster on the sample text)
- 我创建了一个笔记本在 [./bytepair_encoder](./bytepair_encoder)，将这两种实现进行了并列比较（在样本文本上，tiktoken 大约快了5倍）。

In [24]:
# pip install tiktoken

In [25]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.5.1


In [26]:
tokenizer = tiktoken.get_encoding("gpt2")

In [27]:
text = "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace."

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]


In [28]:
strings = tokenizer.decode(integers)

print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.


- Experiments with unknown words:
- 未知单词的实验：

In [29]:
integers = tokenizer.encode("Akwirw ier")
print(integers)

[33901, 86, 343, 86, 220, 959]


In [30]:
for i in integers:
    print(f"{i} -> {tokenizer.decode([i])}")

33901 -> Ak
86 -> w
343 -> ir
86 -> w
220 ->  
959 -> ier


In [31]:
strings = tokenizer.decode(integers)
print(strings)

Akwirw ier


## 2.6 Data sampling with a sliding window
## 2.6 使用滑动窗口进行数据采样

In [32]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5145


- For each text chunk, we want the inputs and targets
- 对于每个文本块，我们想要输入和目标。
- Since we want the model to predict the next word, the targets are the inputs shifted by one position to the right
- 由于我们希望模型预测下一个单词，因此目标是输入向右移动一个位置。

In [33]:
enc_sample = enc_text[50:]

In [34]:
context_size = 4

x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


- One by one, the prediction would look like as follows:
- 逐个进行预测的过程如下所示：

In [35]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


In [36]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


- We will take care of the next-word prediction in a later chapter after we covered the attention mechanism
- 我们将在后面的章节中介绍注意力机制后，再来处理下一个单词的预测。
- For now, we implement a simple data loader that iterates over the input dataset and returns the inputs and targets shifted by one
- 现在，我们实现一个简单的数据加载器，它会遍历输入数据集，并返回输入以及向右移动一个位置的目标。

- Install and import PyTorch (see Appendix A for installation tips)
- 安装并导入PyTorch（请参阅附录A获取安装提示）

In [37]:
import torch
print("PyTorch version:", torch.__version__)

PyTorch version: 2.1.0


- Create dataset and dataloader that extract chunks from the input text dataset
- 创建数据集（dataset）和数据加载器（dataloader），从输入文本数据集中提取文本块

In [None]:
from torch.utils.data import Dataset, DataLoader


class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.tokenizer = tokenizer
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text # 对整个文本进行分词
        token_ids = tokenizer.encode(txt, allowed_special={'<|endoftext|>'})

        # Use a sliding window to chunk the book into overlapping sequences of max_length # 使用滑动窗口将书籍划分为最大长度的重叠序列
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [39]:
def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True):
    # Initialize the tokenizer # 初始化分词器
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset # 创建数据集
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader # 创建数据加载器
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle)

    return dataloader

- Let's test the dataloader with a batch size of 1 for an LLM with a context size of 4:
- 让我们使用批量大小为1的数据加载器，测试一个上下文大小为4的LLM。

In [40]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

In [41]:
dataloader = create_dataloader_v1(raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


In [42]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


- We can also create batched outputs
- 我们还可以创建批量输出。
- Note that we increase the stride here so that we don't have overlaps between the batches, since more overlap could lead to increased overfitting
- 请注意，这里我们增加了步幅，以便在批次之间没有重叠，因为更多的重叠可能会导致过拟合。

In [43]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=5, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 3619,   402,   271, 10899],
        [  257,  7026, 15632,   438],
        [  257,   922,  5891,  1576],
        [  568,   340,   373,   645],
        [ 5975,   284,   502,   284],
        [  326,    11,   287,   262],
        [  286,   465, 13476,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [  402,   271, 10899,  2138],
        [ 7026, 15632,   438,  2016],
        [  922,  5891,  1576,   438],
        [  340,   373,   645,  1049],
        [  284,   502,   284,  3285],
        [   11,   287,   262,  6001],
        [  465, 13476,    11,   339]])


## 2.7 Creating token embeddings
## 2.7 创建标记嵌入

- The data is already almost ready for an LLM
- 数据已经几乎准备好用于LLM了。
- But lastly let us embed the tokens in a continuous vector representation using an embedding layer
- 但最后让我们使用嵌入层将标记嵌入到连续的向量表示中。
- Usually, these embedding layers are part of the LLM itself and are updated (trained) during model training
- 通常，这些嵌入层是LLM本身的一部分，并在模型训练期间进行更新（训练）。

- Suppose we have the following three input examples with input ids 5, 1, 3, and 2 (after tokenization):
- 假设我们有以下三个输入示例，它们的输入ID是5、1、3和2（在分词后）：

In [44]:
input_ids = torch.tensor([5, 1, 3, 2])

- For the sake of simplicity, suppose we have a small vocabulary of only 6 words and we want to create embeddings of size 3:
- 为了简单起见，假设我们只有一个包含6个单词的小词汇表，并且我们想要创建大小为3的嵌入。

In [45]:
vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

- This would result in a 6x3 weight matrix:
- 这将生成一个6x3的权重矩阵：

In [46]:
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


- For those who are familiar with one-hot encoding, the embedding layer approach above is essentially just a more efficient way of implementing one-hot encoding followed by matrix multiplication in a fully-connected layer, which is described in the supplementary code in [./embedding_vs_matmul](./embedding_vs_matmul)
- 对于熟悉一热编码（或称独热编码）的人来说，上面的嵌入层方法本质上只是实现了一热编码后接着跟随全连接层的矩阵乘法的一种更有效的方式，这在 [./embedding_vs_matmul](./embedding_vs_matmul) 的附加代码中有描述。
- Because the embedding layer is just a more efficient implementation that is equivalent to the one-hot encoding and matrix-multiplication approach it can be seen as a neural network layer that can be optimized via backpropagation
- 因为嵌入层只是一种更有效的实现方式，等同于一热编码和矩阵乘法的方法，它可以被视为一个可以通过反向传播进行优化的神经网络层。

- To convert a token with id 3 into a 3-dimensional vector, we do the following:
- 要将ID为3的标记转换为一个3维向量，我们执行以下操作：

In [47]:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


- Note that the above is the 4th row in the `embedding_layer` weight matrix
- 请注意，上述操作相当于将 `embedding_layer` 权重矩阵的第4行取出来。
- To embed all three `input_ids` values above, we do
- 要嵌入上面的所有三个 `input_ids` 值，我们执行以下操作：

In [48]:
print(embedding_layer(input_ids))

tensor([[-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010],
        [-0.4015,  0.9666, -1.1481],
        [ 1.2753, -0.2010, -0.1606]], grad_fn=<EmbeddingBackward0>)


## 2.8 Encoding word positions
## 2.8 编码单词位置

- The BytePair encoder has a vocabulary size of 50,257:
- 字节对编码器的词汇表大小为50,257。
- Suppose we want to encode the input tokens into a 256-dimensional vector representation:
- 假设我们想要将输入标记编码成一个256维的向量表示。

In [49]:
vocab_size = 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

- If we sample data from the dataloader, we embed the tokens in each batch into a 256-dimensional vector
- 如果我们从数据加载器中采样数据，我们将每个批次中的标记嵌入到一个256维的向量中。
- If we have a batch size of 8 with 4 tokens each, this results in a 8 x 4 x 256 tensor:
- 如果我们的批量大小为8，每个批次有4个标记，则结果是一个8x4x256的张量：

In [50]:
max_length = 4
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=max_length, stride=5, shuffle=False)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

In [51]:
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 3619,   402,   271, 10899],
        [  257,  7026, 15632,   438],
        [  257,   922,  5891,  1576],
        [  568,   340,   373,   645],
        [ 5975,   284,   502,   284],
        [  326,    11,   287,   262],
        [  286,   465, 13476,    11]])

Inputs shape:
 torch.Size([8, 4])


In [52]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


- GPT-2 uses absolute position embeddings, so we just create another embedding layer:
- GPT-2 使用绝对位置嵌入，因此我们只需创建另一个嵌入层：

In [53]:
block_size = max_length
pos_embedding_layer = torch.nn.Embedding(block_size, output_dim)

In [54]:
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


- To create the input embeddings used in an LLM, we simply add the token and the positional embeddings:
- 要创建LLM中使用的输入嵌入，我们只需将标记和位置嵌入相加即可：

In [55]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])


- 绝对位置嵌入是 GPT-2 中用于处理位置信息的关键技术，它通过为序列中的每个位置分配一个唯一的嵌入向量，并将这些嵌入向量与词嵌入相结合，从而允许模型理解并利用序列中元素的位置信息。
- Absolute position embedding is a key technique used in GPT-2 to handle positional information. It assigns a unique embedding vector to each position in the sequence, and combines these embedding vectors with word embeddings, allowing the model to understand and utilize the positional information of elements in the sequence.
- 维度对齐：首先，PyTorch 会尝试对齐这两个张量的维度。由于 token_embeddings 是 3 维的，而 pos_embeddings 是 2 维的，pos_embeddings 会在其前面虚拟地添加一个维度。这样，pos_embeddings 临时变为 [1, 4, 256]。
- Dimension alignment: Initially, PyTorch attempts to align the dimensions of these two tensors. Since token_embeddings is 3-dimensional, while pos_embeddings is 2-dimensional, PyTorch will virtually add a dimension in front of pos_embeddings. Thus, pos_embeddings temporarily becomes [1, 4, 256].
- 广播：接下来，PyTorch 会将 pos_embeddings 在第一个维度（批次维度）上“广播”（复制）以匹配 token_embeddings 的形状。这样，pos_embeddings 从 [1, 4, 256] 被广播到 [8, 4, 256]。
- Broadcasting: Next, PyTorch broadcasts pos_embeddings along the first dimension (the batch dimension) to match the shape of token_embeddings. This way, pos_embeddings goes from [1, 4, 256] to [8, 4, 256].

# Summary and takeaways
# 摘要和要点

**See the [./dataloader.ipynb](./dataloader.ipynb) code notebook**, which is a concise version of the data loader that we implemented in this chapter and will need for training the GPT model in upcoming chapters.

**请查看 [./dataloader.ipynb](./dataloader.ipynb) 代码笔记本**，这是本章中我们实现的数据加载器的简明版本，我们将在后续章节中使用它来训练GPT模型。

**See [./exercise-solutions.ipynb](./exercise-solutions.ipynb) for the exercise solutions.**

**查看 [./exercise-solutions.ipynb](./exercise-solutions.ipynb) 获取习题的解答。**