<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
补充代码来自 <a href="http://mng.bz/orYv">从零构建大型语言模型</a> 一书，作者为 <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>代码仓库: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>

- 通过取消注释并运行以下单元格来安装此附加笔记本所需的额外包要求：

In [1]:
# pip install -r requirements-extra.txt

# 比较各种字节对编码（BPE）实现

## 使用 `tiktoken` 中的 BPE

In [9]:
from importlib.metadata import version

# 打印 tiktoken 库的版本信息
print("tiktoken version:", version("tiktoken"))

tiktoken version: 0.6.0


In [10]:
import tiktoken

# 获取 GPT-2 的分词器
tik_tokenizer = tiktoken.get_encoding("gpt2")

# 要进行分词的文本
text = "Hello, world. Is this-- a test?"

In [11]:
# 将文本进行编码，允许 "<|endoftext|>" 特殊符号
integers = tik_tokenizer.encode(text, allowed_special={"<|endoftext|>"})

# 打印编码后的整数列表
print(integers)

[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]


In [12]:
# 将编码后的整数列表解码回字符串
strings = tik_tokenizer.decode(integers)

# 打印解码后的字符串
print(strings)

Hello, world. Is this-- a test?


In [13]:
# 打印 GPT-2 分词器的词汇表大小
print(tik_tokenizer.n_vocab)

50257


## 使用 GPT-2 中原始的 BPE 实现

In [14]:
from bpe_openai_gpt2 import get_encoder, download_vocab

In [15]:
download_vocab()

Fetching encoder.json: 1.04Mit [00:02, 509kit/s]                                                    
Fetching vocab.bpe: 457kit [00:01, 315kit/s]                                                        


In [16]:
orig_tokenizer = get_encoder(model_name="gpt2_model", models_dir=".")

In [17]:
integers = orig_tokenizer.encode(text)

print(integers)

[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]


In [18]:
strings = orig_tokenizer.decode(integers)

print(strings)

Hello, world. Is this-- a test?


## 通过 Hugging Face transformers 使用 BPE

In [1]:
# !pip install transformers
import transformers

transformers.__version__

'4.45.1'

In [None]:
from transformers import GPT2Tokenizer

hf_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

In [None]:
hf_tokenizer(strings)["input_ids"]

## 快速性能基准测试

In [19]:
with open('../01_main-chapter-code/the-verdict.txt', 'r', encoding='utf-8') as f:
    raw_text = f.read()

In [20]:
%timeit orig_tokenizer.encode(raw_text)

3.85 ms ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [21]:
%timeit tik_tokenizer.encode(raw_text)

1.15 ms ± 6.18 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [18]:
%timeit hf_tokenizer(raw_text)["input_ids"]

Token indices sequence length is longer than the specified maximum sequence length for this model (5145 > 1024). Running this sequence through the model will result in indexing errors


8.46 ms ± 48.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [19]:
%timeit hf_tokenizer(raw_text, max_length=5145, truncation=True)["input_ids"]

8.36 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
