  - Step 1 – Set Up Playground: 在 bpe_demo.ipynb 里
  新增一个“Toy corpus”单元，放极小语料（如 ["low",
  "lower", "newest", "widest"]），再建一个 Markdown
  单元写清楚接下来每个函数的目标。
  - Step 2 – tokenize_to_chars: 写第一个 code cell，
  实现并调用 tokenize_to_chars(corpus)，打印/展示结
  果，确保得到逐字符或逐字节的列表结构。
  - Step 3 – count_adjacent_pairs: 新建 cell，实
  现统计函数；对上一步的输出运行，打印 pair 频次
  Counter，核对与手算一致。
  - Step 4 – merge_pair: 单独写一个 cell 实现“把
  best_pair 合并进所有序列”的逻辑；调用时用固定
  best_pair，展示合并前后 tokens，确认不会产生重叠
  问题。
  - Step 5 – argmax & loop glue: 写一个 cell 实现
  argmax（或直接用 Python 内置）并把前三个函数串成一
  次循环，手动跑 1~2 次，观察 merges 记录。
  - Step 6 – build_vocab: 写最终构造词表的函数，展示
  vocab 内容（可打印前若干项）。
  - Step 7 – Wrap train_bpe: 最后把上面的函数组合成完
  整 train_bpe，对 toy corpus 和 max_merges 运行，打
  印 merges 和 vocab；确认与预期一致后再考虑扩展到真
  实语料。

### Flow ↔ Pseudocode Mapping

- `tokenize_to_chars` → “初始分词：字符 / 字节”
- `count_adjacent_pairs` → “统计全部相邻 token pair 的频次”
- `argmax(pair_counts)` → “挑选频次最高的 pair”
- `merge_pair(tokens, best_pair)` → “合并 pair → 生成新 token”和“更新语料中的 token 序列”
- `merges.append(best_pair)` → “记录合并步骤到 merges”
- `if not pair_counts: break` 与 `for` 循环结束 → “达到限制?”
- `build_vocab(tokens, merges)` → “输出 merges + vocab”


### Step 1 · Toy Corpus Setup
    We start with a tiny corpus（语料库） so every intermediate
  result is easy to inspect.

In [8]:
toy_corpus = ["low", "lower", "newest", "widest"]

print(toy_corpus)

['low', 'lower', 'newest', 'widest']


### Step 2 · tokenize_to_chars
    Convert each text sample into a list of base
  tokens (characters or bytes).

In [9]:
def tokenize_to_chars(corpus: list[str]) -> list[list[str]]:
    """ "
      Break each piece of text into a list of
    single-character tokens.
    """
    tokenized = []
    for text in corpus:
        tokenized.append(list(text))
    return tokenized


toy_tokens = tokenize_to_chars(toy_corpus)
for sample, tokens in zip(toy_corpus, toy_tokens):
    print(sample, "->", tokens)

low -> ['l', 'o', 'w']
lower -> ['l', 'o', 'w', 'e', 'r']
newest -> ['n', 'e', 'w', 'e', 's', 't']
widest -> ['w', 'i', 'd', 'e', 's', 't']


tokenized corpus now holds per-word cahracter lists;this will feed the pair counter next.

### Step 3 · count_adjacent_pairs
    Count how often each adjacent token pair appears across the corpus.

In [16]:
from collections import Counter
from collections.abc import Iterable


def count_adjacent_pairs(token_sequences: Iterable[list[str]]) -> Counter[tuple[str, str]]:
    """
    Iterate through each token sequence and count adjacent token pairs.
    """
    pair_counts: Counter[tuple[str, str]] = Counter()
    for tokens in token_sequences:
        tokens = list(tokens)  # in case input is a tuple
        for i in range(len(tokens) - 1):
            pair = (tokens[i], tokens[i + 1])
            pair_counts[pair] += 1
    return pair_counts


toy_pair_counts = count_adjacent_pairs(toy_tokens)
for pair, freq in toy_pair_counts.most_common():
    print(pair, ":", freq)

('l', 'o') : 2
('o', 'w') : 2
('w', 'e') : 2
('e', 's') : 2
('s', 't') : 2
('e', 'r') : 1
('n', 'e') : 1
('e', 'w') : 1
('w', 'i') : 1
('i', 'd') : 1
('d', 'e') : 1


### Step 4 · merge_pair
    Replace every occurrence of the chosen pair with a newly merged token.

In [11]:
def merge_pair(token_sequences: list[list[str]], pair: tuple[str, str]) -> list[list[str]]:
    """Replace occurrences of `pair` with the
    concatenated token pair[0]+pair[1]."""
    merged_sequences: list[list[str]] = []
    merged_token = pair[0] + pair[1]

    for tokens in token_sequences:
        i = 0
        new_tokens: list[str] = []
        while i < len(tokens):
            if i < len(tokens) - 1 and tokens[i] == pair[0] and tokens[i + 1] == pair[1]:
                new_tokens.append(merged_token)
                i += 2  # skip both tokens that were merged
            else:
                new_tokens.append(tokens[i])
                i += 1
        merged_sequences.append(new_tokens)

    return merged_sequences


demo_pair = ("l", "o")
merged_once = merge_pair(toy_tokens, demo_pair)
for before, after in zip(toy_tokens, merged_once):
    print(before, "→", after)

['l', 'o', 'w'] → ['lo', 'w']
['l', 'o', 'w', 'e', 'r'] → ['lo', 'w', 'e', 'r']
['n', 'e', 'w', 'e', 's', 't'] → ['n', 'e', 'w', 'e', 's', 't']
['w', 'i', 'd', 'e', 's', 't'] → ['w', 'i', 'd', 'e', 's', 't']


### Step 5 · First Merge Loop
     Tie together the helper functions to perform a single merge iteration.

In [17]:
from copy import deepcopy

working_tokens = deepcopy(toy_tokens)

pair_counts = count_adjacent_pairs(working_tokens)
best_pair = max(pair_counts, key=pair_counts.get)
print("Best pair:", best_pair, "frequency:", pair_counts[best_pair])

working_tokens = merge_pair(working_tokens, best_pair)
print("After first merge:")
for tokens in working_tokens:
    print(tokens)

Best pair: ('l', 'o') frequency: 2
After first merge:
['lo', 'w']
['lo', 'w', 'e', 'r']
['n', 'e', 'w', 'e', 's', 't']
['w', 'i', 'd', 'e', 's', 't']


### Step 6 · build_vocab
     Collect the base symbols and newly merged symbols into a simple vocab map.

In [20]:
from collections import OrderedDict


def build_vocab(base_sequences: list[list[str]], merges: list[tuple[str, str]]):
    """
    Return an OrderedDict mapping token string -> integer id in the order they are introduced.
    """
    vocab = OrderedDict()

    # Seed with all base tokens (characters) in appearance order.
    for seq in base_sequences:
        for token in seq:
            if token not in vocab:
                vocab[token] = len(vocab)

    # Append newly created tokens following merge order.
    for pair in merges:
        merged_token = pair[0] + pair[1]
        if merged_token not in vocab:
            vocab[merged_token] = len(vocab)

    return vocab


demo_merges = [best_pair]  # reuse the best_pair from Step 5 demo
vocab_preview = build_vocab(toy_tokens, demo_merges)

for token, idx in vocab_preview.items():
    print(f"{idx:>2} : {token}")

 0 : l
 1 : o
 2 : w
 3 : e
 4 : r
 5 : n
 6 : s
 7 : t
 8 : i
 9 : d
10 : lo


### Step 7 · train_bpe (full loop)
    Combine all helper functions into the full training routine.

In [21]:
def train_bpe(corpus: list[str], max_merges: int):
    """
    Train a BPE tokenizer on the given corpus.
    Returns: (merges, vocab, final_tokens)
    """
    base_tokens = tokenize_to_chars(corpus)
    working_tokens = [seq[:] for seq in base_tokens]  # deep copy
    merges: list[tuple[str, str]] = []

    for _ in range(max_merges):
        pair_counts = count_adjacent_pairs(working_tokens)
        if not pair_counts:
            break

        best_pair = max(pair_counts, key=pair_counts.get)
        merges.append(best_pair)
        working_tokens = merge_pair(working_tokens, best_pair)

    vocab = build_vocab(base_tokens, merges)
    return merges, vocab, working_tokens


# Train BPE on our toy corpus
merges, vocab, final_tokens = train_bpe(toy_corpus, max_merges=5)

print("Merges:")
for i, pair in enumerate(merges, 1):
    print(f"{i:>2}: {pair}")

print("\nFinal token sequences:")
for tokens in final_tokens:
    print(tokens)

print("\nVocab preview (first 12 entries):")
for i, (token, idx) in enumerate(vocab.items()):
    if i >= 12:
        break
    print(f"{idx:>2} : {token}")

Merges:
 1: ('l', 'o')
 2: ('lo', 'w')
 3: ('e', 's')
 4: ('es', 't')
 5: ('low', 'e')

Final token sequences:
['low']
['lowe', 'r']
['n', 'e', 'w', 'est']
['w', 'i', 'd', 'est']

Vocab preview (first 12 entries):
 0 : l
 1 : o
 2 : w
 3 : e
 4 : r
 5 : n
 6 : s
 7 : t
 8 : i
 9 : d
10 : lo
11 : low
