# Tokenized text 转 Token IDs

## 步骤 

1. 首先需要构建一个包含所有token的词汇表
2. 按字母顺序以确定词汇表的大小

### 计算 token 总数

In [1]:
# 计算token总数：

import re
file_path = '../../input/the-verdict.txt'
with open(file_path, 'r', encoding='utf-8') as file:
    raw_text = file.read()
    
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', raw_text)
# 先遍历每一个item，判断不为空加入list，再做去空格操作
preprocessed = [item.strip() for item in preprocessed if 
item.strip()]

all_words = sorted(list(set(preprocessed)))
vocab_size = len(all_words)
print(vocab_size)

1159


In [2]:
# 打印前50个单词：

vocab = {token:integer for integer, token in
enumerate(all_words)}
for i, item in enumerate(vocab.items()):
    print(item)
    if i > 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Carlo;', 25)
('Chicago', 26)
('Claude', 27)
('Come', 28)
('Croft', 29)
('Destroyed', 30)
('Devonshire', 31)
('Don', 32)
('Dubarry', 33)
('Emperors', 34)
('Florence', 35)
('For', 36)
('Gallery', 37)
('Gideon', 38)
('Gisburn', 39)
('Gisburns', 40)
('Grafton', 41)
('Greek', 42)
('Grindle', 43)
('Grindle:', 44)
('Grindles', 45)
('HAD', 46)
('Had', 47)
('Hang', 48)
('Has', 49)
('He', 50)
('Her', 51)


## 代码 - SimpleTokenizerV1

实现一个完整的分词器类SimpleTokenizerV1：
* 编码
* 解码：LLM输出从数字转回文本

In [3]:
class SimpleTokenizerV1:
    def __init__(self, vocab): # vocab{s:i}词汇表
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.?!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed 
if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # 去掉符号前多余的空格
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

### 测试

简单测试一下SimpleTokenizerV1

In [4]:
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know," Mrs. Gisburn
said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

tokenizer.decode(ids)

[1, 58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7, 39, 873, 1136, 773, 812, 7]


'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'