特殊上下文
> 处理超纲词汇和标记特定段落的开始和结束

特殊的token：
* <|UNK|>：不属于词汇表的单词
* <|endoftext|>：标记特定段落的开始和结束

In [5]:
# 向词汇表中添加两个特殊的token：

import re
file_path = '../input/the-verdict.txt'
with open(file_path, 'r', encoding='utf-8') as file:
    raw_text = file.read()
    
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if 
item.strip()]

all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:integer for integer,token in enumerate(all_tokens)}
print(len(vocab.items()))


1161


In [6]:
# 额外快速检查，打印最后5个词：

for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1156)
('your', 1157)
('yourself', 1158)
('<|endoftext|>', 1159)
('<|unk|>', 1160)


将2_token2token_id.ipynb中添加不在词汇表中的处理，即可处理未知单词

In [9]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.?!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed 
if item.strip()]
        # 避免不在词汇表中
        preprocessed = [item if item in self.str_to_int else "<|unk|>" for item in preprocessed]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # 去掉符号前多余的空格
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [11]:
# 测试：
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
print(text)

tokenizer = SimpleTokenizerV2(vocab)
print(tokenizer.encode(text))

print(tokenizer.decode(tokenizer.encode(text)))

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.
[1160, 5, 362, 1155, 642, 1000, 10, 1159, 57, 1013, 981, 1009, 738, 1013, 1160, 7]
<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.
