- 将文本作为字符串加载到内存中。
- 将字符串拆分为词元（如单词和字符）。
- 建立一个词表，将拆分的词元映射到数字索引。
- 将文本转换为数字索引序列，方便模型操作。

In [1]:
import collections
import re
from d2l import torch as d2l

读取数据集，按行存到lines中

In [2]:
d2l.DATA_HUB['time_machine'] = (d2l.DATA_URL + 'timemachine.txt',
                                '090b5e7e70c295757f55df93cb0a180b9691891a')

In [3]:
def read_time_machine():
    with open(d2l.download('time_machine'), 'r') as f:
        lines = f.readlines()
    return [re.sub('[^A-Za-x]+', ' ', line).strip().lower() for line in lines]

lines = read_time_machine()
print(f'文本总行数：{len(lines)}')
test = lines[:10]
test

Downloading ..\data\timemachine.txt from http://d2l-data.s3-accelerate.amazonaws.com/timemachine.txt...
文本总行数：3221


['the time machine b h g wells',
 '',
 '',
 '',
 '',
 'i',
 '',
 '',
 'the time traveller for so it will be convenient to speak of him',
 'was expounding a recondite matter to us his gre e es shone and']

输入将文本行列表lines，输出为词元（token）——文本的基本单位
返回词元二维列表（每个词元是一个string）

In [4]:
def tokenize(lines, token='word'): # 提供两种分词模式
    if token == 'word':
        return [line.split() for line in lines] # 将每个line中的每个单词分离开 每行作为一个列表 生成一个二维列表
    elif token == 'char':
        return [list(line) for line in lines]
    else:
        print('错误：未知词元类型: '+ token)

tokens = tokenize(lines, token='word')
tokens

[['the', 'time', 'machine', 'b', 'h', 'g', 'wells'],
 [],
 [],
 [],
 [],
 ['i'],
 [],
 [],
 ['the',
  'time',
  'traveller',
  'for',
  'so',
  'it',
  'will',
  'be',
  'convenient',
  'to',
  'speak',
  'of',
  'him'],
 ['was',
  'expounding',
  'a',
  'recondite',
  'matter',
  'to',
  'us',
  'his',
  'gre',
  'e',
  'es',
  'shone',
  'and'],
 ['twinkled',
  'and',
  'his',
  'usuall',
  'pale',
  'face',
  'was',
  'flushed',
  'and',
  'animated',
  'the'],
 ['fire',
  'burned',
  'brightl',
  'and',
  'the',
  'soft',
  'radiance',
  'of',
  'the',
  'incandescent'],
 ['lights',
  'in',
  'the',
  'lilies',
  'of',
  'silver',
  'caught',
  'the',
  'bubbles',
  'that',
  'flashed',
  'and'],
 ['passed',
  'in',
  'our',
  'glasses',
  'our',
  'chairs',
  'being',
  'his',
  'patents',
  'embraced',
  'and'],
 ['caressed',
  'us',
  'rather',
  'than',
  'submitted',
  'to',
  'be',
  'sat',
  'upon',
  'and',
  'there',
  'was',
  'that'],
 ['luxurious',
  'after',
  'dinner'

词元是string，而输入模型需要时数字；故构建词表（vocabulary）<br>
用来将string类型的词元映射为从0开始的数字索引中。<br>
先统计词元，得到语料（corpus）；然后根据词元出现频率为其分配数字索引<br>
删除很少出现的词元 以降低复杂性<br>
语料库中不存在或已删除的任何词元都将映射到一个特定的未知词元“< unk >”<br>
    增加一个列表，用于保存那些被保留的词元， 例如：填充词元（“< pad>”）； 序列开始词元（“< bos>”）； 序列结束词元（“< eos>”）

In [22]:
tokens

[['the', 'time', 'machine', 'b', 'h', 'g', 'wells'],
 [],
 [],
 [],
 [],
 ['i'],
 [],
 [],
 ['the',
  'time',
  'traveller',
  'for',
  'so',
  'it',
  'will',
  'be',
  'convenient',
  'to',
  'speak',
  'of',
  'him'],
 ['was',
  'expounding',
  'a',
  'recondite',
  'matter',
  'to',
  'us',
  'his',
  'gre',
  'e',
  'es',
  'shone',
  'and'],
 ['twinkled',
  'and',
  'his',
  'usuall',
  'pale',
  'face',
  'was',
  'flushed',
  'and',
  'animated',
  'the'],
 ['fire',
  'burned',
  'brightl',
  'and',
  'the',
  'soft',
  'radiance',
  'of',
  'the',
  'incandescent'],
 ['lights',
  'in',
  'the',
  'lilies',
  'of',
  'silver',
  'caught',
  'the',
  'bubbles',
  'that',
  'flashed',
  'and'],
 ['passed',
  'in',
  'our',
  'glasses',
  'our',
  'chairs',
  'being',
  'his',
  'patents',
  'embraced',
  'and'],
 ['caressed',
  'us',
  'rather',
  'than',
  'submitted',
  'to',
  'be',
  'sat',
  'upon',
  'and',
  'there',
  'was',
  'that'],
 ['luxurious',
  'after',
  'dinner'

In [20]:
def count_corpus(tokens):
    # 统计词元频率
    if len(tokens) == 0 or isinstance(tokens[0], list):
        # 将二维词元展开成一维词元
        tokens = [token for line in tokens for token in line]
    return collections.Counter(tokens)

class Vocab:
    def __init__(self, tokens=None, min_freq=0, reserved_tokens=None) -> None:
        if tokens is None:
            tokens = []
        if reserved_tokens is None:
            reserved_tokens = []
        # 按出现频率排序
        counter = count_corpus(tokens)
        self._token_freqs = sorted(counter.items(), key=lambda x:x[1], reverse=True) # 对counter.items()第二维数据进行降序排序
        
        # 未知次词元引为0
        self.idx_to_token = ['unk'] + reserved_tokens
        self.token_to_idx = {token: idx for idx, token in enumerate(self.idx_to_token)}
        
        for token, freq in self.__token_freqs:
            if freq < min_freq:
                break
            if token not in self.token_to_idx:
                self.idx_to_token.append(token)
                self.token_to_idx[token] = len(self.idx_to_token) - 1 # 添加新词的索引
    
    def __len__(self):
        return len(self.idx_to_token)
    
    def __getitem__(self, tokens):
        if not isinstance(tokens, (list, tuple)):
            return self.token_to_idx.get(tokens, self.unk)
        return [self.__getitem__(token) for token in tokens]
    
    def to_tokens(self, indices):
        if not isinstance(indices, (list, tuple)):
            return self.idx_to_token[indices]
        return [self.idx_to_token[index] for index in indices]