# Tokenizer

一般的步骤是先分词，然后构造词典映射，然后在根据词典将文本转化为数字，最后填充和截断。
但是现在都不需要，transformers都封装好了，只需要把数据准备好调用tokenizer就行了

In [13]:
from transformers import AutoTokenizer
sentence = 'I love you.'
tokenizer = AutoTokenizer.from_pretrained('/root/autodl-tmp/bert-base-uncased')
tokenizer

Using bos_token, but it is not set yet.
Using eos_token, but it is not set yet.


BertTokenizerFast(name_or_path='/root/autodl-tmp/bert-base-uncased', vocab_size=30522, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

### 分词器tokenizer的tokenize方法可以直接分词

In [30]:
tokens = tokenizer.tokenize(sentence)
tokens

['i', 'love', 'you', '.']

In [18]:
#词表长度
len(tokenizer.vocab)

30522

### 词和词索引的相互转化

In [23]:
ids = tokenizer.convert_tokens_to_ids(tokens)
ids

[1045, 2293, 2017, 1012]

In [32]:
tokens = tokenizer.convert_ids_to_tokens(ids)
tokens

['[CLS]', 'i', 'love', 'you', '.', '[SEP]']

In [27]:
ids = tokenizer.encode(sentence, add_special_tokens=True)
ids

[101, 1045, 2293, 2017, 1012, 102]

In [29]:
str_sen = tokenizer.decode(ids, skip_special_tokens=False)
str_sen

'[CLS] i love you. [SEP]'

### 填充和截断

In [38]:
# 截断
ids = tokenizer.encode(sentence, max_length=2, truncation=True)
ids

[101, 102]

In [39]:
# 填充
ids = tokenizer.encode(sentence, padding="max_length", max_length=15)
ids

[101, 1045, 2293, 2017, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [40]:
attention_mask = [1 if idx != 0 else 0 for idx in ids]
token_type_ids = [0] * len(ids)
ids, attention_mask, token_type_ids

([101, 1045, 2293, 2017, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

### 处理batch数据

In [45]:
sens = ["According to Tony Estanguet, president of the Paris Olympics Organizing Committee, a special event with selected athletes in attendance will happen on the River Seine on Wednesday to mark the one-year countdown for the Paris Olympics.",
        "I love you.",
        "Two months out and lagging ticket sales enjoyed a boost from 100-day events and the torch relay."]
res = tokenizer(sens)
res

{'input_ids': [[101, 2429, 2000, 4116, 9765, 5654, 23361, 1010, 2343, 1997, 1996, 3000, 3783, 10863, 2837, 1010, 1037, 2569, 2724, 2007, 3479, 7576, 1999, 5270, 2097, 4148, 2006, 1996, 2314, 16470, 2006, 9317, 2000, 2928, 1996, 2028, 1011, 2095, 18144, 2005, 1996, 3000, 3783, 1012, 102], [101, 1045, 2293, 2017, 1012, 102], [101, 2048, 2706, 2041, 1998, 2474, 12588, 7281, 4341, 5632, 1037, 12992, 2013, 2531, 1011, 2154, 2824, 1998, 1996, 12723, 8846, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

### 当加载大模型分词器的时候我们添加相信远程代码的参数

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('THUDM/chatglm-6b', trust_remote_code=True)