## A Deeper Dive into Tokenizers


In [22]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
encoding = tokenizer(example)
print(type(encoding))
print(encoding)

<class 'transformers.tokenization_utils_base.BatchEncoding'>
{'input_ids': [101, 1422, 1271, 1110, 156, 7777, 2497, 1394, 1105, 146, 1250, 1120, 20164, 10932, 10289, 1107, 6010, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


We choose `whole_func_string` feature of the dataset to train the tokenizer on.

In [24]:
# Check if the tokenizer is fast (code written in Rust to parallelize and be faster) or not
tokenizer.is_fast

True

In [25]:
encoding.tokens()

['[CLS]',
 'My',
 'name',
 'is',
 'S',
 '##yl',
 '##va',
 '##in',
 'and',
 'I',
 'work',
 'at',
 'Hu',
 '##gging',
 'Face',
 'in',
 'Brooklyn',
 '.',
 '[SEP]']

In [26]:
encoding.word_ids()

[None, 0, 1, 2, 3, 3, 3, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, None]

The notion of what a word is complicated. For instance, does “I’ll” (a contraction of “I will”) count as one or two words? It actually depends on the tokenizer and the pre-tokenization operation it applies. Some tokenizers just split on spaces, so they will consider this as one word. Others use punctuation on top of spaces, so will consider it two words.

To see some examples, we will a tokenizer from the bert-base-cased and roberta-base checkpoints and tokenize ”81s” with them.

In [29]:
example = "81s"
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
roberta_tokenizer = AutoTokenizer.from_pretrained("roberta-base")
bert = bert_tokenizer(example)
roberta = roberta_tokenizer(example)
print(bert.tokens(), roberta.tokens())
print(bert.word_ids(), roberta.word_ids())

['[CLS]', '81', '##s', '[SEP]'] ['<s>', '81', 's', '</s>']
[None, 0, 0, None] [None, 0, 1, None]


Thererfore, the BERT tokenizer considers 81s as a single word, while RoBERTa consideres it as two words.