The output of a tokenizer isn’t a simple Python dictionary; what we get is actually a special BatchEncoding object. It’s a subclass of a dictionary

In [1]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
encoding = tokenizer(example)
print(type(encoding))

  from .autonotebook import tqdm as notebook_tqdm


<class 'transformers.tokenization_utils_base.BatchEncoding'>


In [11]:
word = encoding.tokens()
word

['[CLS]',
 'My',
 'name',
 'is',
 'S',
 '##yl',
 '##va',
 '##in',
 'and',
 'I',
 'work',
 'at',
 'Hu',
 '##gging',
 'Face',
 'in',
 'Brooklyn',
 '.',
 '[SEP]']

In [18]:
x = encoding.word_ids()
x

[None, 0, 1, 2, 3, 3, 3, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, None]

In [27]:
word = encoding.tokens()
x = encoding.word_ids()
dict_token_id = {}
word_i = 0

for ids in x:
    if ids in dict_token_id.keys():
        dict_token_id[ids] = dict_token_id[ids] + ", " + word[word_i]
    else:
        dict_token_id[ids] = word[word_i]
    
    word_i += 1

dict_token_id

{None: '[CLS], [SEP]',
 0: 'My',
 1: 'name',
 2: 'is',
 3: 'S, ##yl, ##va, ##in',
 4: 'and',
 5: 'I',
 6: 'work',
 7: 'at',
 8: 'Hu, ##gging',
 9: 'Face',
 10: 'in',
 11: 'Brooklyn',
 12: '.'}

We can see that the tokenizer’s special tokens [CLS] and [SEP] are mapped to None, and then each token is mapped to the word it originates from. This is especially useful to determine if a token is at the start of a word or if two tokens are in the same word.

In [28]:
tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer_roberta = AutoTokenizer.from_pretrained("roberta-base")
encode_bert = tokenizer_bert("81s")
encode_roberta = tokenizer_roberta("81s")

In [29]:
encode_bert.tokens()

['[CLS]', '81', '##s', '[SEP]']

In [33]:
encode_bert.word_ids()

[None, 0, 0, None]

In [30]:
encode_roberta.tokens()

['<s>', '81', 's', '</s>']

In [32]:
encode_roberta.word_ids()

[None, 0, 1, None]

In [34]:
start, end = encoding.word_to_chars(3)
example[start:end]

'Sylvain'

In [41]:
start, end = encoding.token_to_chars(13)
example[start:end]

'gging'

In [42]:
encoding.token_to_chars(13)

CharSpan(start=35, end=40)

In [56]:
start = encoding.char_to_word(20)
start

4

This is all powered by the fact the fast tokenizer keeps track of the span of text each token comes from in a list of offsets