#### HF Tokenizers ####

In [1]:
from transformers import AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


##### Encoding #####

In [2]:
# This tokenizer is a subword tokenizer: it splits the words until it obtains tokens that can be represented by its vocabulary. 
# Thatâ€™s the case here with transformer, which is split into two tokens: transform and ##er.

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


In [3]:
# The conversion to input IDs is handled by the convert_tokens_to_ids() tokenizer method:

ids = tokenizer.convert_tokens_to_ids(tokens)

print(f"These outputs, once converted to the appropriate framework tensor, can then be used as inputs to a model:\n{ids}")

These outputs, once converted to the appropriate framework tensor, can then be used as inputs to a model:
[7993, 170, 13809, 23763, 2443, 1110, 3014]


##### Decoding #####

Decoding is going the other way around: from vocabulary indices, we want to get a string. 

This can be done with the decode() method as follows:

In [4]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

Using a transformer network is simple


Note that the decode method not only converts the indices back to tokens, but also groups together the tokens that were part of the same words to produce a readable sentence. This behavior will be extremely useful when we use models that predict new text (either text generated from a prompt, or for sequence-to-sequence problems like translation or summarization).