# Tokenizing the Text in your Dataset

The components contained within the transformer do not have any intrinsic knowledge of the words that it processes. Instead, the tokenizer only uses the token identifiers for the words that it processes. In this recipe, we will learn how to transform the text in your dataset into a representation that can be used by the models for downstream tasks.

### How to do it...

In this recipe, you will continue from the previous example of using the RottenTomatoes dataset and sampling a few sentences from it. We will then encode the sampled sentences into tokens and their respective representations.

Import the necessary AutoTokenizer module from the transformers library:

In [1]:
from transformers import AutoTokenizer

Initialize Sentence Array Consisting Three Sentences

In [2]:
sentences = [
    "The first sentence, which is the longest one in the list.",
    "The second sentence is not that long.",
    "A very short sentence."
]

Initialize a tokenizer of "bert-base-cased" type. This tokenizer is case-sentitive. This means that the words star and STAR will have different tokenized representations.

In [3]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Tokenize all the sentences in the "sentences" array

In [4]:
tokenized_input = tokenizer(sentences)
print(tokenized_input)

{'input_ids': [[101, 1109, 1148, 5650, 117, 1134, 1110, 1103, 6119, 1141, 1107, 1103, 2190, 119, 102], [101, 1109, 1248, 5650, 1110, 1136, 1115, 1263, 119, 102], [101, 138, 1304, 1603, 5650, 119, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]]}


We take the input IDs of the first sentence and convert them back into tokens:

In [5]:
tokens = tokenizer.convert_ids_to_tokens(
    tokenized_input["input_ids"][0]
)

print(tokens)

['[CLS]', 'The', 'first', 'sentence', ',', 'which', 'is', 'the', 'longest', 'one', 'in', 'the', 'list', '.', '[SEP]']
