# AutoTokenizer

In natural language processing (NLP), a tokenizer (or tokenization) is a fundamental preprocessing step that breaks down a text into individual units called tokens. A token can be as short as a single character or as long as a word, subword, or even a whole sentence, depending on the tokenization strategy used.
The main goal of tokenization is to convert raw text into a sequence of tokens that can be further processed by NLP models. These tokens act as the basic building blocks for various NLP tasks, such as text classification, named entity recognition, machine translation, and text generation.
Because each model's architecture and objective are different, they may require tokenization that suits their specific characteristics.

In [10]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
sequence = "In a hole in the ground there lived a hobbit."
tokens = tokenizer(sequence)
print(tokens)

{'input_ids': [101, 1999, 1037, 4920, 1999, 1996, 2598, 2045, 2973, 1037, 7570, 10322, 4183, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [2]:
print(tokenizer.decode(tokens['input_ids']))

[CLS] in a hole in the ground there lived a hobbit. [SEP]


# Padding, Truncation, Return tensor

In [12]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
sequence = "In a hole in the ground there lived a hobbit."
encoded_input = tokenizer(sequence, padding=True, truncation=True, return_tensors="pt")
print(encoded_input)

{'input_ids': tensor([[  101,  1999,  1037,  4920,  1999,  1996,  2598,  2045,  2973,  1037,
          7570, 10322,  4183,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
<class 'transformers.tokenization_utils_base.BatchEncoding'>


The output is workable with pytorch