The `AutoTokenizer` tokenizes the raw input automatically and is the first step in the `pipline`'s work flow.

We say it does it automatically, because `AutoTokenizer` automatically determines what types of tokenizer to use for the selected model.

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
rawtext = [
  "Hello, how are you?",
  "My name is Mobin, nice to meet you sir."
]
encoding = tokenizer(rawtext, return_tensors="pt", padding=True, truncation=True)
encoding

{'input_ids': tensor([[  101,  8667,   117,  1293,  1132,  1128,   136,   102,     0,     0,
             0,     0,     0,     0],
        [  101,  1422,  1271,  1110, 12556,  7939,   117,  3505,  1106,  2283,
          1128,  6442,   119,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Here, truncation, truncates the input if it exceeds some pre-determined model max token number.

Furthermore, padding here adds [pad] token to make all inputs the same length. Then the `attention_mask` shows which parts of the input are padding and not padding, worth the model's attention.

Now let us see the tokens themselves:



In [13]:
tokens = tokenizer.tokenize(rawtext[1])
tokens

['My',
 'name',
 'is',
 'Mo',
 '##bin',
 ',',
 'nice',
 'to',
 'meet',
 'you',
 'sir',
 '.']

In [15]:
tokenizer.tokenize("سلام = salam = hello")

['س', '##ل', '##ا', '##م', '=', 'sa', '##lam', '=', 'hello']