# Tokenizer \& Preprocessing the data 

https://huggingface.co/transformers/preprocessing.html

In [1]:
from transformers import AutoTokenizer 
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




## Base use

The only method in `PreTrainedTokenizer` we need to remember is `__call__`.  

In [3]:
encoded_input = tokenizer('I am not alone')
encoded_input

{'input_ids': [101, 146, 1821, 1136, 2041, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1]}

A tokenizer can decode a list of ids. 

In [4]:
tokenizer.decode(encoded_input['input_ids'])

'[CLS] I am not alone [SEP]'

Rather than a single sentence, we may send a list of sentences to tokenizer. 

In [6]:
batch_sentences = [
    'I am the first sentence', 
    'As well as the last one', 
    'A book without word'
]

encoded_inputs = tokenizer(batch_sentences)
encoded_inputs

{'input_ids': [[101, 146, 1821, 1103, 1148, 5650, 102], [101, 1249, 1218, 1112, 1103, 1314, 1141, 102], [101, 138, 1520, 1443, 1937, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

The code cell above may only give a rough idea how to handle a batch of sentences. 

However, if the purpose of sending several sentences to the tokenizer is to build a <b>batch</b> to feed the model, then we want: 

<ul>
    <li>To pad each sentence with max length.</li>
    <li>To truncate each sentence to max length.</li>
    <li>To return tensors.</li>
</ul>
    
Thus we use the following code cell. 

In [7]:
batch_sentences = [
    'I am the first sentence', 
    'As well as the last one', 
    'A book without word'
]

batch = tokenizer(
    batch_sentences, 
    padding = True, 
    truncation = True, 
    return_tensors = 'pt'
)

batch

{'input_ids': tensor([[ 101,  146, 1821, 1103, 1148, 5650,  102,    0],
        [ 101, 1249, 1218, 1112, 1103, 1314, 1141,  102],
        [ 101,  138, 1520, 1443, 1937,  102,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0]])}

## Preprocessing pairs of sentences

Skip this part as we will not need it in the future. 

## Everything you always wanted to know about padding and truncation

Here are the three arguments you need to know: `padding`, `truncation`, `max_length`. 

The online tutorial includes complete information about how to use the three arguments more flexibly, and strategies how to specify the arguments. 

## Pre-tokenized inputs

In [None]:
encoded_input = tokenizer(['My', 'truck', 'is', 'loaded'])