In [1]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

raw_inputs = [
    "I've been waiting for a HuggingFace Course for a very long time.",
    "I love and hate this so much."
]
inputs = tokenizer(
    raw_inputs,
    padding=True,
    truncation=True,
    return_tensors="pt"
)

In [2]:
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2005,  1037,  2200,  2146,  2051,  1012,   102],
        [  101,  1045,  2293,  1998,  5223,  2023,  2061,  2172,  1012,   102,
             0,     0,     0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


- `.from_pretrained()` : function downloads and caches the configuration as well as the vocabulary associated to a given checkpoint.
- `distilbert-base-uncased-finetuned-sst-2-english` : checkpoint used by default for sentiment analysis in the pipeline
- `padding=True` : We can see that the 2 sentences that is fed into the tokenizer are not of the same size so, we need to pad the shortest one to be able to build an array. Thus, `padding=True`, leading to trailing 0s in the tensor value of `input_ids`.
- `truncation=True` : Ensure that any sentence longer than the maximum that the model can handle gets truncated.
- `return_tensors="pt"` : Return PyTorch tensors.

`attention_mask` in the inputs indicate where the padding has been applied so that the model doesn't pay attention to it.