## üìò tokenizer() ‚Äì One Call for Everything

The `tokenizer()` method can handle all the steps in one go:

- Tokenizing text
- Converting tokens to IDs
- Padding and truncating
- Creating attention masks
- Preparing tensors to feed into the model

## ‚úÖ Tokenizer Object

The tokenizer object knows everything: how to split tokens, convert to IDs, add padding, truncation, and attention masks.

It returns a Python dictionary with `input_ids` and `attention_mask` when we give it sentences.

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
print(tokenizer)
print(type(tokenizer))

## üßæ Single Sequence Example

In [None]:
sequence = "I've been waiting for a HuggingFace course my whole life."
model_inputs = tokenizer(sequence)
print(model_inputs)
print(model_inputs.keys())
print("Input_IDS:", model_inputs['input_ids'])
print("Attention_mask", model_inputs['attention_mask'])

## üì¶ Multiple Sequences (Batch Input)

Tokenizer can process multiple sequences at once by passing them as a list.

In [None]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]
model_inputs = tokenizer(sequences)
print(model_inputs)
print("Input_IDS:", model_inputs['input_ids'])
print("Attention_mask", model_inputs['attention_mask'])

## üßä Padding Options

- `"longest"` ‚Üí pad up to the longest sentence in the batch
- `"max_length"` ‚Üí pad up to the model‚Äôs maximum length (e.g. 512)
- `max_length=<int>` ‚Üí custom length

In [None]:
model_inputs = tokenizer(sequences, padding="longest")
print(model_inputs['input_ids'])

model_inputs = tokenizer(sequences, padding="max_length")
print(model_inputs['input_ids'])

model_inputs = tokenizer(sequences, padding="max_length", max_length=8)
print(model_inputs['input_ids'])

model_inputs = tokenizer(sequences, padding="max_length", max_length=8, truncation=True)
print(model_inputs['input_ids'])

## ‚úÇÔ∏è Truncation Options

- `truncation=True` ‚Üí truncate beyond model‚Äôs max (e.g. 512)
- `max_length=<int>` ‚Üí custom truncation

In [None]:
model_inputs = tokenizer(sequences, truncation=True)
print(model_inputs['input_ids'])

model_inputs = tokenizer(sequences, truncation=True, max_length=8)
print(model_inputs['input_ids'])

## üîπ Return Tensors for PyTorch / NumPy / TensorFlow

- `return_tensors="pt"` ‚Üí PyTorch
- `return_tensors="np"` ‚Üí NumPy
- `return_tensors="tf"` ‚Üí TensorFlow

In [None]:
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")
print(model_inputs)

model_inputs = tokenizer(sequences, padding=True, return_tensors="np")
print(model_inputs)

## ‚úÖ Special Tokens in Transformers

When using `tokenizer(sequence)` directly, it adds special tokens:

- `[CLS]` ‚Äì Classification token at the start
- `[SEP]` ‚Äì Separator token at the end

These are required by models like BERT/DistilBERT.

In [None]:
sequence = "I've been waiting for a HuggingFace course my whole life."
model_inputs = tokenizer(sequence)
print(model_inputs)

tokens = tokenizer.tokenize(sequence)
print(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

print(tokenizer.decode(model_inputs['input_ids']))
print(tokenizer.decode(ids))

## ‚úÖ Wrapping up: From Tokenizer to Model

Now that we‚Äôve seen all the individual steps the tokenizer object uses when applied on texts, let‚Äôs see one final time how it can handle multiple sequences (padding!), very long sequences (truncation!), and multiple types of tensors with its main API.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "So have I!"
]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)
print(output.logits)