# Hugging Face Tokenizers ðŸ”¤

Tokenizers are the bridge between human-readable text and the numerical representations that machine learning models understand. This notebook explores how to use Hugging Face tokenizers effectively.

## What You'll Learn:
- Loading different tokenizers (DistilBERT, BERT)
- Understanding tokenization output (input_ids, attention_mask)
- Special tokens ([CLS], [SEP], [PAD])
- Padding and truncation strategies
- Complete inference pipeline from text to predictions

In [1]:
from transformers import DistilBertTokenizer, AutoTokenizer

### DistilBERT Tokenizer

In [2]:
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

tokenizer = DistilBertTokenizer.from_pretrained(model_name)

text = "Happiness lies within you"
output = tokenizer(text)
output

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

{'input_ids': [101, 8404, 3658, 2306, 2017, 102], 'attention_mask': [1, 1, 1, 1, 1, 1]}

The tokenizer returns a dictionary with:
- **input_ids**: Numerical IDs representing each token
- **attention_mask**: 1s for real tokens, 0s for padding

Notice the sequence starts with `101` ([CLS]) and ends with `102` ([SEP]).

### BERT Tokenizer

In [3]:
model_name = 'bert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(model_name)

`AutoTokenizer` is a convenient class that automatically detects and loads the correct tokenizer for any model. This is the recommended approach as it ensures compatibility.

In [4]:
text = "Happiness lies within you"

output = tokenizer(text)
output

{'input_ids': [101, 8404, 3658, 2306, 2017, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1]}

In [5]:
tokenizer.decode(output['input_ids'])

'[CLS] happiness lies within you [SEP]'

### Decoding: Converting IDs Back to Text

The `decode()` method converts token IDs back to human-readable text. Notice how special tokens [CLS] and [SEP] are included in the output.

In [6]:
tokens = tokenizer.convert_ids_to_tokens(output['input_ids'])
tokens

['[CLS]', 'happiness', 'lies', 'within', 'you', '[SEP]']

`convert_ids_to_tokens()` shows the actual token strings, useful for understanding how the tokenizer splits words (subword tokenization).

### Special token ids

In [7]:
tokenizer.cls_token_id

101

Each special token has a fixed ID in the vocabulary:
- **[CLS]** (101): Classification token, marks sequence start
- **[SEP]** (102): Separator token, marks sequence end
- **[PAD]** (0): Padding token, fills shorter sequences

In [8]:
tokenizer.sep_token_id

102

In [9]:
tokenizer.pad_token_id

0

In [10]:
texts = [
    "Happiness lies within you",
    "I love nature"
]

### Batch Tokenization

Tokenizers can process multiple texts at once. This is more efficient than tokenizing one by one.

In [11]:
tokenizer(texts)

{'input_ids': [[101, 8404, 3658, 2306, 2017, 102], [101, 1045, 2293, 3267, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}

### padding and truncation

In [12]:
tokenizer(texts, padding=True, return_tensors='pt')

{'input_ids': tensor([[ 101, 8404, 3658, 2306, 2017,  102],
        [ 101, 1045, 2293, 3267,  102,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 0]])}

**`padding=True`**: Pads all sequences to the length of the longest sequence in the batch.

**`return_tensors='pt'`**: Returns PyTorch tensors instead of Python lists (required for model input).

In [13]:
tokenizer(texts, padding='max_length', max_length=5, truncation=True, return_tensors='pt')

{'input_ids': tensor([[ 101, 8404, 3658, 2306,  102],
        [ 101, 1045, 2293, 3267,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1]])}

**`padding='max_length'`** + **`max_length=5`**: Forces all sequences to exactly 5 tokens.

**`truncation=True`**: Cuts sequences longer than max_length. Here, our texts get truncated!

In [14]:
tokenizer(texts, padding='max_length', max_length=20, truncation=True, return_tensors='pt')

{'input_ids': tensor([[ 101, 8404, 3658, 2306, 2017,  102,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0],
        [ 101, 1045, 2293, 3267,  102,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}

With `max_length=20`, sequences are padded (not truncated) to reach 20 tokens. The attention_mask shows which tokens are real (1) vs padding (0).

### Supplying tokens to a model

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequences = [
    "That phone case broke after 2 days of use", 
    "That herbal tea has helped me so much"
]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)
output

2026-01-02 19:15:37.287468: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-01-02 19:15:37.325949: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-01-02 19:15:38.954299: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-01-02 19:15:38.954555: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not 

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

SequenceClassifierOutput(loss=None, logits=tensor([[ 4.0561, -3.2456],
        [-3.6340,  3.8584]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

Now let's see the complete workflow: tokenize text, pass to model, and get predictions. We use a pre-trained sentiment analysis model.

In [16]:
import torch
import torch.nn.functional as F

probs = F.softmax(output.logits, dim=-1)
probs

tensor([[9.9933e-01, 6.7395e-04],
        [5.5700e-04, 9.9944e-01]], grad_fn=<SoftmaxBackward0>)

### Post-Processing: Logits to Probabilities

The model outputs raw **logits** (unnormalized scores). We apply **softmax** to convert them to probabilities that sum to 1.

In [17]:
predicted_classes = torch.argmax(probs, dim=1).tolist()
predicted_classes

[0, 1]

`argmax` selects the class with highest probability:
- **0** = NEGATIVE
- **1** = POSITIVE

input text ==> tokenizer ==> tokens(token ids) ==> model ==> logits ==> post processing ==> output text

Previously when we used HuggingFace pipeline we were able to do all of this with just one line of code. Above code explains the inner workings of the pipeline.

In [18]:
from transformers import pipeline
pipe = pipeline("sentiment-analysis")
pipe("My dog is cute")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9997941851615906}]

### The Pipeline Shortcut

The **pipeline** does ALL of the above (tokenization â†’ model â†’ post-processing) in one line! It's the easiest way to use transformers for inference.