<a href="https://colab.research.google.com/github/Lakshmi-Adhikari-AI/LLM-HuggingFace/blob/main/ch6/Fast_tokenizers's_Special_Powers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#⚡ Fast tokenizers' special powers (PyTorch)
The Hugging Face `AutoTokenizer` fast variant (backed by Rust) unlocks high-speed processing and powerful text-to-token, token-to-text mappings.  
Let's explore how to inspect tokens, map them to words/chars, and replicate the outputs of token-classification/NER pipelines.


Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

## 1️⃣ Fast vs Slow Tokenizers: Why the Difference Matters

Fast tokenizers accelerate batch tokenization, and keep track of which original text spans produce which tokens (offset mapping).


In [None]:
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained("bert-base-cased")
example="My name is Sylvain and I work at Hugging Face in Brooklyn."
encoding=tokenizer(example)

print(type(encoding)) # Output: BatchEncoding, a dictionary subclass
print(tokenizer.is_fast) # True: we're using a fast tokenizer
print(encoding.is_fast) # Also True!
print(encoding)

## 2️⃣ Exploring Token-Level Information

Fast tokenizers return lots of useful information. Let's see the token IDs for a sentence, the original word indices, and how tokens map back to original text.


In [None]:
# List tokens (subword units) produced by the tokenizer
print(encoding.tokens())

# Word IDs: which word in the original sentence does each token come from
print(encoding.word_ids())

# Example: For word index 3 (i.e,"Sylvain") what is the text span?
start,end=encoding.word_to_chars(3)
print(start)
print(end)
print(example[start:end]) # Should print "Sylvain"

## 3️⃣ Try Comparing Tokenizers on a Tricky Input

What happens if you tokenize "81s" using BERT and RoBERTa?


In [None]:
tokenizer_bert=AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer_roberta=AutoTokenizer.from_pretrained("roberta-base")

text="81s"
print(tokenizer_bert.tokenize(text)) # BERT splits into ['81','s']
print(tokenizer_bert(example,return_tensors="pt").tokens())
print(tokenizer_roberta.tokenize(text)) # RoBERTa may treat differently
print(tokenizer_roberta(example,return_tensors="pt").tokens())


## 4️⃣ Inside the Token-Classification/Ner Pipeline

Let’s run NER with Hugging Face pipeline to see how token/word/entity predictions are made, then try to replicate it manually.


In [None]:
from transformers import pipeline

token_classifier=pipeline("token-classification")
print(token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn."))

token_classifier=pipeline("token-classification",aggregation_strategy="simple")
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

## 5️⃣ Get Predictions Manually: Model, Tokenizer, Input

Let's tokenize, run the model, and decode entity predictions ourselves.


In [None]:
from transformers import AutoModelForTokenClassification

model_checkpoint="dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer=AutoTokenizer.from_pretrained(model_checkpoint)
model=AutoModelForTokenClassification.from_pretrained(model_checkpoint)

inputs=tokenizer(example,return_tensors="pt")
print(inputs)
outputs=model(**inputs)
print(inputs["input_ids"].shape) # batch_size=1,seq_len=19
print(outputs.logits.shape) # (batch_size=1,seq_len=19,num_labels=9)

## 6️⃣ From Model Outputs to Entity Predictions

Convert model logits to softmax probabilities, then use argmax to pick the highest-prob label for each token.




In [None]:
import torch
probabilities=torch.nn.functional.softmax(outputs.logits,dim=-1)[0].tolist()
predictions=outputs.logits.argmax(dim=-1)[0].tolist()

print(predictions) # Each index into model.config.id2label
print(model.config.id2label)

## 7️⃣ Display Non-Outsider Tokens

Extract tokens where the predicted label is not "O" (outside any entity).


In [None]:
results = []
tokens = inputs.tokens()

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != "O":
        results.append(
            {"entity": label, "score": probabilities[idx][pred], "word": tokens[idx]}
        )

print(results)

## 8️⃣ Add Character Spans With Offsets

Include the character boundaries from the original text for each entity token.


In [None]:
inputs_with_offsets=tokenizer(example,return_offsets_mapping=True)
print(inputs_with_offsets["offset_mapping"])
print(example[12:14])

tokens=inputs_with_offsets.tokens()
offsets=inputs_with_offsets["offset_mapping"]

result=[]
for idx,pred in enumerate(predictions):
  label=model.config.id2label[pred]
  if label!="O":
    start,end=offsets[idx]
    results.append({
        "entity":label,
        "score":probabilities[idx][pred],
        "word":tokens[idx],
        "start":start,
        "end":end,
    })
print(results)

## 9️⃣ Group Entities (e.g. Multi-token "PER", "ORG", "LOC")

Offsets let us group tokens into full entities directly in the original text span—no need to join tokens manually!


In [None]:
import numpy as np

results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

idx = 0
while idx < len(predictions):
    pred = predictions[idx]
    label = model.config.id2label[pred]
    if label != "O":
        # Remove the B- or I-
        label = label[2:]
        start, _ = offsets[idx]

        # Grab all the tokens labeled with I-label
        all_scores = []
        while (
            idx < len(predictions)
            and model.config.id2label[predictions[idx]] == f"I-{label}"
        ):
            all_scores.append(probabilities[idx][pred])
            _, end = offsets[idx]
            idx += 1

        # The score is the mean of all the scores of the tokens in that grouped entity
        score = np.mean(all_scores).item()
        word = example[start:end]
        results.append(
            {
                "entity_group": label,
                "score": score,
                "word": word,
                "start": start,
                "end": end,
            }
        )
    idx += 1

print(results)

# 🎯 Summary

Fast tokenizers make entity extraction, labeling, and mapping between token IDs and text spans efficient and easy.  
This is essential for token classification, NER, QA, and custom NLP tasks—unlocking batch speed and rich mapping features.
