<a href="https://colab.research.google.com/github/Lakshmi-Adhikari-AI/LLM-HuggingFace/blob/main/ch6/FastTokenizers_in_QA_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🤖 Fast tokenizers in the QA pipeline (PyTorch)

Let's dig deep into how the Tokenizers library powers the QA pipeline—extracting answers using offset mapping, even with long contexts!

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

## 1️⃣ Quick QA Using the Pipeline

You can extract answers in a single call with the QA pipeline. Let's see how it works.


In [None]:
from transformers import pipeline

question_answerer=pipeline("question-answering")
context="""
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question="which deep learing libraries back 🤗 Transformers? "
question_answerer(question=question,context=context)


## 2️⃣ QA Works, Even With Long Contexts

The pipeline can find answers even at the end of *very* long documents, thanks to chunking and offset handling.


In [None]:
long_context = """
🤗 Transformers: State of the Art NLP
🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.
🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.
Why should I use transformers?
1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:
2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.
3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.
4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internals are exposed as consistently as possible.
  - Model files can be used independently of the library for quick experiments.
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""

question_answerer(question=question,context=long_context)


## 3️⃣ Manual QA: Tokenizing and Model Forward Pass

Let's reproduce the pipeline steps in detail (tokenization, masking, logits).


In [None]:
from transformers import AutoTokenizer,AutoModelForQuestionAnswering

model_checkpoint="distilbert-base-cased-distilled-squad"
tokenizer=AutoTokenizer.from_pretrained(model_checkpoint)
model=AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

inputs=tokenizer(question,context,return_tensors="pt")
outputs=model(**inputs)
start_logits=outputs.start_logits
end_logits=outputs.end_logits


print(start_logits.shape,end_logits.shape) #(1,num_tokens),(1,num_tokens)

## 4️⃣ Mask Out Irrelevant Tokens

We only want to predict answer spans that are in the context—not in the question or [SEP] tokens.


In [None]:
import torch

sequence_ids=inputs.sequence_ids() # maks out tokens not in context(sequence id1=context)
mask=[i!=1 for i in sequence_ids]
mask[0]=False                     # unmask [cls] token
mask=torch.tensor(mask)[None]
start_logits[mask]=-10000
end_logits[mask]=-10000

## 5️⃣ Convert Logits to Probabilities

Softmax yields estimated start and end token probabilities.


In [None]:
start_probabilities=torch.nn.functional.softmax(start_logits,dim=-1)[0]
end_probabilities=torch.nn.functional.softmax(end_logits,dim=-1)[0]

## 6️⃣ Find Best Start and End Indices

Search for the (start, end) pair (with start ≤ end) with the highest probability.


In [None]:
scores = start_probabilities[:, None] * end_probabilities[None, :]
scores = torch.triu(scores) # only keep scores for start <= end
max_index = scores.argmax().item()
start_index = max_index // scores.shape[1]
end_index = max_index % scores.shape[1]
print(scores[start_index, end_index])

## 7️⃣ Convert Token Indices to Character Spans (using Offsets)

Map token-level indices back to the actual answer in the context, using offset mappings.


In [None]:
inputs_with_offsets=tokenizer(question,context,return_offsets_mapping=True)
offsets=inputs_with_offsets["offset_mapping"]
start_char,_=offsets[start_index]
_,end_char=offsets[end_index]
answer=context[start_char:end_char]
result={
    "answer":answer,
    "start":start_char,
    "end":end_char,
    "score":scores[start_index,end_index].item(),

}
print(result)

## 8️⃣ Handling Long Contexts: Stride Splitting

Let's see what happens when the context is longer than the model's input length.


In [None]:
inputs=tokenizer(question,long_context)
print(len(inputs["input_ids"])) # may be greater than 384

## 9️⃣ Truncate the Context ("only_second" strategy)

When needed, the model splits the context into overlapping windows (with stride).


In [None]:
inputs=tokenizer(question,long_context,max_length=384,truncation="only_second")
print(tokenizer.decode(inputs["input_ids"]))

In [None]:
inputs = tokenizer(question, long_context, max_length=384, truncation="only_second")
print(tokenizer.decode(inputs["input_ids"]))

## 🔟 Use Overflowing Tokens and Stride for Complete Coverage

Automatically split long contexts for full answer search coverage.


In [None]:
sentence="This sentence is not too long but we are going to split it anyway."
inputs2=tokenizer(
    sentence,truncation=True,return_overflowing_tokens=True,max_length=6,stride=2
)

for ids in inputs2["input_ids"]:
  print(tokenizer.decode(ids))

print(inputs2.keys())
print(inputs2["overflow_to_sample_mapping"]) # shows which split come from which input

In [None]:
sentences = [
    "This sentence is not too long but we are going to split it anyway.",
    "This sentence is shorter but will still get split.",
]
inputs = tokenizer(
    sentences, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

print(inputs["overflow_to_sample_mapping"])

## 1️⃣1️⃣ Apply to a Real Long QA Example

Let’s split the long context and process all chunks—keep stride and offset mappings for answer extraction.


In [None]:
inputs=tokenizer(
    question,
    long_context,
    stride=128,
    max_length=384,
    padding="longest",
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True
)

_=inputs.pop("overflow_to_sample_mapping")
offsets=inputs.pop("offset_mapping")
inputs=inputs.convert_to_tensors("pt")
print(inputs["input_ids"].shape) # e.g (n_chunks,max_len)

## 1️⃣2️⃣ Model Outputs for Each Chunk

The model processes all segments. Mask padding as well as non-context.


In [None]:
outputs=model(**inputs)
start_logits=outputs.start_logits
end_logits=outputs.end_logits
print(start_logits.shape,end_logits.shape)

sequence_ids=inputs.sequence_ids()
mask=[i!=1 for i in sequence_ids]
mask[0]=False
mask=torch.logical_or(torch.tensor(mask)[None],inputs["attention_mask"]==0)
start_logits[mask]=-10000
end_logits[mask]=-10000

## 1️⃣3️⃣ Collect Best Candidate Answers Across All Chunks

Aggregate all best (start, end, score) tuples and convert token to character spans.


In [None]:
candidates = []
start_probs = torch.nn.functional.softmax(start_logits, dim=-1)
end_probs = torch.nn.functional.softmax(end_logits, dim=-1)
for start_prob, end_prob in zip(start_probs, end_probs):
    scores = start_prob[:, None] * end_prob[None, :]
    idx = torch.triu(scores).argmax().item()
    start_idx = idx // scores.shape[1]
    end_idx = idx % scores.shape[1]
    score = scores[start_idx, end_idx].item()
    candidates.append((start_idx, end_idx, score))
print(candidates)

# Now map each candidate to its charecter span in the context

for candidate, offset in zip(candidates, offsets):
    start_token, end_token, score = candidate
    start_char, _ = offset[start_token]
    _, end_char = offset[end_token]
    answer = long_context[start_char:end_char]
    result = {"answer": answer, "start": start_char, "end": end_char, "score": score}
    print(result)

# ✅ Summary

Fast tokenizers make QA possible on arbitrarily long texts, using offset mapping, overflowing tokens, and context splitting—features vital for robust question answering in modern NLP.
