# Lab 04: Transformers 101

**AI Demystified: Decoding Models, Compute, and Connectivity**

Welcome! This lab gives a gentle, hands-on tour of the Hugging Face **Transformers** library.

**Goals**

**You will see:**
- How tokenization turns text into tokens and integer IDs
- What GPT‑2 expects as model inputs (tensor shapes)
- Token-level embeddings from GPT‑2 (last hidden state)
- A tiny demo of summarization and sentiment classification

---

## Step 0: Tokenization concepts

- **Tokenization** splits text into small units (tokens). Models work on token IDs, not raw text.
- GPT‑2 uses a **byte-level BPE** tokenizer. It marks a **leading space** before a token with the special character **`Ġ`** (U+0120).
  - Example: `" power"` → token string `"Ġpower"` (space + "power").
  - This keeps whitespace information without using a separate "space" token.
- When you **decode** IDs back to text, the `Ġ` markers disappear and you recover the original spacing.

> Below, we’ll print tokens and IDs, then verify we can decode back to the original sentence.

## Step 1: Install dependencies

In [None]:
%%capture
!pip -q install "transformers>=4.41" torch --extra-index-url https://download.pytorch.org/whl/cpu

## Step 2: Imports

In [None]:
# Core Transformers imports. torch is the tensor engine used under the hood.
from transformers import AutoTokenizer, AutoModel, pipeline
import torch

## Step 3: Tokenization (GPT‑2)
We’ll tokenize a short sentence, view tokens and IDs, then do a decode round‑trip.

In [None]:
# Load the GPT‑2 tokenizer.
tokenizer = AutoTokenizer.from_pretrained("gpt2")

In [None]:
# A small example sentence.
text = "Transformers are powerful models for language tasks."

In [None]:
tokens = tokenizer.tokenize(text)

In [None]:
# Token strings (note the 'Ġ' marker indicating a leading space before tokens after the first).
print(tokens)

In [None]:
print(len(tokens))

In [None]:
# Token IDs are the integer form models actually consume.
ids = tokenizer.encode(text)

In [None]:
print(ids)

In [None]:
print(len(ids))

In [None]:
# Decode round‑trip: IDs → text (Ġ markers are not visible in decoded text).
decoded = tokenizer.decode(ids)

In [None]:
print(decoded)

## Step 4: Build model‑ready tensors
We create the **PyTorch tensors** (`input_ids`, `attention_mask`) that GPT‑2 expects. We only inspect shapes here.

In [None]:
# The convenient call form builds a full batch with tensors.
inputs = tokenizer(text, return_tensors="pt")
print(inputs["input_ids"].shape)

## Step 5: GPT‑2 embeddings (last hidden state)
We run the base GPT‑2 model to get **contextual token embeddings**. Each token gets a vector influenced by its context.

In [None]:
# Load the base GPT‑2 transformer (no language model head).
model = AutoModel.from_pretrained("gpt2")

In [None]:
# Inference only: no gradient tracking needed.
with torch.no_grad():
    outputs = model(**inputs)

In [None]:
# Token-level embeddings after the final transformer layer.
last_hidden = outputs.last_hidden_state  # [batch, seq_len, hidden_size]
print(last_hidden.shape)

In [None]:
# Peek at the first 10 dimensions of the first token's vector.
print(last_hidden[0, 0, :10])

In [None]:
# Input embedding matrix (lookup table BEFORE transformer layers).
E = model.get_input_embeddings().weight  # [vocab_size, hidden_size]
print(E.shape)

## Step 6: Summarization (compact model)
We use a small summarization model for quick demos. This is independent of GPT‑2.

In [None]:
# A short paragraph to summarize.
long_text = (
    "Cisco HyperFabric integrates servers, networking, and GPUs for high-performance AI workloads. "
    "It uses RoCEv2 with PFC/ECN to maintain low loss and predictable latency across a leaf-spine fabric, "
    "improving collective operations and training throughput."
    )

In [None]:
# DistilBART CNN is smaller than bart-large and runs quickly on CPU.
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

In [None]:
summary = summarizer(long_text, max_length=40, min_length=12, do_sample=False)

In [None]:
print(summary)

In [None]:
# CLEAN OUTPUT
print(summary[0]["summary_text"])

## Step 7: Text classification (sentiment)
A quick look at a ready‑to‑use sentiment pipeline.

In [None]:
classifier = pipeline("sentiment-analysis")

In [None]:
print(classifier("I love working with AI infrastructure!"))

In [None]:
print(classifier("This restaurant service is frustrating and unreliable."))

##Step 8: Named Entity Recognition

In [None]:
# 1) Build NER pipeline (merges subwords so outputs are clean)
ner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")

In [None]:
text = "Barack Obama was born in Hawaii and served as the President of the United States."

In [None]:
#  RAW OUTPUT (show full dicts with scores & spans)
entities = ner(text)

In [None]:
print(entities)

In [None]:
# CLEAN OUTPUT (just 'word → label')
for e in entities:
    print(f"- {e['word']} → {e['entity_group']}")

## Step 9: Machine Translation (EN→DE and EN→HI)

In [None]:
src_text = "Artificial Intelligence is transforming the world."

In [None]:
# A) English → German (fast, tiny)
translator_de = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")

In [None]:
out_de = translator_de(src_text)

In [None]:
print(out_de)

In [None]:
# CLEAN OUTPUT
print(out_de[0]["translation_text"])

In [None]:
# B) English → Hindi (simple, reliable)
#   Option 1: MarianMT (Helsinki) — very lightweight
translator_hi = pipeline("translation_en_to_hi", model="Helsinki-NLP/opus-mt-en-hi")

In [None]:
out_hi = translator_hi(src_text)

In [None]:
print(out_hi)

In [None]:
# CLEAN OUTPUT
print(out_hi[0]["translation_text"])

---
### ✅ Wrap‑up
- `Ġ` marks a **leading space** for GPT‑2’s byte‑level BPE tokenizer (you’ll see it only in token strings).
- Models consume **IDs** and **tensors**; we printed their shapes to make this concrete.
- GPT‑2’s `last_hidden_state` provides **contextual token embeddings**.
- Pipelines make tasks like **summarization**, **sentiment**, **ner** and **translation** a one‑liner for demos.