# Week 1 Lab: Exploring Tokenization

This notebook mirrors the lecture storyline:

1. Break text into whitespace tokens and build a vocabulary.
2. Turn documents into bag-of-words vectors.
3. Compare modern subword tokenizers on real data.
4. Generate dense embeddings to reason about word neighbours.
5. (Optional) Call a hosted model on Hugging Face Inference for experimentation.

The lab is self-contained—run the cells in order and record your observations in the reflection prompts at the end.

## 0. Setup

Create a dedicated Python environment for this lab session, register it as a Jupyter kernel, and install the required packages:

```bash
python3 -m venv venvLLMDS
source venvLLMDS/bin/activate  # Windows PowerShell: .\venvLLMDS\Scripts\Activate.ps1
pip install --upgrade pip
pip install ipykernel
ipython kernel install --user --name=venvLLMDS
pip install --upgrade datasets transformers torch huggingface-hub pandas plotly
```

Make sure your VS Code / Jupyter session uses the newly created `venvLLMDS` kernel before running the cells—the rest of the notebook assumes that environment.

### Hugging Face token (optional)

1. Create a free account at [huggingface.co](https://huggingface.co/).
2. Generate a new token under **Settings → Access Tokens** (select the default `read` scope).
3. Store it securely:
   - On macOS/Linux: `echo 'HF_TOKEN=hf_your_token_here' >> .env`
   - On Windows PowerShell: `Set-Content -Path .env -Value 'HF_TOKEN=hf_your_token_here'`
4. Restart the notebook or run `load_dotenv()` so the token is available for the optional API cell.

If you skip this step the notebook will still run; only the hosted inference demo is disabled.

In [None]:
from __future__ import annotations

import os
from collections import Counter
from typing import Iterable

from IPython.display import HTML, display
import re

import plotly.express as px
import plotly.io as pio

import numpy as np
import pandas as pd
import torch
from datasets import load_dataset
from transformers import AutoModel, AutoTokenizer

try:
    from huggingface_hub import InferenceClient
except ImportError:  # pragma: no cover - optional dependency
    InferenceClient = None

HF_TOKEN = os.getenv('HF_TOKEN') or os.getenv('HUGGINGFACEHUB_API_TOKEN')
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


## 1. Whitespace tokenization → vocabulary

The lecture introduced tokenization by simply splitting text on whitespace. Recreate that pipeline for two sentences and inspect the intermediate artifacts.

Before we dive into counting tokens, we define two short example sentences—one about robots, one about drones. Lowercasing plus `split()` mimics the lecture’s whitespace tokenizer and lets us build a small lookup table of positions versus tokens.

In [None]:
sentences = {
    "robotics": "This is a smart robot exploring language.",
    "drones": "My agile drone is also very smart."
}

whitespace_tokens = {
    name: sentence.lower().split()
    for name, sentence in sentences.items()
}

token_table = pd.DataFrame(
    [(name, i, token) for name, tokens in whitespace_tokens.items() for i, token in enumerate(tokens)],
    columns=["sentence", "position", "token"]
)

token_table

To sanity-check the tokenization, the next cell arranges both token lists side-by-side. Empty cells simply indicate that one sentence ended sooner; this makes it easy to scan where the vocab overlaps or diverges.

In [None]:
# Quick side-by-side look at whitespace tokenization
comparison = (pd.DataFrame.from_dict(whitespace_tokens, orient='index')
              .T.reset_index(drop=False)
              .rename(columns={'index': 'position'}))
comparison.fillna('', inplace=True)
comparison

In [None]:
# Build a vocabulary and basic statistics
vocabulary = sorted({token for tokens in whitespace_tokens.values() for token in tokens})
vocab_counts = Counter(token for tokens in whitespace_tokens.values() for token in tokens)

print(f"Vocabulary size: {len(vocabulary)}")
pd.DataFrame({"token": vocabulary, "frequency": [vocab_counts[tok] for tok in vocabulary]})

## 2. Manual bag-of-words vectors

Bag-of-words encodes each document by counting occurrences from the shared vocabulary. This recreates the slide that mapped the toy vocabulary to vectors of 0/1 counts.

In [None]:
def bag_of_words(tokens: Iterable[str], vocab: list[str]) -> np.ndarray:
    counts = Counter(tokens)
    return np.array([counts.get(term, 0) for term in vocab], dtype=np.int32)

bow_vectors = {
    name: bag_of_words(tokens, vocabulary)
    for name, tokens in whitespace_tokens.items()
}

bow_df = pd.DataFrame(bow_vectors, index=vocabulary)
bow_df

### Exercise

- Add a third sentence and observe how the vocabulary and bag-of-words representations change.
- Compute cosine similarity between the bag-of-words vectors to quantify overlap.

## 3. Subword tokenizers on a dataset

Whitespace tokenization breaks down for larger corpora. Use Hugging Face tokenizers to compare token counts on a small sample. This mirrors the lecture’s motivation for subword models. 

In [None]:
model_names = ["bert-base-uncased", "bert-base-cased", "gpt2"]
sentences = {
    "robotics": "This is a smart robot exploring language.",
    "drones": "My agile drone is also very smart."
}

for model_name in model_names:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    print(f"\n=== Tokenization with {model_name} ===")
    
    for label, sentence in sentences.items():
        tokens = tokenizer.tokenize(sentence)
        token_ids = tokenizer.encode(sentence)
        print(f"\nSentence ({label}): {sentence}")
        print("Tokens:", tokens)
        print("Token IDs:", token_ids)

In [None]:
# 1) Data: a few longer texts to make tokenization visible
ds = load_dataset("ag_news", split="train[:200]")
texts = [row["text"] for row in ds]

examples = {
    "short":  texts[5],
    "medium": " ".join(texts[20:22]),
    "long":   " ".join(texts[50:60]),
}

# 2) Tokenizers to compare
model_names = [
    "Xenova/gpt-4",                # byte-level BPE (Ġ = leading space)
    "bert-base-uncased",   # WordPiece (## = continuation)
    "distilroberta-base",  # BPE (similar to RoBERTa; shows Ġ/▁ depending on vocab)
]

tokenizers = {m: AutoTokenizer.from_pretrained(m, use_fast=True) for m in model_names}

for model_name, tok in tokenizers.items():
    print(f"{model_name} → vocab size: {len(tok)}")

# 3) Helpers to prettify and color tokens
def prettify_tokens(model_name, toks):
    if "gpt2" in model_name or "roberta" in model_name:
        # Show word boundaries explicitly; do not alter token boundaries
        return [t.replace("Ġ", "␠").replace("▁", "␠") for t in toks]
    # Keep BERT's '##' to illustrate subword continuation
    return toks

def colorize_tokens(tokens):
    colors = ["#eef6ff", "#FFD4CC", "#FABF8F", "#FFFE85", "#DCCBFE"] # pastel colors
    spans = []
    for i, t in enumerate(tokens):
        spans.append(
            f'<span style="background:{colors[i % len(colors)]}; padding:2px; margin:1px; '
            f'border-radius:3px; font-family:monospace;">{t}</span>'
        )
    return " ".join(spans)


def show_tokenization(model_name, text):
    tok = tokenizers[model_name]
    ids = tok.encode(text, add_special_tokens=False)
    toks = tok.convert_ids_to_tokens(ids)
    pretty = prettify_tokens(model_name, toks)
    html = (
        f"<h4>{model_name}</h4>"
        f"<div style='font-family:system-ui; margin-bottom:6px;'><b>Tokens:</b></div>"
        f"<div>{colorize_tokens(pretty)}</div>"
        f"<div style='margin-top:6px; font-family:monospace;'>count = {len(toks)}</div>"
    )
    return html, toks, ids

# 4) Display: original text + per-model token view
for label, text in examples.items():
    display(HTML(f"<h3>Example: {label}</h3><p style='line-height:1.4'>{text}</p>"))
    for m in model_names:
        html, toks, ids = show_tokenization(m, text)
        display(HTML(html))

Let us see how the number of tokens depends on the model with a more detailed study. 

In [None]:
# 1) Load 1000 short news articles (text field)
texts = [row["text"] for row in load_dataset("ag_news", split="train[:1000]")]

# 2) Choose tokenizers to compare
model_names = ["bert-base-uncased", "gpt2", "distilroberta-base"]

# 3) Compute token lengths per text per model
records = []
for model in model_names:
    tok = AutoTokenizer.from_pretrained(model)
    lengths = [len(tok(t).input_ids) for t in texts]
    for L in lengths:
        records.append({"model": model, "tokens": L})

# 4) Build dataframe for plotting
df = pd.DataFrame(records)

# 5) Plot histogram overlay
fig = px.histogram(
    df,
    x="tokens",
    color="model",
    barmode="overlay",
    nbins=40,
    title="Token count distribution across models"
)
fig.update_layout(bargap=0.05)
fig.show()

### Embedding overview and similarity

We obtain sentence embeddings by mean pooling the token embeddings from a small transformer. To compare meanings, we use cosine similarity, which measures the angle between vectors and is scale‑invariant: s(u, v) = (u·v)/(|u||v|). We L2‑normalize embeddings before comparison to make cosine a simple dot product. 

### Offline fallback (no model download)

If network/model download is unavailable, you can still practice the concept with a tiny toy example. Pick 6–8 words and make up 2‑D vectors (e.g., [[1,0], [0.9,0.1], [0,1], …]) that reflect rough similarity. Compute cosine similarities and nearest neighbors with NumPy. This mirrors the slide intuition: smaller angles → higher similarity.

## 4. Dense embeddings and neighbours

Use a sentence-transformer to obtain normalized embeddings and inspect cosine similarities, replicating the lecture’s intuition about “word neighbours.”

In [None]:
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embedding_tokenizer = AutoTokenizer.from_pretrained(embedding_model_name)
embedding_model = AutoModel.from_pretrained(embedding_model_name).to(DEVICE)

def encode_texts(texts: list[str]) -> torch.Tensor:
    encoded = embedding_tokenizer(texts, padding=True, truncation=True, return_tensors="pt").to(DEVICE)
    with torch.no_grad():
        model_output = embedding_model(**encoded)
    # Mean pooling then L2 normalize
    embeddings = model_output.last_hidden_state.mean(dim=1)
    embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
    return embeddings.cpu()

words = ["cats", "dog", "puppy", "houses", "apple", "robot", "drone"]
embeddings = encode_texts(words)

cos = torch.matmul(embeddings, embeddings.T)

similarity_df = pd.DataFrame(cos.numpy(), index=words, columns=words)
similarity_df

### Discussion prompt

Which pairs cluster together? Compare your findings with the “embedding neighbours” slide. How does this change if you swap in domain-specific words?

## 5. Optional: Hosted generation via Hugging Face Inference

This mirrors the lecture’s “API call” segment without requiring OpenAI. If you set `HF_TOKEN`, the cell below sends a short prompt to the free-tier Inference API.

> The free tier is rate-limited. Keep prompts short and cache responses for your report.

In [None]:
if HF_TOKEN and InferenceClient is not None:
    client = InferenceClient(model='mistralai/Mistral-7B-Instruct-v0.2', token=HF_TOKEN)
    prompt = 'Summarise why subword tokenization is helpful for transformer models.'
    response = client.text_generation(prompt, max_new_tokens=80, temperature=0.6)
    print(response)
elif InferenceClient is None:
    print('Install huggingface-hub to enable the Inference API example.')
else:
    print('Skipping call: set HF_TOKEN to enable the Inference API example.')

## 6. Reflection and deliverables

- Compare whitespace and subword token counts—when does each strategy make sense?
- How did embedding similarities align with your intuition from the lecture slide?
- Include screenshots or tables of your experiments and discuss any surprises.
- Make sure that every term, concept, function or package used does make sense to you and you are able to explain them. 

**Deliverables**

- Short write-up summarising your findings (1 page).
- CSV or markdown table logging tokenizer statistics.
- Notes from at least one optional experiment (longer prompt, different model, custom vocabulary, etc.).