<a href="https://colab.research.google.com/github/PrincetonUniversity/intro_machine_learning/blob/main/day4/transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/PrincetonUniversity/intro_machine_learning/blob/main/day4/transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers

Gage DeZoort

Wintersession 2025

*Adapted from a helpful conversation with ChatGPT.*


## 0. Imports

In [None]:
%matplotlib inline

!pip install datasets -q



The goal of this tutorial is to train a sequence-to-sequence



## 1. The Learning Task





Given a word or sequence of words, how likely is some subsequent word? This is a fundamental language modeling task: assigning a likelihood probability to a word to follow some input sequence.


As an example, let's consider the following input sequence:

*I need to take my dog to the vet because he is*

What's the next word? *Hungry*? *Healthy*? *Sick*?

You get the picture.

### 1.1 Tokenization

Machines need to analyze *tokenized* data. Tokens can be words, phrases, characters, etc. They have corresponding `IDs` that are stored in a lookup table.

We're going to use a model called *BERT* (Bidirectional Transformers) as our tokenizer. BERT is a transformer model, whose tokenizer splits the input text into words and punctuation, ignoring whitespace. It also splits complicated words into subwords. See below how the string `"deeeep"` which does not appear in the English language, is split into three tokens `['dee', '##ee', '##p']`. The latter two tokens are called *subwords*.

Google's propriatary WordPiece algorithm is used to build BERT's vocabulary (of subwords) built iteratively from an initial vocab of single character tokens. Frequent character pairs are merged into new subwords until its 30,000 token vocabulary is constructed.





In [None]:
from transformers import AutoTokenizer

# Choose a pre-trained model tokenizer (e.g., BERT)
model_name = "bert-base-uncased" # 100M parameters, not case-sensitive
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example: Tokenizing text
text = "Transformers are a type of deeeep learning model used for NLP tasks. Epehmeral. Anachronism."
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

# Converting tokens to IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Token IDs:", token_ids)

# Decoding token IDs back to text
decoded_text = tokenizer.decode(token_ids)
print("Decoded Text:", decoded_text)

### 1.2 Sequence Data


To create a coherent learning task, we need to take sequences of tokens and batch them into inputs with corresponding targets. Sequences are batched into uniform-length chunks. For example consider two words written as sequences of tokens:

Sequence #1: `["run", "##ner"]`

Sequence #2: `["d", "##run", "#k", "##en"]`

Our model will expect fixed-size sequences at input, say of size `max_length=3`. Sequence #1 is shorter than `max_length`, so we have to *pad* it with some default value. In BERT, this default value is `[PAD]`. Sequence #2, on the other hand, is longer than `max_length`, so we have to *truncate* it.

In [None]:
# Padding and truncation

sequence = tokenizer(text, padding="max_length", truncation=True, max_length=10)
print("Encoded Sequence:", sequence)
tokenizer.decode(sequence["input_ids"])

Here, the `input_IDs` are what the BERT transformer will actually process, the `token_type_ids` are used to demarkate segments (for next-sentence prediction), and the `attention_mask` indicates which tokens are padding (0). Note that BERT's tokenizer has added a few special tokens. `[CLS]` is a classification token marking the start of the sequence, and `[SEP]` is the separater token marking the end.

## 1. Transformer Models

BERT is a pre-trained transformer model available for generic use cases. It takes as input the `sequence` data type we generated above and outputs embeddings for each token.

In [None]:
# --- Section 4: Understanding Attention ---
import torch
from transformers import AutoModel

# Load a pre-trained model
model = AutoModel.from_pretrained(model_name)

# Example input
inputs = tokenizer("The quick brown fox jumps over the lazy dog.", return_tensors="pt")

# Forward pass through the model
outputs = model(**inputs)

# The model outputs embeddings
print("Last hidden state shape:", outputs.last_hidden_state.shape)

So we see that each of the 12 words gets a 768 dimensional output embedding. This is a high dimension, so we'll have to use some specialized tools to get a closer look.

## 1.1 Attention is All You Need

Transformers use attention modules, which quantify how much tokens in a sequence focus on other tokens. Let's take a closer look at how attention works.

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch

# Input text
text = "Transformers are powerful and versatile models."

# Tokenize and extract embeddings
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs, output_attentions=True)

# Extract hidden states (last layer embeddings)
token_embeddings = outputs.last_hidden_state.squeeze(0)  # Shape: [sequence_length, hidden_size]
print(token_embeddings.shape)

Since the embeddings have such a high dimension, we need to use a dimensionality reduction technique called principle component analysis (PCA) to visualize them. PCA identifies mutually-orthogonal directions ($< 768$ of them!) of large variance in the data, returning the projection in this new, lower-dimensional basis.

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Apply PCA to reduce dimensions to 2D
pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(token_embeddings.detach().numpy())

# Visualize the reduced embeddings
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
plt.figure(figsize=(10, 7))
for i, token in enumerate(tokens):
    plt.scatter(reduced_embeddings[i, 0], reduced_embeddings[i, 1])
    plt.text(reduced_embeddings[i, 0] + 0.01, reduced_embeddings[i, 1] + 0.01, token, fontsize=12)
plt.title("2D Visualization of Token Embeddings")
plt.xlabel("PCA Dimension 1")
plt.ylabel("PCA Dimension 2")
plt.grid()
plt.show()

We see that the tokens `["are","versatile", "and", "powerful", "models"]` all have very similar embeddings. The sentence start and end tokens, in addition to `"transformers"` and the punctuation "." are embedded elsewhere.

In [None]:
# Extract attention weights
attention_weights = outputs.attentions  # Shape: [num_layers, batch_size, num_heads, seq_len, seq_len]

# Example: Visualize attention from the last layer, head 0
import seaborn as sns
import numpy as np

attention_last_layer = torch.mean(outputs.attentions[-1][0], dim=0).detach().numpy()  # Shape: [seq_len, seq_len]

plt.figure(figsize=(10, 8))
sns.heatmap(attention_last_layer, annot=True, fmt=".2f", xticklabels=tokens, yticklabels=tokens, cmap="viridis")
plt.title("Attention Weights for the Last Layer, Head 0")
plt.xlabel("Key Tokens")
plt.ylabel("Query Tokens")
plt.show()

You may notice that [CLS] and [SEP] get the strongest attention weights. `[CLS]` is typically sent to a downstream classification module to analyze the sentiment/meaning of the sequence provided. It may also be used to compare two sequences, e.g. via cosine similarity. [SEP] is usually used in sentence pair analysis; e.g. it can store information about how different two sentences are.

In [None]:
# Aggregate attention across heads for multiple layers
for layer_idx in range(11):
    layer_attention = torch.mean(outputs.attentions[layer_idx][0], dim=0).detach().numpy()
    sns.heatmap(layer_attention, xticklabels=tokens, yticklabels=tokens, cmap="viridis")
    plt.title(f"Layer {layer_idx + 1} Attention (Averaged Across Heads)")
    plt.xlabel("Key Tokens")
    plt.ylabel("Query Tokens")
    plt.show()

### 1.2 Sentence Similarity

Let's drill down on the embedding stored in `[CLS]` by evaluating several sentences that have (potentially) similar semantic structure.

In [None]:
from torch.nn import CosineSimilarity

s1 = "Transformers are powerful and versatile models."
s2 = "Language models like transformers have diverse applications."
s3 = "Political polarization keeps us divided and blind to issues that really matter."

# Tokenize and extract embeddings
cls = []
for s in [s1, s2, s3]:
  inputs = tokenizer(s, return_tensors="pt")
  outputs = model(**inputs, output_attentions=True)
  token_embeddings = outputs.last_hidden_state.squeeze(0)  # Shape: [sequence_length, hidden_size]
  cls.append(token_embeddings[0])

# Cosine similarity of each sentence
cos_sim = CosineSimilarity(dim=-1)
for i in range(3):
  for j in range(3):
    print(i, j, cos_sim(cls[i], cls[j]))

## 2. Fine-tuning

We've got pre-trained models like BERT available to us. These models have been trained on massive corpora and have excellent general language capabilities. Fine tuning is the process of tuning a pre-trained model, which is a much more efficient approach than re-tuning a language model from scratch.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
import torch

!pip install evaluate
import evaluate

We're going to spin up a smaller version of BERT to fine tune.

In [None]:
# Load tokenizer and model
model_name = "distilbert-base-uncased" # "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)  # Binary classification

The [IMDb dataset](https://huggingface.co/datasets/stanfordnlp/imdb) contains 50k movie reviews formatted as input sequences for downstream sentiment analysis. For example, what (0 or 1) do you think the training label would be for this review?

*National Treasure is about as over-rated and over-hyped as they come. Nicholas Cage is in no way a believable action hero, and this film is no "Indiana Jones". People who have compared this movie to the Indian Jones classic trilogy have seriously fallen off their rocker...*

In [None]:
# Load IMDb dataset
dataset = load_dataset("imdb")

In [None]:
# Take a small fraction of the dataset (e.g., 10%)
fraction = 0.1
small_train_dataset = dataset["train"].shuffle(seed=42).select(range(int(len(dataset["train"]) * fraction)))
small_test_dataset = dataset["test"].shuffle(seed=42).select(range(int(len(dataset["test"]) * fraction)))

# Verify the size
print(f"Train size: {len(small_train_dataset)}, Test size: {len(small_test_dataset)}")

# Tokenize the smaller datasets
def preprocess_data(example):
    return tokenizer(example["text"], padding="max_length", truncation=True, max_length=128)

small_train_dataset = small_train_dataset.map(preprocess_data, batched=True)
small_test_dataset = small_test_dataset.map(preprocess_data, batched=True)

# Convert to PyTorch format
small_train_dataset = small_train_dataset.rename_column("label", "labels")
small_test_dataset = small_test_dataset.rename_column("label", "labels")

small_train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
small_test_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

In [None]:
import torch

# Function to move tensors to the correct device (GPU/CPU)
def move_to_device(batch):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    # Move tensor columns to the correct device
    batch = {key: value.to(device) if torch.is_tensor(value) else value for key, value in batch.items()}
    return batch

# Apply this function to your dataset using `map`
small_train_dataset = small_train_dataset.map(move_to_device, batched=True)
small_test_dataset = small_test_dataset.map(move_to_device, batched=True)

In [None]:
model = model.to("cuda" if torch.cuda.is_available() else "cpu")
print("Model device:", next(model.parameters()).device)  # This should print "cuda" if using GPU

In [None]:
from transformers import TrainingArguments

accuracy = evaluate.load("accuracy")

from sklearn.metrics import accuracy_score

def compute_metrics(p):
    predictions, labels = p
    preds = predictions.argmax(axis=-1)  # Get the class with the highest probability
    return accuracy.compute(predictions=preds, references=labels)

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=10,
    logging_steps=10,
    fp16=torch.cuda.is_available(),  # Enable mixed precision if on GPU
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

In [None]:
results = trainer.evaluate()
print(results)