# Lab 3: LLM/Transformers and Language Representations

This lab explores language model internals using Gemma-2-2b-it, covering text generation, attention mechanisms, hidden representations, and model interpretability through various exercises.

### Model access and prerequisites

This lab will use the off-the-shelf open-source LLM "Gemma-2-2b-it". This model is available on Hugging Face, which is a popular platform for sharing machine learning models. Please make an account there and go to "https://huggingface.co/google/gemma-2-2b-it" and accept the terms of use. Follow Instruction.md to set up your huggingface token as an environment variable.

## 1. Environment setup and model authentication

Set up Hugging Face access and prepare the computing environment.

### Hugging Face authentication

Load API key from environment and authenticate.

In [None]:
# Load Hugging Face API key from environment (do NOT hardcode your token here).
import os
from huggingface_hub import login
from dotenv import load_dotenv

# Load .env file (if present)
load_dotenv()
hf_key = os.environ.get("HUGGINGFACE_API_KEY")
if hf_key:
    login(hf_key)
else:
    raise EnvironmentError("HUGGINGFACE_API_KEY not found. Copy .env.template to .env and add your token. See Instruction.md")

### PyTorch and CUDA configuration

Configure PyTorch settings to avoid compilation issues and optimize CUDA performance.

In [None]:
# Configure PyTorch to avoid Windows CUDA compilation issues
import torch
import os

# Disable torch compilation that can cause issues on Windows
torch._dynamo.config.suppress_errors = True
torch._dynamo.config.disable = True
os.environ["TORCH_COMPILE_DISABLE"] = "1"

# Set safer CUDA configurations
if torch.cuda.is_available():
    torch.backends.cudnn.allow_tf32 = True
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.benchmark = True

## 2. Model loading and helper functions

Load the Gemma model, tokenizer, and set up utility functions for text generation and coherence scoring.

### Load Gemma model and tokenizer

Let's load the model and tokenizer from Hugging Face and move the model to our GPU. We will use the `transformers` library for this purpose. Note that this may take a while, as the model is quite large.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
torch.set_float32_matmul_precision('high')

model_id = "google/gemma-2-2b-it"
dtype = torch.float16
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map=device,
    torch_dtype=dtype,
    attn_implementation="eager",  # Force eager attention to enable output_attentions
)

### Coherence scoring model

Next we download the ms-marco scoring model. This model is used to calculate the coherence score between the question and the answer. It is a small model, so it should load quickly.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

SCORING_MODEL = AutoModelForSequenceClassification.from_pretrained('cross-encoder/ms-marco-MiniLM-L-6-v2')
SCORING_TOKENIZER = AutoTokenizer.from_pretrained('cross-encoder/ms-marco-MiniLM-L-6-v2')


def calculate_coherence(question, answer, scoring_model=SCORING_MODEL, tokenizer=SCORING_TOKENIZER):
  features = tokenizer([question], [answer], padding=True, truncation=True, return_tensors="pt")
  scoring_model.eval()
  with torch.no_grad():
      scores = scoring_model(**features).logits.squeeze().item()
  return scores

### Text generation helper function

Define utility function for generating text from prompts with configurable sampling.

In [None]:
def generate_text_from_prompt(prompt, tokenizer, model, do_sample=False):
  """
  generate the output from the prompt.
  param:
    prompt (str): the prompt inputted to the model
    tokenizer   : the tokenizer that is used to encode / decode the input / output
    model       : the model that is used to generate the output

  return:
    the response of the model
  """
  # Tokenize the prompt
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
  
  # Generate response
  with torch.no_grad():
      output_ids = model.generate(
          input_ids,
          max_new_tokens=128,        # adjust as needed
          do_sample=do_sample,           # deterministic
          eos_token_id=tokenizer.eos_token_id,
          pad_token_id=tokenizer.eos_token_id
      )
      

  if output_ids is not None and len(output_ids) > 0:
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)
  else:
    return "Empty Response"

## 3. Text generation and coherence analysis

Explore text generation with and without chat templates, and analyze response quality.

### Exercise 1: Text generation and coherence scoring

Alright let's try the Gemma model with a simple prompt and calculate the coherence score of the response. A higher coherence score means that the response is more relevant to the question.

**Exercise 1:** Generate text from the prompt and calculate the coherence score

**Exercise 1b:** Switch do_sample to True and generate text again. What is the coherence score now? Why is it different?

In [None]:
# With chat template
question = "Please tell me about the key differences between supervised learning and unsupervised learning. Answer in 200 words."
chat = [
    {"role": "user", "content": question},
]

prompt_with_template = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
response_with_template = generate_text_from_prompt(prompt_with_template, tokenizer, model, do_sample=False)

# extract the real output from the model
response_with_template = response_with_template.split('model\n')[-1].strip('\n').strip()

print("========== Output ==========\n", response_with_template)
score = calculate_coherence(question, response_with_template)
print(f"========== Coherence Score : {score:.4f}  ==========")

### Comparison: With vs without chat template

Compare text generation using chat templates versus plain text prompts.

In [None]:
# Without chat template (directly using plain text)
response_without_template = generate_text_from_prompt(question, tokenizer, model)

# extract the real output from the model
response_without_template = response_without_template.split(question.split(' ')[-1])[-1].strip('\n').strip()
print("========== Output ==========\n", response_without_template)
score = calculate_coherence(question, response_without_template)
print(f"========== Coherence Score : {score:.4f}  ==========")

## 4. Output probability analysis and uncertainty

Examine model output probabilities, entropy, and uncertainty through attention and logit analysis.

### Exercise 2: Output probabilities and entropy analysis

Next we will inspect the attention layers and output logits/probabilities of the model. This will help us understand how the model processes the input and generates the output.

**Exercise 2:** Fill in 3 consecutive user questions in the user_inputs list below. The model will generate a response and we plot the output probabilities for the top 10 tokens. 

**Exercise 2a:** Which prompt produced the lowest uncertainty? Measure this by calculating the truncated entropy of the 10 output logits. 

**Exercise 2b:** Now that we have some insight into the model's behaviour. Let's try to generate response with high uncertainty in the first token. With only changing the user inputs, can you generate a response with a low probability (< 0.5) for the first token?

In [None]:
import torch
import json
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
warnings.filterwarnings(
    "ignore",
    message="Glyph .* missing from font"
)                # suppress those emoji‐font warnings

# ————— Student fills in their chat here ————— # TODO By student.
user_inputs = [
]
# ——————————————————————————————————————————
def entropy_topk(logits, T=1.0, k=40): ## TODO By student.
    z = logits / T
    topz, topi = torch.topk(z, k)
    p = torch.softmax(topz, dim=-1)      # renormalized on top-k set
    return float(-(p * (p.clamp_min(1e-12)).log()).sum())

model = model.to(device)
chat_history = []

for turn, user_msg in enumerate(user_inputs, start=1):
    # — 1) Append user message to history
    chat_history.append({"role": "user", "content": user_msg})
    
    # — 2) Show raw history + formatted prompt
    formatted = tokenizer.apply_chat_template(
        chat_history,
        tokenize=False,
        add_generation_prompt=True
    )
    print(f"\n--- Turn {turn} ---")
    print("Chat History:")
    print(json.dumps(chat_history, indent=2))
    print("\nFormatted Prompt:")
    print(formatted)
    
    # — 3) Tokenize & get logits → probs
    inputs = tokenizer(formatted, return_tensors="pt").to(device)
    with torch.no_grad():
        out = model(**inputs)
    last_logits = out.logits[:, -1, :]
    probs = torch.nn.functional.softmax(last_logits, dim=-1)

    # — 4) Top-10 tokens
    top_k = 10
    top_p, top_i = torch.topk(probs, top_k, dim=-1)
    top_p = top_p.cpu().squeeze().numpy()
    top_i = top_i.cpu().squeeze().numpy()
    top_toks = [tokenizer.decode([i]) for i in top_i]
    print(f"Entropy of top-{top_k} tokens: {entropy_topk(last_logits)}")
    
    # — 5) Plot immediately for this turn
    plt.figure(figsize=(6, 3))
    sns.barplot(x=top_p, y=top_toks, dodge=False)
    plt.title(f"Turn {turn}: Top-10 Next-Token Probs")
    plt.xlabel("Probability")
    plt.ylabel("Token")
    plt.legend([],[], frameon=False)   # hide redundant legend
    plt.tight_layout()
    plt.show()

    # — 6) Generate reply & append to history
    gen = model.generate(
        **inputs,
        max_new_tokens=32,
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
        temperature = 0.3
    )
    reply = tokenizer.decode(
        gen[0][inputs.input_ids.shape[1]:],
        skip_special_tokens=True
    ).strip()
    chat_history.append({"role": "assistant", "content": reply})

    # — 7) Print assistant’s reply immediately
    print(f"\nAssistant (Turn {turn}):\n{reply}\n")


## 5. Sentence embeddings and layer analysis

Explore sentence representations across different model layers and visualize semantic clustering.

### Exercise 3: Word embeddings and t-SNE visualization

Model sentence/word representations are high-dimensional vectors that capture the meaning of the sentence/word. We can visualize these representations using t-SNE, which reduces the dimensionality of the vectors to 2D for visualization purposes.

**Exercise 3:** Run the script below to visualize the sentence embeddings of the sentences. Why are the sentences clustered together like this?

**Exercise 3a:** There is a conjecture that LLMs capture word-focused/lexical information in early layers, semantic information in middle layers, and sentence-level next token information in later layers. Find out how many layers the model has and print it out.

**Exercise 3b:** Compare the t-SNE cluster of early, middle, and late layers. Do you see any differences in the clustering of the sentences? Provide explanations based on the above conjecture.

**Exercise 3c:** Change the sentence for Orange (telecom) to align better with Microsoft (company) and Apple (company) based on your insights.

In [None]:
import torch
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt


# Sentences with different meanings of words
sentences = [
    "I ate a fresh apple.",  # Apple (fruit)
    "Apple released the new iPhone.",  # Apple (company)
    "I peeled an orange and ate it.",  # Orange (fruit)
    "The Orange network has great coverage.",  # Orange (telecom)
    "Microsoft announced a new update.",  # Microsoft (company)
    "Banana is my favorite fruit.",  # Banana (fruit)
    "My company eats fruits for lunch.",  # Company (business)
]

# Tokenize and move to device
inputs = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True)
inputs = inputs.to(device)

# Get hidden states
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# Compute sentence-level embeddings (mean pooling with attention mask to ignore padding)
hidden_states = outputs.hidden_states[-1]            # [B, T, D]
mask = inputs.attention_mask.unsqueeze(-1)           # [B, T, 1]
sum_vec = (hidden_states * mask).sum(dim=1)          # [B, D]
len_vec = mask.sum(dim=1).clamp_min(1)               # [B, 1]
sentence_embeddings = (sum_vec / len_vec).cpu().numpy()

# Words to visualize
word_labels = [
    "Apple (fruit)", "Apple (company)",
    "Orange (fruit)", "Orange (telecom)",
    "Microsoft (company)", "Banana (fruit)",
]

# Reduce to 2D using t-SNE
tsne = TSNE(n_components=2, perplexity=2, random_state=42)
embeddings_2d = tsne.fit_transform(sentence_embeddings)

# Plot the embeddings
plt.figure(figsize=(8, 6))
colors = ["red", "blue", "orange", "purple", "green", "brown", "pink", "cyan", "magenta", "yellow"]
for i, label in enumerate(word_labels):
    plt.scatter(embeddings_2d[i, 0], embeddings_2d[i, 1], color=colors[i], s=100)
    plt.text(embeddings_2d[i, 0] + 0.1, embeddings_2d[i, 1] + 0.1, label, fontsize=12, color=colors[i])

plt.xlabel("t-SNE Dim 1")
plt.ylabel("t-SNE Dim 2")
plt.title("t-SNE Visualization of Word Embeddings")
plt.show()

## 6. Attention mechanism analysis

Inspect and interpret attention patterns during text generation across different layers and heads.

### Exercise 4: Attention pattern inspection during generation

Inspecting and interpreting attention during generation.
 
Have a look at the code and try to understand what the different components do.

**Exercise 4:** Fill in the code to extract the logits and attentions each step.

**Exercise 4a:** Plot the attention heatmap for the first layer and head 7. Pick two noteworthy query tokens and explain the attention patterns for each.

**Exercise 4b:** Try out a different layer/head and explain one attention pattern you observe. How does it differ from the previous layer/head?

In [None]:
# Import necessary libraries
import torch
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Input prompt for text generation
prompt = "Google"
input_ids = tokenizer(prompt, return_tensors="pt")  # Tokenize the input prompt
next_token_id = input_ids.input_ids.to("cuda")  # Move input token ids to GPU
attention_mask = input_ids.attention_mask.to("cuda")  # Move attention mask to GPU
cache_position = torch.arange(attention_mask.shape[1], device="cuda")  # Position for the KV cache

# Set the number of tokens to generate and other parameters
generation_tokens = 20  # Limit for visualization (number of tokens to generate)
total_tokens = generation_tokens + next_token_id.size(1) - 1  # Total tokens to handle
layer_idx = 10  # Specify the layer index for attention visualization
head_idx = 7  # Specify the attention head index to visualize

# KV cache setup for caching key/values across time steps
from transformers.cache_utils import HybridCache
kv_cache = HybridCache(config=model.config, max_batch_size=1, max_cache_len=total_tokens, device="cuda", dtype=torch.float16)

generated_tokens = []  # List to store generated tokens
attentions = None  # Placeholder to store attention weights

num_new_tokens = 0  # Counter for the number of new tokens generated
model.eval()  # Set the model to evaluation mode


# Generate tokens and collect attention weights for visualization
for num_new_tokens in range(generation_tokens):
    with torch.no_grad():
        outputs = model(
            next_token_id,
            attention_mask=attention_mask,
            cache_position=cache_position,
            use_cache=True,
            past_key_values=kv_cache,
            output_attentions=True
        )

    ## TODO Student
    # Get the logits for the last generated token from outputs
    logits = 
    
    ## TODO Student
    # Extract the attention scores from the model's outputs
    attention_scores =
    
    # Extract attention weights for the specified layer and head
    last_layer_attention = attention_scores[layer_idx][0][head_idx].detach().cpu().numpy()

    if num_new_tokens == 0:
        attentions = last_layer_attention
    else:
        attentions = np.append(attentions, last_layer_attention, axis=0)

    # Greedy next‐token selection (you could swap in sampling if you like)
    next_token_id = logits.argmax(dim=-1)           # shape: [1]
    generated_tokens.append(next_token_id.item())

    # Update masks, cache, and positions for the next step
    attention_mask = torch.cat([attention_mask, torch.ones(1, 1, device="cuda")], dim=-1)
    next_token_id = next_token_id.unsqueeze(0)
    kv_cache = outputs.past_key_values
    cache_position = cache_position[-1:] + 1
    

# Decode the generated tokens into human-readable text
generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=True)
full_text = prompt + generated_text  # Combine the prompt with the generated text


# Tokenize all the generated text (prompt + generated)
tokens = tokenizer.tokenize(full_text)

# Function to plot a heatmap of attention weights
def plot_attention(attn_matrix, tokens, title="Attention Heatmap"):
    plt.figure(figsize=(10, 8))  # Set the figure size
    sns.heatmap(attn_matrix, xticklabels=tokens, yticklabels=tokens, cmap="viridis", annot=False)  # Plot the attention matrix as a heatmap
    plt.xlabel("Key Tokens")
    plt.ylabel("Query Tokens")
    plt.title(title)
    plt.xticks(rotation=45)  # Rotate x-axis labels for better visibility
    plt.yticks(rotation=0)  # Rotate y-axis labels
    plt.show()

print(f"Generated Text: {full_text}")
# Plot the attention heatmap for the last generated token
plot_attention(attentions, tokens, title=f"Attention Weights for Generated Token of Layer {layer_idx}")