# LLM Project - Complete Tutorial

This notebook provides a comprehensive tutorial on using T5 models, based on `solution-01-t5.ipynb`.

**What you'll learn:**
- Tokenization: Converting text to numbers and back
- Model architecture: Encoder-decoder structure, shared embeddings
- Encoder outputs: Understanding how input is processed
- Manual generation: Step-by-step autoregressive token generation
- Optimized generation: Using model.generate() with caching
- Embeddings visualization: PCA and cosine similarity

**Prerequisites:** None! This notebook installs everything you need.


In [1]:
## Step 1: Setup and Installation

First, we'll clone the repository and install all dependencies.

SyntaxError: unterminated string literal (detected at line 3) (3212902300.py, line 3)

In [None]:
# Clone the repository
!git clone https://github.com/SabraHashemi/llm-project.git
%cd llm-project


Install required packages for transformers, visualization, and machine learning.


In [None]:
%pip install -q transformers torch matplotlib scikit-learn numpy python-dateutil
print("‚úÖ Dependencies installed!")


In [None]:
## Step 2: Import Modules

Import the necessary libraries and our custom tokenizer/model loader modules.


ModuleNotFoundError: No module named 'llm_tokenizers'

In [None]:
import sys
import torch
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity

# Add project root to Python path
sys.path.insert(0, '.')

# Import our custom modules
from llm_tokenizers import BaseTokenizerWrapper
from llm_models import Seq2SeqModelLoader

print("‚úÖ All modules imported successfully!")


In [None]:
## Step 3: Initialize Tokenizer and Model

Load the T5-small tokenizer and model. This will download the models on first run (~240MB).


In [None]:
# Initialize tokenizer
print("Loading tokenizer...")
tokenizer = BaseTokenizerWrapper("t5-small")
print(f"‚úÖ Tokenizer loaded! Vocabulary size: {tokenizer.vocab_size:,} tokens")

# Initialize model
print("\nLoading model (this may take a moment on first run)...")
model = Seq2SeqModelLoader("t5-small")
print(f"‚úÖ Model loaded!")
print(f"\nüìä Model Configuration:")
print(f"   - Hidden size (d_model): {model.hidden_size}")
print(f"   - Number of layers: {model.num_layers}")
print(f"   - Number of attention heads: {model.num_heads}")
print(f"   - Feed-forward dimension (d_ff): {model.hidden_size * 4}")  # T5 uses 4x expansion
print(f"   - Vocabulary size: {model.vocab_size:,}")


## Step 4: Understanding Model Architecture

**T5 Architecture Overview:**
- **Encoder**: Processes input text (bidirectional self-attention + feed-forward)
- **Decoder**: Generates output text (self-attention + cross-attention + feed-forward)
- **Shared Embeddings**: Encoder and decoder share the same token embedding layer
- **Language Model Head (lm_head)**: Maps decoder output to vocabulary probabilities

**What is tokenization?**
- **Encoding**: Converts human-readable text ‚Üí token IDs (numbers)
- **Decoding**: Converts token IDs (numbers) ‚Üí human-readable text

This is the fundamental conversion between text and the numerical representation models use.


In [None]:
# Inspect model architecture
print("üîç Model Structure:")
print(f"   - Encoder: {len(model.model.encoder.block)} layers")
print(f"   - Decoder: {len(model.model.decoder.block)} layers")
print(f"   - Shared embeddings: {model.model.shared}")
print(f"   - Language model head: {model.model.lm_head}")

# Verify shared embeddings
is_shared = (id(model.model.shared) == id(model.model.encoder.embed_tokens) and 
             id(model.model.shared) == id(model.model.decoder.embed_tokens))
print(f"\n‚úÖ Shared embeddings: {is_shared}")
print("   ‚Üí Encoder and decoder use the same embedding weights!")

# Inspect encoder block structure
print(f"\nüì¶ Encoder Block Structure (first layer):")
encoder_block = model.model.encoder.block[0]
print(f"   1. Self-Attention: {type(encoder_block.layer[0]).__name__}")
print(f"   2. Feed-Forward: {type(encoder_block.layer[1]).__name__}")

# Inspect decoder block structure
print(f"\nüì¶ Decoder Block Structure (first layer):")
decoder_block = model.model.decoder.block[0]
print(f"   1. Self-Attention: {type(decoder_block.layer[0]).__name__}")
print(f"   2. Cross-Attention: {type(decoder_block.layer[1]).__name__}")
print(f"   3. Feed-Forward: {type(decoder_block.layer[2]).__name__}")
print("\nüí° Key difference: Decoder has cross-attention to look at encoder outputs!")


In [None]:
# Example: Encode text to token IDs
sentence = "hello, this is a sentence!"
tokens = tokenizer.encode(sentence)

print(f"Original text: '{sentence}'")
print(f"Token IDs: {tokens['input_ids']}")
print(f"Attention mask: {tokens['attention_mask']}")
print(f"\nDecoded back: '{tokenizer.decode(tokens['input_ids'])}'")
print("\nüí° Note: The tokenizer automatically added </s> (end-of-sequence token)")


In [None]:
# See actual tokens (not just IDs)
sentence = "hello, this is a sentence!"
tokens_list = tokenizer.tokenize(sentence)
print(f"Original: '{sentence}'")
print(f"Tokens: {tokens_list}")
print("\nüí° The '‚ñÅ' prefix indicates a word starting after a space!")

# Compare tokenization with/without spaces
print("\nüìù Space matters in tokenization:")
print(f"  'hello,world' ‚Üí {tokenizer.tokenize('hello,world')}")
print(f"  'hello, world' ‚Üí {tokenizer.tokenize('hello, world')}")
print("\nüí° Notice how 'world' is tokenized differently!")


### Special Tokens

Each model uses special tokens with specific meanings:
- **EOS** (`</s>`): End of sequence
- **PAD** (`<pad>`): Padding token (also used as decoder start in T5)
- **BOS**: Beginning of sequence (T5 doesn't use this)


### Batch Encoding with Padding

**Why padding?**
When processing multiple sentences of different lengths, we need to pad shorter sentences to match the longest one. This allows us to stack them into a tensor.

**Attention masks:** Tell the model which tokens are real (1) and which are padding (0).


In [None]:
# Batch encoding example
sentences = [
    "this is the first sentence",
    "instead, this is the second sequence!"
]

# Without padding
tokens_no_pad = tokenizer.encode(sentences)
print("Without padding:")
for i, sent in enumerate(sentences):
    print(f"  {i+1}. Length: {len(tokens_no_pad['input_ids'][i])} tokens")

# With padding
tokens_padded = tokenizer.encode(sentences, padding=True)
print("\nWith padding:")
for i, (ids, mask) in enumerate(zip(tokens_padded['input_ids'], tokens_padded['attention_mask'])):
    print(f"  {i+1}. IDs: {ids}")
    print(f"     Mask: {mask}")
    print(f"     ‚Üí Mask 0 = padding (model ignores these)")

print("\nüí° All sentences now have the same length!")
print("üí° Attention mask prevents the model from attending to padding tokens")


In [None]:
# Get special tokens
special = tokenizer.get_special_tokens()
print("Special tokens for T5:")
print(f"  EOS token: '{special['eos_token']}' (ID: {special['eos_token_id']})")
print(f"  PAD token: '{special['pad_token']}' (ID: {special['pad_token_id']})")
print(f"  BOS token: {special['bos_token']} (T5 doesn't use BOS)")


## Step 6: Encoder Outputs

**What does the encoder do?**
The encoder processes the input text and creates a rich representation that the decoder uses to generate output.

**Key outputs:**
- `encoder_last_hidden_state`: Final hidden states from all encoder layers
- This representation stays constant during generation (encoder runs once)
- Decoder uses this via cross-attention to generate relevant output

**How T5 works:**
1. **Encoder**: Processes the input text (e.g., "translate english to german: hello")
2. **Decoder**: Generates output tokens one by one
3. **First step**: Decoder starts with `<pad>` token (T5's special start token)
4. **Output**: Model produces logits (probabilities) for each possible next token


In [None]:
# Prepare input and run encoder
input_sentence = "translate english to german: hello, how are you?"
tokens = tokenizer.encode(input_sentence, return_tensors="pt")
decoder_input_ids = torch.tensor([[tokenizer.tokenizer.pad_token_id]])

print(f"Input: '{input_sentence}'")
print(f"Token IDs: {tokens['input_ids'].tolist()[0]}")
print(f"Sequence length: {tokens['input_ids'].shape[1]} tokens\n")

# Forward pass to get encoder outputs
with torch.no_grad():
    output = model(**tokens, decoder_input_ids=decoder_input_ids)

# Inspect encoder outputs
encoder_hidden = output.encoder_last_hidden_state
print(f"üì§ Encoder Output:")
print(f"   Shape: {encoder_hidden.shape}")
print(f"   ‚Üí [batch=1, sequence_length={encoder_hidden.shape[1]}, hidden_size={encoder_hidden.shape[2]}]")
print(f"\nüí° This representation captures the meaning of the entire input sentence!")
print(f"üí° The decoder will use this via cross-attention to generate the translation.")


In [None]:
# Prepare input
input_sentence = "translate english to german: hello, how are you?"
tokens = tokenizer.encode(input_sentence, return_tensors="pt")
decoder_input_ids = torch.tensor([[tokenizer.tokenizer.pad_token_id]])

print(f"Input: '{input_sentence}'")
print(f"Encoder input shape: {tokens['input_ids'].shape}")
print(f"Decoder input (starting token): {decoder_input_ids.tolist()}")

# Forward pass
with torch.no_grad():
    output = model(**tokens, decoder_input_ids=decoder_input_ids)

print(f"\n‚úÖ Forward pass completed!")
print(f"Output logits shape: {output.logits.shape}")
print(f"   ‚Üí Shape means: [batch=1, sequence=1, vocab={model.vocab_size:,}]")
print(f"   ‚Üí For each position, we have {model.vocab_size:,} logits (one per token)")
print(f"\nOutput contains:")
print(f"  - logits: Probabilities for next token")
print(f"  - past_key_values: Cached attention states")
print(f"  - encoder_last_hidden_state: Final encoder representation")


In [None]:
# Manual step-by-step generation
input_sentence = "translate english to german: hello, how are you?"
tokens = tokenizer.encode(input_sentence, return_tensors="pt")
decoder_input_ids = torch.tensor([[tokenizer.tokenizer.pad_token_id]])

print(f"Input: '{input_sentence}'\n")
print("üîÑ Manual Generation Process:")
print("=" * 50)

max_length = 10
i = 0

while i < max_length and decoder_input_ids[0, -1].item() != tokenizer.tokenizer.eos_token_id:
    # Forward pass
    with torch.no_grad():
        output = model(**tokens, decoder_input_ids=decoder_input_ids)
    
    # Get the most likely next token (greedy decoding)
    next_token_logits = output.logits[0, -1, :]  # Last position, all vocab
    next_token_id = next_token_logits.argmax().item()
    
    # Decode current sequence
    current_text = tokenizer.decode(decoder_input_ids[0], skip_special_tokens=True)
    next_token_text = tokenizer.decode([next_token_id], skip_special_tokens=True)
    
    print(f"Step {i+1}: '{current_text}' ‚Üí next token: '{next_token_text}' (ID: {next_token_id})")
    
    # Add predicted token to decoder input
    decoder_input_ids = torch.cat([decoder_input_ids, torch.tensor([[next_token_id]])], dim=1)
    
    # Check if we hit end token
    if next_token_id == tokenizer.tokenizer.eos_token_id:
        print(f"\n‚úÖ Generation complete! Final output: '{tokenizer.decode(decoder_input_ids[0], skip_special_tokens=True)}'")
        break
    
    i += 1

print("\nüí° Key insights:")
print("   - Each step predicts ONE token at a time")
print("   - The model uses ALL previous tokens + encoder output")
print("   - Generation stops when <eos> token is predicted")


## Step 6: Text Generation

**Autoregressive Generation:**
Models generate text token by token. Each new token depends on all previous tokens.

**Greedy Decoding:** Always picks the most likely next token (highest probability).

The `model.generate()` method does this automatically with optimizations!


In [None]:
# Optimized generation using model.generate()
input_sentence = "translate english to german: hello, how are you?"
tokens = tokenizer.encode(input_sentence, return_tensors="pt")

print(f"Input:  '{input_sentence}'")

with torch.no_grad():
    generated_ids = model.generate(**tokens, max_length=20)

generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

print(f"Output: '{generated_text}'")
print(f"\n‚úÖ Generated in one call!")
print(f"üí° model.generate() is optimized with caching and is much faster than manual loops.")
print(f"üí° The encoder runs once, and its output is reused for all decoder steps.")


In [None]:
# Generate text using model.generate()
# Note: We need to prepare tokens again for generation
input_sentence = "translate english to german: hello, how are you?"
tokens = tokenizer.encode(input_sentence, return_tensors="pt")

with torch.no_grad():
    generated_ids = model.generate(**tokens, max_length=20)

generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

print(f"Input:  '{input_sentence}'")
print(f"Output: '{generated_text}'")
print(f"\nüí° The model translated English to German!")
print(f"üí° model.generate() is optimized and faster than manual generation")


## Step 7: Token Embeddings Visualization

**What are embeddings?**
- Each word/token is represented as a high-dimensional vector (512 dimensions for T5-small)
- These vectors capture semantic meaning
- Similar words have similar embeddings

**Visualization techniques:**
1. **PCA (Principal Component Analysis)**: Reduces 512D ‚Üí 2D for visualization
2. **Cosine Similarity**: Measures how "similar" two word embeddings are (0-1 scale)


In [None]:
# Words to visualize
words = [
    "chair",
    "table",
    "plate",
    "knife",
    "spoon",
    "horse",
    "goat",
    "sheep",
    "cat",
    "dog",
]

print(f"üìù Analyzing embeddings for {len(words)} words:")
for i, word in enumerate(words, 1):
    print(f"   {i}. {word}")

# Get token IDs for first token of each word
word_tokens = tokenizer.encode(words, return_tensors="pt", padding=True)["input_ids"][:, 0]
print(f"\nToken IDs: {word_tokens.tolist()}")

# Extract embeddings from the model's shared embedding layer
with torch.no_grad():
    token_embeddings = model.model.shared(word_tokens).cpu().detach().numpy()

print(f"‚úÖ Embeddings extracted!")
print(f"   Shape: {token_embeddings.shape}")
print(f"   ‚Üí Each word is a {token_embeddings.shape[1]}-dimensional vector")


### PCA Visualization (2D Projection)

PCA reduces high-dimensional embeddings to 2D so we can visualize them. Words that are semantically similar should appear close together!


In [None]:
# Apply PCA to reduce dimensions from 512D to 2D
pca = PCA(n_components=2)
X_pca = pca.fit_transform(token_embeddings)

print(f"Explained variance per component: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {sum(pca.explained_variance_ratio_):.2%}")
print("‚Üí This shows how much information is preserved in 2D")

# Create PCA plot
plt.figure(figsize=(12, 8))
plt.scatter(X_pca[:, 0], X_pca[:, 1], s=200, alpha=0.7, edgecolors='black', linewidth=1.5)

# Add labels with better positioning
for i, word in enumerate(words):
    plt.annotate(word, (X_pca[i, 0], X_pca[i, 1]), 
                xytext=(5, 5), textcoords='offset points', 
                fontsize=11, fontweight='bold')

plt.xlabel('First Principal Component', fontsize=13)
plt.ylabel('Second Principal Component', fontsize=13)
plt.title('Token Embeddings - PCA Visualization (2D Projection)\nWords close together are semantically similar', 
          fontsize=15, fontweight='bold')
plt.grid(True, alpha=0.3, linestyle='--')
plt.tight_layout()
plt.show()

print("\nüí° Observations:")
print("   - Furniture words (chair, table, plate, knife, spoon) should cluster together")
print("   - Animal words (horse, goat, sheep, cat, dog) should cluster together")
print("   - This shows the model learned semantic relationships!")


## Step 10: Cross-Attention Visualization (Optional)

**What is cross-attention?**
Cross-attention shows what parts of the input the decoder focuses on when generating each output token.

**Key insights:**
- Early layers focus on task identification ("translate", "german")
- Later layers focus on content words ("hello")
- This reveals how the model connects input to output!

**Note:** This requires reloading the model with `output_attentions=True`.


In [None]:
# Reload model with attention outputs enabled
print("Reloading model with attention outputs...")
model_with_attn = Seq2SeqModelLoader("t5-small", output_attentions=True)

# Prepare input
input_sentence = "translate english to german: hello how are you?"
tokens = tokenizer.encode(input_sentence, return_tensors="pt")
decoder_input_ids = torch.tensor([[tokenizer.tokenizer.pad_token_id]])

# Forward pass with attention
with torch.no_grad():
    output = model_with_attn(**tokens, decoder_input_ids=decoder_input_ids)

print(f"\n‚úÖ Model outputs now include attention weights!")
print(f"Available keys: {list(output.keys())}")
print(f"\nCross-attention layers: {len(output.cross_attentions)}")
print(f"Shape of first cross-attention: {output.cross_attentions[0].shape}")
print(f"   ‚Üí [batch=1, heads={output.cross_attentions[0].shape[1]}, decoder_pos=1, encoder_pos={output.cross_attentions[0].shape[3]}]")


In [None]:
# Visualize cross-attention for first layer (averaged across heads)
first_layer_attn = output.cross_attentions[0][0, :, 0].detach().cpu().numpy()  # [heads, encoder_pos]
avg_attn = first_layer_attn.mean(axis=0)  # Average across heads

# Get input tokens for labels
input_tokens = tokenizer.tokenize(input_sentence) + ["</s>"]

# Plot
plt.figure(figsize=(10, 6))
plt.barh(range(len(input_tokens)), avg_attn)
plt.yticks(range(len(input_tokens)), input_tokens)
plt.xlabel('Attention Weight (Average across heads)')
plt.title('Cross-Attention: First Layer\n(What the decoder focuses on when generating first token)')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

print("\nüí° Higher bars = more attention")
print("üí° Early layers typically focus on task words ('translate', 'german')")


### Vocabulary Exploration

The tokenizer has a vocabulary mapping tokens to IDs. Let's explore it!


In [None]:
# Get vocabulary
vocabulary = tokenizer.get_vocab()
reverse_vocab = {v: k for k, v in vocabulary.items()}

print(f"üìö Vocabulary size: {len(vocabulary):,} tokens")
print(f"   ‚Üí 32,000 base tokens + 100 special tokens (<extra_id_0> to <extra_id_99>)")

# Show some random tokens
import random
random_tokens = random.sample(list(vocabulary.keys()), 10)
print(f"\nüîÄ Random tokens from vocabulary:")
for token in random_tokens:
    print(f"   '{token}' ‚Üí ID: {vocabulary[token]}")

# Check special tokens
print(f"\nüéØ Special token IDs:")
print(f"   EOS (</s>): {vocabulary.get('</s>', 'N/A')}")
print(f"   PAD (<pad>): {vocabulary.get('<pad>', 'N/A')}")
print(f"   Extra ID 0: {vocabulary.get('<extra_id_0>', 'N/A')}")
print(f"   Extra ID 1: {vocabulary.get('<extra_id_1>', 'N/A')}")


### Cross-Attention Across All Layers

Let's visualize cross-attention across all decoder layers to see how attention patterns change from early to late layers.

**Note:** This requires reloading the model with `output_attentions=True`.


In [None]:
# Reload model with attention outputs enabled
from transformers import AutoModelForSeq2SeqLM

model_with_attn = AutoModelForSeq2SeqLM.from_pretrained("t5-small", output_attentions=True)
model_with_attn.eval()

# Prepare input
input_sentence = "translate english to german: hello how are you?"
tokens = tokenizer.encode(input_sentence, return_tensors="pt")
decoder_input_ids = torch.tensor([[tokenizer.tokenizer.pad_token_id]])

# Forward pass to get attention
with torch.no_grad():
    output = model_with_attn(**tokens, decoder_input_ids=decoder_input_ids)

# Visualize cross-attention across all decoder layers
num_layers = len(output.cross_attentions)
fig, axes = plt.subplots(1, min(num_layers, 10), figsize=(14, 3))

if num_layers == 1:
    axes = [axes]  # Make it iterable if only one layer

axes[0].set_ylabel("Attention head", fontsize=11)

input_tokens = tokenizer.tokenize(input_sentence) + ["</s>"]

for i in range(min(num_layers, 10)):
    # Get attention for layer i: [batch=1, heads, decoder_pos=1, encoder_pos]
    layer_attn = output.cross_attentions[i][0, :, 0].detach().cpu().numpy()
    
    # Plot heatmap
    im = axes[i].imshow(layer_attn, aspect='auto', cmap='viridis')
    axes[i].set_xticks(range(len(input_tokens)))
    axes[i].set_xticklabels(input_tokens, rotation=90, fontsize=9)
    axes[i].set_yticks([])
    axes[i].set_title(f"Layer {i+1}", fontsize=10, fontweight='bold')

plt.suptitle('Cross-Attention Across Decoder Layers\n(Each layer shows attention from all heads)', 
             fontsize=13, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("\nüí° Observations:")
print("   - Early layers (1-3): Focus on task words ('translate', 'german')")
print("   - Middle layers (4-7): Transition to content words")
print("   - Late layers (8-10): Focus on actual content ('hello', 'how', 'are')")
print("   - This shows how the model processes information hierarchically!")


### Attention Throughout Generation

Now let's see how attention shifts as the model generates each token. This shows how the decoder focuses on different parts of the input as it generates different parts of the output!


In [None]:
# Generate the full sequence step-by-step and collect attention weights
input_sentence = "translate english to german: hello how are you?"
tokens = tokenizer.encode(input_sentence, return_tensors="pt")
decoder_input_ids = torch.tensor([[tokenizer.tokenizer.pad_token_id]])

attns = []

max_length = 20
i = 0

print("üîÑ Generating sequence and collecting attention weights...")
print("=" * 60)

while i < max_length and decoder_input_ids[0, -1].item() != tokenizer.tokenizer.eos_token_id:
    # Forward pass
    with torch.no_grad():
        step_output = model_with_attn(**tokens, decoder_input_ids=decoder_input_ids)
    
    # Get predicted token (greedy decoding)
    next_token_logits = step_output.logits[0, -1, :]
    next_token_id = next_token_logits.argmax().item()
    
    # Store attention weights for this step
    attns.append(step_output.cross_attentions)
    
    # Decode current sequence
    current_text = tokenizer.decode(decoder_input_ids[0], skip_special_tokens=True)
    next_token_text = tokenizer.decode([next_token_id], skip_special_tokens=True)
    
    print(f"Step {i+1}: '{current_text}' ‚Üí '{next_token_text}'")
    
    # Add predicted token
    decoder_input_ids = torch.cat([decoder_input_ids, torch.tensor([[next_token_id]])], dim=1)
    
    if next_token_id == tokenizer.tokenizer.eos_token_id:
        break
    
    i += 1

print(f"\n‚úÖ Generated {i} tokens")
print(f"‚úÖ Collected attention weights for each generation step")


In [None]:
# Create heatmap showing attention throughout generation
# Average attention across all layers and heads, for the last decoder position at each step
attention_matrix = torch.stack([
    torch.stack(a).mean(axis=(0, 1, 2))[-1]  # Average across layers and heads, last decoder pos
    for a in attns
]).detach().cpu().numpy()

input_tokens = tokenizer.tokenize(input_sentence) + ["</s>"]

# Get the final generated sequence tokens
output_tokens = tokenizer.tokenize(tokenizer.decode(decoder_input_ids[0], skip_special_tokens=True))

fig, ax = plt.subplots(figsize=(12, 8))
im = ax.imshow(attention_matrix, cmap='YlOrRd', aspect='auto', interpolation='nearest')

# Set labels
ax.set_xticks(range(len(input_tokens)))
ax.set_xticklabels(input_tokens, rotation=45, ha='right', fontsize=10)
ax.set_yticks(range(len(output_tokens)))
ax.set_yticklabels(output_tokens, fontsize=10)

ax.set_xlabel('Input Token', fontsize=12, fontweight='bold')
ax.set_ylabel('Generated Token', fontsize=12, fontweight='bold')
ax.set_title('Attention Throughout Generation\n(How decoder focuses on input as it generates output)', 
             fontsize=14, fontweight='bold', pad=15)

# Add colorbar
cbar = plt.colorbar(im, ax=ax, label='Attention Weight', shrink=0.8)
cbar.ax.tick_params(labelsize=10)

plt.tight_layout()
plt.show()

print("\nüí° Key insights:")
print("   - When generating 'Hallo', attention focuses on 'hello'")
print("   - As generation progresses, attention shifts to other input words")
print("   - This shows the decoder dynamically attends to different parts of the input!")
print("   - The model learns to align input and output words semantically!")


### Cosine Similarity Matrix

Cosine similarity measures how similar two word embeddings are:
- **1.0** = Identical (same word)
- **0.8-1.0** = Very similar (e.g., "cat" and "dog")
- **0.5-0.8** = Somewhat related
- **0.0-0.5** = Unrelated
- **Negative** = Opposites or very different


In [None]:
# Compute cosine similarity matrix
similarity_matrix = cosine_similarity(token_embeddings)

print(f"Similarity matrix shape: {similarity_matrix.shape}")
print(f"‚Üí Shows pairwise similarity between all {len(words)} words\n")

# Create heatmap with better styling
fig, ax = plt.subplots(figsize=(12, 10))
cax = ax.imshow(similarity_matrix, cmap="RdYlBu_r", vmin=-1, vmax=1, aspect='auto')
cbar = fig.colorbar(cax, ax=ax, label='Cosine Similarity', shrink=0.8)
cbar.ax.tick_params(labelsize=11)

ax.set_xticks(range(len(words)))
ax.set_yticks(range(len(words)))
ax.set_xticklabels(words, rotation=45, ha='right', fontsize=11)
ax.set_yticklabels(words, fontsize=11)
ax.set_title('Cosine Similarity Matrix\n(Red = similar, Blue = different)', 
             fontsize=15, fontweight='bold', pad=20)

# Add similarity values to the plot (only show if similarity > 0.3 for readability)
for i in range(len(words)):
    for j in range(len(words)):
        sim_val = similarity_matrix[i, j]
        # Use white text for dark backgrounds, black for light
        text_color = 'white' if abs(sim_val) > 0.5 else 'black'
        if abs(sim_val) > 0.3 or i == j:  # Show diagonal and significant similarities
            ax.text(j, i, f'{sim_val:.2f}',
                   ha="center", va="center", 
                   color=text_color, fontsize=9, fontweight='bold')

plt.tight_layout()
plt.show()

# Print most similar pairs
print("\nüí° Most similar word pairs:")
similarities = []
for i in range(len(words)):
    for j in range(i + 1, len(words)):
        similarities.append((words[i], words[j], similarity_matrix[i, j]))

similarities.sort(key=lambda x: x[2], reverse=True)
for i, (word1, word2, sim) in enumerate(similarities[:5], 1):
    print(f"   {i}. '{word1}' ‚Üî '{word2}': {sim:.3f}")


## Summary

**What you learned:**
1. ‚úÖ **Tokenization**: Converting text to numbers (token IDs) and back
2. ‚úÖ **Model Architecture**: Encoder-decoder structure, shared embeddings, attention layers
3. ‚úÖ **Encoder Outputs**: How the encoder processes input and creates representations
4. ‚úÖ **Manual Generation**: Step-by-step autoregressive token generation process
5. ‚úÖ **Optimized Generation**: Using `model.generate()` with caching for efficiency
6. ‚úÖ **Embeddings**: How words are represented as high-dimensional vectors
7. ‚úÖ **Visualization**: PCA and cosine similarity for understanding semantic relationships

**Key Concepts:**
- **Encoder**: Processes input once, creates rich representation (bidirectional)
- **Decoder**: Generates output token-by-token (autoregressive) using cross-attention
- **Shared Embeddings**: Encoder and decoder share the same token embedding weights
- **Autoregressive**: Each token depends on all previous tokens
- **Caching**: Encoder outputs are cached during generation for efficiency

**Next steps:**
- Try different input prompts and tasks (summarization, question answering)
- Experiment with different T5 model sizes (t5-base, t5-large)
- Explore attention mechanisms in `solution-02-attention.ipynb`
- Modify the code to experiment with different models (BERT, GPT-2, etc.)
- Study cross-attention weights to see what the model focuses on

**Resources:**
- Full tutorial: `labs/solution-01-t5.ipynb`
- Test suite: `test_t5_notebook.py`
- More examples: `examples/` folder
