# Lab 4: Transformers and Modern NLP

In this lab, we'll explore the transformer architecture that revolutionized NLP and learn to use state-of-the-art pre-trained models like BERT and GPT.

## Learning Objectives

By the end of this lab, you will:
- Understand the complete transformer architecture
- Implement transformer encoder and decoder blocks
- Learn about BERT and GPT architectures
- Use Hugging Face Transformers library
- Fine-tune pre-trained models for specific tasks
- Perform text classification, generation, and Q&A
- Evaluate models with perplexity, BLEU, and ROUGE

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
np.random.seed(42)
tf.random.set_seed(42)

print(f"TensorFlow version: {tf.__version__}")

## Part 1: Building a Complete Transformer

The **Transformer** architecture ("Attention is All You Need", 2017) uses self-attention exclusively, without recurrence.

### Key Components:
1. **Multi-Head Attention**: Parallel attention mechanisms
2. **Feed-Forward Networks**: Position-wise fully connected layers
3. **Layer Normalization**: Normalize layer outputs
4. **Residual Connections**: Skip connections for gradient flow
5. **Positional Encoding**: Inject sequence order information

### Architecture:
```
Encoder:                    Decoder:
Input Embedding             Output Embedding
+ Positional Encoding       + Positional Encoding
↓                           ↓
Multi-Head Self-Attention   Masked Multi-Head Self-Attention
+ Residual + Norm           + Residual + Norm
↓                           ↓
Feed Forward                Cross-Attention (to encoder)
+ Residual + Norm           + Residual + Norm
                            ↓
                            Feed Forward
                            + Residual + Norm
```

In [None]:
def positional_encoding(position, d_model):
    """
    Create positional encoding matrix.
    
    PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
    PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
    """
    angle_rads = np.arange(position)[:, np.newaxis] / np.power(
        10000, (2 * (np.arange(d_model)[np.newaxis, :] // 2)) / np.float32(d_model)
    )
    
    # Apply sin to even indices
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
    
    # Apply cos to odd indices
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    
    pos_encoding = angle_rads[np.newaxis, ...]
    
    return tf.cast(pos_encoding, dtype=tf.float32)

# Visualize positional encoding
pos_enc = positional_encoding(50, 128)

plt.figure(figsize=(12, 6))
plt.pcolormesh(pos_enc[0], cmap='RdBu')
plt.xlabel('Dimension')
plt.ylabel('Position')
plt.colorbar()
plt.title('Positional Encoding (50 positions, 128 dimensions)')
plt.tight_layout()
plt.show()

print("Positional encoding allows the model to use sequence order!")

In [None]:
def scaled_dot_product_attention(q, k, v, mask=None):
    """
    Calculate scaled dot-product attention.
    
    Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
    """
    matmul_qk = tf.matmul(q, k, transpose_b=True)
    
    # Scale by sqrt(d_k)
    d_k = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(d_k)
    
    # Apply mask (if provided)
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)
    
    # Softmax over last axis
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
    
    output = tf.matmul(attention_weights, v)
    
    return output, attention_weights


class MultiHeadAttention(layers.Layer):
    """Multi-head attention layer."""
    
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        
        assert d_model % num_heads == 0
        
        self.depth = d_model // num_heads
        
        self.wq = layers.Dense(d_model)
        self.wk = layers.Dense(d_model)
        self.wv = layers.Dense(d_model)
        
        self.dense = layers.Dense(d_model)
    
    def split_heads(self, x, batch_size):
        """Split the last dimension into (num_heads, depth)."""
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])
    
    def call(self, v, k, q, mask=None):
        batch_size = tf.shape(q)[0]
        
        # Linear projections
        q = self.wq(q)
        k = self.wk(k)
        v = self.wv(v)
        
        # Split into multiple heads
        q = self.split_heads(q, batch_size)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)
        
        # Scaled dot-product attention
        scaled_attention, attention_weights = scaled_dot_product_attention(
            q, k, v, mask
        )
        
        # Concatenate heads
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
        concat_attention = tf.reshape(scaled_attention, 
                                      (batch_size, -1, self.d_model))
        
        # Final linear projection
        output = self.dense(concat_attention)
        
        return output, attention_weights


def point_wise_feed_forward_network(d_model, dff):
    """Position-wise feed-forward network."""
    return keras.Sequential([
        layers.Dense(dff, activation='relu'),
        layers.Dense(d_model)
    ])

print("Transformer components defined!")

In [None]:
class EncoderLayer(layers.Layer):
    """Single transformer encoder layer."""
    
    def __init__(self, d_model, num_heads, dff, dropout_rate=0.1):
        super(EncoderLayer, self).__init__()
        
        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = point_wise_feed_forward_network(d_model, dff)
        
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        
        self.dropout1 = layers.Dropout(dropout_rate)
        self.dropout2 = layers.Dropout(dropout_rate)
    
    def call(self, x, training, mask=None):
        # Multi-head attention
        attn_output, _ = self.mha(x, x, x, mask)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)  # Residual connection
        
        # Feed forward
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)  # Residual connection
        
        return out2


class Encoder(layers.Layer):
    """Transformer encoder (stack of encoder layers)."""
    
    def __init__(self, num_layers, d_model, num_heads, dff, 
                 input_vocab_size, maximum_position_encoding, dropout_rate=0.1):
        super(Encoder, self).__init__()
        
        self.d_model = d_model
        self.num_layers = num_layers
        
        self.embedding = layers.Embedding(input_vocab_size, d_model)
        self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)
        
        self.enc_layers = [EncoderLayer(d_model, num_heads, dff, dropout_rate) 
                           for _ in range(num_layers)]
        
        self.dropout = layers.Dropout(dropout_rate)
    
    def call(self, x, training, mask=None):
        seq_len = tf.shape(x)[1]
        
        # Embedding + positional encoding
        x = self.embedding(x)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]
        
        x = self.dropout(x, training=training)
        
        # Pass through encoder layers
        for i in range(self.num_layers):
            x = self.enc_layers[i](x, training, mask)
        
        return x

print("Encoder defined!")

In [None]:
class DecoderLayer(layers.Layer):
    """Single transformer decoder layer."""
    
    def __init__(self, d_model, num_heads, dff, dropout_rate=0.1):
        super(DecoderLayer, self).__init__()
        
        self.mha1 = MultiHeadAttention(d_model, num_heads)  # Masked self-attention
        self.mha2 = MultiHeadAttention(d_model, num_heads)  # Cross-attention
        
        self.ffn = point_wise_feed_forward_network(d_model, dff)
        
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = layers.LayerNormalization(epsilon=1e-6)
        
        self.dropout1 = layers.Dropout(dropout_rate)
        self.dropout2 = layers.Dropout(dropout_rate)
        self.dropout3 = layers.Dropout(dropout_rate)
    
    def call(self, x, enc_output, training, 
             look_ahead_mask=None, padding_mask=None):
        # Masked multi-head attention (self-attention)
        attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layernorm1(attn1 + x)
        
        # Multi-head attention (cross-attention to encoder output)
        attn2, attn_weights_block2 = self.mha2(
            enc_output, enc_output, out1, padding_mask
        )
        attn2 = self.dropout2(attn2, training=training)
        out2 = self.layernorm2(attn2 + out1)
        
        # Feed forward
        ffn_output = self.ffn(out2)
        ffn_output = self.dropout3(ffn_output, training=training)
        out3 = self.layernorm3(ffn_output + out2)
        
        return out3, attn_weights_block1, attn_weights_block2


class Decoder(layers.Layer):
    """Transformer decoder (stack of decoder layers)."""
    
    def __init__(self, num_layers, d_model, num_heads, dff, 
                 target_vocab_size, maximum_position_encoding, dropout_rate=0.1):
        super(Decoder, self).__init__()
        
        self.d_model = d_model
        self.num_layers = num_layers
        
        self.embedding = layers.Embedding(target_vocab_size, d_model)
        self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)
        
        self.dec_layers = [DecoderLayer(d_model, num_heads, dff, dropout_rate)
                           for _ in range(num_layers)]
        
        self.dropout = layers.Dropout(dropout_rate)
    
    def call(self, x, enc_output, training, 
             look_ahead_mask=None, padding_mask=None):
        seq_len = tf.shape(x)[1]
        attention_weights = {}
        
        # Embedding + positional encoding
        x = self.embedding(x)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]
        
        x = self.dropout(x, training=training)
        
        # Pass through decoder layers
        for i in range(self.num_layers):
            x, block1, block2 = self.dec_layers[i](
                x, enc_output, training, look_ahead_mask, padding_mask
            )
            
            attention_weights[f'decoder_layer{i+1}_block1'] = block1
            attention_weights[f'decoder_layer{i+1}_block2'] = block2
        
        return x, attention_weights

print("Decoder defined!")

In [None]:
class Transformer(keras.Model):
    """Complete transformer model."""
    
    def __init__(self, num_layers, d_model, num_heads, dff,
                 input_vocab_size, target_vocab_size, 
                 pe_input, pe_target, dropout_rate=0.1):
        super(Transformer, self).__init__()
        
        self.encoder = Encoder(num_layers, d_model, num_heads, dff, 
                               input_vocab_size, pe_input, dropout_rate)
        
        self.decoder = Decoder(num_layers, d_model, num_heads, dff, 
                               target_vocab_size, pe_target, dropout_rate)
        
        self.final_layer = layers.Dense(target_vocab_size)
    
    def call(self, inputs, training):
        inp, tar = inputs
        
        enc_padding_mask = None
        look_ahead_mask = self.create_look_ahead_mask(tf.shape(tar)[1])
        dec_padding_mask = None
        
        # Encoder
        enc_output = self.encoder(inp, training, enc_padding_mask)
        
        # Decoder
        dec_output, attention_weights = self.decoder(
            tar, enc_output, training, look_ahead_mask, dec_padding_mask
        )
        
        # Final linear layer
        final_output = self.final_layer(dec_output)
        
        return final_output, attention_weights
    
    def create_look_ahead_mask(self, size):
        """Create mask to prevent attending to future tokens."""
        mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
        return mask

print("Complete Transformer defined!")
print("\nArchitecture Summary:")
print("- Encoder: Multi-head self-attention + Feed-forward")
print("- Decoder: Masked self-attention + Cross-attention + Feed-forward")
print("- Residual connections and layer normalization throughout")
print("- Positional encoding for sequence order")

## Part 2: BERT - Bidirectional Encoder Representations

**BERT** (2018) uses only the encoder part of the transformer and is pre-trained with two objectives:

### Pre-training Tasks:
1. **Masked Language Modeling (MLM)**: Predict masked tokens
   - Input: "The [MASK] is on the table"
   - Target: Predict "cat"

2. **Next Sentence Prediction (NSP)**: Predict if sentence B follows A
   - Input: "The cat sat. [SEP] It was happy."
   - Target: True/False

### Key Features:
- **Bidirectional**: Sees context from both directions
- **Pre-trained**: On massive text corpora (Wikipedia, BookCorpus)
- **Fine-tunable**: For downstream tasks (classification, NER, Q&A)

### Special Tokens:
- `[CLS]`: Classification token (first token)
- `[SEP]`: Separator between sentences
- `[MASK]`: Masked token for MLM
- `[PAD]`: Padding token

In [None]:
# Simulate masked language modeling
class SimpleBERT:
    """Simplified BERT-style model for demonstration."""
    
    def __init__(self, vocab_size, d_model=128, num_heads=4, num_layers=2):
        self.vocab_size = vocab_size
        self.d_model = d_model
        
        # Build encoder-only model
        self.encoder = Encoder(
            num_layers=num_layers,
            d_model=d_model,
            num_heads=num_heads,
            dff=d_model * 4,
            input_vocab_size=vocab_size,
            maximum_position_encoding=512,
            dropout_rate=0.1
        )
        
        # MLM head
        self.mlm_head = keras.Sequential([
            layers.Dense(d_model, activation='gelu'),
            layers.LayerNormalization(epsilon=1e-6),
            layers.Dense(vocab_size)
        ])
    
    def create_model(self):
        """Create Keras model for BERT."""
        inputs = layers.Input(shape=(None,), dtype=tf.int32)
        
        # Encoder
        encoder_output = self.encoder(inputs, training=True)
        
        # MLM predictions
        mlm_output = self.mlm_head(encoder_output)
        
        model = keras.Model(inputs=inputs, outputs=mlm_output)
        return model

# Create toy BERT model
bert = SimpleBERT(vocab_size=1000, d_model=128, num_heads=4, num_layers=2)
bert_model = bert.create_model()

print("BERT Model Architecture:")
print(bert_model.summary())

print("\nBERT Use Cases:")
print("- Text Classification (sentiment, topic)")
print("- Named Entity Recognition (NER)")
print("- Question Answering")
print("- Sentence Similarity")
print("- Text Embeddings")

## Part 3: GPT - Generative Pre-trained Transformer

**GPT** uses only the decoder part with **causal** (autoregressive) attention.

### Pre-training:
- **Causal Language Modeling**: Predict next token
- Input: "The cat sat on"
- Target: Predict "the"

### Key Differences from BERT:
- **Unidirectional**: Only sees previous tokens (left-to-right)
- **Generative**: Can generate coherent text
- **Autoregressive**: Generates one token at a time

### Evolution:
- **GPT** (2018): 117M parameters
- **GPT-2** (2019): 1.5B parameters
- **GPT-3** (2020): 175B parameters, few-shot learning
- **GPT-4** (2023): Multimodal, improved reasoning

In [None]:
class SimpleGPT:
    """Simplified GPT-style model for demonstration."""
    
    def __init__(self, vocab_size, d_model=128, num_heads=4, num_layers=2):
        self.vocab_size = vocab_size
        self.d_model = d_model
        self.num_layers = num_layers
        self.num_heads = num_heads
    
    def create_model(self, max_length=128):
        """Create Keras model for GPT."""
        inputs = layers.Input(shape=(None,), dtype=tf.int32)
        
        # Embedding
        x = layers.Embedding(self.vocab_size, self.d_model)(inputs)
        
        # Positional encoding
        pos_enc = positional_encoding(max_length, self.d_model)
        seq_len = tf.shape(x)[1]
        x += pos_enc[:, :seq_len, :]
        
        # Decoder layers (with causal masking)
        for _ in range(self.num_layers):
            # Multi-head causal self-attention
            attn_output = layers.MultiHeadAttention(
                num_heads=self.num_heads, 
                key_dim=self.d_model // self.num_heads,
                dropout=0.1
            )(x, x, use_causal_mask=True)
            
            x = layers.LayerNormalization(epsilon=1e-6)(x + attn_output)
            
            # Feed-forward
            ffn_output = layers.Dense(self.d_model * 4, activation='gelu')(x)
            ffn_output = layers.Dense(self.d_model)(ffn_output)
            x = layers.LayerNormalization(epsilon=1e-6)(x + ffn_output)
        
        # Output layer
        outputs = layers.Dense(self.vocab_size)(x)
        
        model = keras.Model(inputs=inputs, outputs=outputs)
        return model

# Create toy GPT model
gpt = SimpleGPT(vocab_size=1000, d_model=128, num_heads=4, num_layers=2)
gpt_model = gpt.create_model()

print("GPT Model Architecture:")
print(gpt_model.summary())

print("\nGPT Use Cases:")
print("- Text Generation (stories, articles, code)")
print("- Text Completion")
print("- Conversational AI (ChatGPT)")
print("- Few-shot Learning (with prompting)")
print("- Code Generation (GitHub Copilot)")

## Part 4: Using Hugging Face Transformers

**Hugging Face** provides easy access to thousands of pre-trained models.

### Installation:
```bash
pip install transformers tokenizers datasets
```

### Key Components:
- **Models**: Pre-trained transformer models
- **Tokenizers**: Convert text to tokens
- **Pipelines**: High-level API for common tasks
- **Datasets**: Curated datasets for NLP

In [None]:
# Install if needed (uncomment):
# !pip install transformers tokenizers datasets -q

try:
    from transformers import pipeline, AutoTokenizer, AutoModel
    
    print("Transformers library installed!")
    print("\nAvailable Pipeline Tasks:")
    print("- text-classification: Sentiment, topic classification")
    print("- ner: Named Entity Recognition")
    print("- question-answering: Answer questions from context")
    print("- summarization: Text summarization")
    print("- translation: Machine translation")
    print("- text-generation: Generate text")
    print("- fill-mask: Fill masked tokens (BERT-style)")
    print("- zero-shot-classification: Classify without training")
    
except ImportError:
    print("Please install transformers: pip install transformers tokenizers")

In [None]:
# Example 1: Sentiment Analysis
try:
    print("Loading sentiment analysis pipeline...")
    sentiment_analyzer = pipeline("sentiment-analysis")
    
    texts = [
        "I love this movie! It's absolutely fantastic.",
        "This is the worst product I've ever bought.",
        "The weather is okay today, nothing special."
    ]
    
    print("\nSentiment Analysis Results:\n")
    for text in texts:
        result = sentiment_analyzer(text)[0]
        print(f"Text: {text}")
        print(f"Sentiment: {result['label']}, Score: {result['score']:.3f}\n")
        
except Exception as e:
    print(f"Error: {e}")
    print("Note: First run may download models (can be slow)")

In [None]:
# Example 2: Named Entity Recognition (NER)
try:
    print("Loading NER pipeline...")
    ner = pipeline("ner", grouped_entities=True)
    
    text = """Apple Inc. was founded by Steve Jobs in Cupertino, California. 
    The company is now led by Tim Cook."""
    
    entities = ner(text)
    
    print("\nNamed Entity Recognition:\n")
    print(f"Text: {text}\n")
    print("Entities found:")
    for entity in entities:
        print(f"  {entity['word']:20s} -> {entity['entity_group']:10s} (score: {entity['score']:.3f})")
        
except Exception as e:
    print(f"Error: {e}")

In [None]:
# Example 3: Question Answering
try:
    print("Loading Q&A pipeline...")
    qa = pipeline("question-answering")
    
    context = """
    The transformer architecture was introduced in the paper 'Attention is All You Need' 
    in 2017. It uses self-attention mechanisms to process sequences in parallel, 
    unlike RNNs which process sequentially. This makes transformers much faster to train 
    and better at capturing long-range dependencies.
    """
    
    questions = [
        "When was the transformer architecture introduced?",
        "What mechanism do transformers use?",
        "Why are transformers faster than RNNs?"
    ]
    
    print("\nQuestion Answering:\n")
    for question in questions:
        result = qa(question=question, context=context)
        print(f"Q: {question}")
        print(f"A: {result['answer']} (score: {result['score']:.3f})\n")
        
except Exception as e:
    print(f"Error: {e}")

In [None]:
# Example 4: Text Generation
try:
    print("Loading text generation pipeline...")
    generator = pipeline("text-generation", model="gpt2")
    
    prompts = [
        "Artificial intelligence is",
        "The future of machine learning",
        "In a world where robots"
    ]
    
    print("\nText Generation (GPT-2):\n")
    for prompt in prompts:
        result = generator(prompt, max_length=50, num_return_sequences=1)[0]
        print(f"Prompt: {prompt}")
        print(f"Generated: {result['generated_text']}\n")
        
except Exception as e:
    print(f"Error: {e}")

In [None]:
# Example 5: Fill Mask (BERT-style)
try:
    print("Loading fill-mask pipeline...")
    unmasker = pipeline("fill-mask")
    
    sentences = [
        "The [MASK] is shining brightly today.",
        "Python is a popular [MASK] language.",
        "Transformers use [MASK] mechanisms."
    ]
    
    print("\nFill Mask (BERT):\n")
    for sentence in sentences:
        results = unmasker(sentence, top_k=3)
        print(f"Sentence: {sentence}")
        print("Predictions:")
        for i, result in enumerate(results, 1):
            print(f"  {i}. {result['token_str']:15s} (score: {result['score']:.3f})")
        print()
        
except Exception as e:
    print(f"Error: {e}")

## Part 5: Fine-tuning BERT for Text Classification

Fine-tuning adapts a pre-trained model to a specific task with your data.

### Steps:
1. Load pre-trained model
2. Add task-specific head (e.g., classification layer)
3. Train on your dataset with small learning rate
4. Evaluate on validation set

### Tips:
- Use small learning rate (1e-5 to 5e-5)
- Few epochs (2-4 usually sufficient)
- Monitor for overfitting
- Use learning rate warmup

In [None]:
# Simulate fine-tuning BERT for sentiment classification
try:
    from transformers import BertTokenizer, TFBertForSequenceClassification
    
    # Sample dataset
    train_texts = [
        "I love this product!",
        "This is terrible.",
        "Amazing quality!",
        "Worst purchase ever.",
        "Highly recommended!",
        "Complete waste of money."
    ]
    train_labels = [1, 0, 1, 0, 1, 0]  # 1 = positive, 0 = negative
    
    # Load tokenizer and model
    print("Loading BERT model...")
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = TFBertForSequenceClassification.from_pretrained(
        'bert-base-uncased',
        num_labels=2
    )
    
    # Tokenize
    encodings = tokenizer(
        train_texts,
        truncation=True,
        padding=True,
        max_length=128,
        return_tensors='tf'
    )
    
    # Compile
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=2e-5),
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=['accuracy']
    )
    
    print("\nFine-tuning BERT...")
    # Note: This is just demonstration; real fine-tuning needs more data
    history = model.fit(
        dict(encodings),
        np.array(train_labels),
        epochs=3,
        batch_size=2,
        verbose=0
    )
    
    print(f"\nFinal training accuracy: {history.history['accuracy'][-1]:.3f}")
    
    # Test predictions
    test_texts = ["This is fantastic!", "I hate this."]
    test_encodings = tokenizer(
        test_texts,
        truncation=True,
        padding=True,
        return_tensors='tf'
    )
    
    predictions = model.predict(dict(test_encodings), verbose=0)
    predicted_classes = tf.argmax(predictions.logits, axis=1)
    
    print("\nTest Predictions:")
    for text, pred in zip(test_texts, predicted_classes):
        sentiment = "Positive" if pred == 1 else "Negative"
        print(f"'{text}' -> {sentiment}")
        
except Exception as e:
    print(f"Error: {e}")
    print("\nNote: This example requires transformers with TensorFlow.")
    print("Install with: pip install transformers[tf]")

## Part 6: Evaluation Metrics for Language Models

Different tasks require different evaluation metrics.

### Classification Metrics:
- **Accuracy**: Fraction of correct predictions
- **Precision**: TP / (TP + FP)
- **Recall**: TP / (TP + FN)
- **F1-Score**: Harmonic mean of precision and recall

### Language Model Metrics:
- **Perplexity**: exp(average cross-entropy loss)
  - Lower is better
  - Measures how "surprised" the model is

### Translation Metrics:
- **BLEU** (Bilingual Evaluation Understudy):
  - Compares n-gram overlap with reference translations
  - Score: 0-1 (higher is better)
  - BLEU-4 most common (4-grams)

### Summarization Metrics:
- **ROUGE** (Recall-Oriented Understudy for Gisting Evaluation):
  - ROUGE-N: N-gram overlap
  - ROUGE-L: Longest common subsequence
  - Higher is better

In [None]:
# Calculate perplexity
def calculate_perplexity(loss):
    """Calculate perplexity from cross-entropy loss."""
    return np.exp(loss)

# Example losses and perplexities
losses = [0.5, 1.0, 2.0, 3.0, 4.0]
perplexities = [calculate_perplexity(loss) for loss in losses]

print("Perplexity Examples:\n")
print("Cross-Entropy Loss | Perplexity")
print("-" * 35)
for loss, ppl in zip(losses, perplexities):
    print(f"{loss:18.1f} | {ppl:10.2f}")

print("\nInterpretation:")
print("- Lower perplexity = better model")
print("- Perplexity ~1.2: Very good model")
print("- Perplexity ~50: Moderate model")
print("- Perplexity >100: Poor model")

In [None]:
# Simple BLEU score implementation
from collections import Counter

def compute_bleu(reference, candidate, n=4):
    """
    Compute BLEU score (simplified version).
    
    Args:
        reference: Reference translation (string)
        candidate: Candidate translation (string)
        n: Maximum n-gram size (default: 4)
    """
    ref_tokens = reference.lower().split()
    cand_tokens = candidate.lower().split()
    
    scores = []
    
    for i in range(1, n + 1):
        # Create n-grams
        ref_ngrams = [tuple(ref_tokens[j:j+i]) for j in range(len(ref_tokens) - i + 1)]
        cand_ngrams = [tuple(cand_tokens[j:j+i]) for j in range(len(cand_tokens) - i + 1)]
        
        if len(cand_ngrams) == 0:
            scores.append(0)
            continue
        
        # Count matches
        ref_counts = Counter(ref_ngrams)
        cand_counts = Counter(cand_ngrams)
        
        matches = sum((cand_counts & ref_counts).values())
        total = sum(cand_counts.values())
        
        scores.append(matches / total if total > 0 else 0)
    
    # Geometric mean of n-gram precisions
    if min(scores) > 0:
        bleu = np.exp(np.mean([np.log(s) for s in scores]))
    else:
        bleu = 0
    
    # Brevity penalty
    bp = 1.0 if len(cand_tokens) >= len(ref_tokens) else np.exp(1 - len(ref_tokens) / len(cand_tokens))
    
    return bp * bleu

# Test BLEU
reference = "The cat is sitting on the mat"
candidates = [
    "The cat is sitting on the mat",  # Perfect match
    "The cat sits on the mat",        # Close
    "A cat is on a mat",               # Moderate
    "Dog runs in park",                 # Poor
]

print("BLEU Score Examples:\n")
print(f"Reference: {reference}\n")
for candidate in candidates:
    bleu = compute_bleu(reference, candidate)
    print(f"Candidate: {candidate}")
    print(f"BLEU: {bleu:.3f}\n")

## Part 7: Practical Considerations

### Model Selection:

**Use BERT when:**
- Classification tasks (sentiment, topic)
- Understanding tasks (Q&A, NER)
- You need bidirectional context
- You have labeled data for fine-tuning

**Use GPT when:**
- Text generation tasks
- Conversational AI
- Few-shot learning with prompts
- Creative writing

**Use T5 when:**
- You want a unified text-to-text approach
- Translation, summarization
- You need flexibility across tasks

### Computational Considerations:
- **Small models** (BERT-base, DistilBERT): Laptop-friendly
- **Large models** (BERT-large, GPT-3): Need GPUs
- **Inference optimization**: Quantization, distillation, ONNX

### Best Practices:
1. Start with pre-trained models
2. Use appropriate tokenizer for your model
3. Fine-tune with small learning rates
4. Monitor validation metrics
5. Use data augmentation if data is limited
6. Consider domain adaptation
7. Evaluate on diverse test sets
8. Check for biases in predictions

In [None]:
# Model size comparison
import pandas as pd

models_data = {
    'Model': ['DistilBERT', 'BERT-base', 'BERT-large', 'GPT-2', 'GPT-3', 'T5-small', 'T5-base'],
    'Parameters': ['66M', '110M', '340M', '117M', '175B', '60M', '220M'],
    'Layers': [6, 12, 24, 12, 96, 6, 12],
    'Hidden Size': [768, 768, 1024, 768, 12288, 512, 768],
    'Use Case': [
        'Fast classification',
        'General NLP',
        'High accuracy needed',
        'Text generation',
        'Few-shot learning',
        'Small tasks',
        'General seq2seq'
    ]
}

df = pd.DataFrame(models_data)
print("Popular Transformer Models:\n")
print(df.to_string(index=False))

print("\n" + "=" * 70)
print("Model Selection Guide:")
print("=" * 70)
print("Laptop/CPU: DistilBERT, BERT-base, T5-small")
print("Single GPU: BERT-large, GPT-2, T5-base")
print("Multiple GPUs: GPT-3, Large T5 variants")
print("=" * 70)

## Key Takeaways

1. **Transformers** use self-attention exclusively (no recurrence)
2. **Multi-head attention** allows parallel processing of different aspects
3. **Positional encoding** injects sequence order information
4. **BERT** is bidirectional, great for understanding tasks
5. **GPT** is autoregressive, excellent for generation
6. **Pre-training + Fine-tuning** is the dominant paradigm
7. **Hugging Face** provides easy access to state-of-the-art models
8. **Transfer learning** enables training with less data
9. **Perplexity, BLEU, ROUGE** are key evaluation metrics
10. **Model selection** depends on task, data, and resources

## Exercises

1. **Complete Transformer**: Train full transformer on translation
2. **BERT Fine-tuning**: Fine-tune BERT on custom classification task
3. **GPT Text Generation**: Generate creative text with GPT-2
4. **Multi-task Learning**: Train T5 on multiple tasks
5. **Model Comparison**: Compare BERT vs. GPT on same task
6. **Attention Visualization**: Visualize attention patterns
7. **Custom Tokenizer**: Build domain-specific tokenizer
8. **Model Distillation**: Compress BERT to smaller model

## Next Steps

Congratulations! You've completed the Introduction to AI course.

### Further Learning:
- **Advanced Transformers**: Vision Transformers, multimodal models
- **Efficient Models**: Reformer, Linformer, Performer
- **Prompt Engineering**: Optimizing prompts for LLMs
- **RLHF**: Reinforcement Learning from Human Feedback
- **RAG**: Retrieval-Augmented Generation
- **LangChain**: Building LLM applications

### Resources:
- **Hugging Face Course**: https://huggingface.co/course
- **Transformers Paper**: "Attention is All You Need"
- **BERT Paper**: "BERT: Pre-training of Deep Bidirectional Transformers"
- **GPT Paper**: "Language Models are Few-Shot Learners"

Great work completing this course! 🎉