# ü§ñ Transformers Review: Architecture & HuggingFace

**M·ª•c ti√™u:** Hi·ªÉu Transformer architecture v√† s·ª≠ d·ª•ng HuggingFace cho NLP & Vision

**N·ªôi dung:**
- Transformer architecture fundamentals
- Self-attention mechanism
- HuggingFace Transformers library
- Pre-trained models (BERT, GPT, ViT)
- Pipeline API & inference
- Fine-tuning patterns
- Vision Transformers (ViT, CLIP)

**Level:** Intermediate to Advanced

---

In [None]:
# Installation (if needed)
# !pip install transformers torch torchvision datasets pillow

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from transformers import __version__ as transformers_version

print(f"PyTorch: {torch.__version__}")
print(f"Transformers: {transformers_version}")
print(f"CUDA available: {torch.cuda.is_available()}")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

---

# 1. Transformer Architecture Fundamentals

## 1.1 Core Components

### Transformer = Encoder + Decoder (original)

```
Input ‚Üí Embedding ‚Üí Positional Encoding ‚Üí 
  ‚Üì
Encoder (N layers):
  - Multi-Head Self-Attention
  - Add & Norm
  - Feed-Forward Network
  - Add & Norm
  ‚Üì
Decoder (N layers):
  - Masked Multi-Head Self-Attention
  - Add & Norm
  - Cross-Attention (to encoder)
  - Add & Norm
  - Feed-Forward Network
  - Add & Norm
  ‚Üì
Output ‚Üí Linear ‚Üí Softmax
```

### Key Innovations
1. **Self-Attention**: All positions attend to all positions
2. **Parallel Processing**: No sequential dependency (unlike RNNs)
3. **Positional Encoding**: Since no recurrence, need position info

### Modern Variants
- **Encoder-only**: BERT (bidirectional, good for understanding)
- **Decoder-only**: GPT (unidirectional, good for generation)
- **Encoder-Decoder**: T5, BART (seq2seq tasks)

## 1.2 Self-Attention Mechanism

### Formula

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where:
- $Q$ = Query (what I'm looking for)
- $K$ = Key (what I have)
- $V$ = Value (what I return)
- $d_k$ = dimension of keys (scaling factor)

### Intuition
1. Compute similarity between Query and all Keys (dot product)
2. Scale by $\sqrt{d_k}$ (prevent large values)
3. Softmax to get attention weights
4. Weighted sum of Values

In [None]:
# Implement Scaled Dot-Product Attention

class ScaledDotProductAttention(nn.Module):
    """
    Scaled Dot-Product Attention
    """
    def __init__(self, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, Q, K, V, mask=None):
        """
        Args:
            Q: Query (batch, seq_len, d_k)
            K: Key (batch, seq_len, d_k)
            V: Value (batch, seq_len, d_v)
            mask: Optional mask (batch, seq_len, seq_len)
        
        Returns:
            output: (batch, seq_len, d_v)
            attention_weights: (batch, seq_len, seq_len)
        """
        d_k = Q.size(-1)
        
        # Compute attention scores: Q @ K^T / sqrt(d_k)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / np.sqrt(d_k)
        
        # Apply mask (if provided)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        # Softmax to get attention weights
        attention_weights = F.softmax(scores, dim=-1)
        attention_weights = self.dropout(attention_weights)
        
        # Apply attention to values
        output = torch.matmul(attention_weights, V)
        
        return output, attention_weights

# Test
batch_size, seq_len, d_model = 2, 4, 8

Q = torch.randn(batch_size, seq_len, d_model)
K = torch.randn(batch_size, seq_len, d_model)
V = torch.randn(batch_size, seq_len, d_model)

attention = ScaledDotProductAttention()
output, weights = attention(Q, K, V)

print(f"Q shape: {Q.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")
print(f"\nAttention weights (first sample):")
print(weights[0].detach().numpy())
print(f"\nWeights sum to 1: {weights[0].sum(dim=-1)}")

## 1.3 Multi-Head Attention

### Why Multi-Head?

Instead of single attention:
- Learn **multiple attention patterns** in parallel
- Different heads can focus on different aspects (e.g., syntax, semantics)

### Formula

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$

where each head:
$$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$

In [None]:
# Multi-Head Attention Implementation

class MultiHeadAttention(nn.Module):
    """
    Multi-Head Attention
    """
    def __init__(self, d_model, num_heads, dropout=0.1):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        # Linear projections
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        
        self.attention = ScaledDotProductAttention(dropout)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, Q, K, V, mask=None):
        """
        Args:
            Q, K, V: (batch, seq_len, d_model)
        Returns:
            output: (batch, seq_len, d_model)
        """
        batch_size = Q.size(0)
        
        # Linear projections and split into multiple heads
        # (batch, seq_len, d_model) -> (batch, num_heads, seq_len, d_k)
        Q = self.W_q(Q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(K).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(V).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        
        # Apply attention
        output, attention_weights = self.attention(Q, K, V, mask)
        
        # Concatenate heads
        # (batch, num_heads, seq_len, d_k) -> (batch, seq_len, d_model)
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        
        # Final linear projection
        output = self.W_o(output)
        output = self.dropout(output)
        
        return output, attention_weights

# Test
d_model, num_heads = 512, 8
mha = MultiHeadAttention(d_model, num_heads)

X = torch.randn(2, 10, d_model)  # (batch, seq_len, d_model)
output, weights = mha(X, X, X)

print(f"Input shape: {X.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")
print(f"\n‚úÖ Multi-head attention with {num_heads} heads")

## 1.4 Positional Encoding

### Problem
Self-attention is **permutation-invariant** ‚Üí No position information!

### Solution
Add positional encoding to input embeddings:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$
$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

### Why Sinusoidal?
- Can extrapolate to longer sequences
- Relative positions have consistent patterns

In [None]:
# Positional Encoding

class PositionalEncoding(nn.Module):
    """
    Sinusoidal Positional Encoding
    """
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        
        # Create positional encoding matrix
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # Register as buffer (not a parameter)
        pe = pe.unsqueeze(0)  # (1, max_len, d_model)
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        """
        Args:
            x: (batch, seq_len, d_model)
        """
        return x + self.pe[:, :x.size(1), :]

# Visualize positional encoding
d_model = 128
pe = PositionalEncoding(d_model, max_len=100)

# Get encoding
encoding = pe.pe[0].numpy()  # (max_len, d_model)

plt.figure(figsize=(12, 6))
plt.imshow(encoding.T, cmap='RdBu', aspect='auto')
plt.colorbar()
plt.xlabel('Position', fontsize=12)
plt.ylabel('Dimension', fontsize=12)
plt.title('Positional Encoding Visualization', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("üìä Positional encoding pattern:")
print("   - Alternating sin/cos functions")
print("   - Different frequencies for different dimensions")
print("   - Captures relative position information")

---

# 2. HuggingFace Transformers Library

## 2.1 Core Concepts

### Three Main Components
1. **Models**: Pre-trained architectures (BERT, GPT, T5, ViT, etc.)
2. **Tokenizers**: Convert text to tokens
3. **Pipelines**: High-level API for inference

### Model Hub
- https://huggingface.co/models
- 100,000+ pre-trained models
- Easy to load and fine-tune

In [None]:
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
from transformers import pipeline

print("‚úÖ HuggingFace Transformers imported successfully")
print("\nüí° Key classes:")
print("   AutoTokenizer: Automatically load correct tokenizer")
print("   AutoModel: Load base model")
print("   AutoModelFor*: Task-specific models")
print("   pipeline: High-level inference API")

## 2.2 Pipeline API (Quickest Way)

### Common Tasks

In [None]:
# Sentiment Analysis
print("1Ô∏è‚É£ Sentiment Analysis")
classifier = pipeline("sentiment-analysis")
result = classifier("I love this movie! It's amazing.")
print(f"   Result: {result}")

# Multiple texts
texts = [
    "This is great!",
    "I hate this.",
    "It's okay, nothing special."
]
results = classifier(texts)
print(f"\n   Batch results:")
for text, result in zip(texts, results):
    print(f"   '{text}' ‚Üí {result['label']} ({result['score']:.3f})")

In [None]:
# Named Entity Recognition (NER)
print("\n2Ô∏è‚É£ Named Entity Recognition")
ner = pipeline("ner", grouped_entities=True)
text = "Apple Inc. is located in Cupertino, California. Tim Cook is the CEO."
entities = ner(text)

print(f"   Text: {text}")
print(f"\n   Entities found:")
for entity in entities:
    print(f"   - {entity['word']}: {entity['entity_group']} (score: {entity['score']:.3f})")

In [None]:
# Question Answering
print("\n3Ô∏è‚É£ Question Answering")
qa = pipeline("question-answering")

context = """The Transformer is a deep learning model introduced in 2017, 
used primarily in the field of natural language processing (NLP). 
It was proposed in the paper 'Attention Is All You Need' by Vaswani et al."""

question = "When was the Transformer introduced?"
answer = qa(question=question, context=context)

print(f"   Question: {question}")
print(f"   Answer: {answer['answer']} (score: {answer['score']:.3f})")

In [None]:
# Text Generation
print("\n4Ô∏è‚É£ Text Generation")
generator = pipeline("text-generation", model="gpt2")

prompt = "The future of artificial intelligence is"
output = generator(prompt, max_length=50, num_return_sequences=2, temperature=0.7)

print(f"   Prompt: {prompt}")
print(f"\n   Generated texts:")
for i, result in enumerate(output, 1):
    print(f"   {i}. {result['generated_text']}")

## 2.3 Manual Model Usage (More Control)

In [None]:
# Load model and tokenizer manually
model_name = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()

print(f"‚úÖ Loaded {model_name}")
print(f"   Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"   Hidden size: {model.config.hidden_size}")
print(f"   Num layers: {model.config.num_hidden_layers}")
print(f"   Num attention heads: {model.config.num_attention_heads}")

In [None]:
# Tokenization
text = "Hello, how are you doing today?"

# Encode text
inputs = tokenizer(text, return_tensors="pt")

print(f"Text: {text}")
print(f"\nTokenized:")
print(f"   Input IDs: {inputs['input_ids']}")
print(f"   Tokens: {tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])}")
print(f"   Attention mask: {inputs['attention_mask']}")

# Decode back
decoded = tokenizer.decode(inputs['input_ids'][0])
print(f"\nDecoded: {decoded}")

In [None]:
# Forward pass
with torch.no_grad():
    outputs = model(**inputs)

# Extract embeddings
last_hidden_state = outputs.last_hidden_state  # (batch, seq_len, hidden_size)
pooler_output = outputs.pooler_output  # (batch, hidden_size) - [CLS] token

print(f"Output shapes:")
print(f"   Last hidden state: {last_hidden_state.shape}")
print(f"   Pooler output: {pooler_output.shape}")

# Use [CLS] token as sentence embedding
sentence_embedding = last_hidden_state[:, 0, :]  # Same as pooler_output
print(f"   Sentence embedding: {sentence_embedding.shape}")

## 2.4 Task-Specific Models

In [None]:
# Sequence Classification (e.g., sentiment analysis)
model_cls = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)
tokenizer_cls = AutoTokenizer.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)

text = "I love this product! It's amazing!"
inputs = tokenizer_cls(text, return_tensors="pt")

with torch.no_grad():
    outputs = model_cls(**inputs)
    logits = outputs.logits
    probs = F.softmax(logits, dim=-1)
    prediction = torch.argmax(probs, dim=-1)

labels = ['NEGATIVE', 'POSITIVE']
print(f"Text: {text}")
print(f"Prediction: {labels[prediction]} (prob: {probs[0][prediction]:.3f})")
print(f"Probabilities: NEGATIVE={probs[0][0]:.3f}, POSITIVE={probs[0][1]:.3f}")

---

# 3. Vision Transformers (ViT)

## 3.1 ViT Architecture

### Key Idea
Apply Transformer directly to **image patches**:

1. Split image into patches (e.g., 16x16)
2. Flatten patches to sequences
3. Linear projection to get patch embeddings
4. Add positional embeddings
5. Apply Transformer encoder
6. Use [CLS] token for classification

```
Image (224x224x3) ‚Üí Patches (14x14 patches of 16x16)
  ‚Üí Flatten (196 patches of 768-dim)
  ‚Üí Add [CLS] token (197 tokens)
  ‚Üí Transformer Encoder
  ‚Üí Classification Head (on [CLS])
```

In [None]:
from transformers import ViTImageProcessor, ViTForImageClassification
from PIL import Image
import requests

# Load ViT model
processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
model_vit = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')

print(f"‚úÖ Loaded ViT model")
print(f"   Patch size: {model_vit.config.patch_size}")
print(f"   Image size: {model_vit.config.image_size}")
print(f"   Hidden size: {model_vit.config.hidden_size}")
print(f"   Num patches: {(model_vit.config.image_size // model_vit.config.patch_size) ** 2}")

In [None]:
# Load sample image (or create synthetic)
# For demo, create random image
image = Image.new('RGB', (224, 224), color='red')

# In practice:
# image = Image.open('path/to/image.jpg')

# Process image
inputs = processor(images=image, return_tensors="pt")

print(f"Processed image shape: {inputs['pixel_values'].shape}")
print(f"   (batch, channels, height, width)")

# Inference
with torch.no_grad():
    outputs = model_vit(**inputs)
    logits = outputs.logits
    predicted_class = logits.argmax(-1).item()

print(f"\nPredicted class ID: {predicted_class}")
print(f"Predicted label: {model_vit.config.id2label[predicted_class]}")

## 3.2 CLIP (Contrastive Language-Image Pre-training)

### Key Idea
- Joint vision-language model
- Image encoder + Text encoder
- Learn shared embedding space
- Zero-shot classification

In [None]:
from transformers import CLIPProcessor, CLIPModel

# Load CLIP
model_clip = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor_clip = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

print("‚úÖ Loaded CLIP model")

# Zero-shot image classification
image = Image.new('RGB', (224, 224), color='blue')  # Synthetic image
texts = ["a photo of a cat", "a photo of a dog", "a photo of a car"]

inputs = processor_clip(text=texts, images=image, return_tensors="pt", padding=True)

with torch.no_grad():
    outputs = model_clip(**inputs)
    logits_per_image = outputs.logits_per_image  # Image-text similarity
    probs = logits_per_image.softmax(dim=1)

print(f"\nZero-shot classification:")
for text, prob in zip(texts, probs[0]):
    print(f"   '{text}': {prob.item():.3f}")

print(f"\nüí° CLIP can classify images with arbitrary text labels!")

---

# 4. Fine-tuning Patterns

## 4.1 Basic Fine-tuning Setup

In [None]:
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorWithPadding

# Example: Fine-tune for sequence classification
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

print("‚úÖ Training configuration:")
print(f"   Learning rate: {training_args.learning_rate}")
print(f"   Batch size: {training_args.per_device_train_batch_size}")
print(f"   Epochs: {training_args.num_train_epochs}")

# Data collator (handles padding)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Trainer
# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=train_dataset,  # Your dataset
#     eval_dataset=eval_dataset,
#     data_collator=data_collator,
# )

# Train
# trainer.train()

print("\nüí° Fine-tuning pattern:")
print("   1. Load pre-trained model")
print("   2. Prepare dataset (tokenize)")
print("   3. Configure TrainingArguments")
print("   4. Create Trainer")
print("   5. trainer.train()")

## 4.2 Common Fine-tuning Strategies

### 1. Full Fine-tuning
- Update all parameters
- Requires more GPU memory
- Best performance

### 2. Freeze Base, Train Head
```python
# Freeze base model
for param in model.base_model.parameters():
    param.requires_grad = False

# Only train classification head
```

### 3. Gradual Unfreezing
- Start with frozen base
- Gradually unfreeze layers

### 4. LoRA (Low-Rank Adaptation)
- Add small trainable matrices
- Freeze original weights
- Very parameter-efficient

```python
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,  # Rank
    lora_alpha=32,
    target_modules=["query", "value"],
    lora_dropout=0.1,
)

model = get_peft_model(model, config)
```

---

# üéØ Key Takeaways

## Transformer Architecture

### Core Concepts
1. **Self-Attention**: All positions attend to all positions
   - Query, Key, Value mechanism
   - Scaled by $\sqrt{d_k}$
   - Multi-head for different patterns

2. **Positional Encoding**: Inject position information
   - Sinusoidal functions
   - Learnable alternative

3. **Parallel Processing**: No sequential bottleneck
   - Faster than RNNs
   - Better long-range dependencies

## HuggingFace Ecosystem

### Quick Start
```python
# Pipeline API (easiest)
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I love this!")
```

### Manual Control
```python
# Load model and tokenizer
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Tokenize
inputs = tokenizer(text, return_tensors="pt")

# Inference
outputs = model(**inputs)
```

### Fine-tuning
```python
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    num_train_epochs=3,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

trainer.train()
```

## Vision Transformers

### ViT (Vision Transformer)
- Split image into patches
- Apply standard Transformer
- Competitive with CNNs

### CLIP (Contrastive Learning)
- Joint vision-language model
- Zero-shot classification
- Image-text similarity

## Best Practices

1. **Start with Pipeline API** for quick experiments
2. **Use Auto classes** (AutoTokenizer, AutoModel) for flexibility
3. **Fine-tune from pre-trained** (don't train from scratch)
4. **Monitor GPU memory** (use smaller models or gradient checkpointing)
5. **Use mixed precision** (fp16) for faster training
6. **Consider LoRA** for parameter-efficient fine-tuning

---

**Next:** Docker for ML deployment