
## **Transformers**  


---

### **A Brief History of NLP Advancements**  
1. **Rule-Based Era** (1950s–1990s): Handcrafted grammar rules (e.g., ELIZA)  
2. **Statistical NLP** (1990s–2010s): Hidden Markov Models, TF-IDF, Word2Vec  
3. **Neural Networks** (2010s–2017): RNNs, LSTMs, GRUs (sequential processing bottlenecks)  
4. **Transformer Revolution** (2017–Present): Parallel processing, self-attention, scalability  

---

### **Why Transformers Revolutionized NLP**  
- **Parallelization**: Process entire sequences at once (vs. sequential RNNs)  
- **Long-Range Dependencies**: Direct token-to-token attention across arbitrary distances  
- **Transfer Learning**: Pretrain on massive corpora, fine-tune for downstream tasks  
- **State-of-the-Art Performance**: Dominance in GLUE, SQuAD, and other benchmarks  

---
---

### **Core Idea of Transformers**  
Transformers (Vaswani et al., 2017) revolutionized sequence modeling by replacing recurrence (RNNs) and convolution (CNNs) with **self-attention**. This allows:  
- **Parallel processing** of entire sequences (no sequential dependency).  
- **Global context capture**: Directly model relationships between any two tokens in a sequence, regardless of distance.  
- **Scalability**: Efficient handling of long-range dependencies (critical for text, time series, or genomic data).  

---

### **Self-Attention: The Engine of Transformers**  

#### **Key Concepts**  
- **Queries (Q), Keys (K), Values (V)**: Learnable representations of the input.  
  - **Q** asks: "What am I looking for?"  
  - **K** answers: "What do I contain?"  
  - **V** provides: "The actual information to aggregate."  

- **Scaled Dot-Product Attention**:  
  $$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$  
  - **Softmax**: Assigns weights to values based on query-key compatibility.  
  - **Scaling by $\sqrt{d_k}$**: Prevents gradient vanishing/exploding as dimensionality grows.  

#### **Intuition**  
- Each token generates a **query** to "ask" about other tokens.  
- The **keys** determine how much each token "responds" to the query.  
- The **values** are summed based on these weights, creating a context-aware representation.  

---

### **Multi-Head Attention**  
Extends self-attention by running multiple attention operations in parallel:  
1. **Split** Q, K, V into $h$ heads (e.g., $h=8$).  
2. **Process independently**: Each head learns different relationship types (e.g., syntactic, semantic).  
3. **Concatenate** outputs and project back to original dimension.  

**Why It Works**:  
- Captures diverse interaction patterns (e.g., long vs. short-range, hierarchical).  
- Analogous to CNN filters learning edges vs. textures, but for sequence relationships.  

---

### **Positional Encoding**  
**Problem**: Self-attention treats sequences as unordered sets.  
**Solution**: Inject positional information using sinusoidal functions:  
$$PE_{(pos,2i)} = \sin(pos/10000^{2i/d})$$  
$$PE_{(pos,2i+1)} = \cos(pos/10000^{2i/d})$$  
- **Learned Alternatives**: Trainable embeddings for positions.  
- **Key Property**: Relative distances are preserved via trigonometric identities.  

---

### **Transformer vs. RNNs/CNNs**  

| **Aspect**          | **RNNs**               | **CNNs**               | **Transformers**       |  
|----------------------|------------------------|------------------------|------------------------|  
| **Computation**      | Sequential             | Local windows          | Fully parallel         |  
| **Context Range**    | Limited by hidden state | Kernel size            | Global (all tokens)    |  
| **Memory**           | O(n)                   | O(kn)                  | O(n²)                  |  
| **Use Case**         | Short sequences        | Local patterns         | Long-range dependencies|  

---



In [1]:

import torch
import torch.nn as nn

class SelfAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super().__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads

        # Projections for Q, K, V
        self.to_qkv = nn.Linear(embed_size, 3 * embed_size)
        self.scale = self.head_dim ** -0.5  # 1/sqrt(d_k)
        
        # Final output projection
        self.to_out = nn.Linear(embed_size, embed_size)

    def forward(self, x, mask=None):
        batch_size, seq_len, _ = x.shape
        qkv = self.to_qkv(x).chunk(3, dim=-1)  # Split into Q, K, V
        q, k, v = map(lambda t: t.view(batch_size, seq_len, self.heads, self.head_dim).transpose(1, 2), qkv)

        # Attention scores
        scores = (q @ k.transpose(-2, -1)) * self.scale #@ is matrix multiplication, * is element multiplication 
        if mask is not None:  # Mask future tokens (e.g., for decoding)
            scores = scores.masked_fill(mask == 0, -1e9)
        
        attn_weights = torch.softmax(scores, dim=-1)
        context = attn_weights @ v  # Aggregate values

        # Reassemble and project
        context = context.transpose(1, 2).reshape(batch_size, seq_len, self.embed_size)
        return self.to_out(context)

# Example usage
embed_size = 512
heads = 8
x = torch.randn(2, 10, embed_size)  # (batch, sequence length, features)
sa = SelfAttention(embed_size, heads)
output = sa(x)
print(output.shape) 

torch.Size([2, 10, 512])


In [4]:
import torch
import torch.nn as nn

# Longer sentence
sentence = "The black cat sleeps on the carpet while the dog barks in the garden"
vocab = {word: idx for idx, word in enumerate(sentence.split())}

embed_size = 8
heads = 2

# Word embeddings
word_embeddings = torch.randn(len(vocab), embed_size)
x = word_embeddings.unsqueeze(0)

sa = SelfAttention(embed_size, heads)
with torch.no_grad():
    qkv = sa.to_qkv(x).chunk(3, dim=-1)
    q, k, v = map(lambda t: t.view(1, len(vocab), heads, embed_size//heads).transpose(1, 2), qkv)
    scores = (q @ k.transpose(-2, -1)) * sa.scale
    attention_weights = torch.softmax(scores, dim=-1)

# Show strongest relationships
for i, word1 in enumerate(vocab):
    print(f"\nWord '{word1}' pays most attention to:")
    weights = attention_weights[0, 0, i]
    # Get top 3 weights
    top_indices = weights.topk(3).indices
    for idx in top_indices:
        word2 = list(vocab.keys())[idx]
        weight = weights[idx].item()
        print(f"- {word2}: {weight:.2f}")


Word 'The' pays most attention to:
- in: 0.21
- sleeps: 0.12
- carpet: 0.11

Word 'black' pays most attention to:
- black: 0.11
- garden: 0.10
- cat: 0.10

Word 'cat' pays most attention to:
- barks: 0.20
- black: 0.12
- dog: 0.11

Word 'sleeps' pays most attention to:
- in: 0.20
- sleeps: 0.17
- while: 0.10

Word 'on' pays most attention to:
- cat: 0.20
- on: 0.10
- while: 0.10

Word 'the' pays most attention to:
- cat: 0.12
- black: 0.11
- The: 0.10

Word 'carpet' pays most attention to:
- sleeps: 0.13
- cat: 0.13
- in: 0.12

Word 'while' pays most attention to:
- cat: 0.16
- while: 0.10
- sleeps: 0.09

Word 'dog' pays most attention to:
- cat: 0.18
- in: 0.15
- sleeps: 0.10

Word 'barks' pays most attention to:
- cat: 0.27
- on: 0.09
- garden: 0.08

Word 'in' pays most attention to:
- sleeps: 0.14
- The: 0.11
- while: 0.10

Word 'garden' pays most attention to:
- cat: 0.13
- garden: 0.09
- on: 0.09


**p.s.** Words are repeated because the model is also learning self-attention - meaning how much a word should pay attention to itself. 

Word 'black' pays most attention to:
- black: 0.11  # Self-attention 

This happens because:
1. Sometimes a word's context includes the word itself
2. The weights are probabilities (sum to 1), distributed across all words
3. We're using random embeddings (`torch.randn`), so the patterns aren't semantically meaningful



### **7. Why Attention Works: The Big Picture**  
1. **Dynamic Weighting**: Unlike fixed convolution kernels, attention adaptively focuses on relevant tokens.  
2. **Interpretability**: Attention maps reveal which tokens influence predictions (e.g., subject-verb agreement in NLP).  
3. **Flexibility**: Handles variable-length inputs and cross-modal data (text + images).  

---

### **8. Limitations & Extensions**  
- **Quadratic Complexity**: Infeasible for very long sequences (e.g., genome data).  
  - **Fix**: Sparse attention (Longformer, BigBird).  
- **Lack of Inductive Bias**: Requires more data than RNNs/CNNs.  
  - **Fix**: Hybrid architectures (Conformer).  

---

Transformers’ generality makes them applicable to **any data with sequential or relational structure**—from text to financial time series—by rethinking how we model dependencies.

## **Key Transformer Architectures**  
### **BERT (Bidirectional Encoder)**  
- **Pretraining**:  
  - **Masked Language Modeling (MLM)**: Predict masked tokens (e.g., `"The [MASK] sat on the mat"`)  
  - **Next Sentence Prediction (NSP)**: Classify if two sentences are consecutive  
- **Bidirectionality**: Context from left and right (critical for tasks like NER)  
- **Applications**:  
  - Sentiment analysis (economic indicator prediction from news)  
  - Document classification (regulatory compliance checks)  

---

### **GPT Series (Autoregressive Decoders)**  
- **Unidirectional**: Predict next token using left-to-right context  
- **Generative Strength**: ChatGPT, code generation, creative writing  
- **Economics Applications**:  
  - Financial report generation  
  - Synthetic data creation for policy simulations  

---

### **BART (Seq2Seq Architecture)**  
- **Denoising Pretraining**: Corrupt text (e.g., masking, permutation), then reconstruct  
- **Encoder-Decoder**: Combines BERT’s encoder with GPT-style decoder  
- **Applications**:  
  - Summarizing economic research papers  
  - Generating executive briefs from lengthy reports  

---

### **T5 (Text-to-Text Framework)**  
- **Unified Framework**: All tasks formatted as `"input: text → output: text"`  
  - Example: `"Translate English to German: Hello → Hallo"`  
- **Applications**:  
  - Multilingual economic survey translation  
  - Converting unstructured data (emails) to structured tables  

---

### **Efficient Models**  
| **Model**     | **Key Innovation**                     | **Use Case**                |  
|---------------|----------------------------------------|-----------------------------|  
| **DistilBERT**| Knowledge distillation (60% smaller)  | Real-time sentiment analysis|  
| **ALBERT**    | Parameter sharing (89% fewer params)  | Low-resource language tasks |  

---



---

## Fine-Tuning BERT for a Specific Task

Fine-tuning involves adapting a large, pre-trained model like BERT (Devlin et al., 2018) to a particular downstream task such as text classification. The code below demonstrates a basic setup using Hugging Face's Transformers library for binary/multiclassification:



In [1]:
#!pip install transformers accelerate --upgrade

In [8]:
import pandas as pd
import torch
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import Dataset, DataLoader
from tqdm.notebook import tqdm

class SentimentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128, device='cpu'):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len
        self.device = device
        
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx]
        
        encoding = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=self.max_len,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten().to(self.device),
            'attention_mask': encoding['attention_mask'].flatten().to(self.device),
            'label': torch.tensor(label, dtype=torch.long, device=self.device)
        }

def train_sentiment_model(data, device='cuda' if torch.cuda.is_available() else 'cpu'):
    # Initialize BERT
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
    model = model.to(device)
    
    # Create dataset
    dataset = SentimentDataset(
        data['text'].values, 
        data['label'].values, 
        tokenizer,
        device=device
    )
    dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
    
    # Training setup
    optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
    
    # Training loop
    num_epochs = 10
    print(f"Starting training on device: {device}")
    
    for epoch in range(num_epochs):
        model.train()
        progress_bar = tqdm(dataloader, desc=f'Epoch {epoch + 1}/{num_epochs}')
        
        for batch in progress_bar:
            optimizer.zero_grad()
            
            outputs = model(
                input_ids=batch['input_ids'],
                attention_mask=batch['attention_mask'],
                labels=batch['label']
            )
            
            loss = outputs.loss
            loss.backward()
            optimizer.step()
            
            progress_bar.set_postfix({'loss': f'{loss.item():.4f}'})
    
    print("Training completed!")
    return model, tokenizer

# Sample data
data = {
    'text': [
        'This movie was great!',
        'Terrible waste of time',
        'I loved this film',
        'Worst movie ever',
        'Highly recommended',
    ],
    'label': [1, 0, 1, 0, 1]  # 1 for positive, 0 for negative
}
df = pd.DataFrame(data)

# Train model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model, tokenizer = train_sentiment_model(df, device)

# Test prediction
def predict_sentiment(text, model, tokenizer, device):
    model.eval()
    inputs = tokenizer(
        text, 
        return_tensors="pt", 
        padding=True, 
        truncation=True
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs)
        prediction = torch.nn.functional.softmax(outputs.logits, dim=-1)
    
    return {
        'positive': prediction[0][1].item(),
        'negative': prediction[0][0].item()
    }

# Test the model
test_text = "This is an amazing movie!"
prediction = predict_sentiment(test_text, model, tokenizer, device)
print(f"\nTest prediction for '{test_text}':")
print(f"Positive: {prediction['positive']:.2%}")
print(f"Negative: {prediction['negative']:.2%}")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting training on device: cuda


Epoch 1/10:   0%|          | 0/3 [00:00<?, ?it/s]

Epoch 2/10:   0%|          | 0/3 [00:00<?, ?it/s]

Epoch 3/10:   0%|          | 0/3 [00:00<?, ?it/s]

Epoch 4/10:   0%|          | 0/3 [00:00<?, ?it/s]

Epoch 5/10:   0%|          | 0/3 [00:00<?, ?it/s]

Epoch 6/10:   0%|          | 0/3 [00:00<?, ?it/s]

Epoch 7/10:   0%|          | 0/3 [00:00<?, ?it/s]

Epoch 8/10:   0%|          | 0/3 [00:00<?, ?it/s]

Epoch 9/10:   0%|          | 0/3 [00:00<?, ?it/s]

Epoch 10/10:   0%|          | 0/3 [00:00<?, ?it/s]

Training completed!

Test prediction for 'This is an amazing movie!':
Positive: 88.86%
Negative: 11.14%


In [12]:
class SentimentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128, device='cpu'):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len
        self.device = device
        
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx]
        
        encoding = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=self.max_len,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten().to(self.device),
            'attention_mask': encoding['attention_mask'].flatten().to(self.device),
            'label': torch.tensor(label, dtype=torch.long, device=self.device)
        }

def train_sentiment_model(data, device='cuda' if torch.cuda.is_available() else 'cpu'):
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)
    model = model.to(device)
    
    dataset = SentimentDataset(
        data['text'].values, 
        data['label'].values, 
        tokenizer,
        device=device
    )
    dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
    
    optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
    
    num_epochs = 10
    print(f"Starting training on device: {device}")
    
    for epoch in range(num_epochs):
        model.train()
        progress_bar = tqdm(dataloader, desc=f'Epoch {epoch + 1}/{num_epochs}')
        
        for batch in progress_bar:
            optimizer.zero_grad()
            
            outputs = model(
                input_ids=batch['input_ids'],
                attention_mask=batch['attention_mask'],
                labels=batch['label']
            )
            
            loss = outputs.loss
            loss.backward()
            optimizer.step()
            
            progress_bar.set_postfix({'loss': f'{loss.item():.4f}'})
    
    print("Training completed!")
    return model, tokenizer

# Sample data with neutral class
data = {
    'text': [
        'This movie was great!',
        'Terrible waste of time',
        'The film was okay, nothing special',
        'I loved this film',
        'Worst movie ever',
        'It was decent, had its moments',
        'Highly recommended',
        'Average movie, wouldn\'t watch again',
    ],
    'label': [2, 0, 1, 2, 0, 1, 2, 1]  # 0: negative, 1: neutral, 2: positive
}
df = pd.DataFrame(data)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model, tokenizer = train_sentiment_model(df, device)

def predict_sentiment(text, model, tokenizer, device):
    model.eval()
    inputs = tokenizer(
        text, 
        return_tensors="pt", 
        padding=True, 
        truncation=True
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs)
        prediction = torch.nn.functional.softmax(outputs.logits, dim=-1)
    
    return {
        'negative': prediction[0][0].item(),
        'neutral': prediction[0][1].item(),
        'positive': prediction[0][2].item()
    }

# Test prediction
test_text = "This movie was interesting but had some flaws"
prediction = predict_sentiment(test_text, model, tokenizer, device)
print(f"\nTest prediction for '{test_text}':")
print(f"Negative: {prediction['negative']:.2%}")
print(f"Neutral: {prediction['neutral']:.2%}")
print(f"Positive: {prediction['positive']:.2%}")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting training on device: cuda


Epoch 1/10: 100%|███████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  8.29it/s, loss=0.9519]
Epoch 2/10: 100%|███████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  8.78it/s, loss=0.9562]
Epoch 3/10: 100%|███████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  8.58it/s, loss=0.8871]
Epoch 4/10: 100%|███████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  8.98it/s, loss=0.8717]
Epoch 5/10: 100%|███████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  8.55it/s, loss=1.0387]
Epoch 6/10: 100%|███████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  8.46it/s, loss=0.5864]
Epoch 7/10: 100%|███████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  7.32it/s, loss=0.6808]
Epoch 8/10: 100%|███████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  7.18it/s, loss=0.3695]
Epoch 9/10: 100%|███████████████████████

Training completed!

Test prediction for 'This movie was interesting but had some flaws':
Negative: 11.53%
Neutral: 47.35%
Positive: 41.12%





### **Swapping Models (BERT → DistilBERT)**  
try by yourself :)
```python
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

# Only change these lines:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
```