<a href="https://colab.research.google.com/github/JordanDCunha/Hands-On-Machine-Learning-with-Scikit-Learn-and-PyTorch/blob/main/Chapter15.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Attention Is All You Need: The Original Transformer Architecture

The original 2017 Transformer architecture is represented in Figure 15-3.  
The left part of the figure represents the **encoder**, the right part represents the **decoder**.


## Encoder Overview

The encoder‚Äôs role is to gradually transform the inputs (e.g., sequences of English tokens) until each token‚Äôs representation perfectly captures the meaning of that token in the context of the sentence.

The encoder‚Äôs output is a sequence of **contextualized token embeddings**.


Apart from the embedding layer, every layer in the encoder:

- Takes input of shape  
  **[batch size, max English sequence length, embedding size]**
- Returns output of the **same shape**

Token representations are gradually transformed ‚Äî hence the name *Transformer*.


### Example

If you feed the sentence **‚ÄúI like soccer‚Äù** into the encoder:

- The word **‚Äúlike‚Äù** starts with a vague meaning
- After encoding, it captures:
  - Correct meaning (verb: *to enjoy*)
  - Grammatical role
  - Contextual information needed for translation


## Decoder Overview

The decoder takes:

- The encoder‚Äôs outputs
- The translated sentence so far

Its goal is to predict the **next token** in the translation.


### Decoding Example

Source sentence: **‚ÄúI like soccer‚Äù**

Decoder outputs step-by-step:
1. `me`
2. `me gusta`
3. `me gusta el`
4. `me gusta el f√∫tbol`

Since there is no end-of-sentence token (`</s>`), the decoder is called once more to predict it.


The decoder input sequence becomes:

`<s> me gusta el f√∫tbol`

Each token representation is transformed so that:
- `<s>` ‚Üí `me`
- `me` ‚Üí `gusta`
- `gusta` ‚Üí `el`
- `el` ‚Üí `f√∫tbol`
- `f√∫tbol` ‚Üí `</s>`


## Transformer Stack Structure

- Encoder blocks are stacked **N times** (N = 6 in the paper)
- Decoder blocks are also stacked **N times**
- The **final encoder output** is fed into **every decoder block**


### Components You Already Know

- Embedding layers
- Skip connections + LayerNorm
- Feedforward networks (2 Linear layers with ReLU)
- Final Linear output layer

‚ùó All of these treat tokens **independently**


## New Components

### 1. Encoder Self-Attention

Each token attends to **all tokens in the sentence**, including itself.

This allows:
- Contextual understanding
- Disambiguation of words like ‚Äúlike‚Äù


### 2. Decoder Masked Self-Attention

Each token can only attend to:
- Previous tokens
- Itself

This prevents the model from **cheating during training**.


### 3. Encoder‚ÄìDecoder Cross-Attention

The decoder attends to the encoder‚Äôs outputs.

Example:
- While generating **‚Äúf√∫tbol‚Äù**, the decoder attends strongly to **‚Äúsoccer‚Äù**


## Positional Encodings

Transformers are **position-agnostic**.

To inject order information:
- Add a positional encoding to each token embedding


### Trainable Positional Embeddings (Common Approach)

- Use a trainable matrix
- Add it to token embeddings
- Apply dropout


In [None]:
import torch
import torch.nn as nn

class PositionalEmbedding(nn.Module):
    def __init__(self, max_length, embed_dim, dropout=0.1):
        super().__init__()
        self.pos_embed = nn.Parameter(
            torch.randn(max_length, embed_dim) * 0.02
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, X):
        return self.dropout(X + self.pos_embed[:X.size(1)])


**Note:**  
- Input shape: `[batch, sequence length, embedding size]`
- Positional embedding shape: `[sequence length, embedding size]`
- Broadcasting handles the addition


## Multi-Head Attention (MHA)

Based on **scaled dot-product attention**.

Each head learns to focus on **different aspects** of token representations.


### Why Multiple Heads?

Different heads can specialize:
- Grammar
- Semantics
- Tense
- Long-range dependencies


## Scaled Dot-Product Attention

Given:
- Queries (Q)
- Keys (K)
- Values (V)

We compute:

softmax(QK·µÄ / ‚àöd) V


## Scaled Dot-Product Attention

Given:
- Queries (Q)
- Keys (K)
- Values (V)

We compute:

softmax(QK·µÄ / ‚àöd) V


## Multi-Head Attention Implementation


In [None]:
class MultiheadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads, dropout=0.1):
        super().__init__()
        self.h = num_heads
        self.d = embed_dim // num_heads

        self.q_proj = nn.Linear(embed_dim, embed_dim)
        self.k_proj = nn.Linear(embed_dim, embed_dim)
        self.v_proj = nn.Linear(embed_dim, embed_dim)
        self.out_proj = nn.Linear(embed_dim, embed_dim)
        self.dropout = nn.Dropout(dropout)

    def split_heads(self, X):
        return X.view(X.size(0), X.size(1), self.h, self.d).transpose(1, 2)

    def forward(self, query, key, value):
        q = self.split_heads(self.q_proj(query))
        k = self.split_heads(self.k_proj(key))
        v = self.split_heads(self.v_proj(value))

        scores = q @ k.transpose(2, 3) / self.d ** 0.5
        weights = scores.softmax(dim=-1)
        Z = self.dropout(weights) @ v

        Z = Z.transpose(1, 2).reshape(Z.size(0), Z.size(1), -1)
        return self.out_proj(Z), weights


## Masking Support


In [None]:
def forward(self, query, key, value, attn_mask=None, key_padding_mask=None):
    ...
    if attn_mask is not None:
        scores = scores.masked_fill(attn_mask, -torch.inf)

    if key_padding_mask is not None:
        mask = key_padding_mask.unsqueeze(1).unsqueeze(2)
        scores = scores.masked_fill(mask, -torch.inf)
    ...


## Encoder Block


In [None]:
class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1):
        super().__init__()
        self.self_attn = MultiheadAttention(d_model, nhead, dropout)
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.linear2 = nn.Linear(dim_feedforward, d_model)
        self.dropout = nn.Dropout(dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, src, src_mask=None, src_key_padding_mask=None):
        attn, _ = self.self_attn(src, src, src,
                                 attn_mask=src_mask,
                                 key_padding_mask=src_key_padding_mask)
        Z = self.norm1(src + self.dropout(attn))
        ff = self.dropout(self.linear2(self.dropout(self.linear1(Z).relu())))
        return self.norm2(Z + ff)


## Decoder Block


In [None]:
class TransformerDecoderLayer(nn.Module):
    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None,
                tgt_key_padding_mask=None, memory_key_padding_mask=None):
        attn1, _ = self.self_attn(tgt, tgt, tgt,
                                  attn_mask=tgt_mask,
                                  key_padding_mask=tgt_key_padding_mask)
        Z = self.norm1(tgt + self.dropout(attn1))

        attn2, _ = self.multihead_attn(Z, memory, memory,
                                       attn_mask=memory_mask,
                                       key_padding_mask=memory_key_padding_mask)
        Z = self.norm2(Z + self.dropout(attn2))

        ff = self.dropout(self.linear2(self.dropout(self.linear1(Z).relu())))
        return self.norm3(Z + ff)


## PyTorch Built-in Transformer Modules

- `nn.TransformerEncoderLayer`
- `nn.TransformerDecoderLayer`
- `nn.TransformerEncoder`
- `nn.TransformerDecoder`
- `nn.Transformer`


These are highly optimized and support:
- `batch_first=True`
- Causal masking
- GELU activation
- Performance optimizations


## Final Notes

You now know how to build a full Transformer **from scratch**.

Remaining steps:
- Add final Linear layer
- Train with `nn.CrossEntropyLoss`
- Use autoregressive decoding

Next up: **Using Transformers for machine translation** üöÄ


# Building an English-to-Spanish Transformer

It‚Äôs time to build our Neural Machine Translation (NMT) Transformer model.  
We will reuse the `PositionalEmbedding` module and rely on PyTorch‚Äôs built-in
`nn.Transformer`, which is well-optimized and faster than a custom implementation.

The model will:
- Embed source (English) and target (Spanish) tokens
- Add positional embeddings
- Use an encoder‚Äìdecoder Transformer
- Apply causal masking in the decoder
- Output logits suitable for `nn.CrossEntropyLoss`


In [None]:
import torch
import torch.nn as nn


import torch
import torch.nn as nn


In [None]:
class NmtTransformer(nn.Module):
    def __init__(self, vocab_size, max_length, embed_dim=512, pad_id=0,
                 num_heads=8, num_layers=6, dropout=0.1):
        super().__init__()

        self.embed = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_id)
        self.pos_embed = PositionalEmbedding(max_length, embed_dim, dropout)

        self.transformer = nn.Transformer(
            d_model=embed_dim,
            nhead=num_heads,
            num_encoder_layers=num_layers,
            num_decoder_layers=num_layers,
            batch_first=True
        )

        self.output = nn.Linear(embed_dim, vocab_size)

    def forward(self, pair):
        # Embed tokens and add positional encodings
        src_embeds = self.pos_embed(self.embed(pair.src_token_ids))
        tgt_embeds = self.pos_embed(self.embed(pair.tgt_token_ids))

        # Invert masks: True means "ignore"
        src_pad_mask = ~pair.src_mask.bool()
        tgt_pad_mask = ~pair.tgt_mask.bool()

        # Create causal mask for decoder self-attention
        seq_len = pair.tgt_token_ids.size(1)
        full_mask = torch.full((seq_len, seq_len), True, device=tgt_pad_mask.device)
        causal_mask = torch.triu(full_mask, diagonal=1)

        # Transformer forward pass
        out = self.transformer(
            src=src_embeds,
            tgt=tgt_embeds,
            src_key_padding_mask=src_pad_mask,
            memory_key_padding_mask=src_pad_mask,
            tgt_key_padding_mask=tgt_pad_mask,
            tgt_mask=causal_mask,
            tgt_is_causal=True
        )

        # Project to vocabulary size
        logits = self.output(out)

        # Rearrange for CrossEntropyLoss: (B, vocab_size, seq_len)
        return logits.permute(0, 2, 1)


## Understanding the Forward Pass

1. Tokens are embedded and enriched with positional encodings.
2. Padding masks are inverted because PyTorch expects `True` for tokens to ignore.
3. A causal (upper-triangular) mask prevents the decoder from seeing future tokens.
4. The Transformer processes source and target sequences.
5. A final linear layer produces logits over the vocabulary.
6. Output dimensions are rearranged for `nn.CrossEntropyLoss`.


## Training a Smaller Transformer

To speed up training and reduce overfitting, we can shrink the model:
- Embedding size: 128
- Attention heads: 4
- Encoder layers: 2
- Decoder layers: 2


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

nmt_tr_model = NmtTransformer(
    vocab_size=vocab_size,
    max_length=max_length,
    embed_dim=128,
    pad_id=0,
    num_heads=4,
    num_layers=2,
    dropout=0.1
).to(device)
gintg

The model can now be trained exactly like the RNN encoder‚Äìdecoder
from Chapter 14, using `nn.CrossEntropyLoss` and teacher forcing.


## Evaluation Example

After training for around 20 epochs, even this small Transformer
can produce high-quality translations.


In [None]:
## Evaluation Example

After training for around 20 epochs, even this small Transformer
can produce high-quality translations.


### Output



### Output



## Cleaning Up GPU Memory

Before moving on to other models, free GPU memory:

- Delete unused variables
- Run Python garbage collection
- Clear CUDA cache if using a GPU


In [None]:
import gc

del nmt_tr_model
gc.collect()

if torch.cuda.is_available():
    torch.cuda.empty_cache()


# Encoder-Only Transformers for Natural Language Understanding

When Google released the BERT model in 2018, it proved that an encoder-only
Transformer can handle a wide variety of natural language understanding (NLU)
tasks, including:

- Sentence classification
- Token classification
- Multiple-choice question answering
- Extractive question answering

BERT also demonstrated the power of self-supervised pretraining on large corpora
followed by fine-tuning on relatively small task-specific datasets.

In this section, we will:
- Examine BERT‚Äôs architecture
- Understand its pretraining objectives
- See how to pretrain and fine-tune encoder-only models


## WARNING: Encoder-Only vs Decoder Models

Encoder-only models are generally **not used for text generation** tasks such as:
- Autocompletion
- Translation
- Summarization
- Chatbots

This is because encoders are **bidirectional** and must recompute attention over
the entire sequence whenever a new token is added.

Decoders, by contrast, are **causal** and can cache previous states, making them
much faster for generation.

The ‚ÄúB‚Äù in BERT stands for **Bidirectional Encoder Representations from
Transformers**.


## BERT‚Äôs Architecture

BERT is almost identical to the original Transformer encoder, with three key
differences:

### 1. Scale
- BERT-base: 12 layers, 12 heads, 768 hidden units
- BERT-large: 24 layers, 16 heads, 1,024 hidden units
- Maximum input length: 512 tokens
- Uses trainable positional embeddings

### 2. Pre-Layer Normalization (Pre-LN)
Layer normalization is applied **before** each sublayer instead of after.
This stabilizes training and reduces sensitivity to initialization.

PyTorch supports this via `norm_first=True`.

### 3. Segment Embeddings
BERT supports **two input segments**, useful for sentence-pair tasks.
- `[SEP]` token separates segments
- Segment embeddings (0 or 1) are added to token embeddings


## BERT Pretraining Objectives

BERT uses two self-supervised objectives during pretraining.


### Masked Language Modeling (MLM)

- Each token has a **15% probability** of being selected
- Of selected tokens:
  - 80% replaced with `[MASK]`
  - 10% replaced with a random token
  - 10% left unchanged
- Loss is computed **only on selected tokens**

This forces the model to learn deep bidirectional context.
ya

### Next Sentence Prediction (NSP)

The model predicts whether two sentences are consecutive.

Implementation:
- Add a `[CLS]` token at position 0
- Use the `[CLS]` embedding for binary classification

Later research showed NSP adds little benefit, so it was dropped in many models
(e.g., RoBERTa).


## Creating a BERT Model from Scratch (Using Hugging Face)

Instead of manually building BERT with `nn.TransformerEncoder`, we can use the
Transformers library to quickly define and train a model.


In [None]:
from transformers import BertConfig, BertForMaskedLM, BertTokenizerFast


In [None]:
bert_tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

config = BertConfig(
    vocab_size=bert_tokenizer.vocab_size,
    hidden_size=128,
    num_hidden_layers=2,
    num_attention_heads=4,
    intermediate_size=512,
    max_position_embeddings=128
)

bert = BertForMaskedLM(config)


## Loading and Tokenizing a Dataset

We use the WikiText dataset here for demonstration purposes.


In [None]:
from datasets import load_dataset


In [None]:
def tokenize(example, tokenizer=bert_tokenizer):
    return tokenizer(
        example["text"],
        truncation=True,
        max_length=128,
        padding="max_length"
    )

mlm_dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
mlm_dataset = mlm_dataset.map(tokenize, batched=True)


## Data Collator for MLM

The data collator dynamically applies masking during training.


## Data Collator for MLM

The data collator dynamically applies masking during training.


In [None]:
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling


In [None]:
args = TrainingArguments(
    output_dir="./my_bert",
    num_train_epochs=5,
    per_device_train_batch_size=16
)

mlm_collator = DataCollatorForLanguageModeling(
    tokenizer=bert_tokenizer,
    mlm=True,
    mlm_probability=0.15
)

trainer = Trainer(
    model=bert,
    args=args,
    train_dataset=mlm_dataset,
    data_collator=mlm_collator
)

trainer_output = trainer.train()


## Testing the Pretrained Model

After pretraining, we can test the model using the fill-mask pipeline.


In [None]:
from transformers import pipeline
import torch

torch.manual_seed(42)

fill_mask = pipeline("fill-mask", model=bert, tokenizer=bert_tokenizer)
predictions = fill_mask("The capital of [MASK] is Rome.")
predictions[0]
why is

The model performs poorly because it was trained for too few epochs.
In practice, BERT was trained for **days on TPUs**.

This is why most users fine-tune pretrained checkpoints instead of training
from scratch.


The model performs poorly because it was trained for too few epochs.
In practice, BERT was trained for **days on TPUs**.

This is why most users fine-tune pretrained checkpoints instead of training
from scratch.

## Fine-Tuning BERT

BERT can be fine-tuned for many tasks with minimal architectural changes:


### Sentence Classification
- Use `[CLS]` embedding
- Add a classification head
- Optimize cross-entropy loss


### Token Classification (e.g., NER)
- Apply classification head to each token
- Common in legal, medical, and financial NLP


### Sentence Pair Tasks
- Natural Language Inference (NLI)
- Paraphrase detection
- Question‚Äìanswer matching


### Multiple-Choice Question Answering
- Run BERT once per candidate answer
- Use softmax over answer scores


### Extractive Question Answering
- Predict start and end token indices
- Use two logits per token


## Sentence Embeddings and SBERT

Standard BERT is inefficient for semantic similarity at scale.
Sentence-BERT (SBERT) solves this by producing fixed sentence embeddings.


In [None]:
from sentence_transformers import SentenceTransformer


In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = [
    "She's shopping",
    "She bought some shoes",
    "She's working"
]

embeddings = model.encode(sentences, convert_to_tensor=True)
similarities = model.similarity(embeddings, embeddings)
similarities


Sentence embeddings enable:
- Semantic search
- Document clustering
- Reranking search results
- Fast similarity comparisons


## Other Encoder-Only Models

- **RoBERTa**: Better training, dynamic masking, no NSP
- **DistilBERT**: Smaller, faster, distilled from BERT
- **ALBERT**: Parameter sharing and factorized embeddings
- **ELECTRA**: Replaced token detection (RTD)
- **DeBERTa**: Disentangled attention with relative positions


## Final Thoughts

Encoder-only models remain extremely valuable:
- Smaller and faster than decoders
- Easy to fine-tune
- Ideal for NLU tasks

Next up: **Decoder-only models like GPT** üöÄ


# Decoder-Only Transformers

While Google was working on the first encoder-only model (BERT), OpenAI researchers
took a different route and built the first decoder-only model: **GPT**.

GPT stands for *Generative Pre-Training*. These models predict the **next token**
given all previous tokens, making them ideal for text generation.

Decoder-only models are the foundation of modern systems such as ChatGPT and
Claude.


## GPT-1 Overview

GPT-1 was released in June 2018 and pretrained on approximately 7,000 books.

Training objective:
- Predict the next token for every position in the sequence

This allows the model to generate text one token at a time.


## Text Generation with Decoder-Only Models

Given an input such as:

"Happy birthday"

The model predicts the next token:
"to"

This token is appended to the input, and the process repeats.


## Strengths and Weaknesses of Decoder-Only Models

Decoder-only models excel at:
- Text generation
- Code generation
- Question answering
- Chatbots
- Reasoning (to some extent)

They are less efficient for:
- Classification
- Tasks requiring full bidirectional context

Encoder-only models are often faster and smaller for NLU tasks.


## WARNING: Inference Cost

Decoder-only models are **autoregressive**:
- One forward pass per generated token

Encoder-only models process the input once.

Caching past key/value states significantly improves decoder performance.


## GPT-1 Architecture and Pretraining

Pretraining details:
- Sequences of exactly 512 tokens
- No padding tokens
- No start-of-sequence or end-of-sequence tokens
- Next-token prediction for all positions

This ensures uniform training data across token positions.


## GPT-1 Architecture Differences from the Original Transformer

1. No cross-attention layers
   - Only masked self-attention + feedforward layers

2. Larger scale
   - 12 decoder layers
   - 12 attention heads
   - 768-dimensional embeddings
   - 117 million parameters


## GPT-1 Architecture Differences from the Original Transformer

1. No cross-attention layers
   - Only masked self-attention + feedforward layers

2. Larger scale
   - 12 decoder layers
   - 12 attention heads
   - 768-dimensional embeddings
   - 117 million parameters


## GPT-1 Fine-Tuning Tasks

GPT-1 was fine-tuned for multiple tasks with minimal architectural changes.


### Text Classification

- Add a classification head on top of the **last token**
- Use cross-entropy loss


### Sentence Pair Tasks

- Concatenate sentences using a delimiter token (`$`)
- Add a classification head on the last token


### Semantic Similarity

- Run the model twice:
  - Sentence A $ Sentence B
  - Sentence B $ Sentence A
- Sum final token embeddings
- Feed to a regression head


### Multiple Choice Question Answering

- Run model once per answer choice
- Score each using last token
- Apply softmax across answers


## GPT-2 and Zero-Shot Learning

GPT-2 was released in February 2019 and scaled up dramatically.

Largest version:
- 48 layers
- 20 attention heads
- 1,600-dimensional embeddings
- 1.5B parameters
- Context length: 1,024 tokens


## WebText Dataset

GPT-2 was trained on **WebText**, a curated dataset of ~8M high-quality web pages
linked from popular Reddit posts.

This improved data quality compared to Common Crawl.


## Zero-Shot Learning (ZSL)

GPT-2 performs many tasks **without fine-tuning**, simply by prompt formatting.

Examples:
- Question answering using "Q: ... A:"
- Summarization using "TL;DR:"
- Translation using examples in the prompt


## Scaling Laws

Zero-shot performance improves roughly log-linearly with model size.

Bigger models ‚Üí better generalization


## GPT-3 and In-Context Learning

GPT-3 (2020):
- ~40B parameters
- ~570GB of training data

Key insight:
- Models can learn tasks from **examples in the prompt**


## In-Context Learning (ICL)

ICL means:
- No gradient updates
- No fine-tuning
- Task examples provided directly in the prompt

Includes:
- One-shot learning (OSL)
- Few-shot learning (FSL)


## Using GPT-2 with Hugging Face Transformers


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM


In [None]:
model_id = "gpt2"

gpt2_tokenizer = AutoTokenizer.from_pretrained(model_id)
gpt2 = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype="auto"
)


## Text Generation Helper Function


In [None]:
def generate(model, tokenizer, prompt, max_new_tokens=50, **generate_kwargs):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        pad_token_id=tokenizer.eos_token_id,
        **generate_kwargs
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


## Greedy Decoding Example


In [None]:
prompt = "Scientists found a talking unicorn today. Here's the full story:"
generate(gpt2, gpt2_tokenizer, prompt)


Greedy decoding often causes repetition.
To fix this, we enable sampling.


In [None]:
import torch

torch.manual_seed(42)
generate(gpt2, gpt2_tokenizer, prompt, do_sample=True)


## Sampling Controls

Common generation parameters:
- temperature
- top_k
- top_p (nucleus sampling)
- num_beams


In [None]:
torch.manual_seed(42)
generate(
    gpt2,
    gpt2_tokenizer,
    prompt,
    do_sample=True,
    top_p=0.6
)


## Question Answering with In-Context Learning


In [None]:
DEFAULT_TEMPLATE = (
    "Capital city of France = Paris\n"
    "Capital city of {country} ="
)

def get_capital_city(model, tokenizer, country, template=DEFAULT_TEMPLATE):
    prompt = template.format(country=country)
    extended = generate(model, tokenizer, prompt, max_new_tokens=10)
    answer = extended[len(prompt):]
    return answer.strip().splitlines()[0].strip()


In [None]:
get_capital_city(gpt2, gpt2_tokenizer, "United Kingdom")


In [None]:
get_capital_city(gpt2, gpt2_tokenizer, "Mexico")


## Limitations of GPT-2

- Common factual errors
- Biases from web data
- Hallucinations
- Performance improves with model size


## Larger Decoder Models: Mistral-7B

Mistral-7B:
- 7 billion parameters
- Apache 2.0 license
- Advanced attention mechanisms
- Runs on Colab GPUs


## Loading Mistral-7B


In [None]:
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM


In [None]:
model_id = "mistralai/Mistral-7B-v0.3"

mistral_tokenizer = AutoTokenizer.from_pretrained(model_id)
mistral = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype="auto"
)


## Final Notes

Decoder-only models:
- Scale extremely well
- Enable zero-shot and few-shot learning
- Power modern LLM systems

Next: **Chat fine-tuning and instruction tuning**


# Decoder-Only Transformers

While Google was working on the first encoder-only model (i.e., BERT), Alec Radford and other OpenAI researchers were taking a different route: they built the first decoder-only model, named GPT. This model paved the way for today‚Äôs most impressive models, including most of the ones used in famous chatbots like ChatGPT or Claude.

The GPT model (now known as GPT-1) was released in June 2018. GPT stands for **Generative Pre-Training**: it was pretrained on a dataset of about 7,000 books and learned to predict the next token, so it can be used to generate text one token at a time, just like the original Transformer‚Äôs decoder.

For example, if you feed it *‚ÄúHappy birthday‚Äù*, it will predict *‚Äúbirthday to‚Äù*, so you can append *‚Äúto‚Äù* to the input and repeat the process.


### What Decoder-Only Models Are Good At

Decoder-only models excel at:

- Text generation and auto-completion  
- Code generation  
- Question answering (free-form answers)  
- Math and logical reasoning (to some extent)  
- Chatbots  

They can also be used for summarization or translation, but encoder‚Äìdecoder models are often better at these tasks because the encoder provides a stronger understanding of the source text.

Decoder-only models can perform text classification, but encoder-only models are usually faster and more parameter-efficient for this task.


### Warning: Autoregressive Inference

At inference time:

- **Encoder-only models** process the input once.
- **Decoder-only models** generate one token at a time.

This makes generation sequential and slower, but decoder-only models benefit heavily from **key‚Äìvalue caching**, which greatly speeds up inference.


## GPT-1 Architecture and Generative Pretraining

During pretraining, GPT-1:

- Used batches of 64 sequences
- Each sequence was exactly 512 tokens long
- Predicted the next token for *every* input token
- Used no padding and no special tokens (no BOS/EOS)

Compared to BERT, GPT-1‚Äôs training was simpler and provided equal data exposure to all token positions.


### Key Architectural Differences vs Transformer Decoder

GPT-1 differs from the original Transformer decoder in two major ways:

1. **No cross-attention**  
   - There is no encoder output to attend to.
   - Each block contains:
     - Masked multi-head self-attention
     - A feedforward network
     - Skip connections and layer normalization

2. **Much larger model**
   - 12 decoder layers (instead of 6)
   - Embedding size: 768
   - 12 attention heads
   - ~117 million parameters


### Important PyTorch Warning

You **cannot** use `nn.TransformerDecoder` to build a decoder-only model because it always includes cross-attention layers.

Instead:
- Use `nn.TransformerEncoder`
- Always apply a **causal (triangular) mask**


## GPT-2 and Zero-Shot Learning

GPT-2 was released in February 2019 and scaled the GPT-1 architecture dramatically.

The largest GPT-2 model:
- 48 decoder layers
- 20 attention heads
- Embedding size of 1600
- Context window of 1024 tokens
- Over 1.5 billion parameters

It was trained on **WebText**, a curated dataset of high-quality web pages.


### Zero-Shot Learning (ZSL)

GPT-2 could perform many tasks *without fine-tuning*:

- **Question answering**
  - Prompt: `What is the capital of New Zealand? A:`
- **Summarization**
  - Prompt ends with: `TL;DR:`
- **Translation**
  - Provide a few examples, then a new sentence

Performance improved predictably with model size (log-linear scaling).


## Using GPT-2 to Generate Text

We can use Hugging Face‚Äôs Transformers library to load GPT-2.


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "gpt2"
gpt2_tokenizer = AutoTokenizer.from_pretrained(model_id)
gpt2 = AutoModelForCausalLM.from_pretrained(
    model_id, device_map="auto", dtype="auto"
)


### Why These Arguments Matter

- `device_map="auto"` automatically places the model on the best device (GPU if available).
- `dtype="auto"` uses mixed precision (float16) when supported, saving memory and improving speed.


In [None]:
### Why These Arguments Matter

- `device_map="auto"` automatically places the model on the best device (GPU if available).
- `dtype="auto"` uses mixed precision (float16) when supported, saving memory and improving speed.


By default, `generate()` uses greedy decoding, which often causes repetition.
For creative text generation, enable sampling.


In [None]:
prompt = "Scientists found a talking unicorn today. Here's the full story:"
generate(gpt2, gpt2_tokenizer, prompt)


In [None]:
import torch

torch.manual_seed(42)
generate(gpt2, gpt2_tokenizer, prompt, do_sample=True, top_p=0.6)


## Using GPT-2 for Question Answering (In-Context Learning)


In [None]:
DEFAULT_TEMPLATE = "Capital city of France = Paris\nCapital city of {country} ="

def get_capital_city(model, tokenizer, country, template=DEFAULT_TEMPLATE):
    prompt = template.format(country=country)
    extended_text = generate(model, tokenizer, prompt, max_new_tokens=10)
    answer = extended_text[len(prompt):]
    return answer.strip().splitlines()[0].strip()


In [None]:
get_capital_city(gpt2, gpt2_tokenizer, "United Kingdom")
get_capital_city(gpt2, gpt2_tokenizer, "Mexico")


# Turning a Large Language Model into a Chatbot

To build a chatbot, you need more than a base model. For example, let‚Äôs try asking Mistral-7B for something:


In [None]:
prompt = "List some places I should visit in Paris."
generate(mistral7b, mistral7b_tokenizer, prompt)


The model does not answer the question; it simply *completes* it. This is expected behavior for a base language model.

To make the model conversational, we can apply **prompt engineering**, which consists of carefully crafting the prompt so the model behaves like a helpful chatbot.


## Prompt Engineering with Role Tags

We can introduce the model to a fictional chatbot persona and explicitly mark who is speaking.


In [None]:
bob_introduction = """
Bob is an amazing chatbot. It knows everything and it's incredibly helpful.
"""


In [None]:
full_prompt = f"{bob_introduction}Me: {prompt}\nBob:"


In [None]:
extended_text = generate(
    mistral7b,
    mistral7b_tokenizer,
    full_prompt,
    max_new_tokens=100
)

answer = extended_text[len(full_prompt):].strip()
print(answer)


The model now answers correctly, but it continues generating the conversation.

To fix this, we can stop generation when the conversation returns to ‚ÄúMe:‚Äù.


In [None]:
answer.split("\nMe: ")[0]


Now suppose we want to continue the same conversation. We must keep the entire conversation context and append new turns to it.


In [None]:
class BobTheChatbot:
    def __init__(self, model, tokenizer, introduction=bob_introduction,
                 max_answer_length=10_000):
        self.model = model
        self.tokenizer = tokenizer
        self.context = introduction
        self.max_answer_length = max_answer_length

    def chat(self, prompt):
        self.context += "\nMe: " + prompt + "\nBob:"
        context = self.context
        start_index = len(context)

        while True:
            extended = generate(
                self.model,
                self.tokenizer,
                context,
                max_new_tokens=100
            )
            answer = extended[start_index:]

            if ("\nMe: " in answer or
                extended == context or
                len(answer) >= self.max_answer_length):
                break

            context = extended

        answer = answer.split("\nMe: ")[0]
        self.context += answer
        return answer.strip()


In [None]:
bob = BobTheChatbot(mistral7b, mistral7b_tokenizer)

bob.chat("List some places I should visit in Paris.")
bob.chat("Tell me more about the first place.")
bob.chat("And Rome?")


We now have a working chatbot in about 20 lines of code.

However, several problems remain:
- The chatbot can repeat itself.
- Its answers are often shallow.
- It may produce unsafe or illegal advice.

Prompt engineering helps, but it is not sufficient.


## Prompt Engineering Techniques

Popular prompt engineering techniques include:

- Rephrasing instructions
- Adding examples (few-shot prompting)
- Assigning a role or personality
- Specifying output format and style
- Prompt chaining (multi-step prompts)
- Chain-of-thought (CoT) prompting
- Tree-of-thoughts (ToT)
- Self-critique and refinement


## Hallucinations and Retrieval-Augmented Generation (RAG)

LLMs can hallucinate facts. To reduce this:
- Retrieve relevant information from trusted sources
- Inject this information into the prompt
- Let the model answer using this context

This approach is called **Retrieval-Augmented Generation (RAG)**.


## Fine-Tuning a Model for Chatting

Building a reliable chatbot typically requires two fine-tuning steps:

1. **Supervised Fine-Tuning (SFT)**
2. **Fine-Tuning with Human Preferences (RLHF or DPO)**

This process turns a base model into an instruct and conversational model.


### Supervised Fine-Tuning (SFT)

The model is trained on curated instruction‚Äìresponse pairs.

Loss masking is often used so that:
- The loss is computed only on answer tokens
- The model focuses on improving responses


### Reinforcement Learning from Human Feedback (RLHF)

RLHF:
- Trains a reward model from human rankings
- Uses PPO to optimize the model
- Prevents excessive drift from the base model

This approach is powerful but complex and unstable.


## Direct Preference Optimization (DPO)

DPO is a simpler alternative to RLHF:
- No reward model
- No reinforcement learning
- More stable and data-efficient

Each training sample includes:
- A prompt
- A preferred answer
- A rejected answer


In [None]:
prompt = "The capital of Argentina is "
full_input = [prompt + "Buenos Aires", prompt + "Madrid"]

mistral7b_tokenizer.pad_token = mistral7b_tokenizer.eos_token
encodings = mistral7b_tokenizer(
    full_input, return_tensors="pt", padding=True
).to(device)

logits = mistral7b(**encodings).logits


In [None]:
import torch.nn.functional as F
import torch

next_token_ids = encodings.input_ids[:, 1:]
next_token_log_probas = -F.cross_entropy(
    logits[:, :-1].permute(0, 2, 1),
    next_token_ids,
    reduction="none"
)


In [None]:
padding_mask = encodings.attention_mask[:, :-1]
log_probas_sum = (next_token_log_probas * padding_mask).sum(dim=1)
log_probas_sum


In [None]:
def dpo_loss(model, ref_model, tokenizer, full_input_c, full_input_r, beta=0.1):
    p_c = sum_of_log_probas(model, tokenizer, full_input_c)
    p_r = sum_of_log_probas(model, tokenizer, full_input_r)

    with torch.no_grad():
        p_ref_c = sum_of_log_probas(ref_model, tokenizer, full_input_c)
        p_ref_r = sum_of_log_probas(ref_model, tokenizer, full_input_r)

    return -F.logsigmoid(
        beta * ((p_c - p_ref_c) - (p_r - p_ref_r))
    ).mean()


## Fine-Tuning with the TRL Library

The Hugging Face TRL library supports:
- SFT
- RLHF
- DPO

We will use:
- Alpaca dataset for SFT
- Anthropic HH-RLHF dataset for DPO


In [None]:
from datasets import load_dataset

sft_dataset = load_dataset("tatsu-lab/alpaca", split="train")
print(sft_dataset[1]["text"])


In [None]:
from datasets import load_dataset

sft_dataset = load_dataset("tatsu-lab/alpaca", split="train")
print(sft_dataset[1]["text"])


In [None]:
from trl import SFTTrainer, SFTConfig

sft_model_dir = "./my_gpt2_sft_alpaca"

training_args = SFTConfig(
    output_dir=sft_model_dir,
    max_length=512,
    per_device_train_batch_size=4,
    num_train_epochs=1,
    save_steps=50,
    logging_steps=10,
    learning_rate=5e-5,
)

sft_trainer = SFTTrainer(
    "gpt2",
    train_dataset=sft_dataset,
    args=training_args,
)

sft_trainer.train()
sft_trainer.model.save_pretrained(sft_model_dir)


In [None]:
pref_dataset = load_dataset("Anthropic/hh-rlhf", split="train")
pref_dataset[2].keys()


In [None]:
from trl import DPOTrainer, DPOConfig

dpo_model_dir = "./my_gpt2_sft_alpaca_dpo_hh_rlhf"

training_args = DPOConfig(
    output_dir=dpo_model_dir,
    max_length=512,
    per_device_train_batch_size=4,
    num_train_epochs=1,
    save_steps=50,
    logging_steps=10,
    learning_rate=2e-5,
)

gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token

dpo_trainer = DPOTrainer(
    sft_model_dir,
    args=training_args,
    train_dataset=pref_dataset,
    processing_class=gpt2_tokenizer,
)

dpo_trainer.train()
dpo_trainer.model.save_pretrained(dpo_model_dir)


You now understand the full pipeline:

- Transformer architecture
- Pretraining with next-token prediction
- Supervised fine-tuning (SFT)
- Preference alignment with DPO
- Deployment in a tool-augmented chatbot system

This is exactly how modern chatbots are built.


# Encoder‚ÄìDecoder Models

In this chapter, other than the original Transformer architecture, we have focused solely on encoder-only and decoder-only models. This might have given you the impression that encoder‚Äìdecoder models are over, but for some problems, they are still very relevant‚Äîespecially for tasks like **translation** or **summarization**.

Since the encoder is **bidirectional**, it can encode the source text and output excellent contextual embeddings, which the decoder can then use to produce a better output than a decoder-only model would (at least for models of a similar size).

---

## T5: Text-to-Text Transfer Transformer

The **T5 model** (released by Google in 2019) is a particularly influential encoder‚Äìdecoder model. It was the first to frame *all* NLP tasks as **text-to-text** problems.

Examples:

- **Translation**  
  Input:  
  `translate English to Spanish: I like soccer`  
  Output:  
  `me gusta el f√∫tbol`

- **Summarization**  
  Input:  
  `summarize: <paragraph>`  
  Output:  
  `<summary>`

- **Classification**  
  Input:  
  `classify: <text>`  
  Output:  
  `<class name>`

For **zero-shot classification**, the possible classes can simply be listed in the prompt.

This unified text-to-text approach makes T5:
- Very easy to pretrain on diverse tasks
- Very easy to use at inference time

T5 was pretrained using a **masked span corruption objective**, similar to MLM, but masking one or more **contiguous spans** instead of individual tokens.

---

## Variants of T5

Google released several variants of T5:

### mT5 (2020)
A multilingual T5 supporting **over 100 languages**.  
Well-suited for:
- Translation
- Cross-lingual tasks (e.g., asking questions in English about Spanish text)

### ByT5 (2021)
A **byte-level** variant of T5 that removes the need for tokenization entirely (not even BPE).  
This approach has not become widely adopted, as tokenizers are generally more efficient.

### FLAN-T5 (2022)
An **instruction-tuned** version of T5 with excellent:
- Zero-shot learning (ZSL)
- Few-shot learning (FSL)

### UL2 (2022)
Pretrained using **multiple objectives**, including:
- Masked span denoising (like T5)
- Standard next-token prediction
- Masked token prediction

### FLAN-UL2 (2023)
An improved version of UL2 using **instruction tuning**.

---

## Encoder‚ÄìDecoder Models from Meta

Meta released several encoder‚Äìdecoder models, starting with **BART** in 2020.

BART was pretrained using a **denoising objective**:
- The input text is corrupted (masked, deleted, shuffled, modified, or inserted tokens)
- The model must reconstruct the original text

BART is particularly effective for:
- Text generation
- Summarization

A multilingual variant called **mBART** is also available.

---

## Beyond NLP

Encoder‚Äìdecoder architectures are also common outside NLP:

- **Vision models**, especially for:
  - Object detection
  - Image segmentation
- **Multimodal models**, combining text, vision, audio, or other modalities

This brings us to the next chapter, where we will explore **vision transformers and multimodal transformers**.

It‚Äôs time for transformers to open their eyes! üëÄ
