<a href="https://colab.research.google.com/github/JordanDCunha/Hands-On-Machine-Learning-with-Scikit-Learn-and-PyTorch/blob/main/Chapter14.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Generating Shakespearean Text Using a Character RNN

In a famous 2015 blog post titled *‚ÄúThe Unreasonable Effectiveness of Recurrent Neural Networks‚Äù*, Andrej Karpathy showed how to train an RNN to predict the next character in a sentence.

This char-RNN can then be used to generate novel text, one character at a time.


### Example of Generated Text

After training on all of Shakespeare‚Äôs works, a char-RNN generated text such as:

PANDARUS:
Alas, I think he shall be come approached and the day
When little srain would be attain‚Äôd into being never fed,
And who is but a chain and subjects of his death,
I should not sleep.


Not exactly a masterpiece, but still impressive: the model learned words, grammar, punctuation, and structure purely by predicting the next character.

This is our first example of a **language model**.


## Creating the Training Dataset

First, we download a subset of Shakespeare‚Äôs works (about 25%) from Andrej Karpathy‚Äôs char-RNN project.


In [None]:
from pathlib import Path
import urllib.request

def download_shakespeare_text():
    path = Path("datasets/shakespeare/shakespeare.txt")
    if not path.is_file():
        path.parent.mkdir(parents=True, exist_ok=True)
        url = "https://homl.info/shakespeare"
        urllib.request.urlretrieve(url, path)
    return path.read_text()

shakespeare_text = download_shakespeare_text()


Let‚Äôs inspect the beginning of the text.


In [None]:
print(shakespeare_text[:80])


Neural networks work with numbers, not text.  
We must encode text into numbers by splitting it into **tokens**.

Here, we use **characters** as tokens.


Neural networks work with numbers, not text.  
We must encode text into numbers by splitting it into **tokens**.

Here, we use **characters** as tokens.


In [None]:
vocab = sorted(set(shakespeare_text.lower()))
"".join(vocab)


Now we assign an integer ID to each character.


In [None]:
char_to_id = {char: index for index, char in enumerate(vocab)}
id_to_char = {index: char for index, char in enumerate(vocab)}

char_to_id["a"], id_to_char[13]


### Encoding and Decoding Text


In [None]:
import torch

def encode_text(text):
    return torch.tensor([char_to_id[char] for char in text.lower()])

def decode_text(char_ids):
    return "".join([id_to_char[char_id.item()] for char_id in char_ids])


Let‚Äôs test these helper functions.


In [None]:
encoded = encode_text("Hello, world!")
encoded, decode_text(encoded)


## Preparing the Dataset

We turn the long character sequence into many overlapping windows.

Inputs:
- A window of characters

Targets:
- The same window shifted one character into the future


In [None]:
from torch.utils.data import Dataset, DataLoader

class CharDataset(Dataset):
    def __init__(self, text, window_length):
        self.encoded_text = encode_text(text)
        self.window_length = window_length

    def __len__(self):
        return len(self.encoded_text) - self.window_length

    def __getitem__(self, idx):
        if idx >= len(self):
            raise IndexError("dataset index out of range")
        end = idx + self.window_length
        window = self.encoded_text[idx:end]
        target = self.encoded_text[idx + 1:end + 1]
        return window, target


### Creating the Data Loaders


In [None]:
window_length = 50
batch_size = 512

train_set = CharDataset(shakespeare_text[:1_000_000], window_length)
valid_set = CharDataset(shakespeare_text[1_000_000:1_060_000], window_length)
test_set  = CharDataset(shakespeare_text[1_060_000:], window_length)

train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(valid_set, batch_size=batch_size)
test_loader  = DataLoader(test_set, batch_size=batch_size)


Each batch contains shuffled windows and their shifted targets.

The window length limits how far back the model can learn dependencies.


## Why We Need Embeddings

Token IDs are arbitrary numbers ‚Äî nearby IDs are not necessarily similar.

One-hot encoding fixes this but scales poorly for large vocabularies.

**Embeddings** solve this by learning dense vector representations.


### What Is an Embedding?

An embedding maps each category to a trainable dense vector.

Embeddings are learned during training and capture semantic relationships.


### Example: Using nn.Embedding


In [None]:
import torch.nn as nn

torch.manual_seed(42)
embed = nn.Embedding(5, 3)
embed(torch.tensor([[3, 2], [0, 2]]))


Embedding layers are equivalent to one-hot encoding followed by a linear layer,
but far more efficient.


## Building the Char-RNN Model

We use:
- An embedding layer
- A two-layer GRU
- A linear output layer


In [None]:
class ShakespeareModel(nn.Module):
    def __init__(self, vocab_size, n_layers=2, embed_dim=10, hidden_dim=128,
                 dropout=0.1):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.gru = nn.GRU(embed_dim, hidden_dim, num_layers=n_layers,
                          batch_first=True, dropout=dropout)
        self.output = nn.Linear(hidden_dim, vocab_size)

    def forward(self, X):
        embeddings = self.embed(X)
        outputs, _states = self.gru(embeddings)
        return self.output(outputs).permute(0, 2, 1)


In [None]:
torch.manual_seed(42)
model = ShakespeareModel(len(vocab)).to(device)


The output is permuted because nn.CrossEntropyLoss expects
the class dimension to be second.


## Predicting the Next Character


In [None]:
model.eval()

text = "To be or not to b"
encoded_text = encode_text(text).unsqueeze(0).to(device)

with torch.no_grad():
    Y_logits = model(encoded_text)
    predicted_char_id = Y_logits[0, :, -1].argmax().item()
    predicted_char = id_to_char[predicted_char_id]

predicted_char


## Generating New Text

Greedy decoding often leads to repetition.

Instead, we **sample** characters using predicted probabilities.


In [None]:
import torch.nn.functional as F

def next_char(model, text, temperature=1):
    encoded_text = encode_text(text).unsqueeze(0).to(device)
    with torch.no_grad():
        logits = model(encoded_text)[0, :, -1]
        probs = F.softmax(logits / temperature, dim=-1)
        char_id = torch.multinomial(probs, num_samples=1).item()
    return id_to_char[char_id]


In [None]:
def extend_text(model, text, n_chars=80, temperature=1):
    for _ in range(n_chars):
        text += next_char(model, text, temperature)
    return text


### Effect of Temperature


In [None]:
print(extend_text(model, "To be or not to b", temperature=0.01))


In [None]:
print(extend_text(model, "To be or not to b", temperature=0.4))


In [None]:
print(extend_text(model, "To be or not to b", temperature=100))


Low temperature ‚Üí repetitive but coherent  
Medium temperature ‚Üí best balance  
High temperature ‚Üí chaos


## Final Notes

- Char-RNNs learn surprisingly rich representations
- They are limited by window length
- Sampling strategy strongly affects output quality

Next up: **sentiment analysis**.


# Sentiment Analysis Using Hugging Face Libraries

One of the most common applications of NLP is text classification‚Äîespecially sentiment analysis.

If image classification on the MNIST dataset is the ‚ÄúHello, world!‚Äù of computer vision, then sentiment analysis on the IMDb reviews dataset is the ‚ÄúHello, world!‚Äù of natural language processing.

The IMDb dataset consists of 50,000 movie reviews in English (25,000 for training, 25,000 for testing), each labeled as negative (0) or positive (1).

It is simple enough to run on a laptop but challenging enough to be interesting.


## Loading the IMDb Dataset

We will use the Hugging Face Datasets library to download IMDb.


In [None]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")
split = imdb_dataset["train"].train_test_split(train_size=0.8, seed=42)
imdb_train_set, imdb_valid_set = split["train"], split["test"]
imdb_test_set = imdb_dataset["test"]


## Inspecting the Dataset


In [None]:
imdb_train_set[1]["text"], imdb_train_set[1]["label"]


In [None]:
imdb_train_set[16]["text"], imdb_train_set[16]["label"]


The first review is clearly positive.  
The second contains mixed sentiment but ends negatively.

This is a nontrivial classification task.


## Tokenization with Subword Models

A simple character RNN would struggle here.

We need a better tokenization strategy that can handle rare words and morphology.


## Byte Pair Encoding (BPE)

BPE starts with characters and repeatedly merges the most frequent adjacent pairs until a target vocabulary size is reached.


In [None]:
import tokenizers

bpe_model = tokenizers.models.BPE(unk_token="<unk>")
bpe_tokenizer = tokenizers.Tokenizer(bpe_model)
bpe_tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.Whitespace()

special_tokens = ["<pad>", "<unk>"]
bpe_trainer = tokenizers.trainers.BpeTrainer(
    vocab_size=1000,
    special_tokens=special_tokens
)

train_reviews = [review["text"].lower() for review in imdb_train_set]
bpe_tokenizer.train_from_iterator(train_reviews, bpe_trainer)


## Using the BPE Tokenizer


In [None]:
some_review = "what an awesome movie!"
bpe_encoding = bpe_tokenizer.encode(some_review)
bpe_encoding.tokens, bpe_encoding.ids


Frequent words become single tokens, rare words are split.

Unknown characters (like emojis) are replaced by `<unk>`.


In [None]:
bpe_tokenizer.decode(bpe_encoding.ids)


## Padding and Truncation


In [None]:
import torch

bpe_tokenizer.enable_padding(pad_id=0, pad_token="<pad>")
bpe_tokenizer.enable_truncation(max_length=500)

bpe_encodings = bpe_tokenizer.encode_batch(train_reviews[:3])
bpe_batch_ids = torch.tensor([e.ids for e in bpe_encodings])
bpe_batch_ids


Padding ensures equal-length sequences.

Attention masks indicate which tokens are padding.


In [None]:
attention_mask = torch.tensor([e.attention_mask for e in bpe_encodings])
lengths = attention_mask.sum(dim=-1)
attention_mask, lengths


## Byte-Level BPE (BBPE)

Whitespace tokenization loses space information.

ByteLevel pre-tokenization preserves spaces and supports all Unicode bytes.


## WordPiece Tokenization

WordPiece is similar to BPE but uses a likelihood-based scoring function.

It often produces shorter sequences and more meaningful splits.


## Unigram Language Model Tokenization

Unigram LM starts with a large vocabulary and removes tokens that reduce likelihood the least.

It works well for languages without spaces.


## Reusing Pretrained Tokenizers


In [None]:
import transformers

gpt2_tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2")
gpt2_encoding = gpt2_tokenizer(
    train_reviews[:3],
    truncation=True,
    max_length=500
)


In [None]:
gpt2_token_ids = gpt2_encoding["input_ids"][0][:10]
gpt2_tokenizer.decode(gpt2_token_ids)


gpt2_token_ids = gpt2_encoding["input_ids"][0][:10]
gpt2_tokenizer.decode(gpt2_token_ids)


In [None]:
bert_tokenizer = transformers.AutoTokenizer.from_pretrained(
    "bert-base-uncased"
)

bert_encoding = bert_tokenizer(
    train_reviews[:3],
    padding=True,
    truncation=True,
    max_length=500,
    return_tensors="pt"
)

bert_encoding["input_ids"], bert_encoding["attention_mask"]


## DataLoader Tokenization with collate_fn


In [None]:
from torch.utils.data import DataLoader

def collate_fn(batch, tokenizer=bert_tokenizer):
    reviews = [r["text"] for r in batch]
    labels = [[r["label"]] for r in batch]
    encodings = tokenizer(
        reviews,
        padding=True,
        truncation=True,
        max_length=200,
        return_tensors="pt"
    )
    labels = torch.tensor(labels, dtype=torch.float32)
    return encodings, labels

batch_size = 256

imdb_train_loader = DataLoader(
    imdb_train_set,
    batch_size=batch_size,
    shuffle=True,
    collate_fn=collate_fn
)


## Sentiment Analysis Model (GRU)


In [None]:
import torch.nn as nn

class SentimentAnalysisModel(nn.Module):
    def __init__(self, vocab_size, n_layers=2, embed_dim=128,
                 hidden_dim=64, pad_id=0, dropout=0.2):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_id)
        self.gru = nn.GRU(
            embed_dim,
            hidden_dim,
            num_layers=n_layers,
            batch_first=True,
            dropout=dropout
        )
        self.output = nn.Linear(hidden_dim, 1)

    def forward(self, encodings):
        embeddings = self.embed(encodings["input_ids"])
        _, hidden_states = self.gru(embeddings)
        return self.output(hidden_states[-1])


## Packed Sequences to Ignore Padding


In [None]:
from torch.nn.utils.rnn import pack_padded_sequence

def forward(self, encodings):
    embeddings = self.embed(encodings["input_ids"])
    lengths = encodings["attention_mask"].sum(dim=1)
    packed = pack_padded_sequence(
        embeddings,
        lengths.cpu(),
        batch_first=True,
        enforce_sorted=False
    )
    _, hidden_states = self.gru(packed)
    return self.output(hidden_states[-1])


## Bidirectional GRU

Bidirectional RNNs read sequences forward and backward.

Hidden sizes must be doubled.


In [None]:
self.gru = nn.GRU(
    embed_dim,
    hidden_dim,
    num_layers=n_layers,
    batch_first=True,
    dropout=dropout,
    bidirectional=True
)

self.output = nn.Linear(2 * hidden_dim, 1)


## Reusing Pretrained BERT Embeddings


In [None]:
bert_model = transformers.AutoModel.from_pretrained("bert-base-uncased")

class SentimentAnalysisModelPreEmbeds(nn.Module):
    def __init__(self, pretrained_embeddings, hidden_dim=64):
        super().__init__()
        weights = pretrained_embeddings.weight.data
        self.embed = nn.Embedding.from_pretrained(weights, freeze=True)
        embed_dim = weights.shape[-1]
        self.gru = nn.GRU(embed_dim, hidden_dim, batch_first=True)
        self.output = nn.Linear(hidden_dim, 1)


## Using Full BERT for Classification


In [None]:
class SentimentAnalysisModelBert(nn.Module):
    def __init__(self, hidden_dim=64):
        super().__init__()
        self.bert = transformers.AutoModel.from_pretrained(
            "bert-base-uncased"
        )
        embed_dim = self.bert.config.hidden_size
        self.gru = nn.GRU(embed_dim, hidden_dim, batch_first=True)
        self.output = nn.Linear(hidden_dim, 1)

    def forward(self, encodings):
        x = self.bert(**encodings).last_hidden_state
        lengths = encodings["attention_mask"].sum(dim=1)
        packed = pack_padded_sequence(
            x, lengths.cpu(),
            batch_first=True,
            enforce_sorted=False
        )
        _, hidden_states = self.gru(packed)
        return self.output(hidden_states[-1])


## BertForSequenceClassification


In [None]:
from transformers import BertForSequenceClassification

bert_for_binary_clf = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2,
    dtype=torch.float16
)


## Trainer API


In [None]:
def tokenize_batch(batch):
    return bert_tokenizer(
        batch["text"],
        truncation=True,
        max_length=200
    )

tok_imdb_train_set = imdb_train_set.map(tokenize_batch, batched=True)
tok_imdb_valid_set = imdb_valid_set.map(tokenize_batch, batched=True)


In [None]:
def compute_accuracy(pred):
    return {
        "accuracy": (pred.label_ids == pred.predictions.argmax(-1)).mean()
    }


In [None]:
from transformers import TrainingArguments

train_args = TrainingArguments(
    output_dir="my_imdb_model",
    num_train_epochs=2,
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    eval_strategy="epoch",
    logging_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to="none"
)


In [None]:
from transformers import Trainer, DataCollatorWithPadding

trainer = Trainer(
    model=bert_for_binary_clf,
    args=train_args,
    train_dataset=tok_imdb_train_set,
    eval_dataset=tok_imdb_valid_set,
    compute_metrics=compute_accuracy,
    data_collator=DataCollatorWithPadding(bert_tokenizer)
)

trainer.train()


## Pipelines API


In [None]:
from transformers import pipeline

classifier = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    truncation=True,
    max_length=512
)

classifier(train_reviews[:10])


## Final Notes

- Pretrained transformers dominate modern NLP
- Tokenization matters deeply
- Pipelines give instant results
- Fine-tuning gives best performance
- Bias must always be evaluated

Next: **Neural Machine Translation** üöÄ


# An Encoder-Decoder Network for Neural Machine Translation

Let‚Äôs begin with a relatively simple sequence-to-sequence NMT model that will translate English text to Spanish (see Figure 14-5).


## Model Architecture

English texts are fed as inputs to the encoder, and the decoder outputs the Spanish translations.

The Spanish translations are also used as inputs to the decoder during training, but shifted back by one step. In other words, during training the decoder is given as input the token that it should have output at the previous step, regardless of what it actually output.

This is called **teacher forcing**, a technique that significantly speeds up training and improves the model‚Äôs performance.

For the very first token, the decoder is given the start-of-sequence (SoS) token (`"<s>"`), and the decoder is expected to end the text with an end-of-sequence (EoS) token (`"</s>"`).


Each token is initially represented by its ID (e.g., 4553 for the token ‚Äúsoccer‚Äù). An `nn.Embedding` layer returns the token embedding, which is fed to the encoder and decoder.

At each step, the decoder‚Äôs dense output layer (`nn.Linear`) outputs a logit score for each token in the output vocabulary (Spanish).

Passing these logits through softmax gives a probability distribution over all tokens. This is a standard classification task, and the model is trained using `nn.CrossEntropyLoss`.


Each token is initially represented by its ID (e.g., 4553 for the token ‚Äúsoccer‚Äù). An `nn.Embedding` layer returns the token embedding, which is fed to the encoder and decoder.

At each step, the decoder‚Äôs dense output layer (`nn.Linear`) outputs a logit score for each token in the output vocabulary (Spanish).

Passing these logits through softmax gives a probability distribution over all tokens. This is a standard classification task, and the model is trained using `nn.CrossEntropyLoss`.


### TIP

In a 2015 paper, Samy Bengio et al. proposed gradually switching from feeding the decoder the previous target token to feeding it the previous output token during training.


## Loading the Dataset

We will use the Tatoeba Challenge dataset via the Hugging Face Datasets library.

The training set is large, so we use the validation set for training and split it into training and validation subsets.


In [None]:
from datasets import load_dataset

nmt_original_valid_set, nmt_test_set = load_dataset(
    path="ageron/tatoeba_mt_train",
    name="eng-spa",
    split=["validation", "test"]
)

split = nmt_original_valid_set.train_test_split(train_size=0.8, seed=42)
nmt_train_set, nmt_valid_set = split["train"], split["test"]


Each sample contains an English sentence and its Spanish translation.


In [None]:
nmt_train_set[0]


## Training a Shared BPE Tokenizer

Since English and Spanish share many words and subwords, we use a single tokenizer.

We train a BPE tokenizer on both English and Spanish text.


In [None]:
import tokenizers

def train_eng_spa():
    for pair in nmt_train_set:
        yield pair["source_text"]
        yield pair["target_text"]

max_length = 256
vocab_size = 10_000

nmt_tokenizer_model = tokenizers.models.BPE(unk_token="<unk>")
nmt_tokenizer = tokenizers.Tokenizer(nmt_tokenizer_model)

nmt_tokenizer.enable_padding(pad_id=0, pad_token="<pad>")
nmt_tokenizer.enable_truncation(max_length=max_length)
nmt_tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.Whitespace()

nmt_tokenizer_trainer = tokenizers.trainers.BpeTrainer(
    vocab_size=vocab_size,
    special_tokens=["<pad>", "<unk>", "<s>", "</s>"]
)

nmt_tokenizer.train_from_iterator(train_eng_spa(), nmt_tokenizer_trainer)


## Testing the Tokenizer


In [None]:
nmt_tokenizer.encode("I like soccer").ids


In [None]:
nmt_tokenizer.encode("<s> Me gusta el f√∫tbol").ids


## Utility Class for Tokenized Pairs

We store token IDs and attention masks for both source and target sequences.


In [None]:
import torch
from collections import namedtuple

fields = ["src_token_ids", "src_mask", "tgt_token_ids", "tgt_mask"]

class NmtPair(namedtuple("NmtPairBase", fields)):
    def to(self, device):
        return NmtPair(
            self.src_token_ids.to(device),
            self.src_mask.to(device),
            self.tgt_token_ids.to(device),
            self.tgt_mask.to(device)
        )


In [None]:
from torch.utils.data import DataLoader

def nmt_collate_fn(batch):
    src_texts = [pair["source_text"] for pair in batch]
    tgt_texts = [f"<s> {pair['target_text']} </s>" for pair in batch]

    src_encodings = nmt_tokenizer.encode_batch(src_texts)
    tgt_encodings = nmt_tokenizer.encode_batch(tgt_texts)

    src_token_ids = torch.tensor([enc.ids for enc in src_encodings])
    tgt_token_ids = torch.tensor([enc.ids for enc in tgt_encodings])

    src_mask = torch.tensor([enc.attention_mask for enc in src_encodings])
    tgt_mask = torch.tensor([enc.attention_mask for enc in tgt_encodings])

    inputs = NmtPair(
        src_token_ids,
        src_mask,
        tgt_token_ids[:, :-1],
        tgt_mask[:, :-1]
    )

    labels = tgt_token_ids[:, 1:]
    return inputs, labels

batch_size = 32

nmt_train_loader = DataLoader(
    nmt_train_set,
    batch_size=batch_size,
    shuffle=True,
    collate_fn=nmt_collate_fn
)

nmt_valid_loader = DataLoader(
    nmt_valid_set,
    batch_size=batch_size,
    collate_fn=nmt_collate_fn
)

nmt_test_loader = DataLoader(
    nmt_test_set,
    batch_size=batch_size,
    collate_fn=nmt_collate_fn
)


## Encoder-Decoder GRU Model


In [None]:
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence

class NmtModel(nn.Module):
    def __init__(self, vocab_size, embed_dim=512, pad_id=0,
                 hidden_dim=512, n_layers=2):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_id)
        self.encoder = nn.GRU(
            embed_dim, hidden_dim,
            num_layers=n_layers,
            batch_first=True
        )
        self.decoder = nn.GRU(
            embed_dim, hidden_dim,
            num_layers=n_layers,
            batch_first=True
        )
        self.output = nn.Linear(hidden_dim, vocab_size)

    def forward(self, pair):
        src_embeddings = self.embed(pair.src_token_ids)
        tgt_embeddings = self.embed(pair.tgt_token_ids)

        src_lengths = pair.src_mask.sum(dim=1)
        src_packed = pack_padded_sequence(
            src_embeddings,
            src_lengths.cpu(),
            batch_first=True,
            enforce_sorted=False
        )

        _, hidden_states = self.encoder(src_packed)
        outputs, _ = self.decoder(tgt_embeddings, hidden_states)

        return self.output(outputs).permute(0, 2, 1)


In [None]:
torch.manual_seed(42)
vocab_size = nmt_tokenizer.get_vocab_size()
nmt_model = NmtModel(vocab_size).to(device)


## Loss Function

Padding tokens should be ignored.


In [None]:
xentropy = nn.CrossEntropyLoss(ignore_index=0)


## Translation Helper Function


In [None]:
def translate(model, src_text, max_length=20, eos_id=3):
    tgt_text = ""
    for index in range(max_length):
        batch, _ = nmt_collate_fn([{
            "source_text": src_text,
            "target_text": tgt_text
        }])

        with torch.no_grad():
            logits = model(batch.to(device))
            token_ids = logits.argmax(dim=1)
            next_token_id = token_ids[0, index]

        next_token = nmt_tokenizer.id_to_token(next_token_id)
        tgt_text += " " + next_token

        if next_token_id == eos_id:
            break

    return tgt_text


## Testing the Translator


In [None]:
nmt_model.eval()
translate(nmt_model, "I like soccer.")


## Model Optimizations

- Sampled softmax
- Adaptive softmax (`nn.AdaptiveLogSoftmaxWithLoss`)
- Weight tying between embedding and output layers


## Limitations

The model struggles with long sentences and loses details.


In [None]:
longer_text = "I like to play soccer with my friends."
translate(nmt_model, longer_text)


## Possible Improvements

- Larger dataset
- More GRU layers
- Bidirectional encoder
- Beam search
- Attention mechanisms (next section!)


# Beam Search

To translate an English text to Spanish, we call our model several times, producing one word at a time. Unfortunately, this means that when the model makes one mistake, it is stuck with it for the rest of the translation, which can cause more errors, making the translation worse and worse.


For example, suppose we want to translate ‚ÄúI like soccer‚Äù, and the model correctly starts with ‚ÄúMe‚Äù, but then predicts ‚Äúgustan‚Äù (plural) instead of ‚Äúgusta‚Äù (singular). This mistake is understandable, since ‚ÄúMe gustan‚Äù is the correct way to start translating ‚ÄúI like‚Äù in many cases.

Once the model has made this mistake, it is stuck with ‚Äúgustan‚Äù. It then reasonably adds ‚Äúlos‚Äù, which is the plural for ‚Äúthe‚Äù. But since the model never saw ‚Äúlos f√∫tbol‚Äù in the training data (soccer is singular, not plural), the model tries to find something reasonable to add, and given the context it adds ‚Äújugadores‚Äù, which means ‚Äúthe players‚Äù.

So ‚ÄúI like soccer‚Äù gets translated to ‚ÄúI like the players‚Äù. One error caused a chain of errors.


## Why Beam Search?

How can we give the model a chance to go back and fix mistakes it made earlier?

One of the most common solutions is **beam search**: it keeps track of a short list of the *k* most promising output sequences (say, the top three), and at each decoder step it tries to extend each of them by one word, keeping only the *k* most likely sequences.

The parameter *k* is called the **beam width**.


## Example: Beam Width = 3

Suppose you translate the sentence ‚ÄúI like soccer‚Äù using beam search with a beam width of three (see Figure 14-7).

At the first decoder step, the model outputs an estimated probability for each possible first word. Suppose the top three words are:

- ‚ÄúMe‚Äù (75%)
- ‚Äúa‚Äù (3%)
- ‚Äúcomo‚Äù (1%)

These become our initial beam.


Next, the model predicts the next word for each of these sentences.

For ‚ÄúMe‚Äù, it might output:
- ‚Äúgustan‚Äù (36%)
- ‚Äúgusta‚Äù (32%)
- ‚Äúencanta‚Äù (16%)

These are **conditional probabilities**, given that the sentence starts with ‚ÄúMe‚Äù.

Since the vocabulary may contain 10,000 tokens, this results in 10,000 candidate continuations per beam entry.


We now compute the probability of each two-word sentence by multiplying probabilities.

For example:
- P("Me") = 75%
- P("gustan" | "Me") = 36%

So:
- P("Me gustan") = 0.75 √ó 0.36 = 27%


We now compute the probability of each two-word sentence by multiplying probabilities.

For example:
- P("Me") = 75%
- P("gustan" | "Me") = 36%

So:
- P("Me gustan") = 0.75 √ó 0.36 = 27%


After computing probabilities for all 30,000 two-word sequences (3 √ó 10,000), we keep only the top three:

- ‚ÄúMe gustan‚Äù (27%)
- ‚ÄúMe gusta‚Äù (24%)
- ‚ÄúMe encanta‚Äù (12%)

Even though ‚ÄúMe gustan‚Äù is currently the best, ‚ÄúMe gusta‚Äù is still alive.


## Continuing the Search

Repeating the process, the top candidates may become:

- ‚ÄúMe gustan los‚Äù (10%)
- ‚ÄúMe gusta el‚Äù (8%)
- ‚ÄúMe gusta mucho‚Äù (2%)

At the next step:
- ‚ÄúMe gusta el f√∫tbol‚Äù (6%)
- ‚ÄúMe gusta mucho el‚Äù (1%)
- ‚ÄúMe gusta el deporte‚Äù (0.2%)

Notice that ‚ÄúMe gustan‚Äù has now been eliminated, and the correct translation is winning.


## Key Takeaway

Beam search improves translation quality **without any extra training**, simply by using the model more intelligently at inference time.


## Practical Implementations

The notebook for this chapter contains a very simple `beam_search()` function.

In practice, you will usually want to use the implementation provided by the `GenerationMixin` class in the Transformers library.


This is where the text generation models from the Transformers library get their `generate()` method.

It supports:
- `num_beams` for beam width
- `do_sample` for probabilistic sampling
- combinations of multiple decoding strategies


For more details, see:
https://homl.info/hfgen


## Example Output with Beam Search


In [None]:
beam_search(nmt_model, longer_text, beam_width=3)


This produces a correct translation:


' Me gusta jugar al f√∫tbol con mis amigos . </s>'


## Limitations on Long Sentences

Unfortunately, the model still struggles with long sentences.


In [None]:
longest_text = "I like to play soccer with my friends at the beach."
beam_search(nmt_model, longest_text, beam_width=3)


This produces:


' Me gusta jugar con jugar con los jug adores de la playa . </s>'


Which translates to:

‚ÄúI like to play with play with the players of the beach‚Äù.

The core issue remains the limited short-term memory of RNNs.


## What‚Äôs Next?

Attention mechanisms are the game-changing innovation that addressed this problem.


# Attention Mechanisms

Consider the path from the word ‚Äúsoccer‚Äù to its translation ‚Äúf√∫tbol‚Äù back in Figure 14-5: it is quite long! This means that a representation of this word (along with all the other words) needs to be carried over many steps before it is actually used. Can‚Äôt we make this path shorter?


This was the core idea in a landmark 2014 paper by Dzmitry Bahdanau et al., where the authors introduced a technique that allowed the decoder to focus on the appropriate words (as encoded by the encoder) at each time step.


For example, at the time step where the decoder needs to output the word ‚Äúf√∫tbol‚Äù, it will focus its attention on the word ‚Äúsoccer‚Äù. This means that the path from an input word to its translation is now much shorter, so the short-term memory limitations of RNNs have much less impact.


For example, at the time step where the decoder needs to output the word ‚Äúf√∫tbol‚Äù, it will focus its attention on the word ‚Äúsoccer‚Äù. This means that the path from an input word to its translation is now much shorter, so the short-term memory limitations of RNNs have much less impact.


## Encoder-Decoder with Attention

Figure 14-8 shows our encoder-decoder model with an added attention mechanism.


- On the left, you have the encoder and the decoder (the encoder is bidirectional).
- Instead of sending only the encoder‚Äôs final hidden state to the decoder, we now send **all encoder outputs**.


Since the decoder cannot deal with all encoder outputs at once, they are aggregated.

At each time step, the decoder computes a **weighted sum** of all encoder outputs. This determines which words the decoder focuses on.


The weight Œ±(t,i) is the weight of the *i*th encoder output at the *t*th decoder time step.

If Œ±(3,2) is much larger than Œ±(3,0) and Œ±(3,1), then the decoder focuses mostly on encoder output #2 (e.g., the word ‚Äúsoccer‚Äù).


The rest of the decoder works as before: at each time step, it receives:
- the current inputs
- the hidden state from the previous step
- the previous target word (or previous output at inference time)


## Alignment Model (Attention Layer)

The attention weights Œ±(t,i) are generated by a small neural network called an **alignment model**.


This model:
- Takes each encoder output
- Takes the decoder‚Äôs previous hidden state
- Outputs a score (energy) measuring how well they align


For example, after outputting ‚Äúme gusta el‚Äù, the decoder expects a noun. The encoder output corresponding to ‚Äúsoccer‚Äù aligns best, so it gets a high score.


All scores go through a softmax layer to produce attention weights that sum to 1.


This attention mechanism is called **Bahdanau attention** (or additive / concatenative attention).


## Attention as Differentiable Memory Retrieval

Attention can be viewed as a differentiable memory lookup mechanism.


Suppose the encoder learned:
{"subject": "I", "verb": "like", "noun": "soccer"}

The decoder wants to retrieve the noun.


The decoder does not use symbolic keys like ‚Äúnoun‚Äù. Instead, it uses learned vector representations.

It computes similarity scores between the query and each key, applies softmax, and retrieves a weighted sum of values.


If the ‚Äúnoun‚Äù representation matches best, its weight will be near 1, and the retrieved vector will be close to ‚Äúsoccer‚Äù.


In modern terminology:
- **Query** ‚Üí decoder hidden states
- **Key** ‚Üí encoder outputs (for scoring)
- **Value** ‚Üí encoder outputs (for weighted sum)


NOTE

If the input sentence is n words long, attention requires computing about n¬≤ weights, which becomes expensive for very long sequences.


## Luong (Multiplicative) Attention

In 2015, Minh-Thang Luong et al. proposed **dot-product attention**, which computes similarity using a dot product.


This is faster and often more effective. It requires the encoder and decoder vectors to have the same dimensionality.


Luong attention uses the decoder‚Äôs **current hidden state**, concatenates the attention vector with it, and uses this to predict the next token.


The dot-product variants outperformed additive attention, so Bahdanau attention is less common today.


## Implementing Luong Attention


In [None]:
def attention(query, key, value):  # note: dq == dk and Lk == Lv
    scores = query @ key.transpose(1, 2)  # [B,Lq,dq] @ [B,dk,Lk] = [B, Lq, Lk]
    weights = torch.softmax(scores, dim=-1)  # [B, Lq, Lk]
    return weights @ value  # [B, Lq, Lk] @ [B, Lv, dv] = [B, Lq, dv]


This follows Equation 14-2:
1. Compute attention scores
2. Apply softmax
3. Compute weighted sum of values
ya byay

TIP

You can replace the `@` operator with `torch.bmm()` for faster batch matrix multiplication.


## Updating the NMT Model


The output layer must accept concatenated vectors:


In [None]:
self.output = nn.Linear(2 * hidden_dim, vocab_size)


### Updated `forward()` Method


In [None]:
def forward(self, pair):
    src_embeddings = self.embed(pair.src_token_ids)
    tgt_embeddings = self.embed(pair.tgt_token_ids)
    src_lengths = pair.src_mask.sum(dim=1)
    src_packed = pack_padded_sequence(
        src_embeddings, lengths=src_lengths.cpu(),
        batch_first=True, enforce_sorted=False)

    encoder_outputs_packed, hidden_states = self.encoder(src_packed)
    decoder_outputs, _ = self.decoder(tgt_embeddings, hidden_states)

    encoder_outputs, _ = pad_packed_sequence(
        encoder_outputs_packed, batch_first=True)

    attn_output = attention(
        query=decoder_outputs,
        key=encoder_outputs,
        value=encoder_outputs)

    combined_output = torch.cat(
        (attn_output, decoder_outputs), dim=-1)

    return self.output(combined_output).permute(0, 2, 1)


### Explanation

- Encoder outputs are no longer discarded
- Packed sequences are unpacked before attention
- Decoder outputs act as queries
- Attention output is concatenated with decoder output


WARNING

Padding tokens are not masked. The model learns to ignore them, but masking is preferable.


WARNING

Padding tokens are not masked. The model learns to ignore them, but masking is preferable.


In [None]:
WARNING

Padding tokens are not masked. The model learns to ignore them, but masking is preferable.


Output:


' Me gusta jugar fu tbol con mis amigos en la playa . </s>'


## Final Notes

Attention mechanisms were so powerful that researchers removed recurrent layers entirely.

This led to the Transformer architecture and the paper:
**‚ÄúAttention Is All You Need.‚Äù**
