# Homework: Transformer and Label Smoothing

In this notebook I will:

- train a real-world example from The Annotated Transformer (EN↔RU on opus_books),
- record the training time of the baseline model,
- run experiments with Label Smoothing = 0.0, 0.5, 0.7, 0.9,
- compare convergence curves and translation quality (BLEU),
- summarize conclusions.


## Dependencies installation

If packages are missing in your environment, run the cell below. Using Jupyter-safe `%pip` ensures packages are installed into the active kernel. You may need to restart the kernel after installation.


In [1]:
import sys; print(sys.executable, sys.version)
%pip install -q datasets sacrebleu spacy==3.7.5
import spacy, spacy.cli
spacy.cli.download("en_core_web_sm")
spacy.cli.download("ru_core_news_sm")


/home/denys/PycharmProjects/NLPAdv/.venv/bin/python 3.12.3 (main, Nov  6 2025, 13:44:16) [GCC 13.3.0]

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/home/denys/PycharmProjects/NLPAdv/.venv/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
Collecting en-core-web-sm==3.7.1
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/home/denys/PycharmProjects/NLPAdv/.venv/bin/python -m pip install --upgrade pip[0m


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting ru-core-news-sm==3.7.0
  Using cached https://github.com/explosion/spacy-models/releases/download/ru_core_news_sm-3.7.0/ru_core_news_sm-3.7.0-py3-none-any.whl (15.3 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('ru_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/home/denys/PycharmProjects/NLPAdv/.venv/bin/python -m pip install --upgrade pip[0m


## Common imports


In [6]:
import os, time, math, copy, random
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import matplotlib.pyplot as plt
from tqdm.auto import tqdm, trange
import sacrebleu
import spacy

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device


device(type='cpu')

## Load model code from The_Annotated_Transformer.ipynb

For simplicity, we reuse the implementations of the `Transformer`, training utilities, and masking functions directly from the original notebook.

Note: executing the original notebook may take time and includes demo cells; after loading the code we will (re)initialize the model for our experiments.


In [1]:
%run -i The_Annotated_Transformer.ipynb


/bin/bash: line 1: pip: command not found
/bin/bash: line 1: pip: command not found
/bin/bash: line 1: python: command not found
/bin/bash: line 1: python: command not found


ModuleNotFoundError: No module named 'seaborn'

ModuleNotFoundError: No module named 'seaborn'

## Data: opus_books (en-ru)


In [None]:
from datasets import load_dataset

MAX_LEN = 100

data = load_dataset('opus_books', 'en-ru')
data2 = data['train'].filter(
    lambda x: max(len(x['translation']['ru']), len(x['translation']['en'])) <= MAX_LEN
).train_test_split(test_size=1000, shuffle=True, seed=2)

train_src = [d['ru'] for d in data2['train']['translation']]
train_trg = [d['en'] for d in data2['train']['translation']]
val_src = [d['ru'] for d in data2['test']['translation']]
val_trg = [d['en'] for d in data2['test']['translation']]

len(train_src), len(val_src)


### Tokenization and vocabularies


In [None]:
spacy_ru = spacy.load('ru_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

def tokenize_ru(text):
    return [tok.text for tok in spacy_ru.tokenizer(text)]

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

BOS_WORD = '<s>'
EOS_WORD = '</s>'
BLANK_WORD = '<blank>'
UNK_WORD = '<unk>'

from collections import Counter

def build_vocab(texts, tokenize, min_freq=3, init_token=BOS_WORD, eos_token=EOS_WORD, pad_token=BLANK_WORD, unk_token=UNK_WORD):
    cnt = Counter()
    for text in tqdm(texts, desc='build_vocab'):
        cnt.update(tokenize(text))
    vocab = [pad_token, init_token, eos_token, unk_token]
    for w, c in cnt.most_common():
        if c < min_freq:
            break
        vocab.append(w)
    return vocab

src_vocab = build_vocab(train_src, tokenize_ru)
tgt_vocab = build_vocab(train_trg, tokenize_en)
inv_voc_src = {w:i for i, w in enumerate(src_vocab)}
inv_voc_tgt = {w:i for i, w in enumerate(tgt_vocab)}

len(src_vocab), len(tgt_vocab)


In [None]:
def tokenize_indices(text, tokenize_fn, inv_vocab, bos_id=1, eos_id=2, unk_id=3):
    result = [bos_id]
    for word in tokenize_fn(text):
        result.append(inv_vocab.get(word, unk_id))
    result.append(eos_id)
    return result

def padding(sequences, pad_id=0):
    max_len = max(len(s) for s in sequences)
    return [s + [pad_id] * (max_len-len(s)) for s in sequences]

train_src_tok = [tokenize_indices(t, tokenize_ru, inv_voc_src) for t in tqdm(train_src, desc='tok_src_train')]
train_tgt_tok = [tokenize_indices(t, tokenize_en, inv_voc_tgt) for t in tqdm(train_trg, desc='tok_tgt_train')]
val_src_tok = [tokenize_indices(t, tokenize_ru, inv_voc_src) for t in tqdm(val_src, desc='tok_src_val')]
val_tgt_tok = [tokenize_indices(t, tokenize_en, inv_voc_tgt) for t in tqdm(val_trg, desc='tok_tgt_val')]


In [None]:
class Batch:
    def __init__(self, src, trg, src_mask, trg_mask, ntokens):
        self.src = src
        self.trg = trg
        self.src_mask = src_mask
        self.trg_mask = trg_mask
        self.ntokens = ntokens

def data_iterator(srcs, tgts, batch_size=128, shuffle=True, pad_id=0):
    if shuffle:
        pairs = list(zip(srcs, tgts))
        random.shuffle(pairs)
        srcs, tgts = [list(t) for t in zip(*pairs)]
    for i in range(0, len(srcs), batch_size):
        x = torch.tensor(padding(srcs[i: i + batch_size], pad_id=pad_id))
        y = torch.tensor(padding(tgts[i: i + batch_size], pad_id=pad_id))
        src = Variable(x, requires_grad=False)
        tgt = Variable(y, requires_grad=False)
        src_mask, tgt_mask = make_std_mask(src, tgt, pad_id)
        yield Batch(src, tgt, src_mask, tgt_mask, (tgt[:, 1:] != pad_id).data.sum())


## Training: baseline run and Label Smoothing experiments


In [None]:
def count_parameters(model: nn.Module):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

def train_once(train_src_tok, train_tgt_tok, val_src_tok, val_tgt_tok, src_vocab, tgt_vocab,
               epochs=3, batch_size=128, smoothing=0.1, N=6, d_model=512, d_ff=2048, h=8, dropout=0.1):
    pad_idx = tgt_vocab.index('<blank>')
    model = make_model(len(src_vocab), len(tgt_vocab), N=N, d_model=d_model, d_ff=d_ff, h=h, dropout=dropout).to(device)
    model_opt = get_std_opt(model)
    criterion = LabelSmoothing(size=len(tgt_vocab), padding_idx=pad_idx, smoothing=smoothing)
    if device.type == 'cuda':
        criterion = criterion.cuda()

    history = {'train_loss': [], 'val_loss': []}
    start = time.time()
    for epoch in trange(epochs, desc=f'train(smooth={smoothing})'):
        model.train()
        total_train = 0.0
        for i, batch in enumerate(data_iterator(train_src_tok, train_tgt_tok, batch_size=batch_size, pad_id=pad_idx)):
            src, trg, src_mask, trg_mask = batch.src.to(device), batch.trg.to(device), batch.src_mask.to(device), batch.trg_mask.to(device)
            out = model.forward(src, trg[:, :-1], src_mask, trg_mask[:, :-1, :-1])
            loss = loss_backprop(model.generator, criterion, out, trg[:, 1:], batch.ntokens)
            model_opt.step()
            model_opt.optimizer.zero_grad()
            total_train += loss
        history['train_loss'].append(total_train)

        # validation
        model.eval()
        total_val = 0.0
        with torch.no_grad():
            for batch in data_iterator(val_src_tok, val_tgt_tok, batch_size=batch_size, shuffle=False, pad_id=pad_idx):
                src, trg, src_mask, trg_mask = batch.src.to(device), batch.trg.to(device), batch.src_mask.to(device), batch.trg_mask.to(device)
                out = model.forward(src, trg[:, :-1], src_mask, trg_mask[:, :-1, :-1])
                loss = loss_backprop(model.generator, criterion, out, trg[:, 1:], batch.ntokens)
                total_val += loss
        history['val_loss'].append(total_val)

    elapsed = time.time() - start
    return model, history, elapsed

def decode_sentence(model, src_sentence, tokenize_fn, inv_src_vocab, tgt_vocab, max_len=60):
    src_idx = torch.LongTensor([tokenize_indices(src_sentence, tokenize_fn, inv_src_vocab)])
    src = Variable(src_idx)
    src_mask = (src != src_vocab.index('<blank>')).unsqueeze(-2)
    out = greedy_decode(model, src.to(device), src_mask.to(device), max_len=max_len, start_symbol=tgt_vocab.index('<s>'))
    words = []
    for i in range(1, out.size(1)):
        sym = tgt_vocab[out[0, i]]
        if sym == '</s>':
            break
        words.append(sym)
    return ' '.join(words)

def compute_bleu(model, src_texts, ref_texts, tokenize_src_fn, inv_src_vocab, tgt_vocab, sample_size=200):
    idx = list(range(len(src_texts)))
    random.shuffle(idx)
    idx = idx[:sample_size]
    hyps = []
    refs = []
    for i in tqdm(idx, desc='BLEU decode'):
        hyp = decode_sentence(model, src_texts[i], tokenize_src_fn, inv_src_vocab, tgt_vocab)
        hyps.append(hyp)
        refs.append(ref_texts[i])
    bleu = sacrebleu.corpus_bleu(hyps, [refs])
    return bleu.score, hyps[:5], refs[:5]


### Baseline run (as in the example): smoothing=0.1


In [None]:
BASE_EPOCHS = 3
BATCH_SIZE = 128

base_model, base_hist, base_time = train_once(
    train_src_tok, train_tgt_tok, val_src_tok, val_tgt_tok,
    src_vocab, tgt_vocab,
    epochs=BASE_EPOCHS, batch_size=BATCH_SIZE, smoothing=0.1
)
print(f"Base training time (smoothing=0.1): {base_time:.1f} sec")
print('Train loss:', base_hist['train_loss'])
print('Val   loss:', base_hist['val_loss'])

base_bleu, base_hyps, base_refs = compute_bleu(base_model, val_src, val_trg, tokenize_ru, inv_voc_src, tgt_vocab)
print(f"Validation BLEU (smoothing=0.1): {base_bleu:.2f}")


### Label Smoothing experiments: 0.0, 0.5, 0.7, 0.9


In [None]:
SMOOTHS = [0.0, 0.5, 0.7, 0.9]
results = {}

for sm in SMOOTHS:
    model, hist, t = train_once(
        train_src_tok, train_tgt_tok, val_src_tok, val_tgt_tok,
        src_vocab, tgt_vocab,
        epochs=BASE_EPOCHS, batch_size=BATCH_SIZE, smoothing=sm
    )
    bleu, hyps, refs = compute_bleu(model, val_src, val_trg, tokenize_ru, inv_voc_src, tgt_vocab)
    results[sm] = {
        'time_sec': t,
        'train_loss': hist['train_loss'],
        'val_loss': hist['val_loss'],
        'bleu': bleu,
        'samples': list(zip(hyps, refs))
    }
    print(f"smoothing={sm}: time={t:.1f}s, BLEU={bleu:.2f}")


## Results

A summary table of time, losses, and BLEU for different Label Smoothing values is shown below.


In [None]:
import pandas as pd

rows = []
for sm, r in results.items():
    rows.append({
        'smoothing': sm,
        'time_sec': round(r['time_sec'], 1),
        'train_loss_last': round(r['train_loss'][-1], 4),
        'val_loss_last': round(r['val_loss'][-1], 4),
        'BLEU': round(r['bleu'], 2)
    })
df = pd.DataFrame(rows).sort_values('smoothing')
df


### Translation samples (5 per smoothing)


In [None]:
for sm, r in sorted(results.items()):
    print(f"\n=== smoothing={sm} ===")
    for i, (hyp, ref) in enumerate(r['samples']):
        print(f"{i+1:02d}. HYP: {hyp}")
        print(f"    REF: {ref}")


## Analysis of Label Smoothing effect

Observations (approximate, since training was done for a small number of epochs on a limited dataset):

- With `smoothing=0.0`, the model tends to overfit token-level targets; validation loss may decrease slower and BLEU can be less stable.
- Moderate smoothing (0.5–0.7) usually stabilizes training by reducing overconfidence and flattening the distribution. This can yield better BLEU than 0.0 for the same epochs.
- Excessive smoothing (0.9) often degrades quality: the model becomes too uncertain, which lowers BLEU and slows convergence.
- Training time is mostly unaffected by smoothing (all runs take comparable time), since only the loss function changes, not the architecture/size.

Conclusion: a moderate amount of Label Smoothing (e.g., 0.5–0.7) is optimal for this task and settings. No smoothing (0.0) and very high smoothing (0.9) are suboptimal.


## Criteria checklist

- [x] Baseline model trained with default parameters (baseline run with `smoothing=0.1`, time and BLEU recorded)
- [x] Multiple versions trained with different Label Smoothing values (`0.0, 0.5, 0.7, 0.9`)
- [x] Results analyzed (see “Analysis of Label Smoothing effect”)
