# Exploring Attention Mechanisms and Contextual Embeddings for Part-of-Speech Tagging

**Author:** _Your Name_

This notebook implements the full assignment pipeline:

- **Task 1:** Dataset exploration
- **Task 2:** Baseline POS tagger (static/non-contextual embeddings + BiLSTM)
- **Task 3:** Contextual embeddings (BERT) for token classification
- **Task 4:** Attention-based POS tagger (BiLSTM + additive attention) + visualizations
- **Task 5:** Comparative analysis (Accuracy, Precision, Recall, F1, Complexity)

> Recommended environment: Python 3.10+ with a CUDA-enabled GPU.


## üì¶ Setup
Run the cell below to install required packages (skip if already installed).

In [None]:
# If running on Colab/Kaggle, you may uncomment the following lines.
# Choose the appropriate PyTorch CUDA index-url for your environment or simply use the default CPU install.
!pip -q install torch torchvision torchaudio
!pip -q install transformers datasets seqeval scikit-learn matplotlib seaborn tqdm conllu
# Optional for ELMo (not used by default):
# !pip -q install allennlp allennlp-models


## üîß Imports & Global Config
This sets deterministic seeds, device, and common utilities.

In [None]:
import os, random, statistics, math
from collections import Counter
import numpy as np
import torch

# Reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Device:', device)


## üìÇ Data: Universal Dependencies (UD) English

This notebook expects three CoNLL-U files under `data/ud_en/`:

- `train.conllu`
- `dev.conllu`
- `test.conllu`

You can:
1. **Place the Kaggle UDOS dataset files** in that folder and rename to the above, _or_
2. **Auto-download** the UD English EWT (r2.14) from the official UD GitHub (next cell, optional).

> CoNLL-U fields: `ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC` ‚Äî we use `FORM` and `UPOS`.


In [None]:
# Optional: Auto-download UD English EWT r2.14 from official repo
# If you already have the Kaggle files, skip this cell.
import os, requests
from pathlib import Path

DATA_DIR = Path('data/ud_en')
DATA_DIR.mkdir(parents=True, exist_ok=True)

base = 'https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/r2.14'
files = {
    'train.conllu': 'en_ewt-ud-train.conllu',
    'dev.conllu':   'en_ewt-ud-dev.conllu',
    'test.conllu':  'en_ewt-ud-test.conllu',
}

for out_name, src_name in files.items():
    out_path = DATA_DIR/out_name
    if not out_path.exists():
        url = f"{base}/{src_name}"
        print('Downloading', url)
        r = requests.get(url)
        r.raise_for_status()
        out_path.write_bytes(r.content)

print('Data files present:', list(DATA_DIR.iterdir()))


# Task 1 ‚Äî Dataset Exploration (2 marks)
We load the UD dataset, compute basic statistics, show tag distribution, and display sample sentences with tags.

In [None]:
from conllu import parse_incr
from pathlib import Path
from collections import Counter

DATA_DIR = Path('data/ud_en')
TRAIN = DATA_DIR/'train.conllu'
DEV   = DATA_DIR/'dev.conllu'
TEST  = DATA_DIR/'test.conllu'
assert TRAIN.exists() and DEV.exists() and TEST.exists(), 'Missing .conllu files in data/ud_en/'


def read_conllu(path):
    with open(path, 'r', encoding='utf-8') as f:
        for sent in parse_incr(f):
            tokens = [t['form'] for t in sent if isinstance(t['id'], int)]
            tags   = [t['upostag'] for t in sent if isinstance(t['id'], int)]
            yield tokens, tags

train = list(read_conllu(TRAIN))
dev   = list(read_conllu(DEV))
test  = list(read_conllu(TEST))

len_train, len_dev, len_test = len(train), len(dev), len(test)
print(f"Train={len_train}, Dev={len_dev}, Test={len_test}")

all_sents = train + dev + test
num_sentences = len(all_sents)
num_tokens = sum(len(s[0]) for s in all_sents)
avg_len = num_tokens / num_sentences

print(f"Total sentences: {num_sentences}")
print(f"Total tokens: {num_tokens}")
print(f"Average sentence length: {avg_len:.2f}")

tag_counter = Counter(tag for _, tags in all_sents for tag in tags)
print("Top tags:")
for tag, c in tag_counter.most_common(20):
    print(f"{tag:>5}: {c}")

# Show a few random samples
for i in range(3):
    tokens, tags = random.choice(all_sents)
    print('Tokens :', tokens)
    print('UPOS   :', tags)


## Shared Utilities for Sequence Tagging
We build vocabularies, PyTorch datasets and loaders, and evaluation helpers (SeqEval).

In [None]:
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
from seqeval.metrics import classification_report, accuracy_score, f1_score

PAD_TOKEN = '<PAD>'
UNK_TOKEN = '<UNK>'


def build_vocab(sentences, min_freq=1):
    wc = Counter(tok.lower() for toks, _ in sentences for tok in toks)
    itos = [PAD_TOKEN, UNK_TOKEN] + [w for w, c in wc.items() if c >= min_freq]
    stoi = {w:i for i, w in enumerate(itos)}
    return stoi, itos


def build_tag_map(sentences):
    tags = sorted({t for _, ts in sentences for t in ts})
    tag2id = {t:i for i, t in enumerate(tags)}
    id2tag = {i:t for t, i in tag2id.items()}
    return tag2id, id2tag


word2id, id2word = build_vocab(train)
tag2id, id2tag   = build_tag_map(train)
PAD_ID = word2id[PAD_TOKEN]


MAX_LEN = 128


def vectorize(tokens, tags=None, max_len=MAX_LEN):
    ids = [word2id.get(tok.lower(), word2id[UNK_TOKEN]) for tok in tokens]
    tids = None
    if tags is not None:
        tids = [tag2id[t] for t in tags]
    if max_len is not None:
        ids = ids[:max_len]
        if tids is not None:
            tids = tids[:max_len]
    return ids, tids


class SeqDataset(Dataset):
    def __init__(self, samples, max_len=MAX_LEN):
        self.samples = samples
        self.max_len = max_len
    def __len__(self): return len(self.samples)
    def __getitem__(self, idx):
        tokens, tags = self.samples[idx]
        wi, ti = vectorize(tokens, tags, self.max_len)
        return wi, ti


def collate(batch):
    xs, ys = zip(*batch)
    max_len = max(len(x) for x in xs)
    px = [x + [PAD_ID]*(max_len-len(x)) for x in xs]
    py = [y + [-100]*(max_len-len(y)) for y in ys]  # -100 ignored in CrossEntropyLoss
    attn = [[1]*len(x) + [0]*(max_len-len(x)) for x in xs]
    return (torch.tensor(px, dtype=torch.long),
            torch.tensor(py, dtype=torch.long),
            torch.tensor(attn, dtype=torch.bool))


train_dl = DataLoader(SeqDataset(train), batch_size=64, shuffle=True, collate_fn=collate)
dev_dl   = DataLoader(SeqDataset(dev),   batch_size=128, shuffle=False, collate_fn=collate)
test_dl  = DataLoader(SeqDataset(test),  batch_size=128, shuffle=False, collate_fn=collate)


def evaluate_sequences(true_ids, pred_ids):
    all_true, all_pred = [], []
    for y, p in zip(true_ids, pred_ids):
        y_tags = [id2tag[i] for i in y]
        p_tags = [id2tag[i] for i in p]
        all_true.append(y_tags)
        all_pred.append(p_tags)
    acc = accuracy_score(all_true, all_pred)
    macro_f1 = f1_score(all_true, all_pred, average='macro')
    micro_f1 = f1_score(all_true, all_pred, average='micro')
    report = classification_report(all_true, all_pred, digits=4)
    return acc, macro_f1, micro_f1, report


# Task 2 ‚Äî Baseline POS Tagger (Static Embeddings + BiLSTM) (2 marks)
We train a BiLSTM tagger with a trainable embedding. Loss ignores padding tokens, metrics use `seqeval`.

In [None]:
class BiLSTMTagger(nn.Module):
    def __init__(self, vocab_size, num_tags, emb_dim=100, hidden_dim=256, num_layers=1, dropout=0.2, pad_idx=0):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, emb_dim, padding_idx=pad_idx)
        self.lstm = nn.LSTM(emb_dim, hidden_dim//2, num_layers=num_layers,
                            batch_first=True, dropout=dropout if num_layers>1 else 0,
                            bidirectional=True)
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(hidden_dim, num_tags)

    def forward(self, x, mask=None):
        e = self.emb(x)                     # (B, T, E)
        out, _ = self.lstm(e)               # (B, T, H)
        out = self.dropout(out)
        logits = self.fc(out)               # (B, T, C)
        return logits


def train_epoch(model, dl, optimizer, criterion):
    model.train()
    total = 0.0
    for X, Y, M in dl:
        X, Y, M = X.to(device), Y.to(device), M.to(device)
        optimizer.zero_grad()
        logits = model(X, M)
        loss = criterion(logits.view(-1, logits.size(-1)), Y.view(-1))
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        total += loss.item()
    return total / max(1, len(dl))

@torch.no_grad()
def evaluate_model(model, dl):
    model.eval()
    all_true, all_pred = [], []
    for X, Y, M in dl:
        X, Y, M = X.to(device), Y.to(device), M.to(device)
        logits = model(X, M)
        pred = logits.argmax(-1)
        for y, p, m in zip(Y, pred, M):
            y = y[m].tolist()
            p = p[m].tolist()
            all_true.append(y)
            all_pred.append(p)
    acc, macro_f1, micro_f1, report = evaluate_sequences(all_true, all_pred)
    return acc, macro_f1, micro_f1, report

model_baseline = BiLSTMTagger(vocab_size=len(id2word), num_tags=len(tag2id), pad_idx=PAD_ID).to(device)
optimizer = torch.optim.AdamW(model_baseline.parameters(), lr=3e-3, weight_decay=1e-4)
criterion = nn.CrossEntropyLoss(ignore_index=-100)

EPOCHS = 8
for epoch in range(EPOCHS):
    tr_loss = train_epoch(model_baseline, train_dl, optimizer, criterion)
    acc, mF1, uF1, rep = evaluate_model(model_baseline, dev_dl)
    print(f"Epoch {epoch+1:02d} | train_loss={tr_loss:.4f} | dev_acc={acc:.4f} | dev_macroF1={mF1:.4f}")

print("
DEV REPORT (Baseline):
", rep)
acc_t, mF1_t, uF1_t, rep_t = evaluate_model(model_baseline, test_dl)
print("TEST ACC:", acc_t)
print("TEST REPORT (Baseline):
", rep_t)


# Task 3 ‚Äî Contextual Embedding-Based POS Tagger (BERT) (2 marks)
We fine-tune **BERT-base** for token classification. We align UD tokens to WordPiece and label the **first subword** only, masking the rest (`-100`).

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, DataCollatorForTokenClassification, TrainingArguments, Trainer

MODEL_NAME = 'bert-base-cased'  # alternatives: 'roberta-base', 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

label_list = [id2tag[i] for i in range(len(id2tag))]
label2id = {l:i for i,l in enumerate(label_list)}
id2label = {i:l for l,i in label2id.items()}


def to_hf_dict(sents):
    return {"tokens":[t for t,_ in sents], "tags":[u for _,u in sents]}

train_hf, dev_hf, test_hf = to_hf_dict(train), to_hf_dict(dev), to_hf_dict(test)


def encode_dataset(data):
    enc = tokenizer(data['tokens'], is_split_into_words=True, truncation=True, padding=False, max_length=128)
    all_labels = []
    for i in range(len(data['tokens'])):
        word_ids = enc.word_ids(batch_index=i)
        labels = data['tags'][i]
        aligned = []
        prev_wid = None
        for wid in word_ids:
            if wid is None:
                aligned.append(-100)
            else:
                if wid != prev_wid:
                    aligned.append(label2id[labels[wid]])
                else:
                    aligned.append(-100)
                prev_wid = wid
        all_labels.append(aligned)
    enc['labels'] = all_labels
    return enc

train_enc = encode_dataset(train_hf)
dev_enc   = encode_dataset(dev_hf)
test_enc  = encode_dataset(test_hf)

class HFTokenDataset(torch.utils.data.Dataset):
    def __init__(self, enc):
        self.enc = enc
    def __len__(self): return len(self.enc['input_ids'])
    def __getitem__(self, idx):
        return {k: torch.tensor(v[idx]) for k,v in self.enc.items()}

train_ds = HFTokenDataset(train_enc)
dev_ds   = HFTokenDataset(dev_enc)
test_ds  = HFTokenDataset(test_enc)

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

model_ctx = AutoModelForTokenClassification.from_pretrained(
    MODEL_NAME,
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id
).to(device)

args = TrainingArguments(
    output_dir='outputs/bert-pos',
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=4,
    weight_decay=0.01,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    logging_steps=50,
    fp16=torch.cuda.is_available(),
    report_to=[]
)

# We'll compute only accuracy inside Trainer and do a full report later.
import numpy as np
from seqeval.metrics import accuracy_score as seqeval_accuracy

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    true_seqs, pred_seqs = [], []
    for p, l in zip(preds, labels):
        y, yh = [], []
        for pi, li in zip(p, l):
            if li != -100:
                y.append(id2label[int(li)])
                yh.append(id2label[int(pi)])
        true_seqs.append(y)
        pred_seqs.append(yh)
    return {"accuracy": seqeval_accuracy(true_seqs, pred_seqs)}

trainer = Trainer(
    model=model_ctx,
    args=args,
    train_dataset=train_ds,
    eval_dataset=dev_ds,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train()
metrics_dev = trainer.evaluate()
print('DEV metrics:', metrics_dev)


In [None]:
# Full test set evaluation with detailed report
from seqeval.metrics import classification_report, accuracy_score, f1_score

pred_output = trainer.predict(test_ds)
logits = pred_output.predictions
labels = pred_output.label_ids
preds = np.argmax(logits, axis=-1)

true_seqs, pred_seqs = [], []
for p, l in zip(preds, labels):
    y, yh = [], []
    for pi, li in zip(p, l):
        if li != -100:
            y.append(id2label[int(li)])
            yh.append(id2label[int(pi)])
    true_seqs.append(y)
    pred_seqs.append(yh)

print('BERT Test Accuracy:', accuracy_score(true_seqs, pred_seqs))
print('BERT Test macro-F1:', f1_score(true_seqs, pred_seqs, average='macro'))
print('BERT Test micro-F1:', f1_score(true_seqs, pred_seqs, average='micro'))
print('
BERT TEST REPORT:
', classification_report(true_seqs, pred_seqs, digits=4))


# Task 4 ‚Äî Attention-Based POS Tagger (BiLSTM + Additive Attention) (2 marks)
We add token-to-context **additive attention** over BiLSTM outputs and visualize attention maps for ambiguous examples.


In [None]:
import torch.nn.functional as F
import matplotlib.pyplot as plt
import seaborn as sns

class BiLSTMWithAttention(nn.Module):
    def __init__(self, vocab_size, num_tags, emb_dim=100, hidden_dim=256, pad_idx=0, dropout=0.2):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, emb_dim, padding_idx=pad_idx)
        self.lstm = nn.LSTM(emb_dim, hidden_dim//2, batch_first=True, bidirectional=True)
        self.dropout = nn.Dropout(dropout)
        self.W_q = nn.Linear(hidden_dim, hidden_dim)
        self.W_k = nn.Linear(hidden_dim, hidden_dim)
        self.v   = nn.Linear(hidden_dim, 1, bias=False)
        self.fc  = nn.Linear(hidden_dim*2, num_tags)  # concat [h_t ; context_t]

    def forward(self, x, mask):
        h, _ = self.lstm(self.emb(x))             # (B, T, H)
        h = self.dropout(h)
        Q = self.W_q(h)                            # (B, T, H)
        K = self.W_k(h)                            # (B, T, H)

        B, T, H = Q.size()
        Qe = Q.unsqueeze(2).expand(B, T, T, H)
        Ke = K.unsqueeze(1).expand(B, T, T, H)
        scores = self.v(torch.tanh(Qe + Ke)).squeeze(-1)  # (B, T, T)

        key_mask = mask.unsqueeze(1).expand(B, T, T)      # (B, T, T)
        scores = scores.masked_fill(~key_mask, -1e9)

        attn = torch.softmax(scores, dim=-1)              # (B, T, T)
        context = attn @ h                                 # (B, T, H)

        out = torch.cat([h, context], dim=-1)             # (B, T, 2H)
        logits = self.fc(self.dropout(out))               # (B, T, C)
        return logits, attn


def train_epoch_attn(model, dl, optimizer, criterion):
    model.train()
    total = 0.0
    for X, Y, M in dl:
        X, Y, M = X.to(device), Y.to(device), M.to(device)
        optimizer.zero_grad()
        logits, attn = model(X, M)
        loss = criterion(logits.view(-1, logits.size(-1)), Y.view(-1))
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        total += loss.item()
    return total / max(1, len(dl))

@torch.no_grad()
def evaluate_model_attn(model, dl):
    model.eval()
    all_true, all_pred = [], []
    for X, Y, M in dl:
        X, Y, M = X.to(device), Y.to(device), M.to(device)
        logits, attn = model(X, M)
        pred = logits.argmax(-1)
        for y, p, m in zip(Y, pred, M):
            y = y[m].tolist()
            p = p[m].tolist()
            all_true.append(y)
            all_pred.append(p)
    acc, macro_f1, micro_f1, report = evaluate_sequences(all_true, all_pred)
    return acc, macro_f1, micro_f1, report

model_attn = BiLSTMWithAttention(vocab_size=len(id2word), num_tags=len(tag2id), pad_idx=PAD_ID).to(device)
opt_attn = torch.optim.AdamW(model_attn.parameters(), lr=2e-3, weight_decay=1e-4)
crit = nn.CrossEntropyLoss(ignore_index=-100)

EPOCHS_ATTN = 6
for epoch in range(EPOCHS_ATTN):
    tr_loss = train_epoch_attn(model_attn, train_dl, opt_attn, crit)
    acc, mF1, uF1, rep = evaluate_model_attn(model_attn, dev_dl)
    print(f"Epoch {epoch+1:02d} | train_loss={tr_loss:.4f} | dev_acc={acc:.4f} | dev_macroF1={mF1:.4f}")

print("
DEV REPORT (BiLSTM+Attention):
", rep)
acc_t2, mF1_t2, uF1_t2, rep_t2 = evaluate_model_attn(model_attn, test_dl)
print("TEST ACC:", acc_t2)
print("TEST REPORT (BiLSTM+Attention):
", rep_t2)


@torch.no_grad()
def visualize_attention(model, sentence_tokens):
    model.eval()
    wi, _ = vectorize(sentence_tokens, None, max_len=MAX_LEN)
    X = torch.tensor([wi], dtype=torch.long).to(device)
    M = torch.tensor([[1]*len(sentence_tokens)], dtype=torch.bool).to(device)
    logits, attn = model(X, M)
    T = len(sentence_tokens)
    mat = attn[0, :T, :T].detach().cpu().numpy()
    plt.figure(figsize=(min(0.6*T+2, 12), min(0.6*T+2, 12)))
    sns.heatmap(mat, xticklabels=sentence_tokens, yticklabels=sentence_tokens, cmap='viridis')
    plt.title('Token-to-Context Attention Heatmap')
    plt.xlabel('Context tokens (keys)')
    plt.ylabel('Query tokens')
    plt.tight_layout()
    plt.show()

# Examples with ambiguity
examples = [
    ["I","saw","a","duck","today","."],
    ["Please","record","the","meeting","."],
    ["I","will","book","a","table","."]
]
for ex in examples:
    visualize_attention(model_attn, ex)


# Task 5 ‚Äî Comparative Analysis (2 marks)
We summarize performance and discuss computational complexity. Fill in the table with your results from the runs above.

In [None]:
import pandas as pd

# Enter collected metrics from previous cells (or keep the defaults updated programmatically)
res = []
# Baseline (filled from variables if available)
try:
    res.append({
        'Model': 'BiLSTM (Static Emb)',
        'Accuracy': acc_t,
        'Macro F1': mF1_t,
        'Micro F1': uF1_t,
        'Params (~)': '5‚Äì10M',
        'Complexity': 'O(T) recurrent'
    })
except NameError:
    pass

# Attention
try:
    res.append({
        'Model': 'BiLSTM + Additive Attention',
        'Accuracy': acc_t2,
        'Macro F1': mF1_t2,
        'Micro F1': uF1_t2,
        'Params (~)': '6‚Äì12M',
        'Complexity': 'O(T^2) attention'
    })
except NameError:
    pass

# BERT (we computed in the BERT test cell)
try:
    from IPython.display import display
    bert_acc = accuracy_score(true_seqs, pred_seqs)
    bert_macro = f1_score(true_seqs, pred_seqs, average='macro')
    bert_micro = f1_score(true_seqs, pred_seqs, average='micro')
    res.append({
        'Model': 'BERT-base (Token Classification)',
        'Accuracy': bert_acc,
        'Macro F1': bert_macro,
        'Micro F1': bert_micro,
        'Params (~)': '110M',
        'Complexity': 'O(T^2) self-attn'
    })
except Exception as e:
    print('BERT metrics not available yet:', e)

if res:
    df = pd.DataFrame(res)
    display(df)
else:
    print('Run the training/evaluation cells above first to populate results.')


## üìù Discussion
**Why contextual embeddings and attention help:**
- Contextual models (ELMo/BERT) encode **polysemy**: the representation of a word depends on its **surrounding words**, improving disambiguation (e.g., *book* as NOUN vs VERB).
- Attention mechanisms allow the model to **focus on the most informative context tokens** for each position (e.g., determiners, auxiliaries), enhancing long-range dependency modeling beyond local windows.

**Expected ranges on UD English (indicative):**
- Static BiLSTM: ~95‚Äì97% token accuracy
- BiLSTM + Attention: ~96‚Äì97.5%
- BERT-base: ~97.5‚Äì98.5%

Your exact numbers will vary by preprocessing, hyperparameters, and random seed.

---

## üìö References
- Vaswani et al. (2017). *Attention is All You Need.*
- Peters et al. (2018). *Deep contextualized word representations (ELMo).* 
- Devlin et al. (2019). *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.*
- Ma & Hovy (2016). *End-to-end Sequence Labeling via BiLSTM-CNNs-CRF.*
- UD English EWT Treebank: https://universaldependencies.org/
