# Longformer Model for Extended Context (Optional)

In cases where articles exceed 512 tokens, BERT/RoBERTa truncates important content. **Longformer** can handle up to 4096 tokens using sparse attention, making it ideal for full‑length articles.

Training Longformer end‑to‑end is resource‑intensive, so here I provide the code scaffold without executing a full training run.


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import yaml

# Load config to get max_length for longformer
with open("config.yaml", "r") as f:
    config = yaml.safe_load(f)

# Initialize Longformer tokenizer & model
tokenizer_long = AutoTokenizer.from_pretrained("allenai/longformer-base-4096")
model_long     = AutoModelForSequenceClassification.from_pretrained(
    "allenai/longformer-base-4096",
    num_labels=3
)
model_long.eval()

# Demonstrate tokenization on a small sample
sample_texts = train_df['text'].head(2).tolist()
tokens = tokenizer_long(
    sample_texts,
    padding=True,
    truncation=True,
    max_length=config['training']['max_length']['longformer']
)
print("Tokenized length for first sample:", len(tokens['input_ids'][0]))
