# Longformer Model for Extended Context (Optional)

In cases where articles exceed 512 tokens, BERT/RoBERTa truncates important content. **Longformer** can handle up to 4096 tokens using sparse attention, making it ideal for full‑length articles.

Training Longformer end‑to‑end is resource‑intensive, so here I provide the code scaffold without executing a full training run.


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import yaml

# Load config to get max_length for longformer
with open("config.yaml", "r") as f:
    config = yaml.safe_load(f)

# Initialize Longformer tokenizer & model
tokenizer_long = AutoTokenizer.from_pretrained("allenai/longformer-base-4096")
model_long     = AutoModelForSequenceClassification.from_pretrained(
    "allenai/longformer-base-4096",
    num_labels=3
)
model_long.eval()

# Demonstrate tokenization on a small sample
sample_texts = train_df['text'].head(2).tolist()
tokens = tokenizer_long(
    sample_texts,
    padding=True,
    truncation=True,
    max_length=config['training']['max_length']['longformer']
)
print("Tokenized length for first sample:", len(tokens['input_ids'][0]))


## Training Longformer (Not Executed)

Normally, I would now fine‑tune Longformer similarly to BERT/RoBERTa:
1. Tokenize the full train/val sets with `max_length=4096`.  
2. Create `Dataset` objects with labels.  
3. Set up `TrainingArguments` (reduce `batch_size` to 4 due to memory).  
4. Instantiate `CustomTrainer` with `model_long`, metrics, and callbacks.  
5. Call `trainer_long.train()`.

However, training with 4,096‑token sequences is very slow (≈3-4× per‑epoch time of RoBERTa). To conserve resources, we are **not** running a full Longformer training here.

```python
# (Pseudo‑code – do not run)
# train_enc_long = tokenizer_long(train_texts, padding=True, truncation=True, max_length=4096)
# val_enc_long   = tokenizer_long(val_texts, padding=True, truncation=True, max_length=4096)
# train_dataset_long = Dataset.from_dict({...})
# training_args_long = TrainingArguments(output_dir=..., num_train_epochs=3, per_device_train_batch_size=4, ...)
# trainer_long = CustomTrainer(model_long, args=training_args_long, ...)
# trainer_long.train()
```
While Longformer may improve recall on very long articles, our dataset’s median length (~250 words) is well within 512 tokens, so RoBERTa suffices for deployment.

## Conclusion

I provided the full scaffold to fine‑tune Longformer for extended contexts, but did not execute a complete training run to conserve resources. Given our dataset’s median length (~250 words), a 512‑token model already captures most content. My final chosen model remains **RoBERTa‑base**, which I'm fine‑tuned and saved for deployment.
