### Framing classifier using BERT - V2 - Longformer Generalist

Retain all features from V1 notebook. Drop the long-doc policy in favor of the longformer. Adding in title + text into input. Introduce a more streamlined means of storing results, including folders per run. 


In [5]:
import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datasets import load_dataset
from dotenv import load_dotenv
import os
import transformers
load_dotenv()  # looks for .env in current directory or parent
print(torch.__version__)
print(torch.cuda.is_available())


2.6.0+cu124
True


### Sample the data, NO Topic Filter

In [6]:
# Connect to server 
import psycopg2
conn = psycopg2.connect(
    dbname=os.getenv("DB_NAME"),
    user=os.getenv("DB_USER"),
    password=os.getenv("DB_PASSWORD"),
    host=os.getenv("DB_HOST"),
    port=os.getenv("DB_PORT")
)
cur = conn.cursor()

# key: set the seed
cur.execute("SELECT setseed(0.42)")

# Do our join in database - NOTE this is with a POLITICS FILTER
cur.execute(f"""
           SELECT a.text_generic_frame, a.gpt_topic, a.political_leaning, a.title,  
           b.maintext
           FROM mm_framing_full a
           JOIN newsarticles b ON a.url = b.url
           ORDER BY RANDOM()
            LIMIT 75000
            """)

result= cur.fetchall()

print(cur.description)

cur.close()
conn.close()

df = pd.DataFrame(result, columns=["text_generic_frame", "gpt_topic", "political_leaning", "title", "article_text"])

del result

df.head()

(Column(name='text_generic_frame', type_code=1009), Column(name='gpt_topic', type_code=25), Column(name='political_leaning', type_code=25), Column(name='title', type_code=25), Column(name='maintext', type_code=25))


Unnamed: 0,text_generic_frame,gpt_topic,political_leaning,title,article_text
0,"[Health and safety, Quality of life, Other]",Sports,left_lean,"James Harden scores his 25,000th point, leads ...","James Harden scored his 25,000th career point ..."
1,[Political],Politics,left_lean,Look ahead to the biggest political stories of...,Political reporter Jack Fink and CBS News Texa...
2,"[Capacity and resources, Policy prescription a...",Business & Economy,left_lean,UK housing market fears as Marshalls and Purpl...,A profit warning from the UK driveways to roof...
3,"[Political, Policy prescription and evaluation]",Politics,left_lean,"Cook County State’s Attorney's race, Bring Chi...",The State's Attorney's race is still up in the...
4,"[Cultural identity, Fairness and equality, Mor...",Education,left,Florida Teacher Investigated After Showing Dis...,LOADINGERROR LOADING\nA Florida elementary sch...


### Initial Data Filtering

Not that in between these notebooks, I changed the gpt_topic and generic frame columns directly in database so I don't have to keep fixing them.

In [7]:
# Create word count column
df['num_words'] = df['article_text'].str.split().str.len()

print(f"Original rows: {len(df)}")

# Filter based on length
df_filtered = df[(df['num_words'] > 100)]
df = df_filtered.dropna()
df = df.reset_index(drop=True)

print(f"Filtered rows: {len(df_filtered)}")

del df_filtered

Original rows: 75000
Filtered rows: 56447


In [8]:
# Keep rows only where the list is NOT exactly ['Other']
df = df[df['text_generic_frame'].apply(lambda x: x != ['Other'])]
print(f"Rows remaining: {len(df)}")

Rows remaining: 55616


In [9]:
df.head()

Unnamed: 0,text_generic_frame,gpt_topic,political_leaning,title,article_text,num_words
0,"[Health and safety, Quality of life, Other]",Sports,left_lean,"James Harden scores his 25,000th point, leads ...","James Harden scored his 25,000th career point ...",747
1,"[Capacity and resources, Policy prescription a...",Business & Economy,left_lean,UK housing market fears as Marshalls and Purpl...,A profit warning from the UK driveways to roof...,536
2,"[Cultural identity, Fairness and equality, Mor...",Education,left,Florida Teacher Investigated After Showing Dis...,LOADINGERROR LOADING\nA Florida elementary sch...,464
3,"[Crime and punishment, Public opinion, Securit...",Crime & Safety,left_lean,Police: Two armed minors attempt to rob man si...,BALTIMORE - A man was sitting in his car Thurs...,258
4,"[Economic, Capacity and resources, Public opin...",Social Issues,left_lean,150-year-old Florida Keys lighthouse illuminat...,"ISLAMORADA, Fla. (AP) — A 150-year-old beacon ...",278


In [10]:
# Engineer the text column

# Adding the title
df['article_text'] = df['title'] + "\n" + df['article_text']

# Adding the topic at the very start
df['article_text'] = "TOPIC:" + df['gpt_topic'] + "\n" + df['article_text']



### Tokenization + Adaptively setting max_length

In [11]:

from transformers import LongformerTokenizerFast
import numpy as np

# 1. Initialize Tokenizer
model_name = "allenai/longformer-base-4096"
tokenizer = LongformerTokenizerFast.from_pretrained(model_name)

# 2. Measure Lengths
# We process in batches to keep it snappy
print("Measuring token lengths...")
token_lens = []
texts = df['article_text'].tolist()

# Tokenize just to count (no padding/truncation yet)
# Using the fast tokenizer's batch_encode_plus is usually efficient enough
encodings = tokenizer(texts, add_special_tokens=True, return_attention_mask=False)
token_lens = [len(x) for x in encodings['input_ids']]

# 3. Statistics
token_lens = np.array(token_lens)
p95 = np.percentile(token_lens, 95)
p99 = np.percentile(token_lens, 99)

print(f"Mean Length: {np.mean(token_lens):.1f}")
print(f"95th Percentile: {p95:.1f} tokens")
print(f"99th Percentile: {p99:.1f} tokens")
print(f"Max Length found: {np.max(token_lens)} tokens")

# based on these a max_length of 2048 is sufficient, and balances efficiency

Measuring token lengths...
Mean Length: 740.5
95th Percentile: 1452.0 tokens
99th Percentile: 1914.8 tokens
Max Length found: 21541 tokens


### Dataset Creation

In [12]:
# load the binarizer
import joblib
mlb = joblib.load('encoders/mlb_15_classes.pkl')

labels_matrix = mlb.transform(df['text_generic_frame'])

print(labels_matrix.shape)

(55616, 15)


In [13]:
# we can innovate in storing meta data from past approach by just storing self.df
# we also innovate by saving VRAM with the __getitem__ implementation
from torch.utils.data import Dataset, DataLoader

import torch
from torch.utils.data import Dataset, DataLoader

class NewsArticleDataset(Dataset):
    def __init__(self, df, tokenizer, labels_matrix, max_len=2048):
        self.df = df
        self.tokenizer = tokenizer
        self.max_len = max_len
        # We store the pre-computed matrix you made with MLB
        self.labels = labels_matrix 
        
    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        text = str(row['article_text'])
        
        # TOKENIZATION
        # Truncate, but DO NOT PAD here. 
        # The collator handles padding to save memory.
        encoding = self.tokenizer(
            text,
            max_length=self.max_len,
            truncation=True,
            padding=False, 
            add_special_tokens=True 
        )
        
        input_ids = encoding['input_ids']
        attention_mask = encoding['attention_mask']
        
        # CREATE GLOBAL ATTENTION MASK
        # 0 = Local Attention, 1 = Global Attention
        # We strictly set the [CLS] token (index 0) to Global Attention.
        global_attention_mask = [0] * len(input_ids)
        global_attention_mask[0] = 1 
        
        # GET LABELS
        # Directly access the row from your matrix
        labels_vec = self.labels[idx]

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'global_attention_mask': global_attention_mask,
            'labels': torch.tensor(labels_vec, dtype=torch.float),
            
            # METADATA PASS-THROUGH
            'metadata': {
                'url': row.get('url', ''),
                'title': row.get('title', ''),
                'gpt_topic': row.get('gpt_topic', ''),
                'num_words': row.get('num_words', 0)
            }
        }


In [14]:
# Collate function, which is critical to running this much larger model

def longformer_collate_fn(batch):
    """
    Custom collator to handle dynamic padding and 512-window alignment.
    """
    # 1. Determine the maximum length in this specific batch
    max_len = max(len(item['input_ids']) for item in batch)
    
    # 2. Round up to nearest multiple of 512 (Longformer Window Size)
    # This aligns memory for the sliding window attention mechanism
    window_size = 512
    padded_len = ((max_len + window_size - 1) // window_size) * window_size
    
    # Prepare batch lists
    input_ids_batch = []
    attention_mask_batch = []
    global_attention_mask_batch = []
    labels_batch = []
    metadata_batch = []
    
    pad_token_id = tokenizer.pad_token_id
    
    for item in batch:
        # Calculate padding needed for this sequence
        curr_len = len(item['input_ids'])
        pad_len = padded_len - curr_len
        
        # Pad Input IDs
        ids = item['input_ids'] + [pad_token_id] * pad_len
        
        # Pad Attention Mask (0 for padded tokens)
        mask = item['attention_mask'] + [0] * pad_len
        
        # Pad Global Attention Mask (0 for padded tokens)
        global_mask = item['global_attention_mask'] + [0] * pad_len
        
        input_ids_batch.append(ids)
        attention_mask_batch.append(mask)
        global_attention_mask_batch.append(global_mask)
        labels_batch.append(item['labels'])
        metadata_batch.append(item['metadata'])

    return {
        'input_ids': torch.tensor(input_ids_batch, dtype=torch.long),
        'attention_mask': torch.tensor(attention_mask_batch, dtype=torch.long),
        'global_attention_mask': torch.tensor(global_attention_mask_batch, dtype=torch.long),
        'labels': torch.stack(labels_batch),
        'metadata': metadata_batch # Returns a list of dicts
    }

In [15]:
from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
from torch.utils.data import Subset, DataLoader
import numpy as np

# 1. SETUP SPLITTERS -----------------------------------------------------------
# We need indices to split.
N = len(labels_matrix)
X_indices = np.zeros(N) # Dummy features just to satisfy the splitter API

# A. Split Train (80%) vs Temp (20%)
msss1 = MultilabelStratifiedShuffleSplit(n_splits=1, test_size=0.20, random_state=42)
train_idx, temp_idx = next(iter(msss1.split(X_indices, labels_matrix)))

# B. Split Temp into Val (10%) and Test (10%)
# We split the *temp* indices in half
temp_labels = labels_matrix[temp_idx]
temp_dummy_X = np.zeros(len(temp_idx))

msss2 = MultilabelStratifiedShuffleSplit(n_splits=1, test_size=0.50, random_state=42)
relative_val_idx, relative_test_idx = next(iter(msss2.split(temp_dummy_X, temp_labels)))

# Map relative indices back to original dataframe indices
val_idx = temp_idx[relative_val_idx]
test_idx = temp_idx[relative_test_idx]

print(f"Splits created:")
print(f"Train: {len(train_idx)} | Val: {len(val_idx)} | Test: {len(test_idx)}")


# 2. INSTANTIATE DATASETS ------------------------------------------------------
# We create ONE full dataset, then subset it using the indices above.
full_dataset = NewsArticleDataset(
    df, 
    tokenizer, 
    labels_matrix, 
    max_len=2048
)

train_dataset = Subset(full_dataset, train_idx)
val_dataset   = Subset(full_dataset, val_idx)
test_dataset  = Subset(full_dataset, test_idx)


# 3. CREATE DATALOADERS --------------------------------------------------------
# Optimizations for RTX 4070 Ti Super
BATCH_SIZE = 4 

train_loader = DataLoader(
    train_dataset, 
    batch_size=BATCH_SIZE, 
    shuffle=True, # Shuffle ONLY training
    collate_fn=longformer_collate_fn,
    num_workers=0
)

val_loader = DataLoader(
    val_dataset, 
    batch_size=BATCH_SIZE, 
    shuffle=False, 
    collate_fn=longformer_collate_fn,
    num_workers=0
)

test_loader = DataLoader(
    test_dataset, 
    batch_size=BATCH_SIZE, 
    shuffle=False, 
    collate_fn=longformer_collate_fn,
    num_workers=0
)

print(f"Loaders ready. Train batches: {len(train_loader)}")

Splits created:
Train: 44492 | Val: 5539 | Test: 5585
Loaders ready. Train batches: 11123


In [16]:
# Let's test on a batch as a sanity check
# grab a batch using iterator next()
batch = next(iter(train_loader))

print(batch['metadata'][1])
# so now we have our useful metadata in our batches

{'url': '', 'title': "'Real Housewives' star Brandi Glanville hospitalized following collapse, son called 911", 'gpt_topic': 'Health', 'num_words': np.int64(351)}


### Pre-training Loading

In [14]:
# for a clean state, hard reset the GPU state
import torch, gc
gc.collect()
torch.cuda.empty_cache() 

In [2]:
import os
import json
import torch
from datetime import datetime
# 1. THE DEEP IMPORT (Go straight to the source file)
try:
    from transformers.models.longformer.modeling_longformer import LongformerForSequenceClassification
except ImportError:
    # If this fails, we try the old directory structure (sometimes happens in older/conda envs)
    from transformers.models.longformer import LongformerForSequenceClassification

# 2. MODEL INITIALIZATION -----------------------------------------------------
print("Loading Longformer (this may take a moment)...")

model = LongformerForSequenceClassification.from_pretrained(
    "allenai/longformer-base-4096",
    num_labels=15, 
    problem_type="multi_label_classification",
    ignore_mismatched_sizes=True 
)

# ENABLE Gradient Checkpointing - which saves a huge amount of V-Ram
model.gradient_checkpointing_enable()

model.to('cuda')
print("Model loaded and moved to GPU.")

# 3. OPTIMIZER & SCHEDULER ----------------------------------------------------
# Hyperparameters
LR = 3e-5
EPOCHS = 4
# set the number of mini batches you run before updating gradients
ACCUMULATION_STEPS = 4  # Effective Batch Size = 4 (physical) * 4 (accum) = 16

optimizer = torch.optim.AdamW(
    model.parameters(), 
    lr=LR 
)

# 4. THE "LAB MANAGER" (EXPERIMENT TRACKER) -----------------------------------
class ExperimentTracker:
    def __init__(self, run_name, base_dir="saved_models/framing_training_runs_longformer"):
        # Create a unique timestamped folder for this run
        timestamp = datetime.now().strftime("%Y%m%d_%H%M")
        self.run_dir = os.path.join(base_dir, f"{timestamp}_{run_name}")
        os.makedirs(self.run_dir, exist_ok=True)
        print(f" Experiment initialized. Saving to: {self.run_dir}")
        
        # Initialize a log dictionary
        self.history = {
            "config": {
                "model": "longformer-base-4096",
                "max_len": 2048,
                "batch_size": 4, 
                "accum_steps": ACCUMULATION_STEPS,
                "lr": LR
            },
            "epochs": []
        }
        
    def log_epoch(self, epoch_data):
        """Append epoch results to history and save immediately."""
        self.history["epochs"].append(epoch_data)
        self.save_history()
        
    def save_history(self):
        with open(os.path.join(self.run_dir, "metrics.json"), "w") as f:
            json.dump(self.history, f, indent=4)
            
    def save_model(self, model, name="model_state.bin"):
        torch.save(model.state_dict(), os.path.join(self.run_dir, name))
        print(f"Model saved: {name}")

    def save_report(self, df_report, name="classification_report.csv"):
        df_report.to_csv(os.path.join(self.run_dir, name))


Loading Longformer (this may take a moment)...


Some weights of LongformerForSequenceClassification were not initialized from the model checkpoint at allenai/longformer-base-4096 and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded and moved to GPU.


In [20]:


# DEFINE weighted loss

num_positives = torch.tensor(labels_matrix.sum(axis=0), dtype=torch.float)
num_negatives = len(labels_matrix) - num_positives

# Calculate ratio: if we have 10x more negatives, we boost positives by 10x
pos_weight = (num_negatives / (num_positives + 1e-5)).to('cuda')

print(f"Calculated positive weights for {len(pos_weight)} classes.")

# 2. Define the Criterion
# We pass this into the training function
criterion = torch.nn.BCEWithLogitsLoss(pos_weight=pos_weight)


Calculated positive weights for 15 classes.


### Training Run

In [21]:
# Initialize and name the run
tracker = ExperimentTracker(run_name="longformer_topic_expert_v1")

 Experiment initialized. Saving to: saved_models/framing_training_runs_longformer\20260121_0143_longformer_topic_expert_v1


In [23]:
from tqdm.auto import tqdm
from sklearn.metrics import f1_score

# 3. TRAINING ENGINE ----------------------------------------------------------
def train_engine(model, train_loader, val_loader, optimizer, tracker, criterion):
    scaler = torch.amp.GradScaler('cuda') # Mixed Precision (Newer PyTorch syntax)
    
    for epoch in range(EPOCHS):
        print(f"\n======== EPOCH {epoch+1}/{EPOCHS} ========")
        
        # --- TRAINING PHASE ---
        model.train()
        train_loss = 0
        optimizer.zero_grad()
        
        loop = tqdm(train_loader, leave=True)
        for step, batch in enumerate(loop):
            # Move batch to device
            input_ids = batch['input_ids'].to('cuda')
            attention_mask = batch['attention_mask'].to('cuda')
            global_attention_mask = batch['global_attention_mask'].to('cuda')
            labels = batch['labels'].to('cuda')
            
            # Forward Pass (Mixed Precision)
            with torch.amp.autocast('cuda'):
                outputs = model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    global_attention_mask=global_attention_mask,
                    # We pass labels only to suppress warnings, we don't use internal loss
                    labels=labels 
                )
                
                # CUSTOM WEIGHTED LOSS
                loss = criterion(outputs.logits, labels)
                loss = loss / ACCUMULATION_STEPS
            
            # Backward Pass
            scaler.scale(loss).backward()
            
            if (step + 1) % ACCUMULATION_STEPS == 0:
                scaler.step(optimizer)
                scaler.update()
                optimizer.zero_grad()
            
            train_loss += loss.item() * ACCUMULATION_STEPS
            loop.set_postfix(loss=loss.item() * ACCUMULATION_STEPS)
            
        if len(train_loader) % ACCUMULATION_STEPS != 0:
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()
    
        avg_train_loss = train_loss / len(train_loader)
        
        # --- VALIDATION PHASE ---
        model.eval()
        val_loss = 0
        all_preds = []
        all_labels = []
        
        print("Running Validation...")
        with torch.no_grad():
            for batch in val_loader:
                input_ids = batch['input_ids'].to('cuda')
                attention_mask = batch['attention_mask'].to('cuda')
                global_attention_mask = batch['global_attention_mask'].to('cuda')
                labels = batch['labels'].to('cuda')
                
                outputs = model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    global_attention_mask=global_attention_mask,
                    labels=labels
                )
                
                loss = criterion(outputs.logits, labels)
                val_loss += loss.item()
                
                probs = torch.sigmoid(outputs.logits)
                all_preds.append(probs.cpu().numpy())
                all_labels.append(labels.cpu().numpy())
        
        avg_val_loss = val_loss / len(val_loader)
        
        # --- METRICS ---
        all_preds_np = np.concatenate(all_preds)
        all_labels_np = np.concatenate(all_labels)

        # 1. Standard Monitor (Micro @ 0.5) - Good for overall accuracy
        temp_preds = (all_preds_np > 0.5).astype(int)
        val_f1_micro = f1_score(all_labels_np, temp_preds, average='micro')

        # 2. Minority Class Monitor (Macro @ 0.5) - Good for rare frames
        # This warns you if the model is just predicting the dominant classes
        val_f1_macro = f1_score(all_labels_np, temp_preds, average='macro')

        print(f" Train Loss: {avg_train_loss:.4f} | Val Loss: {avg_val_loss:.4f} | Micro-F1: {val_f1_micro:.4f} | Macro-F1: {val_f1_macro:.4f}")

        tracker.log_epoch({
            "epoch": epoch + 1,
            "train_loss": avg_train_loss,
            "val_loss": avg_val_loss,
            "val_f1_micro": val_f1_micro,
            "val_f1_macro": val_f1_macro 
        })
        
        # Save per-epoch model
        tracker.save_model(model, name=f"model_ep{epoch+1}.bin")

    print("Training Complete.")
    tracker.save_model(model, name="final_model.bin")
    
    return all_preds, all_labels


In [24]:

# 4. EXECUTE TRAINING ---------------------------------------------------------
print("Starting Training Run...")
val_preds, val_targets = train_engine(model, train_loader, val_loader, optimizer, tracker, criterion)

Starting Training Run...



  0%|          | 0/11116 [00:00<?, ?it/s]

Running Validation...
 Train Loss: 0.6245 | Val Loss: 0.5768 | Micro-F1: 0.7085 | Macro-F1: 0.6850
Model saved: model_ep1.bin



  0%|          | 0/11116 [00:00<?, ?it/s]

Running Validation...
 Train Loss: 0.5407 | Val Loss: 0.5848 | Micro-F1: 0.7188 | Macro-F1: 0.6944
Model saved: model_ep2.bin



  0%|          | 0/11116 [00:00<?, ?it/s]

Running Validation...
 Train Loss: 0.4918 | Val Loss: 0.5670 | Micro-F1: 0.7292 | Macro-F1: 0.7071
Model saved: model_ep3.bin



  0%|          | 0/11116 [00:00<?, ?it/s]

Running Validation...
 Train Loss: 0.4492 | Val Loss: 0.5862 | Micro-F1: 0.7270 | Macro-F1: 0.7063
Model saved: model_ep4.bin
Training Complete.
Model saved: final_model.bin


### Post-Training Evaluation: Load Best Epoch and Optimize Thresholds

Based on metrics.json, Epoch 3 achieved the best validation performance:
- Micro F1: 0.7292
- Macro F1: 0.7071

We now load this checkpoint and run per-class threshold optimization.

In [3]:
# Load best epoch model (Epoch 3)
import torch
import gc

gc.collect()
torch.cuda.empty_cache()

# Path to the best model checkpoint
best_model_path = "saved_models/framing_training_runs_longformer/20260121_0143_longformer_topic_expert_v1/model_ep3.bin"

# Load weights into the model
model.load_state_dict(torch.load(best_model_path, map_location='cuda'))
model.eval()
print(f"Loaded best model from: {best_model_path}")

Loaded best model from: saved_models/framing_training_runs_longformer/20260121_0143_longformer_topic_expert_v1/model_ep3.bin


In [17]:
# Define official labels for reporting
official_labels = [
    "Economic", "Capacity and resources", "Morality", "Fairness and equality",
    "Legality, constitutionality and jurisprudence", "Policy prescription and evaluation",
    "Crime and punishment", "Security and defense", "Health and safety",
    "Quality of life", "Cultural identity", "Public opinion", "Political",
    "External regulation and reputation", "Other"
]

# Get raw probabilities (sigmoids) from the validation set for threshold optimization
from sklearn.metrics import f1_score, classification_report
import numpy as np

model.eval()
val_probs = []
val_labels = []

print("Collecting validation set probabilities...")
with torch.no_grad():
    for batch in val_loader:
        input_ids = batch['input_ids'].to('cuda')
        attention_mask = batch['attention_mask'].to('cuda')
        global_attention_mask = batch['global_attention_mask'].to('cuda')
        labels = batch['labels'].to('cuda')
        
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            global_attention_mask=global_attention_mask
        )
        probs = torch.sigmoid(outputs.logits)
        
        val_probs.append(probs.cpu().numpy())
        val_labels.append(labels.cpu().numpy())

val_probs = np.vstack(val_probs)
val_labels = np.vstack(val_labels)

print(f"Validation set shape: {val_probs.shape}")

Collecting validation set probabilities...
Validation set shape: (5539, 15)


In [18]:
# Find the optimal threshold for EACH class via grid search
# We test thresholds from 0.1 to 0.9 and pick the one that maximizes F1 for that specific class
best_thresholds = np.array([0.5] * 15)  # Start with default
n_classes = 15

print("\nFinding optimal thresholds per class...")
for i in range(n_classes):
    best_score = 0
    best_thresh = 0.5
    
    # Get just the column for this class
    y_true = val_labels[:, i]
    y_score = val_probs[:, i]
    
    # Grid search thresholds
    for thresh in np.arange(0.1, 0.95, 0.05):  # test over increments of 0.05
        y_pred = (y_score > thresh).astype(int)
        score = f1_score(y_true, y_pred)
        
        if score > best_score:
            best_score = score
            best_thresh = thresh
            
    best_thresholds[i] = best_thresh
    class_name = official_labels[i]
    print(f"  {class_name:<45} Best: {best_thresh:.2f} (Val F1: {best_score:.3f})")


Finding optimal thresholds per class...
  Economic                                      Best: 0.35 (Val F1: 0.803)
  Capacity and resources                        Best: 0.85 (Val F1: 0.619)
  Morality                                      Best: 0.85 (Val F1: 0.678)
  Fairness and equality                         Best: 0.55 (Val F1: 0.714)
  Legality, constitutionality and jurisprudence Best: 0.40 (Val F1: 0.801)
  Policy prescription and evaluation            Best: 0.40 (Val F1: 0.765)
  Crime and punishment                          Best: 0.45 (Val F1: 0.799)
  Security and defense                          Best: 0.60 (Val F1: 0.761)
  Health and safety                             Best: 0.60 (Val F1: 0.800)
  Quality of life                               Best: 0.40 (Val F1: 0.807)
  Cultural identity                             Best: 0.70 (Val F1: 0.747)
  Public opinion                                Best: 0.45 (Val F1: 0.676)
  Political                                     Best: 0.50 

In [19]:
# Apply thresholds to TEST set
print("\nCollecting test set probabilities...")
test_probs = []
test_labels = []

with torch.no_grad():
    for batch in test_loader:
        input_ids = batch['input_ids'].to('cuda')
        attention_mask = batch['attention_mask'].to('cuda')
        global_attention_mask = batch['global_attention_mask'].to('cuda')
        labels = batch['labels'].to('cuda')
        
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            global_attention_mask=global_attention_mask
        )
        probs = torch.sigmoid(outputs.logits)
        
        test_probs.append(probs.cpu().numpy())
        test_labels.append(labels.cpu().numpy())

test_probs = np.vstack(test_probs)
test_labels = np.vstack(test_labels)

print(f"Test set shape: {test_probs.shape}")


Collecting test set probabilities...
Test set shape: (5585, 15)


### Test Set Evaluation: Default Threshold (0.5)

In [20]:
# Default threshold (0.5) evaluation
test_preds_default = (test_probs > 0.5).astype(int)

report_default = classification_report(test_labels, test_preds_default, target_names=official_labels, output_dict=True)
df_report_default = pd.DataFrame(report_default).transpose()

# Calculate counts per label for reference
all_clean_items = []
for row in test_labels:
    for i, val in enumerate(row):
        if val == 1:
            all_clean_items.append(official_labels[i])
counts = pd.Series(all_clean_items).value_counts()

out_default = (
    df_report_default
      .sort_values("f1-score", ascending=False)
      .assign(count=lambda d: d.index.map(counts).astype("Int64"))
      .reset_index(names="label")
)

print("=== Test Set Performance (Default Threshold = 0.5) ===\n")
out_default

=== Test Set Performance (Default Threshold = 0.5) ===



Unnamed: 0,label,precision,recall,f1-score,support,count
0,Crime and punishment,0.80534,0.820175,0.81269,2280.0,2280.0
1,Quality of life,0.804201,0.816998,0.810549,2765.0,2765.0
2,"Legality, constitutionality and jurisprudence",0.822264,0.793321,0.807533,2216.0,2216.0
3,Economic,0.821265,0.770753,0.795207,2325.0,2325.0
4,Health and safety,0.726567,0.822348,0.771496,1593.0,1593.0
5,Political,0.738046,0.80792,0.771404,2197.0,2197.0
6,Security and defense,0.691226,0.851108,0.76288,1444.0,1444.0
7,Policy prescription and evaluation,0.761341,0.763449,0.762394,2528.0,2528.0
8,weighted avg,0.713084,0.801541,0.74923,25960.0,
9,micro avg,0.690516,0.801541,0.741898,25960.0,


### Test Set Evaluation: Optimized Thresholds

In [21]:
# Apply optimized per-class thresholds
test_preds_optimized = np.zeros_like(test_probs)
for i in range(n_classes):
    test_preds_optimized[:, i] = (test_probs[:, i] > best_thresholds[i]).astype(int)

report_optimized = classification_report(test_labels, test_preds_optimized, target_names=official_labels, output_dict=True)
df_report_optimized = pd.DataFrame(report_optimized).transpose()

out_optimized = (
    df_report_optimized
      .sort_values("f1-score", ascending=False)
      .assign(count=lambda d: d.index.map(counts).astype("Int64"))
      .reset_index(names="label")
)

print("=== Test Set Performance (Optimized Thresholds) ===\n")
out_optimized

=== Test Set Performance (Optimized Thresholds) ===



Unnamed: 0,label,precision,recall,f1-score,support,count
0,Crime and punishment,0.785127,0.838158,0.810776,2280.0,2280.0
1,"Legality, constitutionality and jurisprudence",0.780243,0.841155,0.809555,2216.0,2216.0
2,Quality of life,0.757738,0.867631,0.80897,2765.0,2765.0
3,Economic,0.770196,0.844731,0.805744,2325.0,2325.0
4,Health and safety,0.768248,0.792844,0.780352,1593.0,1593.0
5,Political,0.738046,0.80792,0.771404,2197.0,2197.0
6,Security and defense,0.732958,0.811634,0.770292,1444.0,1444.0
7,Policy prescription and evaluation,0.715311,0.827927,0.76751,2528.0,2528.0
8,weighted avg,0.720976,0.794646,0.754876,25960.0,
9,micro avg,0.718655,0.794646,0.754743,25960.0,


### Comparison with Previous Training Runs

| Run | Model | Data | Micro F1 (Optimized) | Macro F1 (Optimized) |
|-----|-------|------|---------------------|---------------------|
| Run 1 | RoBERTa-base (Head+Tail) | All Topics | 0.731 | 0.706 |
| Run 2 | RoBERTa-base (Weighted Loss) | All Topics | 0.729 | 0.706 |
| Run 3 | RoBERTa-base (Politics Expert) | Politics Only | 0.758 | 0.686 |
| **Longformer** | Longformer-base-4096 | All Topics + Topic Injection | 0.755 | 0.732 |

In [22]:
# Summary metrics comparison
micro_f1_default = f1_score(test_labels, test_preds_default, average='micro')
macro_f1_default = f1_score(test_labels, test_preds_default, average='macro')

micro_f1_optimized = f1_score(test_labels, test_preds_optimized, average='micro')
macro_f1_optimized = f1_score(test_labels, test_preds_optimized, average='macro')

print("=" * 60)
print("LONGFORMER TEST SET SUMMARY")
print("=" * 60)
print(f"\nDefault Threshold (0.5):")
print(f"  Micro F1: {micro_f1_default:.4f}")
print(f"  Macro F1: {macro_f1_default:.4f}")
print(f"\nOptimized Thresholds:")
print(f"  Micro F1: {micro_f1_optimized:.4f}")
print(f"  Macro F1: {macro_f1_optimized:.4f}")
print(f"\nImprovement from threshold optimization:")
print(f"  Micro F1: +{(micro_f1_optimized - micro_f1_default)*100:.2f}%")
print(f"  Macro F1: +{(macro_f1_optimized - macro_f1_default)*100:.2f}%")
print("=" * 60)

# Comparison with prior runs
print("\n" + "=" * 60)
print("CROSS-RUN COMPARISON (Test Set, Optimized Thresholds)")
print("=" * 60)
comparison_data = {
    'Run': ['Run 1 (RoBERTa)', 'Run 2 (Weighted)', 'Run 3 (Politics)', 'Longformer'],
    'Data': ['All Topics', 'All Topics', 'Politics Only', 'All Topics + Topic'],
    'Micro F1': [0.731, 0.729, 0.758, micro_f1_optimized],
    'Macro F1': [0.706, 0.706, 0.686, macro_f1_optimized]
}
df_comparison = pd.DataFrame(comparison_data)
df_comparison

LONGFORMER TEST SET SUMMARY

Default Threshold (0.5):
  Micro F1: 0.7419
  Macro F1: 0.7218

Optimized Thresholds:
  Micro F1: 0.7547
  Macro F1: 0.7329

Improvement from threshold optimization:
  Micro F1: +1.28%
  Macro F1: +1.11%

CROSS-RUN COMPARISON (Test Set, Optimized Thresholds)


Unnamed: 0,Run,Data,Micro F1,Macro F1
0,Run 1 (RoBERTa),All Topics,0.731,0.706
1,Run 2 (Weighted),All Topics,0.729,0.706
2,Run 3 (Politics),Politics Only,0.758,0.686
3,Longformer,All Topics + Topic,0.754743,0.732892


In [None]:
# Save optimized thresholds and classification report
import json

# Save thresholds
threshold_dict = dict(zip(official_labels, best_thresholds.tolist()))
threshold_save_path = "saved_models/framing_training_runs_longformer/20260121_0143_longformer_topic_expert_v1/class_thresholds_optimized.json"

with open(threshold_save_path, 'w') as f:
    json.dump(threshold_dict, f, indent=4)
print(f"Saved optimized thresholds to: {threshold_save_path}")

# Save classification report
report_save_path = "saved_models/framing_training_runs_longformer/20260121_0143_longformer_topic_expert_v1/classification_report_optimized.csv"
out_optimized.to_csv(report_save_path, index=False)
print(f"Saved classification report to: {report_save_path}")

# Display final thresholds
print("\nOptimized Thresholds per Class:")
for label, thresh in threshold_dict.items():
    print(f"  {label:<45}: {thresh:.2f}")

Saved optimized thresholds to: saved_models/framing_training_runs_longformer/20260121_0143_longformer_topic_expert_v1/class_thresholds_optimized.json
Saved classification report to: saved_models/framing_training_runs_longformer/20260121_0143_longformer_topic_expert_v1/classification_report_optimized.csv

Optimized Thresholds per Class:
  Economic                                     : 0.35
  Capacity and resources                       : 0.85
  Morality                                     : 0.85
  Fairness and equality                        : 0.55
  Legality, constitutionality and jurisprudence: 0.40
  Policy prescription and evaluation           : 0.40
  Crime and punishment                         : 0.45
  Security and defense                         : 0.60
  Health and safety                            : 0.60
  Quality of life                              : 0.40
  Cultural identity                            : 0.70
  Public opinion                               : 0.45
  Political   

: 

### Longformer Run 1 - Evaluation Notes

**Configuration:**
- Model: `allenai/longformer-base-4096`
- Max Length: 2048 tokens
- Batch Size: 4 (effective 16 with gradient accumulation)
- Training Data: All topics with `TOPIC:` prefix injection
- Best Epoch: 3 (based on validation Micro F1)

**Key Observations:**
- Compare Longformer results against RoBERTa baseline (Run 1-2) to assess impact of full document context
- Topic injection strategy aims to provide domain signal similar to Run 3's politics-only approach
- Per-class threshold optimization continues to provide meaningful gains over default 0.5 threshold

**Dataset Quality Ceiling:**
Per CLAUDE.md analysis, the Mistral-generated labels have inherent noise (gold-standard F1 ~0.50). Models are effectively learning to replicate Mistral's labeling behavior rather than true frame detection. Best performance expected on "reliable" classes: Legality, Crime, Political.

### Action Plan: Optimizations from Wu et al. (2023)

**1. Data Strategy: Include "Negative" Samples**

* **Context:** Subtask 3 (Persuasion)
* **Action:** Ensure your dataloader includes articles/segments where Mistral detected *no* frames (assigning them a `[0,0,...0]` label vector).
* 
**Reasoning:** The paper found that including unlabelled paragraphs led to "significant improvement" compared to baselines that discarded them .



**2. Pre-training: Implement TAPT (Task-Adaptive Pre-training)**

* **Context:** Subtask 2 (Framing)
* 
**Action:** Before fine-tuning for classification, train your Longformer on the *raw text* of your dataset using the Masked Language Modeling (MLM) objective (Paper used 60 epochs) .


* **Reasoning:** This adapts the model's embeddings to the specific vocabulary of political framing (e.g., "Rights" in gun control vs. healthcare) prior to learning the labels.

**3. Imbalance Handling: Switch to Oversampling**

* **Context:** Subtask 1 (Genre) & Subtask 2 (Framing)
* **Action:** Remove `pos_weight`. Instead, perform **Random Oversampling** (duplicate rows of rare classes).
* 
**Reasoning:** The paper noted that class weights improved rare classes at the "expense of more frequent classes". Given your noisy Mistral labels, aggressive loss weighting may force the model to overfit hallucinations; oversampling was found to be slightly advantageous.



**4. Architecture: Consider RoBERTa-MUPPET**

* **Context:** Subtask 2 (Framing)
* **Action:** If Longformer performance stalls, benchmark against `RoBERTa-MUPPET-Large`.
* 
**Reasoning:** This model achieved 1st place in the English monolingual framing task, explicitly outperforming standard baselines.