To do:
- Have a look at learning rate and gradient norm clipping which I need to read up on.
    - Setting learning rate to 1e-4 from the "Embedding sweep" section of the mmBERT paper
    - Keeping gradient norm clipping to the default which caps it at 1.0

- Hyperparameter tuning (Alpha, learning rate, batch size so on - not sure how to figure this out)
    - There is precedence for no hyperparameter tuning from the author of the OG NLI model that DEBATE is based on = Due to computational restrains and the points from this paper, no hyperparameter tuning was performed in this case. The model tuning in itself is also not the primary focus in this paper, but simply serves as a tool for the actual inquiry into blame in the Danish Parliament



In [None]:
%pip install -r "requirements_bert.txt"

In [1]:
import torch
import transformers
import bitsandbytes
import accelerate
import datasets
import numpy as np
import pandas as pd
import keras
import json
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer, BitsAndBytesConfig
from sklearn.model_selection import train_test_split
from datasets import Dataset
from keras.losses import binary_crossentropy
from sklearn.metrics import accuracy_score, f1_score, average_precision_score, recall_score

  from .autonotebook import tqdm as notebook_tqdm
2025-10-24 13:57:52.272484: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-10-24 13:57:59.129870: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-10-24 13:58:38.994142: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


In [2]:
model_name = "jhu-clsp/mmBERT-base"

quantization_config = BitsAndBytesConfig(
                                        load_in_4bit=True,
                                         bnb_4bit_compute_dtype=torch.bfloat16,
                                         bnb_4bit_quant_type="nf4",
                                         bnb_4bit_use_double_quant=True,
                                         )

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=quantization_config,
    dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at jhu-clsp/mmBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [9]:
lora_config = LoraConfig(
    r=16,  # Low-rank dimension
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules="all-linear",  # Fine-tuning all linear (classification, attention... layers)
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 3,416,096 || all params: 310,947,874 || trainable%: 1.0986


In [10]:
training_args = TrainingArguments(
    report_to='wandb',
    output_dir='./test_tune_results',
    optim="paged_adamw_8bit",
    learning_rate=1e-4, # Learning rate copied from mmBERT paper on embedding sweep of LR (1e-4) as they found this to perform best
    num_train_epochs=3,
    per_device_train_batch_size=256, # Batching at 256 to balance generalization and efficient training
    gradient_accumulation_steps=1,  # Gradient of 1 as full batch fits in memory, accumulation then only slows
    logging_steps=1,
    eval_strategy="epoch",
    save_strategy="epoch",
    bf16=True,  # Enable mixed precision
    fp16=False,
    dataloader_pin_memory=True,
    dataloader_num_workers=8,
    remove_unused_columns=True, # Avoiding manual handling of residual text columns
    max_grad_norm=1.0,
    disable_tqdm=False,
)

In [11]:
def tokenize_function(examples):
    return tokenizer(examples["text"], 
    padding="max_length", 
    truncation=True,
    max_length=512, # Padding to 512 to massively cut down on computation compared to base 8,192 tokens. 
    )

In [14]:
def weighted_bincrossentropy(true, pred, weight_zero = 1, weight_one = 1):
    """
    Calculates weighted binary cross entropy. The weights are fixed to represent class imbalance in the dataset.
        
    For example if there are 10x as many positive classes as negative classes,
        if you adjust weight_zero = 1.0, weight_one = 0.1, then false positives 
        will be penalized 10 times as much as false negatives.

    """
  
    # calculate the binary cross entropy
    bin_crossentropy = binary_crossentropy(true, pred)
    
    # apply the weights
    weights = true * weight_one + (1. - true) * weight_zero
    #weights /= (weight_one + weight_zero) # Normalizing to be more consistent with regular BCE for comparison 
    weighted_bin_crossentropy = weights * bin_crossentropy 

    return np.mean(weighted_bin_crossentropy)

In [15]:

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    #From logits to probabilities
    probs_2d = np.exp(predictions) / np.exp(predictions).sum(axis=1, keepdims=True)
    probs = probs_2d[:, 1]  # positive class extraction
    
    weigthted_bce = weighted_bincrossentropy(labels, probs)
    keras_bce = binary_crossentropy(labels, probs)
    keras_bce = float(np.mean(keras_bce.numpy()))  # Converting from keras eagertensor to float value
    
    # Wrapping all metrics to floats for json serialization during model eval
    return {
        'keras_BCE': keras_bce,
        'weighted BCE': weigthted_bce,
        'recall': float(recall_score(labels, probs.round())),
        'precision': float(average_precision_score(labels, probs)),
        'accuracy': float(accuracy_score(labels, probs.round())), # Need rounding for these two computations (integer required)
        'f1': float(f1_score(labels, probs.round(), average='macro')) # macro f1 is better for imbalanced dataset
    }

In [18]:
# Custom trainer class (weigthed)
from collections import Counter

labels = test_dataframe['label'].tolist()
class_counts = Counter(labels)
total = sum(class_counts.values())

# Higher weight = more emphasis
weights = [total/class_counts[0], total/class_counts[1]]
class_weights = torch.tensor(weights, dtype=torch.float)

#define custom trainer that uses weigted loss
import torch.nn as nn

class WeightedTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        labels = inputs.get("labels")
        outputs = model(**inputs)
        logits = outputs.get("logits")
        
        # Define weighted loss
        loss_fct = nn.CrossEntropyLoss(weight=class_weights.to(model.device))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        
        return (loss, outputs) if return_outputs else loss

In [16]:
val_dataframe = pd.read_json("/work/RuneEgeskovTrust#9638/Bachelor/Bachelor_project/Model_data/validation_set.json")

val_dataframe = val_dataframe[['text', 'label']]

val_dataset = Dataset.from_pandas(val_dataframe)

tokenized_val = val_dataset.map(tokenize_function, batched=True, num_proc=16)

Map (num_proc=16): 100%|██████████| 258/258 [00:00<00:00, 427.20 examples/s]


In [17]:
dataframe_5 = pd.read_json("/work/RuneEgeskovTrust#9638/Bachelor/training_data/preproc_subset_5_cleaned_training_data.json")

test_dataframe = dataframe_5[['text', 'label']]

test_dataframe = test_dataframe[0:10000]

test_dataset = Dataset.from_pandas(test_dataframe)

tokenized_test = test_dataset.map(tokenize_function, batched=True, num_proc=16)

Map (num_proc=16): 100%|██████████| 3550/3550 [00:00<00:00, 6324.45 examples/s]


In [None]:
dataframe = pd.read_json("/work/RuneEgeskovTrust#9638/Bachelor/training_data/preprocessed_data_for_training.json")

dataframe = dataframe[['text', 'label']]

dataset = Dataset.from_pandas(dataframe)

tokenized_dataset = dataset.map(tokenize_function, batched=True, num_proc=16)

Map (num_proc=8): 100%|██████████| 388027/388027 [00:17<00:00, 22653.17 examples/s]


In [20]:
'''
Look into learning rates, model is currently overfitting quite drastically ("small" test-set)
Normalizing weigthed BCE or no?
Look into regularization, dropout and early stopping to avoid overfitting
'''

trainer = WeightedTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_test,
    eval_dataset=tokenized_val,
    compute_metrics=compute_metrics,
)


trainer.train()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/ucloud/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mrune-trust[0m ([33mrune-trust-aarhus-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Keras Bce,Weighted bce,Recall,Precision,Accuracy,F1
1,0.7321,0.693938,0.692687,0.692687,0.886364,0.518162,0.550388,0.54906
2,0.5984,0.610281,0.608975,0.608975,0.943182,0.712803,0.666667,0.666486
3,0.539,0.521573,0.521386,0.521386,0.806818,0.747577,0.771318,0.759583


  return fn(*args, **kwargs)
W0000 00:00:1761307438.174821    2374 gpu_device.cc:2342] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


TrainOutput(global_step=42, training_loss=0.6665794622330439, metrics={'train_runtime': 240.7389, 'train_samples_per_second': 44.239, 'train_steps_per_second': 0.174, 'total_flos': 3740834789683200.0, 'train_loss': 0.6665794622330439, 'epoch': 3.0})

In [30]:
FINE_TUNED_MODEL_NAME = "mmBlameBERT-pol-DA"

merged_model = model.merge_and_unload()    # PEFT: incorporates LoRA into base weights
merged_dir = f"/work/RuneEgeskovTrust#9638/Bachelor/{FINE_TUNED_MODEL_NAME}-merged"
merged_model.save_pretrained(merged_dir)
tokenizer.save_pretrained(merged_dir)
print("✓ Merged model saved to:", merged_dir)

✓ Merged model saved to: /work/RuneEgeskovTrust#9638/Bachelor/mmBlameBERT-pol-DA-merged


In [21]:
eval_results = trainer.evaluate()
print(eval_results)

{'eval_loss': 0.5215731859207153, 'eval_keras_BCE': 0.5213857889175415, 'eval_weighted BCE': 0.5213857293128967, 'eval_recall': 0.8068181818181818, 'eval_precision': 0.7475772782251469, 'eval_accuracy': 0.7713178294573644, 'eval_f1': 0.7595830371949774, 'eval_runtime': 2.2799, 'eval_samples_per_second': 113.164, 'eval_steps_per_second': 14.474, 'epoch': 3.0}


In [29]:

with open("/work/RuneEgeskovTrust#9638/Bachelor/Bachelor_project/EvalResultFullData.txt", "w") as f:
    f.write(str(eval_results))

In [3]:
def preprocess_json(input_path, output_path=None):
    """
    Preprocesses a JSON file by filtering out entries based on the 'text' key.
    
    Criteria for deletion:
      - 'text' is missing or empty
      - 'text' length is <= 3
      - 'text' contains '(' or ')'
    
    Parameters:
        input_path (str): Path to the input JSON file.
        output_path (str, optional): If provided, saves the filtered JSON here.
    
    Returns:
        list: The filtered list of JSON entries.
    """
    # Load JSON file
    with open(input_path, 'r', encoding='utf-8') as f:
        data = json.load(f)

    # Filter entries
    filtered_data = [
        entry for entry in data
        if 'text' in entry
        and entry['text']
        and len(entry['text']) > 3
        and '(' not in entry['text']
        and ')' not in entry['text']
    ]

    # Optionally save to a new file
    if output_path:
        with open(output_path, 'w', encoding='utf-8') as f:
            json.dump(filtered_data, f, ensure_ascii=False, indent=4)

    return filtered_data

In [24]:
#del model
torch.cuda.empty_cache()

In [8]:
preprocess_json("/work/RuneEgeskovTrust#9638/Bachelor/training_data/subset_1_2_3_4_5_cleaned_training_data.json",
 "/work/RuneEgeskovTrust#9638/Bachelor/training_data/preproc_subset_1_2_3_4_5_cleaned_training_data.json")

[{'paragraph': '28438',
  'sentence_nr': '10',
  'text': 'Og jeg synes sådan set,  det er lidt beskæmmende, at hr. Arne Melchior i dag med den  måde, han har optrådt på i denne debat, igen har båret ved  til det bål, at det er nok lidt i orden, for de er jo lidt  ufine i kanten.',
  'speaker': 'Frank Aaen',
  'party': 'EL',
  'preceding_sentence': ', skete der både her og i lande, som jeg tidligere  havde håb og tiltro til, en uhørt registrering af folk, og  det er det, vi skal have undersøgt.',
  'succeeding_sent': '',
  'current_speaker_in_government': False,
  'parties_in_government': ['S', 'RV'],
  'date': '1999-05-25 00:00:00',
  'label': 1},
 {'paragraph': '35815',
  'sentence_nr': '12',
  'text': 'Oven i købet sidder man år efter år  og beslutter at bruge nogle af de menneskers penge.',
  'speaker': 'Frank Aaen',
  'party': 'EL',
  'preceding_sentence': 'Vi fik heller ikke nogen bekræftelse på, at Venstre er  med til systematisk at tage penge fra samtlige landets  pensionister, 