To do:
- ADD WEIGHTED LOSS FUNCTION FOR HITS ON 1 FOR POSITIVE LABELLING AS THIS IS WAY MORE RARE = HIGHER "REWARD"
- Have a look at learning rate and gradient norm clipping which I need to read up on.
    - In addition to this, look at the implications of gradient accumulation steps
    - Much of this pipeline was constrained due to computational restrictions which I think was caused by errors and not actual training process.
- Hyperparameter tuning (Alpha, learning rate, batch size so on - not sure how to figure this out)
    - There is precedence for no hyperparameter tuning from the author of the OG NLI model that DEBATE is based on = Due to computational restrains and the points from this paper, no hyperparameter tuning was performed in this case. The model tuning in itself is also not the primary focus in this paper, but simply serves as a tool for the actual inquiry into blame in the Danish Parliament



In [None]:
%pip install -r "requirements_bert.txt"

In [21]:
import torch
import transformers
import bitsandbytes
import accelerate
import datasets
import numpy as np
import pandas as pd
import keras
import json
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer, BitsAndBytesConfig
from sklearn.model_selection import train_test_split
from datasets import Dataset
from keras.losses import binary_crossentropy
from sklearn.metrics import accuracy_score, f1_score, average_precision_score, recall_score

In [23]:
del model
torch.cuda.empty_cache()

In [24]:
model_name = "jhu-clsp/mmBERT-base"

quantization_config = BitsAndBytesConfig(
                                        load_in_4bit=True,
                                         bnb_4bit_compute_dtype=torch.bfloat16,
                                         bnb_4bit_quant_type="nf4",
                                         bnb_4bit_use_double_quant=True,
                                         )

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at jhu-clsp/mmBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [25]:
lora_config = LoraConfig(
    r=16,  # Low-rank dimension
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules="all-linear",  # Fine-tuning all linear (classification, attention... layers)
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 3,416,096 || all params: 310,947,874 || trainable%: 1.0986


In [26]:
'''
Consider the batch size, could be increased for efficiency purposes.
ADD WEIGHTED LOSS FUNCTION FOR HITS ON 1 FOR POSITIVE LABELLING AS THIS IS WAY MORE RARE = HIGHER "REWARD"
Have a look at learning rate and gradient norm clipping which I need to read up on.
    In addition to this, look at the implications of gradient accumulation steps
    Much of this pipeline was constrained due to computational restrictions which I think was caused by errors and not actual training process.
Early stopping: load_best_model_at_end=True
'''

training_args = TrainingArguments(
    output_dir='./full_tune_results',
    optim="paged_adamw_8bit",
    learning_rate=2e-4, # Learning rate copied from mmBERT paper (8e-4) as they found this to perform best
    num_train_epochs=3,
    per_device_train_batch_size=256, # Batching at 256 to balance generalization and efficient training
    gradient_accumulation_steps=1,  # Gradient of 1 as full batch fits in memory, accumulation then only slows

    logging_steps=1,
    eval_strategy="epoch",
    save_strategy="epoch",
    bf16=True,  # Enable mixed precision
    fp16=False,
    dataloader_pin_memory=True,
    dataloader_num_workers=8,
    remove_unused_columns=True, # Avoiding manual handling of residual text columns
    max_grad_norm=1.0,

    disable_tqdm=False,
)

In [11]:
def preprocess_json(input_path, output_path=None):
    """
    Preprocesses a JSON file by filtering out entries based on the 'text' key.
    
    Criteria for deletion:
      - 'text' is missing or empty
      - 'text' length is <= 3
      - 'text' contains '(' or ')'
    
    Parameters:
        input_path (str): Path to the input JSON file.
        output_path (str, optional): If provided, saves the filtered JSON here.
    
    Returns:
        list: The filtered list of JSON entries.
    """
    # Load JSON file
    with open(input_path, 'r', encoding='utf-8') as f:
        data = json.load(f)

    # Filter entries
    filtered_data = [
        entry for entry in data
        if 'text' in entry
        and entry['text']
        and len(entry['text']) > 3
        and '(' not in entry['text']
        and ')' not in entry['text']
    ]

    # Optionally save to a new file
    if output_path:
        with open(output_path, 'w', encoding='utf-8') as f:
            json.dump(filtered_data, f, ensure_ascii=False, indent=4)

    return filtered_data

In [5]:
def tokenize_function(examples):
    return tokenizer(examples["text"], 
    padding="max_length", 
    truncation=True,
    max_length=392, # Padding to 392 to massively cut down on computation compared to base 8,192 tokens. 
    )

In [6]:
def weighted_bincrossentropy(true, pred, weight_zero = 99.0, weight_one = 1):
    """
    Calculates weighted binary cross entropy. The weights are fixed.
        
    This can be useful for unbalanced catagories.
    
    Adjust the weights here depending on what is required.
    
    For example if there are 10x as many positive classes as negative classes,
        if you adjust weight_zero = 1.0, weight_one = 0.1, then false positives 
        will be penalize 10 times as much as false negatives.

    """
  
    # calculate the binary cross entropy
    bin_crossentropy = binary_crossentropy(true, pred)
    
    # apply the weights
    weights = true * weight_one + (1. - true) * weight_zero
    weights /= (weight_one + weight_zero) # Normalizing to be more consistent with regular BCE for comparison 
    weighted_bin_crossentropy = weights * bin_crossentropy 

    return np.mean(weighted_bin_crossentropy)

In [30]:

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    #From logits to probabilities
    probs_2d = np.exp(predictions) / np.exp(predictions).sum(axis=1, keepdims=True)
    probs = probs_2d[:, 1]  # positive class extraction
    
    weigthted_bce = weighted_bincrossentropy(labels, probs)
    keras_bce = binary_crossentropy(labels, probs)
    keras_bce = float(np.mean(keras_bce.numpy()))  # Converting from keras eagertensor to float value

    recall = recall_score(labels, probs.round())
    
    # Wrapping all metrics to floats for json serialization during model eval
    return {
        'keras_BCE': keras_bce,
        'weighted BCE (STD)': weigthted_bce, # Normalized to be interpretable compared to regular BCE
        'recall': recall,
        'precision': float(average_precision_score(labels, probs)),
        'accuracy': float(accuracy_score(labels, probs.round())), # Need rounding for these two computations (integer required)
        'f1': float(f1_score(labels, probs.round(), average='macro')) # macro f1 is better for imbalanced dataset
    }

In [None]:
# Training data
preprocess_json("/work/RuneEgeskovTrust#9638/Bachelor/training_data/cleaned_training_data_3_4_5_temps.json", "/work/RuneEgeskovTrust#9638/Bachelor/training_data/preprocessed_data_for_training.json")
#Validation data
preprocess_json("/work/RuneEgeskovTrust#9638/Bachelor/Bachelor_project/Model_data/validation_set.json", "/work/RuneEgeskovTrust#9638/Bachelor/Bachelor_project/Model_data/validation_set.json")

In [8]:

dataframe = pd.read_json("/work/RuneEgeskovTrust#9638/Bachelor/training_data/preprocessed_data_for_training.json")


In [28]:
# Custom trainer class (weigthed)
from collections import Counter

labels = test_dataframe['label'].tolist()
class_counts = Counter(labels)
total = sum(class_counts.values())

# Higher weight = more emphasis
weights = [total/class_counts[0], total/class_counts[1]]
class_weights = torch.tensor(weights, dtype=torch.float)

#define custom trainer that uses weigted loss
import torch.nn as nn

class WeightedTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        labels = inputs.get("labels")
        outputs = model(**inputs)
        logits = outputs.get("logits")
        
        # Define weighted loss
        loss_fct = nn.CrossEntropyLoss(weight=class_weights.to(model.device))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        
        return (loss, outputs) if return_outputs else loss

In [None]:
weights

In [10]:
val_dataframe = pd.read_json("/work/RuneEgeskovTrust#9638/Bachelor/Bachelor_project/Model_data/validation_set.json")

val_dataframe = val_dataframe[['text', 'label']]

val_dataset = Dataset.from_pandas(val_dataframe)

tokenized_val = val_dataset.map(tokenize_function, batched=True, num_proc=16)

Map (num_proc=16): 100%|██████████| 174/174 [00:00<00:00, 267.76 examples/s]


In [11]:
test_dataframe = pd.read_json("/work/RuneEgeskovTrust#9638/Bachelor/training_data/cleaned_training_data.json")

test_dataframe = test_dataframe[['text', 'label']]

test_dataframe = test_dataframe[0:10000]

test_dataset = Dataset.from_pandas(test_dataframe)

tokenized_test = test_dataset.map(tokenize_function)

Map: 100%|██████████| 10000/10000 [00:02<00:00, 4129.18 examples/s]


In [None]:
#val_dataframe = val_dataframe[['preceding_sentence', 'text', 'succeeding_sent', 'label']]

dataframe = dataframe[['text', 'label']]

#val_dataset = Dataset.from_pandas(val_dataframe)

dataset = Dataset.from_pandas(dataframe)

#tokenized_val = val_dataset.map(tokenize_function)

# I suspect num_proc can be increased after having identified the padding problem
tokenized_dataset = dataset.map(tokenize_function, batched=True, num_proc=16)

Map (num_proc=8): 100%|██████████| 388027/388027 [00:17<00:00, 22653.17 examples/s]


In [31]:
'''
Look into learning rates, model is currently overfitting quite drastically ("small" test-set)
Normalizing weigthed BCE or no?
Look into regularization, dropout and early stopping to avoid overfitting
'''

trainer = WeightedTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_test,
    eval_dataset=tokenized_val,
    compute_metrics=compute_metrics,
)


trainer.train()

  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Keras Bce,Weighted bce (std),Recall,Precision,Accuracy,F1
1,0.0743,0.383976,0.432287,0.289185,0.894737,0.82869,0.83908,0.828088
2,0.148,0.416972,0.464462,0.310709,0.894737,0.826676,0.827586,0.816919
3,0.1133,0.488464,0.513181,0.3433,0.894737,0.84444,0.844828,0.833705


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


TrainOutput(global_step=120, training_loss=0.22682574565211933, metrics={'train_runtime': 413.8243, 'train_samples_per_second': 72.495, 'train_steps_per_second': 0.29, 'total_flos': 8067821509440000.0, 'train_loss': 0.22682574565211933, 'epoch': 3.0})

In [30]:
FINE_TUNED_MODEL_NAME = "mmBlameBERT-pol-DA"

merged_model = model.merge_and_unload()    # PEFT: incorporates LoRA into base weights
merged_dir = f"/work/RuneEgeskovTrust#9638/Bachelor/{FINE_TUNED_MODEL_NAME}-merged"
merged_model.save_pretrained(merged_dir)
tokenizer.save_pretrained(merged_dir)
print("✓ Merged model saved to:", merged_dir)

✓ Merged model saved to: /work/RuneEgeskovTrust#9638/Bachelor/mmBlameBERT-pol-DA-merged


In [28]:
eval_results = trainer.evaluate()
print(eval_results)

{'eval_loss': 0.46705466508865356, 'eval_keras_BCE': 0.4670414924621582, 'eval_weigthed BCE': 0.31243467330932617, 'eval_precision': 0.652620435757883, 'eval_accuracy': 0.7586206896551724, 'eval_f1': 0.7180555555555556, 'eval_runtime': 3.2533, 'eval_samples_per_second': 53.485, 'eval_steps_per_second': 6.762, 'epoch': 3.0}


In [29]:

with open("/work/RuneEgeskovTrust#9638/Bachelor/Bachelor_project/EvalResultFullData.txt", "w") as f:
    f.write(str(eval_results))

In [None]:
'''
We would expect to see a gradual decrease in both training and validation loss.
If either om them split too far from eachother that indicates issues with the training process.
The process itself should be pretty smooth with no dips either up or down.
'''

In [None]:
# OG function without bce
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    return {
        'precision': average_precision_score(labels, predictions),
        'accuracy': accuracy_score(labels, predictions),
        'f1': f1_score(labels, predictions, average='macro') # Macro is better suited for imbalanced data
    }