# Model Training (BERT → RoBERTa Evolution)

In this notebook, I fine‑tune transformer models on my dataset. I start with **BERT‑base**, then **RoBERTa‑base**, comparing validation performance to select a final model. I also prepare for class imbalance handling and logging.


In [2]:
# move up one level so that works
import os
os.chdir(os.path.abspath(os.path.join(os.getcwd(), "..")))
print("new cwd:", os.getcwd())


new cwd: c:\Testing\Final_Year_Project\AI-Text-Detection-Tool


In [None]:
import torch
import numpy as np
import logging
import yaml
import pandas as pd
from transformers import TrainingArguments, EarlyStoppingCallback
from utils import model_utils

# Load configuration
with open("config.yaml", "r") as f:
    config = yaml.safe_load(f)

# Load splits
train_df = pd.read_parquet(config['paths']['train_data'])
val_df   = pd.read_parquet(config['paths']['val_data'])
test_df  = pd.read_parquet(config['paths']['test_data'])
print(f"Data sizes → Train: {len(train_df)}, Val: {len(val_df)}, Test: {len(test_df)}")

# Set up logging to file
logging.basicConfig(
    filename=config['paths']['log_file'],
    level=logging.INFO,
    format="%(asctime)s %(levelname)s: %(message)s"
)
logging.info("Started training pipeline (BERT → RoBERTa)")

# Compute class weights for loss
labels, counts = np.unique(train_df['label'], return_counts=True)
class_weights = (1.0 / counts) * np.mean(counts)
# Map to [0,1,2] order
weight_list = [0]*len(class_weights)
for lab, wt in zip(labels, class_weights):
    idx = config['model']['label_mapping'][lab]
    weight_list[idx] = float(wt)
print("Class weights:", weight_list)


Data sizes → Train: 309520, Val: 38690, Test: 38691


KeyError: 'human_written'