# News Article Classification using Transformer Models

This notebook presents the design, implementation, and evaluation of a transformer-based
text classification system for automatically categorising news articles into four classes:
World, Sports, Business, and Sci/Tech.

The system is developed using the AG News dataset and the Hugging Face Transformers
library. Two pretrained transformer models are considered:

- DistilBERT (lightweight, efficient)
- RoBERTa (larger, more expressive)

For each model, performance is evaluated before fine-tuning (baseline) and after
fine-tuning on the AG News dataset. Quantitative metrics and error analysis are used
to assess model behaviour and improvements.

This notebook is structured in clear sections corresponding to the project workflow:
data preparation, modelling, evaluation, and analysis.

## 1. Environment

This section sets up the Python environment used throughout the notebook.
It imports all required libraries, configures reproducibility, and verifies
that GPU acceleration is available for model training.

In [None]:
"""
Environment setup and verification.

This cell:
- Imports all core libraries used in the project
- Sets random seeds for reproducibility
- Verifies availability of a CUDA-enabled GPU
"""

# Core numerical and data libraries
import numpy as np
import pandas as pd

# Plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Hugging Face datasets and transformers
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    set_seed
)

# Evaluation and metrics
import evaluate
from sklearn.metrics import confusion_matrix, classification_report

# PyTorch
import torch

# -------------------------------------------------------------------
# Reproducibility
# -------------------------------------------------------------------
set_seed(42)
np.random.seed(42)

# -------------------------------------------------------------------
# GPU verification
# -------------------------------------------------------------------
if torch.cuda.is_available():
    device_name = torch.cuda.get_device_name(0)
    print(f"GPU available: {device_name}")
else:
    raise RuntimeError("CUDA-compatible GPU not detected. Check PyTorch installation.")

## 2. Dataset Loading

This section loads the AG News dataset from the Hugging Face Datasets library.
The dataset consists of news article titles and short descriptions, each labelled
into one of four categories: World, Sports, Business, or Sci/Tech.

The dataset structure and a representative training example are inspected to
verify correct loading.

In [None]:
"""
Load and inspect the AG News dataset.

This cell:
- Loads the AG News dataset from Hugging Face
- Displays the dataset structure
- Prints a representative training example
"""

from datasets import load_dataset

# Load the AG News dataset
dataset = load_dataset("ag_news")

# Display dataset structure
print(dataset)

# Display a representative training example
print("\nExample training sample:")
print(dataset["train"][0])

## 3. Dataset Overview and Class Distribution

This section provides an overview of the label structure of the AG News dataset
and examines the distribution of samples across the four news categories in both
the training and test splits.

Understanding class balance is important to ensure that model evaluation metrics
are meaningful and not biased toward overrepresented categories.

In [None]:
"""
Dataset overview and class distribution analysis.

This cell:
- Maps numerical labels to human-readable class names
- Computes class counts for training and test splits
- Visualises class distributions using bar charts
"""

import pandas as pd
import matplotlib.pyplot as plt

train_df = pd.DataFrame(dataset["train"])
test_df = pd.DataFrame(dataset["test"])

label_names = ["World", "Sports", "Business", "Sci/Tech"]

train_df["label_name"] = train_df["label"].map(lambda x: label_names[x])
test_df["label_name"] = test_df["label"].map(lambda x: label_names[x])

train_counts = train_df["label_name"].value_counts().sort_index()
test_counts = test_df["label_name"].value_counts().sort_index()

print("Training set class distribution:")
print(train_counts)
print("\nTest set class distribution:")
print(test_counts)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.patch.set_facecolor("#ffffff")

colors = ["#0173B2", "#DE8F05"]

for ax, counts, title, color in zip(
    axes,
    [train_counts, test_counts],
    ["Training Set", "Test Set"],
    colors
):
    bars = ax.bar(counts.index, counts.values, color=color, alpha=0.8, edgecolor="none")
    
    for bar in bars:
        height = bar.get_height()
        ax.text(
            bar.get_x() + bar.get_width() / 2, height,
            f"{int(height)}",
            ha="center", va="bottom", fontsize=12, fontweight="150"
        )
    
    ax.set_title(f"{title} Class Distribution", fontsize=16, fontweight="200", pad=15)
    ax.set_xlabel("Category", fontsize=14, fontweight="150")
    ax.set_ylabel("Number of Articles", fontsize=14, fontweight="150")
    ax.set_axisbelow(True)
    ax.tick_params(axis="both", labelsize=12)
    ax.grid(False)
    
    for spine in ax.spines.values():
        spine.set_edgecolor("#e0e0e0")
        spine.set_linewidth(1.5)

plt.tight_layout()
plt.subplots_adjust(wspace=0.2)
plt.show()

## 4. Text Length Analysis

This section analyses the length of news article texts in the training data.
The goal is to understand typical input sizes and determine an appropriate
maximum sequence length for transformer tokenisation.

Summary statistics and a histogram of text lengths are used to support this decision.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid", context="talk")

train_df["text_length"] = train_df["text"].apply(lambda x: len(x.split()))

print("Text length statistics (training set):")
print(train_df["text_length"].describe())

fig, ax = plt.subplots(figsize=(10, 5))
fig.patch.set_facecolor("#ffffff")

sns.histplot(
    train_df["text_length"],
    bins=50,
    kde=True,
    stat="count",
    color="#0173B2",
    alpha=0.75,
    edgecolor="black",
    linewidth=0.5,
    ax=ax
)

mean_val = train_df["text_length"].mean()
ax.axvline(
    mean_val,
    color="#DE8F05",
    linestyle="--",
    linewidth=2,
    label=f"Mean = {mean_val:.1f} words",
    zorder=5
)

ax.set_title("Distribution of News Article Text Lengths", fontsize=15, fontweight="200", pad=18)
ax.set_xlabel("Number of Words per Article", fontsize=12, fontweight="150")
ax.set_ylabel("Number of Articles", fontsize=12, fontweight="150")
ax.legend(fontsize=11, loc="upper right", framealpha=0.95)
ax.grid(True, alpha=0.25, linestyle="--", linewidth=0.8)
ax.set_axisbelow(True)

for spine in ax.spines.values():
    spine.set_edgecolor("#e0e0e0")
    spine.set_linewidth(1.5)

plt.tick_params(axis="both", labelsize=10)
plt.tight_layout()
plt.show()

## 5. Subsampling and Dataset Preparation

To ensure efficient training within local hardware constraints while preserving
class balance, a smaller stratified subset of the dataset is created.

This section constructs:
- A balanced training subset of 8,000 samples (2000 per class)
- A balanced test subset of 2,000 samples (500 per class)

The subsets are then converted back into Hugging Face Dataset objects for
subsequent tokenisation and model training.

In [None]:
"""
Balanced subsampling and dataset preparation.

This cell:
- Creates stratified training and test subsets
- Verifies class balance
- Converts subsets back to Hugging Face Dataset format
"""

from datasets import Dataset, DatasetDict

# Stratified subsampling
train_subset = train_df.groupby("label_name", group_keys=False).sample(
    n=2000, random_state=42
)

test_subset = test_df.groupby("label_name", group_keys=False).sample(
    n=500, random_state=42
)

# Verify class balance
print("Subsampled training set distribution:")
print(train_subset["label_name"].value_counts())

print("\nSubsampled test set distribution:")
print(test_subset["label_name"].value_counts())

# Convert back to Hugging Face Dataset format
train_hf = Dataset.from_pandas(train_subset.reset_index(drop=True))
test_hf = Dataset.from_pandas(test_subset.reset_index(drop=True))

mini_dataset = DatasetDict({
    "train": train_hf,
    "test": test_hf
})

mini_dataset

## 6. Tokenisation Strategy

Tokenisation is a critical step in transformer-based text classification, as it
determines how raw text is converted into numerical representations suitable for
model input. The choice of maximum sequence length directly affects both model
performance and computational efficiency.

In this project, tokenisation decisions are informed by empirical analysis rather
than arbitrary defaults. First, a token length truncation analysis is conducted
for each selected model to quantify how many samples would be affected by common
maximum sequence lengths. Based on these findings, an appropriate maximum length
is selected and applied consistently during dataset preparation.

This section is divided into two subsections:

- **Section 6.1:** Token length truncation analysis for DistilBERT and RoBERTa  
- **Section 6.2:** Tokenisation of the dataset using the selected maximum length

### 6.1. Token Length Truncation Analysis

Before tokenising the dataset for model training, it is important to analyse how
different maximum sequence lengths affect truncation. Transformer models have a
fixed maximum input length, and overly small limits may truncate important
information, while overly large limits increase computational cost unnecessarily.

This section analyses token length distributions using the tokenisers associated
with both selected models:
- DistilBERT
- RoBERTa

For each model, the proportion of samples exceeding common maximum lengths
(128 and 256 tokens) is computed and reported in tabular form. This analysis
informs the choice of an appropriate maximum sequence length for training.

In [None]:
"""
Token length truncation analysis for DistilBERT and RoBERTa.

This cell:
- Loads the tokenisers for both models
- Computes true token lengths without truncation
- Calculates how many samples exceed common max lengths
- Presents the results in a clear table
"""

from transformers import AutoTokenizer
import numpy as np
import pandas as pd

# Models to analyse
models = {
    "DistilBERT": "distilbert-base-uncased",
    "RoBERTa": "roberta-base"
}

results = []

# Analyse token lengths for each model
for model_name, model_checkpoint in models.items():
    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

    # Tokenise without truncation to get true token lengths
    token_lengths = [
        len(tokenizer.encode(text, add_special_tokens=True))
        for text in train_df["text"]
    ]

    token_lengths = np.array(token_lengths)
    total_samples = len(token_lengths)

    # Truncation counts
    trunc_100 = np.sum(token_lengths > 100)
    trunc_128 = np.sum(token_lengths > 128)
    trunc_256 = np.sum(token_lengths > 256)

    results.append({
        "Model": model_name,
        "Total Samples": total_samples,
        "Samples > 100 tokens": trunc_100,
        "Percent > 100 tokens": trunc_100 / total_samples,
        "Samples > 128 tokens": trunc_128,
        "Percent > 128 tokens": trunc_128 / total_samples,
        "Samples > 256 tokens": trunc_256,
        "Percent > 256 tokens": trunc_256 / total_samples,
    })

# Create results table
truncation_df = pd.DataFrame(results)

# Format percentages for readability
truncation_df["Percent > 100 tokens"] = truncation_df["Percent > 100 tokens"].map(lambda x: f"{x:.2%}")
truncation_df["Percent > 128 tokens"] = truncation_df["Percent > 128 tokens"].map(lambda x: f"{x:.2%}")
truncation_df["Percent > 256 tokens"] = truncation_df["Percent > 256 tokens"].map(lambda x: f"{x:.2%}")

truncation_df

### 6.2 Dataset Tokenisation

In this subsection, the balanced dataset is tokenised using the pretrained
tokenisers corresponding to the selected transformer models: DistilBERT and
RoBERTa. Based on the truncation analysis in Section 6.1, a maximum sequence
length of 128 tokens is used for both models.

Tokenisation converts raw text into numerical inputs required by transformer
models, including input IDs and attention masks. The tokenised datasets are
prepared in a format compatible with PyTorch and the Hugging Face Trainer API.

In [None]:
"""
Dataset tokenisation for DistilBERT and RoBERTa.

This cell:
- Loads the appropriate tokenisers for each model
- Tokenises the balanced dataset using a fixed maximum sequence length
- Prepares tokenised datasets for PyTorch-based training
"""

from transformers import AutoTokenizer

MAX_LENGTH = 128

# Model checkpoints
model_checkpoints = {
    "distilbert": "distilbert-base-uncased",
    "roberta": "roberta-base"
}

tokenized_datasets = {}

for model_name, checkpoint in model_checkpoints.items():
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)

    def tokenize_function(batch):
        """
        Tokenise a batch of text examples.

        Args:
            batch (dict): A batch containing the 'text' field.

        Returns:
            dict: Tokenised outputs including input_ids and attention_mask.
        """
        return tokenizer(
            batch["text"],
            padding="max_length",
            truncation=True,
            max_length=MAX_LENGTH
        )

    # Apply tokenisation
    tokenized_dataset = mini_dataset.map(
        tokenize_function,
        batched=True,
        remove_columns=["text", "label_name"]
    )

    # Set format for PyTorch
    tokenized_dataset.set_format(
        type="torch",
        columns=["input_ids", "attention_mask", "label"]
    )

    tokenized_datasets[model_name] = tokenized_dataset

    
# Inspect tokenised dataset structures for verification
print("DistilBERT tokenised dataset:")
print(tokenized_datasets["distilbert"])

print("\nRoBERTa tokenised dataset:")
print(tokenized_datasets["roberta"])


In [None]:
"""
Create visual of end-to-end work flow
"""


import matplotlib.pyplot as plt
import matplotlib.patches as patches

# Set up the figure and axes
fig, ax = plt.subplots(figsize=(14, 3))
fig.patch.set_facecolor("#f4f4f9")
ax.set_xlim(0, 14)
ax.set_ylim(0, 2)

# Define color palette for each step
COLORS = ["#003f5c", "#58508d", "#bc5090", "#ff6361", "#ffa600"]
TEXT_COLOR = "white"
ARROW_COLOR = "#555555"

# Pipeline labels
labels = [
    "Raw Text",
    "Tokeniser",
    "Transformer\nEncoder",
    "Classification\nHead",
    "Predicted\nCategory"
]

# Box positions on x-axis
positions = [1, 4, 7, 10, 13]

# Draw boxes
for x, label, color in zip(positions, labels, COLORS):
    box = patches.FancyBboxPatch(
        (x - 1, 0.7), 2, 0.8,
        boxstyle="round,pad=0.1,rounding_size=0.2",
        linewidth=0,
        facecolor=color
    )
    ax.add_patch(box)
    ax.text(x, 1.1, label, ha='center', va='center', fontsize=11, color=TEXT_COLOR, fontweight='250')

# Draw arrows between boxes
for i in range(len(positions) - 1):
    ax.annotate(
        "", 
        xy=(positions[i+1]-1, 1.1), 
        xytext=(positions[i]+1, 1.1),
        arrowprops=dict(arrowstyle="-|>", lw=2.5, color=ARROW_COLOR)
    )

ax.axis("off")
plt.title("Transformer-Based News Classification Pipeline", fontsize=14, pad=20, fontweight='400', color="#333333")

# Display the plot
plt.show()

## 7. Baseline Model Evaluation (Pre-Fine-Tuning)

Before fine-tuning, each pretrained transformer model is evaluated on the test
dataset to establish a baseline performance. This baseline reflects how well
the model performs on the news classification task without any task-specific
training.

Evaluating the baseline is important for quantifying the improvement gained
through fine-tuning and for understanding the extent to which pretrained
language representations alone capture task-relevant information.

In this section, both DistilBERT and RoBERTa are evaluated using the same test
set and the same evaluation metrics to ensure a fair comparison.

In [None]:
"""
Baseline (pre-fine-tuning) evaluation for DistilBERT and RoBERTa.

This cell:
- Loads pretrained models without fine-tuning
- Evaluates them on the test split
- Computes accuracy, macro-averaged precision, recall, and F1-score
"""

from transformers import AutoModelForSequenceClassification, Trainer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import numpy as np

def compute_metrics(eval_pred):
    """
    Compute evaluation metrics for classification.

    Args:
        eval_pred (tuple): Tuple containing logits and true labels.

    Returns:
        dict: Dictionary of evaluation metrics.
    """
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)

    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average="macro", zero_division=0
    )
    accuracy = accuracy_score(labels, predictions)

    return {
        "accuracy": accuracy,
        "precision_macro": precision,
        "recall_macro": recall,
        "f1_macro": f1,
    }

baseline_results = []

for model_name, checkpoint in model_checkpoints.items():
    # Load pretrained model (no fine-tuning)
    model = AutoModelForSequenceClassification.from_pretrained(
        checkpoint,
        num_labels=4
    )

    trainer = Trainer(
        model=model,
        tokenizer=AutoTokenizer.from_pretrained(checkpoint),
        compute_metrics=compute_metrics
    )

    # Evaluate on test set
    metrics = trainer.evaluate(tokenized_datasets[model_name]["test"])

    # Store results
    metrics["model"] = model_name
    baseline_results.append(metrics)

# Convert results to DataFrame for clarity
baseline_df = pd.DataFrame(baseline_results).set_index("model")

baseline_df

## 8. Supervised Fine-Tuning Strategies

In this section, supervised fine-tuning is performed using two different strategies
for each selected transformer model:

1. **Full Fine-Tuning**  
   All model parameters are updated during training.

2. **Low-Rank Adaptation (LoRA)**  
   The base model parameters are frozen, and only a small number of trainable
   low-rank adapter parameters are introduced and trained.

Both strategies are applied to DistilBERT and RoBERTa using identical training
hyperparameters. This controlled setup enables a fair comparison between:

- Pretrained baseline performance (Section 7)
- Full supervised fine-tuning
- Parameter-efficient fine-tuning via LoRA

The best-performing checkpoint from each fine-tuning strategy is saved for
subsequent evaluation and comparison.

### 8.1 DistilBERT Fine-Tuning

This subsection fine-tunes DistilBERT using full fine-tuning and LoRA-based
parameter-efficient fine-tuning. Training metrics and best checkpoints are saved
for subsequent evaluation and comparison.

In [None]:
"""
Supervised fine-tuning of DistilBERT with and without LoRA.

This cell:
- Performs full fine-tuning of DistilBERT
- Performs LoRA-based fine-tuning of DistilBERT
- Saves best checkpoints and training metrics to disk
"""

import json
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from peft import LoraConfig, get_peft_model


def compute_metrics(eval_pred):
    """
    Compute evaluation metrics for classification.

    Args:
        eval_pred (tuple): Tuple containing logits and true labels.

    Returns:
        dict: Dictionary of evaluation metrics.
    """
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)

    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average="macro", zero_division=0
    )
    accuracy = accuracy_score(labels, predictions)

    return {
        "accuracy": accuracy,
        "precision_macro": precision,
        "recall_macro": recall,
        "f1_macro": f1,
    }

# --------------------------------------------------
# training hyperparameters
# --------------------------------------------------
common_args = dict(
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    bf16=True,
    fp16=False,
    dataloader_num_workers=4,
    load_best_model_at_end=True,
    metric_for_best_model="eval_f1_macro",
    greater_is_better=True,
    save_total_limit=1,        
    save_only_model=True,       
    logging_strategy="epoch",  
    report_to="none"
)

# --------------------------------------------------
# Full fine-tuning
# --------------------------------------------------
distilbert_full = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=4
)

full_args = TrainingArguments(
    output_dir="models/distilbert_full",
    **common_args
)

trainer_full = Trainer(
    model=distilbert_full,
    args=full_args,
    train_dataset=tokenized_datasets["distilbert"]["train"],
    eval_dataset=tokenized_datasets["distilbert"]["test"],
    tokenizer=AutoTokenizer.from_pretrained("distilbert-base-uncased"),
    compute_metrics=compute_metrics
)

train_result_full = trainer_full.train()
trainer_full.save_model("models/distilbert_full/best_model")

# Save training metrics
with open("models/distilbert_full/train_metrics.json", "w") as f:
    json.dump(train_result_full.metrics, f, indent=2)

# -----------------------------------------------------
# LoRA fine-tuning (DistilBERT-specific configuration)
# ----------------------------------------------------
distilbert_lora = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=4
)

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_lin", "v_lin"],
    lora_dropout=0.1,
    bias="none",
    task_type="SEQ_CLS"
)

distilbert_lora = get_peft_model(distilbert_lora, lora_config)

lora_args = TrainingArguments(
    output_dir="models/distilbert_lora",
    **common_args
)

trainer_lora = Trainer(
    model=distilbert_lora,
    args=lora_args,
    train_dataset=tokenized_datasets["distilbert"]["train"],
    eval_dataset=tokenized_datasets["distilbert"]["test"],
    tokenizer=AutoTokenizer.from_pretrained("distilbert-base-uncased"),
    compute_metrics=compute_metrics
)

train_result_lora = trainer_lora.train()
trainer_lora.save_model("models/distilbert_lora/best_model")

# Save training metrics
with open("models/distilbert_lora/train_metrics.json", "w") as f:
    json.dump(train_result_lora.metrics, f, indent=2)

### 8.2 RoBERTa Fine-Tuning

This subsection fine-tunes RoBERTa using full fine-tuning and LoRA-based
parameter-efficient fine-tuning. Training metrics and best checkpoints are saved
for direct comparison with DistilBERT and baseline results.

In [None]:
"""
Supervised fine-tuning of RoBERTa with and without LoRA.

This cell:
- Performs full fine-tuning of RoBERTa
- Performs LoRA-based fine-tuning of RoBERTa
- Saves best checkpoints and training metrics to disk
"""

import json
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from peft import LoraConfig, get_peft_model


def compute_metrics(eval_pred):
    """
    Compute evaluation metrics for classification.

    Args:
        eval_pred (tuple): Tuple containing logits and true labels.

    Returns:
        dict: Dictionary of evaluation metrics.
    """
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)

    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average="macro", zero_division=0
    )
    accuracy = accuracy_score(labels, predictions)

    return {
        "accuracy": accuracy,
        "precision_macro": precision,
        "recall_macro": recall,
        "f1_macro": f1,
    }

# --------------------------------------------------
# training hyperparameters
# --------------------------------------------------
common_args = dict(
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    bf16=True,
    fp16=False,
    dataloader_num_workers=4,
    load_best_model_at_end=True,
    metric_for_best_model="eval_f1_macro",
    greater_is_better=True,
    save_total_limit=1,        
    save_only_model=True,       
    logging_strategy="epoch",  
    report_to="none"
)

# --------------------------------------------------
# Full fine-tuning
# --------------------------------------------------
roberta_full = AutoModelForSequenceClassification.from_pretrained(
    "roberta-base",
    num_labels=4
)

full_args = TrainingArguments(
    output_dir="models/roberta_full",
    **common_args
)

trainer_full = Trainer(
    model=roberta_full,
    args=full_args,
    train_dataset=tokenized_datasets["roberta"]["train"],
    eval_dataset=tokenized_datasets["roberta"]["test"],
    tokenizer=AutoTokenizer.from_pretrained("roberta-base"),
    compute_metrics=compute_metrics
)

train_result_full = trainer_full.train()
trainer_full.save_model("models/roberta_full/best_model")

with open("models/roberta_full/train_metrics.json", "w") as f:
    json.dump(train_result_full.metrics, f, indent=2)

# --------------------------------------------------
# LoRA fine-tuning (RoBERTa-specific configuration)
# --------------------------------------------------
roberta_lora = AutoModelForSequenceClassification.from_pretrained(
    "roberta-base",
    num_labels=4
)

lora_config_roberta = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["query", "value"],
    lora_dropout=0.1,
    bias="none",
    task_type="SEQ_CLS"
)

roberta_lora = get_peft_model(roberta_lora, lora_config_roberta)

lora_args = TrainingArguments(
    output_dir="models/roberta_lora",
    **common_args
)

trainer_lora = Trainer(
    model=roberta_lora,
    args=lora_args,
    train_dataset=tokenized_datasets["roberta"]["train"],
    eval_dataset=tokenized_datasets["roberta"]["test"],
    tokenizer=AutoTokenizer.from_pretrained("roberta-base"),
    compute_metrics=compute_metrics
)

train_result_lora = trainer_lora.train()
trainer_lora.save_model("models/roberta_lora/best_model")

with open("models/roberta_lora/train_metrics.json", "w") as f:
    json.dump(train_result_lora.metrics, f, indent=2)

## 9. Ensemble Model Evaluation

While individual fine-tuned models already demonstrate substantial performance
improvements over the pretrained baseline, ensemble methods can further improve
robustness and generalisation by combining complementary model predictions.

In this section, probability-level ensembling is applied to the four fine-tuned
models produced in Section 8:

- DistilBERT (full fine-tuning)
- DistilBERT (LoRA)
- RoBERTa (full fine-tuning)
- RoBERTa (LoRA)

Predicted class probabilities are averaged using weighted combinations, and
ensemble performance is evaluated using accuracy, macro-averaged precision,
recall, and F1-score. The objective is to assess whether ensembles outperform
individual models and to identify which model combinations contribute most
effectively to performance gains.

In [None]:
"""
Ensemble evaluation using probability-level model combination.

This cell:
- Loads the best fine-tuned checkpoints from disk
- Computes class probability distributions on the test set
- Evaluates all model combinations using weighted averaging
- Identifies the best-performing ensemble based on macro F1-score
"""

import torch
import numpy as np
import itertools
import pandas as pd
from torch.nn.functional import softmax
from peft import PeftModel
from transformers import AutoModelForSequenceClassification
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# --------------------------------------------------
# Device configuration
# --------------------------------------------------
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# --------------------------------------------------
# Load best fine-tuned models
# --------------------------------------------------
ensemble_models = {
    "DistilBERT_Full": "models/distilbert_full/best_model",
    "DistilBERT_LoRA": "models/distilbert_lora/best_model",
    "RoBERTa_Full": "models/roberta_full/best_model",
    "RoBERTa_LoRA": "models/roberta_lora/best_model",
}

loaded_models = {}

for name, path in ensemble_models.items():

    if "LoRA" in name:
        # Load base model first
        if "DistilBERT" in name:
            base_checkpoint = "distilbert-base-uncased"
        else:
            base_checkpoint = "roberta-base"

        base_model = AutoModelForSequenceClassification.from_pretrained(
            base_checkpoint,
            num_labels=4
        )

        model = PeftModel.from_pretrained(base_model, path)

    else:
        # Full fine-tuned model
        model = AutoModelForSequenceClassification.from_pretrained(path)

    model.to(device)
    model.eval()
    loaded_models[name] = model

# --------------------------------------------------
# Compute probability distributions for each model
# --------------------------------------------------
def get_model_probabilities(model, dataset, batch_size=32):
    """
    Compute predicted class probabilities for an entire dataset.

    Args:
        model (PreTrainedModel): Fine-tuned classification model.
        dataset (datasets.Dataset): Tokenised dataset split.
        batch_size (int): Batch size for inference.

    Returns:
        np.ndarray: Array of shape (N, num_classes) with class probabilities.
    """
    loader = torch.utils.data.DataLoader(
        dataset, batch_size=batch_size, shuffle=False, pin_memory=True
    )

    all_probs = []

    with torch.no_grad():
        for batch in loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)

            logits = model(
                input_ids=input_ids,
                attention_mask=attention_mask
            ).logits

            probs = softmax(logits, dim=-1).cpu().numpy()
            all_probs.append(probs)

    return np.vstack(all_probs)

# Compute probabilities for each model
model_probabilities = {
    name: get_model_probabilities(model, tokenized_datasets[
        "distilbert" if "DistilBERT" in name else "roberta"
    ]["test"])
    for name, model in loaded_models.items()
}

true_labels = np.array(mini_dataset["test"]["label"])

# --------------------------------------------------
# Ensemble evaluation
# --------------------------------------------------
def evaluate_ensemble(model_subset, weights):
    """
    Evaluate a weighted ensemble of models.

    Args:
        model_subset (tuple[str]): Selected model names.
        weights (np.ndarray): Normalised ensemble weights.

    Returns:
        dict: Evaluation metrics.
    """
    combined_prob = np.zeros_like(model_probabilities[model_subset[0]])

    for model_name, w in zip(model_subset, weights):
        combined_prob += w * model_probabilities[model_name]

    preds = combined_prob.argmax(axis=1)

    return {
        "accuracy": accuracy_score(true_labels, preds),
        "precision_macro": precision_score(true_labels, preds, average="macro"),
        "recall_macro": recall_score(true_labels, preds, average="macro"),
        "f1_macro": f1_score(true_labels, preds, average="macro"),
    }

# --------------------------------------------------
# Search over ensemble combinations
# --------------------------------------------------
model_names = list(model_probabilities.keys())

combinations_list = (
    list(itertools.combinations(model_names, 2)) +
    list(itertools.combinations(model_names, 3)) +
    [tuple(model_names)]
)

weight_grid = np.arange(0.1, 1.01, 0.1)

ensemble_results = []

for subset in combinations_list:
    best_f1 = -1
    best_weights = None
    best_metrics = None

    for weights in itertools.product(weight_grid, repeat=len(subset)):
        weights = np.array(weights)
        if weights.sum() == 0 or weights.min() < 0.05:
            continue

        weights = weights / weights.sum()
        metrics = evaluate_ensemble(subset, weights)

        if metrics["f1_macro"] > best_f1:
            best_f1 = metrics["f1_macro"]
            best_weights = weights
            best_metrics = metrics

    ensemble_results.append({
        "models": subset,
        "weights": best_weights,
        **best_metrics
    })

ensemble_df = pd.DataFrame(ensemble_results)
ensemble_df.sort_values("f1_macro", ascending=False)

In [None]:
"""
Persist ensemble evaluation results to disk.

This cell:
- Saves the full ensemble results table
- Saves the best-performing ensemble configuration
"""

from pathlib import Path

# Ensure results directory exists
results_dir = Path("results")
results_dir.mkdir(parents=True, exist_ok=True)

# Save full ensemble results
ensemble_df.to_csv(results_dir / "ensemble_results.csv", index=False)
ensemble_df.to_json(
    results_dir / "ensemble_results.json",
    orient="records",
    indent=2
)

# Extract and save best ensemble only
best_ensemble = ensemble_df.sort_values(
    "f1_macro", ascending=False
).iloc[0]

best_ensemble.to_json(
    results_dir / "best_ensemble.json",
    indent=2
)

print("Ensemble results saved to disk.")

## 10. Fine-Tuned Model Evaluation and Error Analysis

This section evaluates and compares the performance of all fine-tuned models
and the best-performing ensemble. Quantitative metrics and diagnostic analyses
are used to assess predictive quality, robustness, and error characteristics.

The analysis focuses on three levels of comparison:
- Individual fine-tuned models
- Selection of the best single model
- Performance gains achieved through ensembling

### 10.1 Evaluation of Fine-Tuned Models

This subsection evaluates all fine-tuned models using their respective best
checkpoints saved during training. Performance is assessed on the test set
using consistent evaluation metrics, enabling direct comparison between:

- DistilBERT with full fine-tuning
- DistilBERT with LoRA fine-tuning
- RoBERTa with full fine-tuning
- RoBERTa with LoRA fine-tuning

These results form the basis for selecting the best-performing single model
in the subsequent subsection.

In [None]:
"""
Evaluation of best fine-tuned checkpoints.

This cell:
- Loads best saved checkpoints for all fine-tuned models
- Evaluates them on the test set
- Produces a unified comparison table of evaluation metrics
"""

from transformers import AutoModelForSequenceClassification, Trainer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from peft import PeftModel
import pandas as pd


def compute_metrics(eval_pred):
    """
    Compute evaluation metrics for classification.

    Args:
        eval_pred (tuple): Tuple containing logits and true labels.

    Returns:
        dict: Dictionary of evaluation metrics.
    """
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)

    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average="macro", zero_division=0
    )
    accuracy = accuracy_score(labels, predictions)

    return {
        "accuracy": accuracy,
        "precision_macro": precision,
        "recall_macro": recall,
        "f1_macro": f1,
    }

fine_tuned_checkpoints = {
    "DistilBERT_Full": {
        "path": "models/distilbert_full/best_model",
        "base": "distilbert-base-uncased",
        "type": "full",
        "dataset": "distilbert",
    },
    "DistilBERT_LoRA": {
        "path": "models/distilbert_lora/best_model",
        "base": "distilbert-base-uncased",
        "type": "lora",
        "dataset": "distilbert",
    },
    "RoBERTa_Full": {
        "path": "models/roberta_full/best_model",
        "base": "roberta-base",
        "type": "full",
        "dataset": "roberta",
    },
    "RoBERTa_LoRA": {
        "path": "models/roberta_lora/best_model",
        "base": "roberta-base",
        "type": "lora",
        "dataset": "roberta",
    },
}

evaluation_rows = []

for model_name, cfg in fine_tuned_checkpoints.items():

    if cfg["type"] == "lora":
        base_model = AutoModelForSequenceClassification.from_pretrained(
            cfg["base"],
            num_labels=4
        )
        model = PeftModel.from_pretrained(base_model, cfg["path"])
    else:
        model = AutoModelForSequenceClassification.from_pretrained(cfg["path"])

    trainer = Trainer(
        model=model,
        compute_metrics=compute_metrics
    )

    metrics = trainer.evaluate(
        tokenized_datasets[cfg["dataset"]]["test"]
    )

    evaluation_rows.append({
        "Model": model_name,
        "eval_accuracy": metrics["eval_accuracy"],
        "eval_precision_macro": metrics["eval_precision_macro"],
        "eval_recall_macro": metrics["eval_recall_macro"],
        "eval_f1_macro": metrics["eval_f1_macro"],
    })

fine_tuned_eval_df = (
    pd.DataFrame(evaluation_rows)
    .set_index("Model")
    .sort_values("eval_f1_macro", ascending=False)
)

fine_tuned_eval_df

### 10.2 Ensemble Performance Evaluation

This subsection evaluates the performance of ensemble models constructed from
the fine-tuned transformers. The ensemble combines model predictions at the
probability level using weighted averaging.

Performance is reported using the same evaluation metrics as individual models
to allow direct comparison between single-model fine-tuning and ensemble-based
approaches.

In [None]:
"""
Evaluation of ensemble model performance.

This cell:
- Loads previously computed ensemble results from disk
- Identifies and present the best-performing ensemble configuration
"""

import pandas as pd

# Load ensemble evaluation results
ensemble_results = pd.read_csv("results/ensemble_results.csv")

# Sort by macro F1-score
ensemble_results_sorted = ensemble_results.sort_values(
    "f1_macro", ascending=False
)

# Display the best-performing ensemble configurations
best_ensemble = ensemble_results_sorted.iloc[0]
best_ensemble

### 10.3 Error Analysis and Confusion Matrix Evaluation

This subsection performs detailed error analysis for all fine-tuned models and
the ensemble approach. Confusion matrices are used to visualise class-level
performance and systematic misclassification patterns.

In [None]:
"""
Generate predictions on the test set for all fine-tuned models and the ensemble.

This cell:
- Runs inference for each fine-tuned model
- Stores true labels and predicted labels
- Prepares predictions for confusion matrix analysis
"""

import numpy as np
import torch
from torch.nn.functional import softmax


# True labels
y_true = np.array(mini_dataset["test"]["label"])

model_predictions = {}

# Generate predictions for individual models
for model_name, model in loaded_models.items():

    dataset_key = "distilbert" if "DistilBERT" in model_name else "roberta"
    dataset = tokenized_datasets[dataset_key]["test"]

    loader = torch.utils.data.DataLoader(
        dataset, batch_size=32, shuffle=False
    )

    preds = []

    with torch.no_grad():
        for batch in loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)

            logits = model(
                input_ids=input_ids,
                attention_mask=attention_mask
            ).logits

            preds.append(torch.argmax(logits, dim=-1).cpu().numpy())

    model_predictions[model_name] = np.concatenate(preds)

# Ensemble predictions (best ensemble configuration)
best_ensemble_config = [
    m.strip().strip("'")
    for m in best_ensemble["models"].strip("()").split(",")
]

best_weights = np.fromstring(
    best_ensemble["weights"].strip("[]"),
    sep=" "
)

ensemble_probs = np.zeros_like(
    model_probabilities[best_ensemble_config[0]]
)

for model_name, w in zip(best_ensemble_config, best_weights):
    ensemble_probs += w * model_probabilities[model_name]

model_predictions["Ensemble"] = ensemble_probs.argmax(axis=1)

In [None]:
"""
Plot confusion matrices for DistilBERT and RoBERTa fine-tuning strategies.

This cell:
- Plots Full vs LoRA confusion matrices side by side
- Saves figures as PNG files
"""


import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.metrics import confusion_matrix
from pathlib import Path

# Configuration
label_order = ["World", "Sports", "Business", "Sci/Tech"]
results_dir = Path("results/confusion_matrices")
results_dir.mkdir(parents=True, exist_ok=True)

# Set clean style
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.sans-serif'] = ['Arial', 'DejaVu Sans']

def plot_clean_confusion_matrix(cm, ax, title, cmap='Blues'):
    """
    Plot a confusion matrix with counts and percentages.
    """
    # Calculate percentages
    cm_percent = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] * 100
    
    # Create annotations combining counts and percentages
    annotations = np.empty_like(cm, dtype=object)
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            count = cm[i, j]
            percent = cm_percent[i, j]
            annotations[i, j] = f'{count}\n({percent:.1f}%)'
    
    # Plot heatmap
    sns.heatmap(
        cm,
        annot=annotations,
        fmt='',
        cmap=cmap,
        xticklabels=label_order,
        yticklabels=label_order,
        ax=ax,
        cbar_kws={'label': 'Count'},
        linewidths=1.5,
        linecolor='white',
        square=True,
        vmin=0,
        annot_kws={'fontsize': 11, 'weight': 'bold'},
        cbar=True
    )
    
    # Title styling
    ax.set_title(title, fontsize=15, fontweight='300', pad=12)
    
    # Axis labels
    ax.set_xlabel('Predicted Label', fontsize=12, fontweight='300', labelpad=8)
    ax.set_ylabel('True Label', fontsize=12, fontweight='300', labelpad=8)
    
    # Tick labels
    ax.set_xticklabels(ax.get_xticklabels(), rotation=0, ha='center', fontsize=10)
    ax.set_yticklabels(ax.get_yticklabels(), rotation=0, va='center', fontsize=10)
    
    # Colorbar styling
    cbar = ax.collections[0].colorbar
    cbar.ax.tick_params(labelsize=9)
    cbar.set_label('Count', fontsize=11, fontweight='300')

# --------------------------------------------------
# DistilBERT confusion matrices
# --------------------------------------------------
y_pred_full = model_predictions["DistilBERT_Full"]
y_pred_lora = model_predictions["DistilBERT_LoRA"]

cm_full = confusion_matrix(y_true, y_pred_full)
cm_lora = confusion_matrix(y_true, y_pred_lora)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.suptitle('DistilBERT Fine-Tuning Comparison', 
             fontsize=20, fontweight='500', y=1.01)

plot_clean_confusion_matrix(cm_full, axes[0], 
                           'Full Fine-Tuning', cmap='Blues')
plot_clean_confusion_matrix(cm_lora, axes[1], 
                           'LoRA Fine-Tuning', cmap='Oranges')

plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.subplots_adjust(wspace=0.2)
plt.savefig(results_dir / "distilbert_confusion_enhanced.png", 
            dpi=600, bbox_inches='tight', facecolor='white')
plt.show()

# --------------------------------------------------
# RoBERTa confusion matrices
# --------------------------------------------------
y_pred_full = model_predictions["RoBERTa_Full"]
y_pred_lora = model_predictions["RoBERTa_LoRA"]

cm_full = confusion_matrix(y_true, y_pred_full)
cm_lora = confusion_matrix(y_true, y_pred_lora)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.suptitle('RoBERTa Fine-Tuning Comparison', 
             fontsize=20, fontweight='500', y=1.01)

plot_clean_confusion_matrix(cm_full, axes[0], 
                           'Full Fine-Tuning', cmap='Blues')
plot_clean_confusion_matrix(cm_lora, axes[1], 
                           'LoRA Fine-Tuning', cmap='Oranges')

plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.subplots_adjust(wspace=0.2)
plt.savefig(results_dir / "roberta_confusion_enhanced.png", 
            dpi=600, bbox_inches='tight', facecolor='white')
plt.show()

In [None]:
"""
Plot confusion matrix for the ensemble model and summarize error patterns.

This cell:
- Generates and saves the ensemble confusion matrix with clean styling
- Computes per-class error rates for qualitative analysis
"""
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from pathlib import Path

# Configuration
label_order = ["World", "Sports", "Business", "Sci/Tech"]
results_dir = Path("results/confusion_matrices")
results_dir.mkdir(parents=True, exist_ok=True)

# Set clean style
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.sans-serif'] = ['Arial', 'DejaVu Sans']

# Ensemble confusion matrix
cm_ensemble = confusion_matrix(y_true, model_predictions["Ensemble"])

# Calculate percentages
cm_percent = cm_ensemble.astype('float') / cm_ensemble.sum(axis=1)[:, np.newaxis] * 100

# Create annotations combining counts and percentages
annotations = np.empty_like(cm_ensemble, dtype=object)
for i in range(cm_ensemble.shape[0]):
    for j in range(cm_ensemble.shape[1]):
        count = cm_ensemble[i, j]
        percent = cm_percent[i, j]
        annotations[i, j] = f'{count}\n({percent:.1f}%)'

# Plot
fig, ax = plt.subplots(figsize=(8, 7))

sns.heatmap(
    cm_ensemble,
    annot=annotations,
    fmt='',
    cmap='Blues',
    xticklabels=label_order,
    yticklabels=label_order,
    ax=ax,
    cbar_kws={'label': 'Count'},
    linewidths=1.5,
    linecolor='white',
    square=True,
    vmin=0,
    annot_kws={'fontsize': 12, 'weight': 'bold'}
)

# Title and labels
ax.set_title('Confusion Matrix - Ensemble Model', 
             fontsize=20, fontweight='500', pad=15, y=1.01)
ax.set_xlabel('Predicted Label', fontsize=12, fontweight='300', labelpad=10)
ax.set_ylabel('True Label', fontsize=12, fontweight='300', labelpad=10)

# Tick labels
ax.set_xticklabels(ax.get_xticklabels(), rotation=0, ha='center', fontsize=11)
ax.set_yticklabels(ax.get_yticklabels(), rotation=0, va='center', fontsize=11)

# Colorbar styling
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=10)
cbar.set_label('Count', fontsize=11, fontweight='300')

plt.tight_layout()
plt.subplots_adjust(wspace=0.2)
plt.savefig(results_dir / "ensemble_confusion_enhanced.png", 
            dpi=600, bbox_inches='tight', facecolor='white')
plt.show()

# --------------------------------------------------
# Per-class error analysis
# --------------------------------------------------
ensemble_errors = pd.DataFrame(
    cm_ensemble,
    index=label_order,
    columns=label_order
)

# Calculate error rates
error_rates = 1 - np.diag(cm_ensemble) / cm_ensemble.sum(axis=1)
error_summary = pd.Series(error_rates, index=label_order, name="Error Rate")

# Display summary
print("\n" + "="*50)
print("ENSEMBLE MODEL - ERROR ANALYSIS")
print("="*50)
print(f"\nOverall Accuracy: {(np.diag(cm_ensemble).sum() / cm_ensemble.sum()):.2%}")
print("\nPer-Class Error Rates:")
print("-"*50)
for label in label_order:
    error_rate = error_summary[label]
    accuracy = 1 - error_rate
    print(f"{label:12s}: {error_rate:6.2%} error  |  {accuracy:6.2%} accuracy")
print("="*50)

print(f"\nâœ“ Ensemble confusion matrix saved to:")
print(f"  {results_dir / 'ensemble_confusion_enhanced.png'}")

# Return error summary for further analysis
error_summary

### 10.4 Misclassification Case Analysis

This subsection performs a qualitative analysis of misclassified test examples
across all fine-tuned models and the ensemble. The objective is to identify
systematic error patterns, ambiguous cases, and domain overlaps that contribute
to incorrect predictions.

For each model, a subset of misclassified examples is extracted along with
the true label and predicted label. These cases are then organised into tables
to support structured comparison and discussion in the written report.

In [None]:
"""
Extract misclassified examples for all fine-tuned models and the ensemble.

This cell:
- Reloads all fine-tuned models explicitly
- Generates predictions on the test set
- Extracts misclassified samples for each model
"""

import numpy as np
import pandas as pd
import torch
from transformers import AutoModelForSequenceClassification
from peft import PeftModel

label_names = ["World", "Sports", "Business", "Sci/Tech"]
y_true = np.array(mini_dataset["test"]["label"])

def predict_labels(model, dataset, batch_size=32):
    """
    Generate predicted labels for a dataset.
    """
    loader = torch.utils.data.DataLoader(
        dataset, batch_size=batch_size, shuffle=False
    )

    preds = []

    model.eval()
    with torch.no_grad():
        for batch in loader:
            logits = model(
                input_ids=batch["input_ids"].to(device),
                attention_mask=batch["attention_mask"].to(device)
            ).logits
            preds.append(torch.argmax(logits, dim=-1).cpu().numpy())

    return np.concatenate(preds)

# Explicit model registry
model_registry = {
    "DistilBERT_Full": {
        "type": "full",
        "path": "models/distilbert_full/best_model",
        "base": "distilbert-base-uncased",
        "dataset": "distilbert",
    },
    "DistilBERT_LoRA": {
        "type": "lora",
        "path": "models/distilbert_lora/best_model",
        "base": "distilbert-base-uncased",
        "dataset": "distilbert",
    },
    "RoBERTa_Full": {
        "type": "full",
        "path": "models/roberta_full/best_model",
        "base": "roberta-base",
        "dataset": "roberta",
    },
    "RoBERTa_LoRA": {
        "type": "lora",
        "path": "models/roberta_lora/best_model",
        "base": "roberta-base",
        "dataset": "roberta",
    },
}

misclassified_rows = []

for model_name, cfg in model_registry.items():

    # Load model
    if cfg["type"] == "lora":
        base_model = AutoModelForSequenceClassification.from_pretrained(
            cfg["base"], num_labels=4
        )
        model = PeftModel.from_pretrained(base_model, cfg["path"])
    else:
        model = AutoModelForSequenceClassification.from_pretrained(cfg["path"])

    model.to(device)

    # Predict
    y_pred = predict_labels(
        model,
        tokenized_datasets[cfg["dataset"]]["test"]
    )

    # Collect misclassifications
    for idx, (t, p) in enumerate(zip(y_true, y_pred)):
        if t != p:
            misclassified_rows.append({
                "model": model_name,
                "text": mini_dataset["test"][idx]["text"],
                "true_label": label_names[t],
                "predicted_label": label_names[p],
            })

# Add ensemble misclassifications
ensemble_preds = model_predictions["Ensemble"]

for idx, (t, p) in enumerate(zip(y_true, ensemble_preds)):
    if t != p:
        misclassified_rows.append({
            "model": "Ensemble",
            "text": mini_dataset["test"][idx]["text"],
            "true_label": label_names[t],
            "predicted_label": label_names[p],
        })

misclassified_df = pd.DataFrame(misclassified_rows)

misclassified_df["model"].value_counts()

In [None]:
"""
Select representative misclassification examples for qualitative analysis.

This cell:
- Creates one table per model
- Samples a fixed number of misclassified cases for each model
- Stores results in a dictionary of DataFrames
"""

# Number of examples per model to inspect
N_SAMPLES = 10

sampled_errors_per_model = {}

for model_name in misclassified_df["model"].unique():
    model_errors = misclassified_df[misclassified_df["model"] == model_name]

    sampled_errors = model_errors.sample(
        n=min(N_SAMPLES, len(model_errors)),
        random_state=42
    )

    sampled_errors_per_model[model_name] = sampled_errors

    print(f"\nMisclassified examples for {model_name}:")
    display(sampled_errors)


### 10.5 Comparative Error Patterns and Observations

This subsection analyses misclassification patterns across all fine-tuned models
and the ensemble to identify systematic errors and model-specific behaviour.
Rather than focusing on individual examples, the analysis aggregates errors to
reveal consistent class confusions, overlap in failure cases, and the extent to
which the ensemble mitigates errors made by single models.

The results in this subsection support higher-level observations about model
biases, domain overlap between news categories, and the benefits and limitations
of ensembling.

In [None]:
"""
Aggregate class-to-class confusion patterns for each model.

This cell:
- Computes normalised confusion matrices
- Converts them into long-form tables
- Highlights dominant misclassification directions
"""


from sklearn.metrics import confusion_matrix
import pandas as pd
import numpy as np

label_names = ["World", "Sports", "Business", "Sci/Tech"]

per_model_confusions = {}

for model_name, preds in model_predictions.items():
    cm = confusion_matrix(
        y_true,
        preds,
        labels=range(len(label_names))
    )

    # Normalise by true class counts
    cm_norm = cm / cm.sum(axis=1, keepdims=True)

    rows = []
    for i, true_label in enumerate(label_names):
        for j, pred_label in enumerate(label_names):
            if i != j:
                rows.append({
                    "true_label": true_label,
                    "predicted_label": pred_label,
                    "error_rate": cm_norm[i, j]
                })

    df = (
        pd.DataFrame(rows)
        .sort_values("error_rate", ascending=False)
        .reset_index(drop=True)
    )

    per_model_confusions[model_name] = df

# Display tables separately for each model
for model_name, df in per_model_confusions.items():
    print(f"\nModel: {model_name}")
    display(df.head(5))

In [None]:
"""
Analyse overlap in misclassified samples between models.

This cell:
- Computes which test samples are misclassified by each model
- Quantifies shared and unique error cases
- Evaluates whether the ensemble reduces shared errors
"""

error_flags = {}

for model_name, preds in model_predictions.items():
    error_flags[model_name] = (preds != y_true)

error_flag_df = pd.DataFrame(error_flags)

# Count number of models misclassifying each sample
error_flag_df["num_models_misclassified"] = error_flag_df.sum(axis=1)

# Summary statistics
overlap_summary = (
    error_flag_df["num_models_misclassified"]
    .value_counts()
    .sort_index()
    .rename("num_samples")
)

display(
    overlap_summary
    .rename_axis("Number of models that misclassified the sample")
    .reset_index(name="Number of test samples")
    .style.hide(axis="index")
)

In [None]:
"""
Assess how often the ensemble corrects errors made by individual models.

This cell:
- Identifies samples misclassified by single models
- Checks whether the ensemble predicts them correctly
- Quantifies corrective impact of ensembling
"""

ensemble_correct = model_predictions["Ensemble"] == y_true

correction_stats = []

for model_name in model_predictions:
    if model_name == "Ensemble":
        continue

    model_wrong = model_predictions[model_name] != y_true
    corrected_by_ensemble = model_wrong & ensemble_correct

    correction_stats.append({
        "model": model_name,
        "errors_by_model": model_wrong.sum(),
        "errors_corrected_by_ensemble": corrected_by_ensemble.sum(),
        "percent_corrected": corrected_by_ensemble.sum() / model_wrong.sum()
        if model_wrong.sum() > 0 else 0.0
    })

ensemble_correction_df = pd.DataFrame(correction_stats)

ensemble_correction_df

## 11. Training Dynamics, Efficiency, and Representation Analysis

This section analyses how different fine-tuning strategies behave during training,
how computationally efficient they are, and how learned representations differ
across models. The analysis is divided into four parts:

1. Training loss curves  
2. Evaluation loss curves  
3. Training efficiency comparison  
4. Embedding space visualisation  

Together, these analyses provide deeper insight into model convergence,
generalisation behaviour, efficiency trade-offs, and representational learning.

In [None]:
"""
Plot training loss curves for all fine-tuned models.

This cell:
- Loads Trainer log history from disk
- Extracts training loss per epoch
- Plots loss curves for comparison
"""

import json
import matplotlib.pyplot as plt
from pathlib import Path

def load_training_loss(model_dir):
    """
    Load training loss history from a Trainer checkpoint directory.

    Args:
        model_dir (str): Path to model directory.

    Returns:
        list[tuple]: (epoch, loss) pairs.
    """
    state_path = Path(model_dir) / "trainer_state.json"

    with open(state_path, "r") as f:
        state = json.load(f)

    return [
        (log["epoch"], log["loss"])
        for log in state["log_history"]
        if "loss" in log and "epoch" in log
    ]

# --------------------------------------------------
# Model paths
# --------------------------------------------------
model_dirs = {
    "DistilBERT Full": "models/distilbert_full/checkpoint-1000",
    "DistilBERT LoRA": "models/distilbert_lora/checkpoint-1000",
    "RoBERTa Full": "models/roberta_full/checkpoint-1000",
    "RoBERTa LoRA": "models/roberta_lora/checkpoint-1000",
}

# --------------------------------------------------
# Figure with two grids
# --------------------------------------------------
fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharey=True)

# DistilBERT panel
for name, style in {
    "DistilBERT Full": {"color": "#DE8F05", "linestyle": "-", "marker": "o"},
    "DistilBERT LoRA": {"color": "#0173B2", "linestyle": "--", "marker": "s"},
}.items():
    epochs, losses = zip(*load_training_loss(model_dirs[name]))
    axes[0].plot(epochs, losses, label=name, linewidth=2.5, markersize=7, **style)

axes[0].set_title("DistilBERT Training Loss", fontsize=15, fontweight="500", y=1.05)
axes[0].set_xlabel("Epoch", fontsize=12)
axes[0].set_ylabel("Training Loss", fontsize=12)
axes[0].grid(True, linestyle="--", alpha=0.3)
axes[0].tick_params(axis="y", labelleft=True)
axes[0].legend()

# RoBERTa panel
for name, style in {
   "RoBERTa Full": {"color": "#DE8F05", "linestyle": "-", "marker": "o"},
   "RoBERTa LoRA": {"color": "#0173B2", "linestyle": "--", "marker": "s"},
}.items():
    epochs, losses = zip(*load_training_loss(model_dirs[name]))
    axes[1].plot(epochs, losses, label=name, linewidth=2.5, markersize=7, **style)

axes[1].set_title("RoBERTa Training Loss", fontsize=15, fontweight="500", y=1.05)
axes[1].set_xlabel("Epoch", fontsize=12)
axes[1].set_ylabel("Training Loss", fontsize=12)
axes[1].grid(True, linestyle="--", alpha=0.3)
axes[1].tick_params(axis="y", labelleft=True)
axes[1].legend()

plt.tight_layout()
plt.subplots_adjust(wspace=0.2)
plt.show()

In [None]:
"""
Plot evaluation loss curves for all fine-tuned models.

This cell:
- Loads Trainer evaluation logs from disk
- Extracts evaluation loss per epoch
- Plots evaluation loss curves for comparison
"""

import json
import matplotlib.pyplot as plt
from pathlib import Path

def load_eval_loss(model_dir):
    """
    Load evaluation loss history from a Trainer checkpoint directory.

    Args:
        model_dir (str): Path to model directory.

    Returns:
        list[tuple]: (epoch, eval_loss) pairs.
    """
    state_path = Path(model_dir) / "trainer_state.json"

    with open(state_path, "r") as f:
        state = json.load(f)

    return [
        (log["epoch"], log["eval_loss"])
        for log in state["log_history"]
        if "eval_loss" in log and "epoch" in log
    ]

# --------------------------------------------------
# Model paths
# --------------------------------------------------
model_dirs = {
    "DistilBERT Full": "models/distilbert_full/checkpoint-1000",
    "DistilBERT LoRA": "models/distilbert_lora/checkpoint-1000",
    "RoBERTa Full": "models/roberta_full/checkpoint-1000",
    "RoBERTa LoRA": "models/roberta_lora/checkpoint-1000",
}

# --------------------------------------------------
# Figure with two grids
# --------------------------------------------------
fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharey=True)

# DistilBERT panel
for name, style in {
    "DistilBERT Full": {"color": "#DE8F05", "linestyle": "-", "marker": "o"},
    "DistilBERT LoRA": {"color": "#0173B2", "linestyle": "--", "marker": "s"},
}.items():
    epochs, losses = zip(*load_eval_loss(model_dirs[name]))
    axes[0].plot(epochs, losses, label=name, linewidth=2.5, markersize=7, **style)

axes[0].set_title("DistilBERT Evaluation Loss", fontsize=15, fontweight="500", y=1.05)
axes[0].set_xlabel("Epoch", fontsize=12)
axes[0].set_ylabel("Evaluation Loss", fontsize=12)
axes[0].grid(True, linestyle="--", alpha=0.3)
axes[0].tick_params(axis="y", labelleft=True)
axes[0].legend()

# RoBERTa panel
for name, style in {
   "RoBERTa Full": {"color": "#DE8F05", "linestyle": "-", "marker": "o"},
   "RoBERTa LoRA": {"color": "#0173B2", "linestyle": "--", "marker": "s"},
}.items():
    epochs, losses = zip(*load_eval_loss(model_dirs[name]))
    axes[1].plot(epochs, losses, label=name, linewidth=2.5, markersize=7, **style)

axes[1].set_title("RoBERTa Evaluation Loss", fontsize=15, fontweight="500", y=1.05)
axes[1].set_xlabel("Epoch", fontsize=12)
axes[1].set_ylabel("Training Loss", fontsize=12)
axes[1].grid(True, linestyle="--", alpha=0.3)
axes[1].tick_params(axis="y", labelleft=True)
axes[1].legend()

plt.tight_layout()
plt.subplots_adjust(wspace=0.2)
plt.show()

In [None]:
"""
Compare training efficiency across fine-tuning strategies.

This cell:
- Loads saved training metrics
- Compares runtime, throughput, and FLOPs
"""

import json
import pandas as pd

metric_files = {
    "DistilBERT Full": "models/distilbert_full/train_metrics.json",
    "DistilBERT LoRA": "models/distilbert_lora/train_metrics.json",
    "RoBERTa Full": "models/roberta_full/train_metrics.json",
    "RoBERTa LoRA": "models/roberta_lora/train_metrics.json",
}

rows = []

for name, path in metric_files.items():
    with open(path, "r") as f:
        m = json.load(f)

    rows.append({
        "Model": name,
        "Train Runtime (s)": m.get("train_runtime"),
        "Samples / Second": m.get("train_samples_per_second"),
        "Total FLOPs": m.get("total_flos"),
        "Train Loss": m.get("train_loss"),
    })

efficiency_df = pd.DataFrame(rows).set_index("Model")
efficiency_df

In [None]:
"""
Embedding space comparison for DistilBERT using PCA.

This cell:
- Extracts CLS embeddings for DistilBERT Baseline, Full Fine-Tuning, and LoRA
- Fits PCA once on the baseline embeddings
- Projects all embeddings into the same PCA space
- Plots three subplots for direct comparison
"""

from transformers import AutoModelForSequenceClassification
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
import torch


baseline_distilbert = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=4
).to(device)

# --------------------------------------------------
# Helper: extract CLS embeddings
# --------------------------------------------------
def extract_embeddings(model, dataset, max_samples=400):
    loader = torch.utils.data.DataLoader(
        dataset.select(range(max_samples)),
        batch_size=32,
        shuffle=False
    )

    embeddings = []
    labels = []

    with torch.no_grad():
        for batch in loader:
            outputs = model(
                input_ids=batch["input_ids"].to(device),
                attention_mask=batch["attention_mask"].to(device),
                output_hidden_states=True
            )
            cls = outputs.hidden_states[-1][:, 0, :].cpu().numpy()
            embeddings.append(cls)
            labels.extend(batch["label"].numpy())

    return np.vstack(embeddings), np.array(labels)

# --------------------------------------------------
# Extract embeddings
# --------------------------------------------------
emb_base, y = extract_embeddings(
    baseline_distilbert,
    tokenized_datasets["distilbert"]["test"]
)

emb_full, _ = extract_embeddings(
    loaded_models["DistilBERT_Full"],
    tokenized_datasets["distilbert"]["test"]
)

emb_lora, _ = extract_embeddings(
    loaded_models["DistilBERT_LoRA"],
    tokenized_datasets["distilbert"]["test"]
)

# --------------------------------------------------
# Fit PCA on baseline embeddings only
# --------------------------------------------------
scaler = StandardScaler()
emb_base_scaled = scaler.fit_transform(emb_base)

pca = PCA(n_components=2, random_state=42)
emb_base_2d = pca.fit_transform(emb_base_scaled)

# Project others into same PCA space
emb_full_2d = pca.transform(scaler.transform(emb_full))
emb_lora_2d = pca.transform(scaler.transform(emb_lora))

# --------------------------------------------------
# Plot styling
# --------------------------------------------------
colors = ["#0173B2", "#DE8F05", "#CC78BC", "#CA9161"]
cmap = plt.cm.colors.ListedColormap(colors)

fig, axes = plt.subplots(1, 3, figsize=(20, 6.5), sharex=True, sharey=True)
fig.patch.set_facecolor("#ffffff")

titles = [
    "Baseline (No Fine-Tuning)",
    "Full Fine-Tuning",
    "LoRA Fine-Tuning"
]

data = [emb_base_2d, emb_full_2d, emb_lora_2d]

for ax, emb_2d, title in zip(axes, data, titles):
    ax.scatter(
        emb_2d[:, 0], emb_2d[:, 1],
        c=y, cmap=cmap, s=100, alpha=0.7,
        edgecolors="white", linewidth=0.8,
        rasterized=True
    )
    
    ax.set_title(title, fontsize=18, fontweight="300", pad=15)
    ax.set_xlabel("PCA Component 1", fontsize=16, fontweight="150", labelpad=14)
    ax.set_ylabel("PCA Component 2", fontsize=16, fontweight="150")
    ax.grid(True, linestyle="--", alpha=0.2, linewidth=0.8)
    ax.tick_params(axis="y", labelleft=True)
    ax.set_axisbelow(True)

    for spine in ax.spines.values():
        spine.set_edgecolor("#e0e0e0")
        spine.set_linewidth(1.5)

# --------------------------------------------------
# Variance explained annotation (global)
# --------------------------------------------------
var_pc1 = pca.explained_variance_ratio_[0] * 100
var_pc2 = pca.explained_variance_ratio_[1] * 100

# Main title
fig.suptitle(
    "DistilBERT Embedding Space Comparison (PCA)",
    fontsize=20, fontweight="500", y=1.07
)

# Subtitle (variance explained)
fig.text(
    0.5, 0.98,
    f"PC1: {var_pc1:.1f}% variance  |   PC2: {var_pc2:.1f}% variance",
    ha="center", fontsize=18, fontweight="150"
)

plt.tight_layout()
plt.subplots_adjust(wspace=0.2)
plt.show()

#### for report
Figure X illustrates the evolution of the DistilBERT embedding space under different training strategies. In the baseline model, embeddings form a compact and largely unstructured cluster, reflecting the absence of task-specific representation learning. After full fine-tuning, the embedding space becomes elongated and structured, indicating that the model has learned discriminative directions relevant to news category classification. LoRA fine-tuning produces an intermediate geometry: embeddings are more dispersed than the baseline, demonstrating task adaptation, but less structured than full fine-tuning, consistent with its slightly lower classification performance. These results suggest that while LoRA captures much of the task-specific structure, full fine-tuning more strongly reshapes the latent representation space.

## 12. Deployment of the Ensemble Model

This section deploys the ensemble-based news classification system as an
interactive web interface using Gradio. The deployed system allows users to
input a news headline or short description and receive a predicted category
along with class probability scores.

The ensemble combines multiple fine-tuned transformer models at the probability
level, leveraging their complementary strengths to produce more robust and
reliable predictions.

In [None]:
"""
Load fine-tuned models for ensemble deployment.

This cell:
- Loads the best fine-tuned checkpoints for all models
- Prepares them for inference
- Sets models to evaluation mode
"""

import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
from torch.nn.functional import softmax

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Label mapping
label_names = ["World", "Sports", "Business", "Sci/Tech"]

# Model configuration
deployment_models = {
    "DistilBERT_Full": {
        "path": "models/distilbert_full/best_model",
        "base": "distilbert-base-uncased",
        "lora": False,
    },
    "DistilBERT_LoRA": {
        "path": "models/distilbert_lora/best_model",
        "base": "distilbert-base-uncased",
        "lora": True,
    },
    "RoBERTa_Full": {
        "path": "models/roberta_full/best_model",
        "base": "roberta-base",
        "lora": False,
    },
    "RoBERTa_LoRA": {
        "path": "models/roberta_lora/best_model",
        "base": "roberta-base",
        "lora": True,
    },
}

loaded_deployment_models = {}
loaded_tokenizers = {}

for name, cfg in deployment_models.items():
    tokenizer = AutoTokenizer.from_pretrained(cfg["base"])
    loaded_tokenizers[name] = tokenizer

    if cfg["lora"]:
        base_model = AutoModelForSequenceClassification.from_pretrained(
            cfg["base"], num_labels=4
        )
        model = PeftModel.from_pretrained(base_model, cfg["path"])
    else:
        model = AutoModelForSequenceClassification.from_pretrained(cfg["path"])

    model.to(device)
    model.eval()
    loaded_deployment_models[name] = model

In [None]:
"""
Flexible prediction function supporting ensemble and single-model inference.

This cell:
- Loads the best ensemble configuration from disk
- Supports ensemble-based or single-model predictions
- Uses a unified inference pipeline for both modes
"""

import json
import numpy as np
import torch
from torch.nn.functional import softmax

# Load best ensemble configuration
with open("results/best_ensemble.json", "r") as f:
    best_ensemble = json.load(f)

ensemble_model_names = best_ensemble["models"]
ensemble_weights = np.array(best_ensemble["weights"], dtype=float)

MAX_LENGTH = 128

def predict(text, mode="ensemble", model_name=None):
    """
    Predict the news category using either an ensemble or a single model.

    Args:
        text (str): News headline or short description.
        mode (str): Prediction mode ("ensemble" or "single").
        model_name (str, optional): Model key to use in single-model mode.

    Returns:
        tuple: Predicted label and dictionary of class probabilities.
    """
    if mode == "ensemble":
        model_names = ensemble_model_names
        weights = ensemble_weights

    elif mode == "single":
        if model_name is None:
            raise ValueError(
                "model_name must be specified when mode='single'."
            )
        model_names = [model_name]
        weights = np.array([1.0])

    else:
        raise ValueError("mode must be either 'ensemble' or 'single'.")

    all_probs = []

    for name in model_names:
        model = loaded_deployment_models[name]
        tokenizer = loaded_tokenizers[name]

        inputs = tokenizer(
            text,
            return_tensors="pt",
            truncation=True,
            padding="max_length",
            max_length=MAX_LENGTH
        ).to(device)

        with torch.no_grad():
            logits = model(**inputs).logits
            probs = softmax(logits, dim=-1).cpu().numpy()[0]

        all_probs.append(probs)

    # Aggregate probabilities
    combined_probs = np.zeros_like(all_probs[0])
    for w, p in zip(weights, all_probs):
        combined_probs += w * p

    predicted_index = int(np.argmax(combined_probs))

    return (
        label_names[predicted_index],
        {label: float(prob) for label, prob in zip(label_names, combined_probs)}
    )

In [None]:
"""
Create and launch an interactive Gradio interface for model inference.

This interface:
- Allows users to select the model (single or ensemble) via UI controls
- Accepts a news headline or short description as input
- Displays class probabilities for all four categories
- Provides text-only examples without coupling to model selection
"""

import gradio as gr
import pandas as pd

MODEL_MAP = {
    "Ensemble": {"mode": "ensemble", "model_name": None},
    "RoBERTa (Full)": {"mode": "single", "model_name": "RoBERTa_Full"},
    "RoBERTa (LoRA)": {"mode": "single", "model_name": "RoBERTa_LoRA"},
    "DistilBERT (Full)": {"mode": "single", "model_name": "DistilBERT_Full"},
    "DistilBERT (LoRA)": {"mode": "single", "model_name": "DistilBERT_LoRA"},
}

def predict_category(text, model_choice):
    config = MODEL_MAP[model_choice]
    _, probabilities = predict(
        text,
        mode=config["mode"],
        model_name=config["model_name"]
    )
    return probabilities

# --------------------------------------------------
# Example table (2 columns, 5 rows)
# --------------------------------------------------
example_texts = [
    "Google unveils new AI chip for cloud customers.",
    "European Union holds emergency meeting on rising tensions.",
    "Scientists discover new exoplanet in nearby solar system.",
    "Tech companies, governments, and sports organizations meet to discuss new economic reforms.",
    "Officials raise concerns about rapid growth following recent announcements.",
    "European regulators investigate major tech firms for violating new digital privacy rules.",
    "Top athletes launch crypto startup focused on fan engagement tokens.",
    "Government to nationalize struggling telecom company after years of financial losses.",
    "NFL signs multi-billion dollar broadcast deal with major streaming platform.",
    "Serena Williams confirms her return for the upcoming Grand Slam tournament.",
]

example_df = pd.DataFrame({
    "Example A": example_texts[:5],
    "Example B": example_texts[5:]
})

def load_example(evt: gr.SelectData):
    return example_df.iloc[evt.index[0], evt.index[1]]

# --------------------------------------------------
# UI
# --------------------------------------------------
with gr.Blocks() as demo:
    gr.Markdown("<h2 style='text-align: center;'>News Article Classification Demo</h2>")

    with gr.Row():
        with gr.Column(scale=1):
            model_choice = gr.Radio(
                choices=list(MODEL_MAP.keys()),
                value="Ensemble",
                label="Model Selection"
            )

            text_input = gr.Textbox(
                lines=3,
                placeholder="Enter a news headline or short description",
                label="News Text"
            )

            submit = gr.Button("Submit")

        with gr.Column(scale=1):
            output = gr.Label(
                num_top_classes=4,
                label="Predicted Category"
            )

    gr.Markdown("### Example News Headlines")

    example_table = gr.Dataframe(
        value=example_df,
        interactive=False,
        wrap=True
    )

    example_table.select(
        fn=load_example,
        outputs=text_input
    )

    submit.click(
        fn=predict_category,
        inputs=[text_input, model_choice],
        outputs=output
    )

demo.launch()

#### Deploying on HuggingFace for Public Access

In [None]:
import os
from dotenv import load_dotenv
from pathlib import Path
from huggingface_hub import login, HfApi

# Clear any existing token
os.environ.pop("HF_TOKEN", None)

# Load .env from project root
project_root = Path.cwd().parent
load_dotenv(project_root / ".env")

# Verify
print("HF_TOKEN loaded:", "HF_TOKEN" in os.environ)

# Authenticate
login(token=os.environ["HF_TOKEN"])
HfApi().whoami()

In [None]:
from huggingface_hub import create_repo

repos = [
    "news-distilbert-full",
    "news-distilbert-lora",
    "news-roberta-full",
    "news-roberta-lora",
]

for repo in repos:
    create_repo(
        repo_id=f"heezuss/{repo}",
        exist_ok=True,
        private=False
    )

In [None]:
"""
Upload fine-tuned models to the Hugging Face Model Hub.

This cell:
- Uploads full fine-tuned models (DistilBERT, RoBERTa)
- Uploads LoRA adapter models (DistilBERT, RoBERTa)
- Uploads corresponding tokenizers
"""

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import PeftModel

HF_USERNAME = "heezuss" 

models_to_upload = {
    "news-distilbert-full": {
        "type": "full",
        "local_path": "models/distilbert_full/best_model",
        "base": "distilbert-base-uncased",
    },
    "news-distilbert-lora": {
        "type": "lora",
        "local_path": "models/distilbert_lora/best_model",
        "base": "distilbert-base-uncased",
    },
    "news-roberta-full": {
        "type": "full",
        "local_path": "models/roberta_full/best_model",
        "base": "roberta-base",
    },
    "news-roberta-lora": {
        "type": "lora",
        "local_path": "models/roberta_lora/best_model",
        "base": "roberta-base",
    },
}

for repo_name, cfg in models_to_upload.items():
    repo_id = f"{HF_USERNAME}/{repo_name}"
    print(f"\nUploading to {repo_id}")

    tokenizer = AutoTokenizer.from_pretrained(cfg["base"])

    if cfg["type"] == "full":
        model = AutoModelForSequenceClassification.from_pretrained(
            cfg["local_path"]
        )
    else:
        base_model = AutoModelForSequenceClassification.from_pretrained(
            cfg["base"], num_labels=4
        )
        model = PeftModel.from_pretrained(
            base_model, cfg["local_path"]
        )

    model.push_to_hub(repo_id)
    tokenizer.push_to_hub(repo_id)

print("\nAll models successfully uploaded to Hugging Face Hub.")

In [None]:
"""
Create and upload README.md (model card) for each model repository.
"""

from huggingface_hub import HfApi

api = HfApi()

README_CONTENT = """# News Article Classification Model

Fine-tuned transformer model for classifying news articles into four categories:
World, Sports, Business, and Sci/Tech.

**Dataset:** AG News  
**Fine-tuning:** Supervised (Full fine-tuning or LoRA-based fine-tuning)  
**Maximum sequence length:** 128  

This model was trained and evaluated as part of an academic project on applied
natural language processing and transformer-based text classification.
"""

for repo_name in models_to_upload.keys():
    repo_id = f"{HF_USERNAME}/{repo_name}"

    api.upload_file(
        path_or_fileobj=README_CONTENT.encode("utf-8"),
        path_in_repo="README.md",
        repo_id=repo_id,
        repo_type="model"
    )

print("README.md added to all model repositories.")