## 1. Notebook Setup and Configuration

This notebook performs Named Entity Recognition (NER) and Relation Extraction (RE) using transformer models.

**CRITICAL WARNING ON DATASET SIZE AND EXPECTED PERFORMANCE:**
The dataset currently configured for this notebook (Train: ~50 documents, Dev: ~20 documents) is **extremely small** for training modern transformer models for complex tasks like NER (with many labels) and RE. 

**Consequences of Small Dataset Size:**
1.  **Poor Generalization:** Models trained on such limited data are highly unlikely to generalize well to unseen data. Expect low F1 scores, precision, recall, and accuracy on development and test sets.
2.  **Overfitting:** Models might easily memorize the small training set, showing deceptively good training loss but performing badly on evaluation.
3.  **Unreliable Hyperparameter Optimization:** HPO results will not be stable or indicative of true optimal parameters for a larger, representative dataset.

**Purpose of this Notebook (Given Data Limitation):**
* To demonstrate the **end-to-end pipeline** for setting up NER and RE tasks.
* To show how to use various fine-tuning strategies (full, LoRA, partial-freeze).
* To illustrate data preprocessing, metric calculation, and baseline evaluations.

**To achieve meaningful performance, a significantly larger and more diverse dataset is essential.** The training parameters (epochs, steps) have been adjusted to be more reasonable than in the original notebook, but they are still tuned for a quick demonstration rather than state-of-the-art results on this tiny dataset.

In [53]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

!pip install datasets evaluate transformers[sentencepiece] peft optuna scikit-learn pandas matplotlib seaborn nltk
!pip install evaluate seqeval

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Looking in indexes: https://download.pytorch.org/whl/cu118


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [55]:
# --- Core Configuration Parameters ---
# Note: Adjust these based on your actual dataset size and available compute time.
# FOR THE CURRENT VERY SMALL DATASET, EVEN THESE VALUES ARE FOR DEMONSTRATION
# AND HIGH PERFORMANCE IS NOT EXPECTED.

# --- Model Configuration ---
MODEL_NAME_NER = "t5-small" 
MODEL_NAME_RE = "t5-small"  

# --- NER Training Parameters ---
# HPO for NER
NUM_OPTUNA_TRIALS_NER = 3  # Number of Optuna trials
NUM_EPOCHS_OPTUNA_NER = 4  # Epochs per Optuna trial for NER
# Final NER Training
NUM_TRAIN_EPOCHS_FINAL_NER = 20 # Target epochs for final NER training (use with EarlyStopping)

# --- RE Training Parameters ---
# HPO for RE (Set NUM_OPTUNA_TRIALS_RE > 0 to enable)
NUM_OPTUNA_TRIALS_RE = 0 
NUM_EPOCHS_OPTUNA_RE = 3   # Epochs per Optuna trial for RE
# Final RE Training
NUM_TRAIN_EPOCHS_FINAL_RE = 20 # Target epochs for final RE training (use with EarlyStopping)

# --- General Settings ---
SEED = 42 
OUTPUT_BASE_DIR = "/kaggle/working/outputs" 
TMP_BASE_DIR = "/kaggle/working/tmp"      
    

Path(OUTPUT_BASE_DIR).mkdir(parents=True, exist_ok=True)
Path(TMP_BASE_DIR).mkdir(parents=True, exist_ok=True)

# Dictionary to store all evaluation results
all_experiment_results = defaultdict(dict)

print(f"--- Configuration Summary ---")
print(f"NER Model: {MODEL_NAME_NER}")
print(f"RE Model: {MODEL_NAME_RE}")
print(f"NER Optuna Trials: {NUM_OPTUNA_TRIALS_NER} trials, {NUM_EPOCHS_OPTUNA_NER} epochs/trial")
print(f"NER Final Training Epochs: {NUM_TRAIN_EPOCHS_FINAL_NER}")
print(f"RE Optuna Trials: {NUM_OPTUNA_TRIALS_RE} trials, {NUM_EPOCHS_OPTUNA_RE} epochs/trial (0 means skip)")
print(f"RE Final Training Epochs: {NUM_TRAIN_EPOCHS_FINAL_RE}")
print(f"Global Seed: {SEED}")
print(f"Output Base Directory: {OUTPUT_BASE_DIR}")
print(f"Temp Base Directory for HPO: {TMP_BASE_DIR}")

--- Configuration Summary ---
NER Model: t5-small
RE Model: t5-small
NER Optuna Trials: 3 trials, 4 epochs/trial
NER Final Training Epochs: 15
RE Optuna Trials: 0 trials, 3 epochs/trial (0 means skip)
RE Final Training Epochs: 10
Global Seed: 42
Output Base Directory: /kaggle/working/outputs
Temp Base Directory for HPO: /kaggle/working/tmp


## 2. Imports and Environment Setup

In [56]:
import os
import re
import random
import json
import time
from pathlib import Path
from collections import Counter, defaultdict
import copy
import warnings

import torch
import numpy as np
import pandas as pd # For results table
import sklearn
from sklearn.metrics import precision_recall_fscore_support, accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import optuna
import nltk

from datasets import Dataset # Keep this
import evaluate # <--- ADD THIS IMPORT

from transformers import (
    AutoTokenizer,
    AutoConfig,
    AutoModelForTokenClassification,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    DataCollatorForTokenClassification,
    DataCollatorWithPadding,
    T5Config,
    set_seed
)
from peft import get_peft_model, LoraConfig, TaskType

# Set seed for reproducibility early
# SEED should be defined in your config cell, e.g., SEED = 42
if 'SEED' not in globals():
    SEED = 42 
    print(f"Warning: SEED not found globally, setting to {SEED}")
set_seed(SEED)

# NLTK Punkt Tokenizer for sentence splitting
try:
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    print("NLTK 'punkt' tokenizer not found. Downloading...")
    nltk.download('punkt', quiet=True)
    print("'punkt' tokenizer downloaded.")

# Suppress some common warnings for cleaner output
warnings.filterwarnings("ignore", category=UserWarning, module='torch.nn.parallel._functions')
warnings.filterwarnings("ignore", category=UserWarning, message="Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.")


print(f"Libraries imported and seed set to {SEED}.")
print(f"The 'evaluate' library version is: {evaluate.__version__}") # Good to check version

Libraries imported and seed set to 42.
The 'evaluate' library version is: 0.4.3


## 3. Data Paths and Loading

In [57]:
DATA_ROOT = Path("/kaggle/input/nlp-dataset") 

TRAIN_DIR = DATA_ROOT / "train"
DEV_DIR   = DATA_ROOT / "dev"
TEST_DIR  = DATA_ROOT / "test"

print(f"Looking for data in: {DATA_ROOT}")
assert TRAIN_DIR.exists(), f"Train folder not found: {TRAIN_DIR}"
assert DEV_DIR.exists(),   f"Dev folder not found:   {DEV_DIR}"
assert TEST_DIR.exists(),  f"Test folder not found:  {TEST_DIR}"

def load_docie_docs(folder: Path, recursive: bool = False):
    docs = []
    pattern = "**/*.json" if recursive else "*.json"
    file_paths = sorted(list(folder.glob(pattern))) 
    if not file_paths:
        print(f"Warning: No JSON files found in {folder} with pattern '{pattern}'")
        return docs
        
    for file_path in file_paths:
        try:
            data = json.loads(file_path.read_text(encoding="utf-8"))
            doc_id_from_file = file_path.stem
            if isinstance(data, list):
                for sub_doc_idx, sub_doc_item in enumerate(data):
                    if isinstance(sub_doc_item, dict):
                        sub_doc_item['id'] = sub_doc_item.get('id', f"{doc_id_from_file}_{sub_doc_idx}")
                        docs.append(sub_doc_item)
            elif isinstance(data, dict):
                data['id'] = data.get('id', doc_id_from_file)
                docs.append(data)
        except json.JSONDecodeError as e:
            print(f"Error decoding JSON from {file_path}: {e}")
        except Exception as e:
            print(f"Error loading {file_path}: {e}")
    return docs

print("Loading raw documents...")
train_docs_raw = load_docie_docs(TRAIN_DIR)
dev_docs_raw = load_docie_docs(DEV_DIR)
test_docs_raw = load_docie_docs(TEST_DIR, recursive=True)

print(f"Raw documents loaded: Train: {len(train_docs_raw)} │ Dev: {len(dev_docs_raw)} │ Test: {len(test_docs_raw)}")

if not train_docs_raw or not dev_docs_raw:
    raise ValueError("CRITICAL: Training or development data is empty. Cannot proceed.")

# Standardize document text key and ensure essential keys exist
print("\nStandardizing document keys...")
for doc_list_name, doc_list in zip(["Train", "Dev", "Test"], [train_docs_raw, dev_docs_raw, test_docs_raw]):
    print(f"Processing {doc_list_name} docs...")
    for i, doc_item in enumerate(doc_list):
        if not isinstance(doc_item, dict):
            print(f"Warning: Item {i} in {doc_list_name} is not a dict, it's a {type(doc_item)}. Skipping.")
            doc_list[i] = None # Mark for removal
            continue
        if 'document' in doc_item and 'document_text' not in doc_item:
            doc_item['document_text'] = doc_item.pop('document')
        elif 'doc' in doc_item and 'document_text' not in doc_item:
            doc_item['document_text'] = doc_item.pop('doc')
        elif 'document_text' not in doc_item:
            print(f"Warning: Document {doc_item.get('id', 'UNKNOWN_ID')} in {doc_list_name} missing text field ('document' or 'doc'). Setting to empty string.")
            doc_item['document_text'] = ""
        
        if 'entities' not in doc_item:
            doc_item['entities'] = [] 
        if 'triples' not in doc_item:
            doc_item['triples'] = []   
    # Remove None items
    train_docs_raw = [d for d in train_docs_raw if d is not None]
    dev_docs_raw = [d for d in dev_docs_raw if d is not None]
    test_docs_raw = [d for d in test_docs_raw if d is not None]

if train_docs_raw:
    print(f"\nKeys in first standardized training document ('{train_docs_raw[0].get('id', 'N/A')}'): {train_docs_raw[0].keys()}")
else:
    print("No training documents available after standardization.")

Looking for data in: /kaggle/input/nlp-dataset
Loading raw documents...
Raw documents loaded: Train: 51 │ Dev: 23 │ Test: 248

Standardizing document keys...
Processing Train docs...
Processing Dev docs...
Processing Test docs...

Keys in first standardized training document ('Communication_all_examples_0'): dict_keys(['domain', 'title', 'entities', 'triples', 'label_set', 'entity_label_set', 'id', 'document_text'])


## 4. Exploratory Data Analysis (EDA)

In [58]:
print("--- EDA: Document Length ---")
if train_docs_raw:
    lengths = [len(doc.get("document_text", "").split()) for doc in train_docs_raw]
    if lengths:
        print(f"Avg Tokens per doc (approx): {np.mean(lengths):.2f}, Max Tokens: {np.max(lengths)}, Min Tokens: {np.min(lengths)}")
    else:
        print("No documents with text found for length analysis.")
else:
    print("No training documents for length EDA.")

--- EDA: Document Length ---
Avg Tokens per doc (approx): 919.08, Max Tokens: 2560, Min Tokens: 128


In [59]:
print("--- EDA: NER Entity Types Distribution (Training Set) ---")
if train_docs_raw:
    all_ner_entity_types_eda = []
    for doc in train_docs_raw:
        for ent in doc.get('entities', []):
            all_ner_entity_types_eda.append(ent.get('type', 'UNKNOWN_TYPE'))
    if all_ner_entity_types_eda:
        ner_ctr_eda = Counter(all_ner_entity_types_eda)
        print(f"Total unique NER entity types found in data: {len(ner_ctr_eda)}")
        print(f"NER Entity Types (Top 20): {ner_ctr_eda.most_common(20)}")
    else:
        print("No entities found in training documents for NER EDA.")
else:
    print("No training documents for NER entity EDA.")

--- EDA: NER Entity Types Distribution (Training Set) ---
Total unique NER entity types found in data: 19
NER Entity Types (Top 20): [('DATE', 647), ('MISC', 417), ('PERSON', 242), ('ORG', 241), ('CARDINAL', 224), ('GPE', 157), ('WORK_OF_ART', 65), ('NORP', 59), ('ORDINAL', 55), ('QUANTITY', 42), ('EVENT', 35), ('PRODUCT', 30), ('FAC', 30), ('MONEY', 29), ('PERCENT', 28), ('LOC', 24), ('LANGUAGE', 10), ('LAW', 9), ('TIME', 8)]


In [60]:
print("--- EDA: RE Relation Types Distribution (Training Set) ---")
if train_docs_raw:
    all_re_relation_types_eda = []
    for doc in train_docs_raw:
        for t in doc.get('triples', []):
            all_re_relation_types_eda.append(t.get('relation', 'UNKNOWN_RELATION'))
    if all_re_relation_types_eda:
        re_ctr_eda = Counter(all_re_relation_types_eda)
        print(f"Total unique RE relation types found in data: {len(re_ctr_eda)}")
        print(f"RE Relation Types (Top 20): {re_ctr_eda.most_common(20)}")
    else:
        print("No triples found in training documents for RE EDA.")
else:
    print("No training documents for RE relation EDA.")

--- EDA: RE Relation Types Distribution (Training Set) ---
Total unique RE relation types found in data: 68
RE Relation Types (Top 20): [('HasPart', 82), ('HasEffect', 67), ('DiplomaticRelation', 45), ('LocatedIn', 44), ('InterestedIn', 38), ('OwnerOf', 32), ('NominatedFor', 25), ('SaidToBeTheSameAs', 25), ('PartOf', 18), ('Creator', 17), ('Founded', 13), ('Country', 13), ('DifferentFrom', 11), ('SignificantEvent', 11), ('PrimeFactor', 11), ('InfluencedBy', 10), ('Follows', 10), ('UsedBy', 9), ('InspiredBy', 9), ('Uses', 8)]


# Part I: Named Entity Recognition (NER)

## 5. NER: Label Mapping

In [61]:
print("--- NER Label Mapping ---")
ner_entity_type_source = []
if train_docs_raw:
    # Prefer 'NER_label_set' if available in the first doc, otherwise derive from all entities
    if 'NER_label_set' in train_docs_raw[0] and isinstance(train_docs_raw[0]['NER_label_set'], list):
        ner_entity_type_source = sorted(list(set(train_docs_raw[0]["NER_label_set"])))
        print("Using 'NER_label_set' from the first document for NER labels.")
    elif 'entity_label_set' in train_docs_raw[0] and isinstance(train_docs_raw[0]['entity_label_set'], list):
        ner_entity_type_source = sorted(list(set(train_docs_raw[0]["entity_label_set"])))
        print("Using 'entity_label_set' from the first document for NER labels.")
    else:
        print("Deriving NER labels from all unique entity types found in training data.")
        all_types = set()
        for doc in train_docs_raw:
            for entity in doc.get('entities', []):
                all_types.add(entity.get('type', 'UNKNOWN_TYPE'))
        ner_entity_type_source = sorted(list(all_types - {'UNKNOWN_TYPE'}))

if not ner_entity_type_source:
    print("CRITICAL WARNING: No NER entity types determined. Using minimal fallback. NER will perform poorly.")
    ner_entity_type_source = ["MISC"]

ner_labels_list = ["O"] 
for t in ner_entity_type_source:
    ner_labels_list += [f"B-{t}", f"I-{t}"]

ner_label_to_id = {label: i for i, label in enumerate(ner_labels_list)}
ner_id_to_label = {i: label for i, label in enumerate(ner_labels_list)}
num_ner_labels = len(ner_labels_list)

print(f"Number of NER Labels: {num_ner_labels}")
print(f"Sample NER Labels (first 10): {ner_labels_list[:min(10, num_ner_labels)]}")
print(f"ner_label_to_id['O'] = {ner_label_to_id.get('O', 'ERROR: O not found')}")

%store ner_label_to_id
%store ner_id_to_label
%store num_ner_labels
%store ner_labels_list

--- NER Label Mapping ---
Using 'entity_label_set' from the first document for NER labels.
Number of NER Labels: 39
Sample NER Labels (first 10): ['O', 'B-CARDINAL', 'I-CARDINAL', 'B-DATE', 'I-DATE', 'B-EVENT', 'I-EVENT', 'B-FAC', 'I-FAC', 'B-GPE']
ner_label_to_id['O'] = 0
Stored 'ner_label_to_id' (dict)
Stored 'ner_id_to_label' (dict)
Stored 'num_ner_labels' (int)
Stored 'ner_labels_list' (list)


## 6. NER: Tokenizer Initialization

In [62]:
%store -r MODEL_NAME_NER 
print(f"--- Initializing Tokenizer for NER model: {MODEL_NAME_NER} ---")
ner_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME_NER, use_fast=True)
print(f"NER Tokenizer for '{MODEL_NAME_NER}' loaded.")

no stored variable or alias MODEL_NAME_NER
--- Initializing Tokenizer for NER model: t5-small ---
NER Tokenizer for 't5-small' loaded.


## 7. NER: Dataset Preparation and Tokenization

In [63]:
print("--- Creating Hugging Face Datasets for NER from raw documents ---")
ner_hf_train_raw = Dataset.from_list(train_docs_raw)
ner_hf_dev_raw = Dataset.from_list(dev_docs_raw)
print(f"Raw NER HF Datasets created. Train: {len(ner_hf_train_raw)}, Dev: {len(ner_hf_dev_raw)}")

--- Creating Hugging Face Datasets for NER from raw documents ---
Raw NER HF Datasets created. Train: 51, Dev: 23


In [64]:
%store -r ner_label_to_id # Retrieve from mapping cell

MAX_SEQ_LENGTH_NER = 512 
TOKEN_STRIDE_NER = 128   

def tokenize_and_align_labels_ner_revised(examples, tokenizer_obj, label_to_id_map):
    """
    Tokenizes documents and aligns NER labels. Uses 'document_text' and 'entities'.
    Assumes each mention in entity['mentions'] is a STRING.
    Uses string.find() to locate mentions, which might be ambiguous if mention text is repeated.
    Handles long documents by chunking with stride.
    """
    tokenized_inputs = tokenizer_obj(
        examples["document_text"],
        truncation=True,
        max_length=MAX_SEQ_LENGTH_NER,
        stride=TOKEN_STRIDE_NER,
        return_overflowing_tokens=True, 
        return_offsets_mapping=True,    
        padding=False                   
    )
    chunk_labels_list = []
    
    for i, original_doc_idx in enumerate(tokenized_inputs["overflow_to_sample_mapping"]):
        doc_entities = examples["entities"][original_doc_idx]
        # Get the full original text for the current document for .find()
        original_document_text = examples["document_text"][original_doc_idx] 
        
        offset_mapping = tokenized_inputs["offset_mapping"][i]
        current_chunk_labels = [label_to_id_map["O"]] * len(offset_mapping)
        
        chunk_char_start_original_doc = -1
        chunk_char_end_original_doc = -1
        for token_start_char, token_end_char in offset_mapping:
            if token_start_char == 0 and token_end_char == 0: continue
            if chunk_char_start_original_doc == -1: chunk_char_start_original_doc = token_start_char
            chunk_char_end_original_doc = token_end_char
            
        if chunk_char_start_original_doc == -1:
            chunk_labels_list.append(current_chunk_labels)
            continue
            
        for entity in doc_entities:
            entity_type = entity.get("type", "UNKNOWN_TYPE") # Handle if 'type' key is missing
            for mention_text_str in entity.get("mentions", []): # mention_text_str is a string
                if not isinstance(mention_text_str, str):
                    # print(f"Warning: Expected mention to be a string, but got {type(mention_text_str)}. Skipping mention: {mention_text_str}")
                    continue

                # Find start/end of mention_text_str in the *original, unchunked* document text.
                # This is the original notebook's approach. Can be ambiguous if mention_text_str repeats.
                mention_orig_char_start = original_document_text.find(mention_text_str)
                
                if mention_orig_char_start == -1:
                    # print(f"Warning: Mention text '{mention_text_str}' not found in document. Skipping this mention.")
                    continue
                mention_orig_char_end = mention_orig_char_start + len(mention_text_str)

                # Check if the mention (found via .find()) is within the character span of THIS CHUNK
                if not (mention_orig_char_end > chunk_char_start_original_doc and \
                        mention_orig_char_start < chunk_char_end_original_doc):
                    continue # Entity mention (this occurrence) not relevant to this chunk
                
                first_token_of_mention_in_chunk = True
                for token_idx, (token_orig_char_start, token_orig_char_end) in enumerate(offset_mapping):
                    if token_orig_char_start == 0 and token_orig_char_end == 0: continue # Special token

                    # Is this token part of the current mention?
                    if max(mention_orig_char_start, token_orig_char_start) < min(mention_orig_char_end, token_orig_char_end):
                        label_key_b = f"B-{entity_type}"
                        label_key_i = f"I-{entity_type}"
                        if first_token_of_mention_in_chunk:
                            current_chunk_labels[token_idx] = label_to_id_map.get(label_key_b, label_to_id_map["O"])
                            first_token_of_mention_in_chunk = False
                        else:
                            current_chunk_labels[token_idx] = label_to_id_map.get(label_key_i, label_to_id_map["O"])
        chunk_labels_list.append(current_chunk_labels)
    
    tokenized_inputs["labels"] = chunk_labels_list
    return tokenized_inputs

print("Tokenizing and aligning labels for NER datasets (revised function to handle string mentions)...\n")

# Ensure required columns exist from the raw HF datasets
if 'document_text' not in ner_hf_train_raw.column_names or 'entities' not in ner_hf_train_raw.column_names:
    raise ValueError("NER datasets must contain 'document_text' and 'entities' columns. Check data loading (cell 'paths-loading-code') and standardization.")

# Remove potentially problematic examples if 'entities' or 'mentions' have unexpected types
# This is a defensive step. Ideally, data loading should clean this.
def filter_problematic_ner_examples(example):
    if not isinstance(example.get('entities'), list): return False
    for entity in example['entities']:
        if not isinstance(entity, dict): return False
        if not isinstance(entity.get('mentions'), list): return False
        for mention in entity['mentions']:
            if not isinstance(mention, str): # Assuming mentions are strings now
                # print(f"Found non-string mention: {mention} in doc id {example.get('id')}")
                return False # Filter out examples with non-string mentions if that's unexpected
    return True

print("Filtering raw NER datasets for valid structure before mapping...")
ner_hf_train_filtered = ner_hf_train_raw.filter(filter_problematic_ner_examples)
ner_hf_dev_filtered = ner_hf_dev_raw.filter(filter_problematic_ner_examples)
print(f"NER Train: {len(ner_hf_train_filtered)} from {len(ner_hf_train_raw)}. NER Dev: {len(ner_hf_dev_filtered)} from {len(ner_hf_dev_raw)} after filtering.")


if len(ner_hf_train_filtered) > 0:
    ner_tokenized_train = ner_hf_train_filtered.map(
        tokenize_and_align_labels_ner_revised, 
        batched=True, 
        fn_kwargs={'tokenizer_obj': ner_tokenizer, 'label_to_id_map': ner_label_to_id},
        remove_columns=ner_hf_train_filtered.column_names 
    )
else:
    ner_tokenized_train = ner_hf_train_filtered # Keep empty if it became empty
    print("Warning: NER training dataset is empty after filtering. Tokenization skipped.")


if len(ner_hf_dev_filtered) > 0:
    ner_tokenized_dev = ner_hf_dev_filtered.map(
        tokenize_and_align_labels_ner_revised, 
        batched=True, 
        fn_kwargs={'tokenizer_obj': ner_tokenizer, 'label_to_id_map': ner_label_to_id},
        remove_columns=ner_hf_dev_filtered.column_names
    )
else:
    ner_tokenized_dev = ner_hf_dev_filtered # Keep empty
    print("Warning: NER development dataset is empty after filtering. Tokenization skipped.")


print("\nNER datasets tokenized.")
print(f"Processed NER Train samples (after chunking): {len(ner_tokenized_train)}")
print(f"Processed NER Dev samples (after chunking): {len(ner_tokenized_dev)}")
if len(ner_tokenized_train) > 0:
    print(f"Features in tokenized NER train set: {ner_tokenized_train.features}")

%store ner_tokenized_train
%store ner_tokenized_dev

no stored variable or alias #
no stored variable or alias Retrieve
no stored variable or alias from
no stored variable or alias mapping
no stored variable or alias cell
Tokenizing and aligning labels for NER datasets (revised function to handle string mentions)...

Filtering raw NER datasets for valid structure before mapping...


Filter:   0%|          | 0/51 [00:00<?, ? examples/s]

Filter:   0%|          | 0/23 [00:00<?, ? examples/s]

NER Train: 51 from 51. NER Dev: 23 from 23 after filtering.


Map:   0%|          | 0/51 [00:00<?, ? examples/s]

Map:   0%|          | 0/23 [00:00<?, ? examples/s]


NER datasets tokenized.
Processed NER Train samples (after chunking): 182
Processed NER Dev samples (after chunking): 81
Features in tokenized NER train set: {'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'offset_mapping': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None), 'overflow_to_sample_mapping': Value(dtype='int64', id=None), 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}
Stored 'ner_tokenized_train' (Dataset)
Stored 'ner_tokenized_dev' (Dataset)


## 8. NER: Data Collator and Metrics Function

In [65]:
%store -r ner_id_to_label ner_labels_list ner_label_to_id # Retrieve from mapping cell

ner_data_collator = DataCollatorForTokenClassification(tokenizer=ner_tokenizer, padding='longest')
print("NER Data Collator initialized.")

seqeval_metric_ner = evaluate.load("seqeval")

def compute_metrics_ner_revised(p):
    predictions_logits, labels = p
    predictions = np.argmax(predictions_logits, axis=2)

    true_predictions_str = [
        [ner_id_to_label.get(p_val, "O") for (p_val, l_val) in zip(pred_seq, label_seq) if l_val != -100]
        for pred_seq, label_seq in zip(predictions, labels)
    ]
    true_labels_str = [
        [ner_id_to_label.get(l_val, "O") for (p_val, l_val) in zip(pred_seq, label_seq) if l_val != -100]
        for pred_seq, label_seq in zip(predictions, labels)
    ]

    seqeval_results = seqeval_metric_ner.compute(predictions=true_predictions_str, references=true_labels_str, zero_division=0)
    
    flat_predictions_token = predictions.flatten()
    flat_labels_token = labels.flatten()
    active_mask = flat_labels_token != -100
    active_preds = flat_predictions_token[active_mask]
    active_labels = flat_labels_token[active_mask]
    token_accuracy = accuracy_score(active_labels, active_preds) if len(active_labels) > 0 else 0.0

    return {
        "precision": seqeval_results.get("overall_precision", 0.0),
        "recall": seqeval_results.get("overall_recall", 0.0),
        "f1": seqeval_results.get("overall_f1", 0.0),
        "accuracy": seqeval_results.get("overall_accuracy", 0.0), 
        "token_accuracy": token_accuracy
    }

print("NER Metrics function (revised with seqeval and token accuracy) defined.")

no stored variable or alias #
no stored variable or alias Retrieve
no stored variable or alias from
no stored variable or alias mapping
no stored variable or alias cell
NER Data Collator initialized.
NER Metrics function (revised with seqeval and token accuracy) defined.


## 9. NER: Raw Pre-trained Model Evaluation (Baseline before Fine-tuning)

This evaluates the performance of the chosen NER model with its pre-trained weights and a randomly initialized token classification head, *before* any fine-tuning on our specific dataset. This helps establish a performance floor.

In [66]:
%store -r MODEL_NAME_NER num_ner_labels ner_id_to_label ner_label_to_id ner_tokenized_dev OUTPUT_BASE_DIR SEED all_experiment_results

print(f"--- NER Raw Model Evaluation for {MODEL_NAME_NER} ---")

if len(ner_tokenized_dev) == 0:
    print("Skipping NER raw model evaluation as tokenized dev dataset is empty.")
    raw_ner_metrics = {key: 0.0 for key in ['eval_loss', 'eval_precision', 'eval_recall', 'eval_f1', 'eval_accuracy', 'eval_token_accuracy']}
else:
    try:
        raw_ner_model = AutoModelForTokenClassification.from_pretrained(
            MODEL_NAME_NER,
            num_labels=num_ner_labels,
            id2label=ner_id_to_label,
            label2id=ner_label_to_id,
            ignore_mismatched_sizes=True # Important if base model is not for token classification
        )

        raw_ner_output_dir = Path(OUTPUT_BASE_DIR) / f"{MODEL_NAME_NER.replace('/', '_')}-ner-raw-eval"
        
        # Minimal TrainingArguments, just for evaluation
        raw_ner_eval_args = TrainingArguments(
            output_dir=str(raw_ner_output_dir),
            per_device_eval_batch_size=16,
            do_train=False,
            do_eval=True,
            report_to="none",
            seed=SEED
        )

        raw_ner_trainer = Trainer(
            model=raw_ner_model,
            args=raw_ner_eval_args,
            eval_dataset=ner_tokenized_dev,
            data_collator=ner_data_collator,
            compute_metrics=compute_metrics_ner_revised
        )

        print("Evaluating raw NER model...")
        raw_ner_metrics = raw_ner_trainer.evaluate()
        print(f"NER Raw Model Metrics ({MODEL_NAME_NER}):")
        for k, v_metric in raw_ner_metrics.items():
            print(f"  {k}: {v_metric:.4f}")
    except Exception as e:
        print(f"Error during NER raw model evaluation: {e}")
        raw_ner_metrics = {key: 0.0 for key in ['eval_loss', 'eval_precision', 'eval_recall', 'eval_f1', 'eval_accuracy', 'eval_token_accuracy']}
        import traceback
        traceback.print_exc()

all_experiment_results['NER Raw Model'] = raw_ner_metrics
%store all_experiment_results

no stored variable or alias MODEL_NAME_NER
no stored variable or alias OUTPUT_BASE_DIR
no stored variable or alias SEED
--- NER Raw Model Evaluation for t5-small ---


Some weights of T5ForTokenClassification were not initialized from the model checkpoint at t5-small and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Evaluating raw NER model...


NER Raw Model Metrics (t5-small):
  eval_loss: 8.0843
  eval_precision: 0.0002
  eval_recall: 0.0060
  eval_f1: 0.0005
  eval_accuracy: 0.0371
  eval_token_accuracy: 0.0371
  eval_runtime: 1.0664
  eval_samples_per_second: 75.9570
  eval_steps_per_second: 2.8130
Stored 'all_experiment_results' (defaultdict)


## 10. NER: Smoke Run (Quick Test of Training Pipeline)

In [67]:
%store -r MODEL_NAME_NER num_ner_labels ner_id_to_label ner_label_to_id ner_tokenized_train ner_tokenized_dev OUTPUT_BASE_DIR SEED all_experiment_results

print(f"--- NER Smoke Run for {MODEL_NAME_NER} ---")

ner_smoke_output_dir = Path(OUTPUT_BASE_DIR) / f"{MODEL_NAME_NER.replace('/', '_')}-ner-smoke-run"

if len(ner_tokenized_train) < 10 or len(ner_tokenized_dev) < 5: 
    print("Skipping NER smoke run as processed datasets are too small.")
    smoke_ner_metrics = {key: 0.0 for key in ['eval_loss', 'eval_precision', 'eval_recall', 'eval_f1', 'eval_accuracy', 'eval_token_accuracy']}
else:
    try:
        ner_smoke_model = AutoModelForTokenClassification.from_pretrained(
            MODEL_NAME_NER,
            num_labels=num_ner_labels,
            id2label=ner_id_to_label,
            label2id=ner_label_to_id,
            ignore_mismatched_sizes=True
        )
        ner_smoke_args = TrainingArguments(
            output_dir=str(ner_smoke_output_dir),
            per_device_train_batch_size=4, 
            per_device_eval_batch_size=8,
            eval_strategy="steps",       # <--- CORRECTED from evaluation_strategy
            eval_steps=20, 
            logging_strategy="steps",
            logging_steps=10,
            save_strategy="no", 
            max_steps=50, 
            learning_rate=5e-5,
            fp16=torch.cuda.is_available(),
            report_to="none",
            seed=SEED,
            disable_tqdm=False
        )
        smoke_train_subset = ner_tokenized_train.shuffle(seed=SEED).select(range(min(100, len(ner_tokenized_train))))
        smoke_eval_subset = ner_tokenized_dev.shuffle(seed=SEED).select(range(min(50, len(ner_tokenized_dev))))

        ner_smoke_trainer = Trainer(
            model=ner_smoke_model,
            args=ner_smoke_args,
            train_dataset=smoke_train_subset,
            eval_dataset=smoke_eval_subset,
            data_collator=ner_data_collator,
            compute_metrics=compute_metrics_ner_revised,
        )
        print(f"Starting NER smoke training ({ner_smoke_args.max_steps} steps)...")
        ner_smoke_trainer.train()
        print("NER smoke training completed.")
        print("Evaluating NER smoke model...")
        smoke_ner_metrics = ner_smoke_trainer.evaluate()
        print(f"NER Smoke-Run Metrics ({MODEL_NAME_NER}):")
        for k, v_metric in smoke_ner_metrics.items():
            print(f"  {k}: {v_metric:.4f}")
    except Exception as e:
        print(f"Error during NER smoke run: {e}")
        smoke_ner_metrics = {key: 0.0 for key in ['eval_loss', 'eval_precision', 'eval_recall', 'eval_f1', 'eval_accuracy', 'eval_token_accuracy']}
        import traceback
        traceback.print_exc()

all_experiment_results['NER Smoke Run'] = smoke_ner_metrics
%store all_experiment_results

no stored variable or alias MODEL_NAME_NER
no stored variable or alias OUTPUT_BASE_DIR
no stored variable or alias SEED
--- NER Smoke Run for t5-small ---


Some weights of T5ForTokenClassification were not initialized from the model checkpoint at t5-small and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting NER smoke training (50 steps)...


Step,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy,Token Accuracy
20,7.2728,5.832761,0.000107,0.002275,0.000205,0.103257,0.103257
40,5.2225,4.151319,6.4e-05,0.001138,0.000122,0.239809,0.239809


NER smoke training completed.
Evaluating NER smoke model...


NER Smoke-Run Metrics (t5-small):
  eval_loss: 3.9380
  eval_precision: 0.0001
  eval_recall: 0.0011
  eval_f1: 0.0001
  eval_accuracy: 0.2647
  eval_token_accuracy: 0.2647
  eval_runtime: 0.6531
  eval_samples_per_second: 76.5560
  eval_steps_per_second: 6.1240
  epoch: 3.8462
Stored 'all_experiment_results' (defaultdict)


## 11. NER: Baseline Training (Full Model, No HPO)

In [68]:
%store -r MODEL_NAME_NER num_ner_labels ner_id_to_label ner_label_to_id ner_tokenized_train ner_tokenized_dev NUM_TRAIN_EPOCHS_FINAL_NER OUTPUT_BASE_DIR SEED all_experiment_results

print(f"--- NER Baseline Training for {MODEL_NAME_NER} ({NUM_TRAIN_EPOCHS_FINAL_NER} epochs) ---")

ner_baseline_output_dir = Path(OUTPUT_BASE_DIR) / f"{MODEL_NAME_NER.replace('/', '_')}-ner-baseline"

if len(ner_tokenized_train) == 0 or len(ner_tokenized_dev) == 0:
    print("Skipping NER baseline training as processed datasets are empty.")
    baseline_ner_metrics = {key: 0.0 for key in ['eval_loss', 'eval_precision', 'eval_recall', 'eval_f1', 'eval_accuracy', 'eval_token_accuracy']}
else:
    try:
        ner_baseline_args = TrainingArguments(
            output_dir=str(ner_baseline_output_dir),
            per_device_train_batch_size=8, 
            per_device_eval_batch_size=16,
            learning_rate=3e-5, 
            num_train_epochs=NUM_TRAIN_EPOCHS_FINAL_NER,
            weight_decay=0.01,
            logging_strategy="epoch",
            eval_strategy="epoch",       # <--- CORRECTED from evaluation_strategy
            save_strategy="epoch",
            load_best_model_at_end=True,
            metric_for_best_model="eval_f1",
            save_total_limit=1, # Save only the best model based on metric_for_best_model
            fp16=torch.cuda.is_available(),
            report_to="none",
            seed=SEED,
            disable_tqdm=False
        )
        ner_baseline_model = AutoModelForTokenClassification.from_pretrained(
            MODEL_NAME_NER,
            num_labels=num_ner_labels,
            id2label=ner_id_to_label,
            label2id=ner_label_to_id,
            ignore_mismatched_sizes=True
        )
        ner_baseline_trainer = Trainer(
            model=ner_baseline_model,
            args=ner_baseline_args,
            train_dataset=ner_tokenized_train,
            eval_dataset=ner_tokenized_dev,
            data_collator=ner_data_collator,
            compute_metrics=compute_metrics_ner_revised,
        )
        print("Starting NER baseline training...")
        ner_baseline_trainer.train()
        print("NER baseline training completed.")
        print("Evaluating NER baseline model (best checkpoint automatically loaded)...\n(Note: If 'missing keys' warning for T5 embeddings appeared during training, these results might be unreliable)")
        baseline_ner_metrics = ner_baseline_trainer.evaluate()
        print(f"NER Baseline Metrics ({MODEL_NAME_NER}):")
        for k, v_metric in baseline_ner_metrics.items():
            print(f"  {k}: {v_metric:.4f}")
        
        ner_baseline_trainer.save_model(str(ner_baseline_output_dir / "best_model_from_baseline_run"))
        print(f"NER Baseline model (best) saved to {ner_baseline_output_dir / 'best_model_from_baseline_run'}")
    except Exception as e:
        print(f"Error during NER baseline training: {e}")
        baseline_ner_metrics = {key: 0.0 for key in ['eval_loss', 'eval_precision', 'eval_recall', 'eval_f1', 'eval_accuracy', 'eval_token_accuracy']}
        import traceback
        traceback.print_exc()

all_experiment_results['NER Baseline (Full FT, No HPO)'] = baseline_ner_metrics
%store all_experiment_results

no stored variable or alias MODEL_NAME_NER
no stored variable or alias NUM_TRAIN_EPOCHS_FINAL_NER
no stored variable or alias OUTPUT_BASE_DIR
no stored variable or alias SEED
--- NER Baseline Training for t5-small (15 epochs) ---


Some weights of T5ForTokenClassification were not initialized from the model checkpoint at t5-small and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting NER baseline training...


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy,Token Accuracy
1,10.9769,9.877737,0.000315,0.008296,0.000607,0.007852,0.007852
2,9.4951,8.396983,0.000348,0.00905,0.000671,0.021046,0.021046
3,8.2177,7.110559,0.00036,0.00905,0.000693,0.048048,0.048048
4,7.0983,5.945067,0.000414,0.009804,0.000795,0.100235,0.100235
5,6.1011,4.926164,0.000455,0.009804,0.000869,0.174937,0.174937
6,5.226,4.050672,0.00048,0.00905,0.000911,0.268869,0.268869
7,4.4868,3.338073,0.000572,0.00905,0.001075,0.372683,0.372683
8,3.8446,2.786341,0.000471,0.006033,0.000874,0.473799,0.473799
9,3.4012,2.385925,0.000746,0.007541,0.001357,0.562737,0.562737
10,3.0425,2.115217,0.000861,0.006787,0.001529,0.635169,0.635169


There were missing keys in the checkpoint model loaded: ['transformer.encoder.embed_tokens.weight'].


NER baseline training completed.
Evaluating NER baseline model (best checkpoint automatically loaded)...


NER Baseline Metrics (t5-small):
  eval_loss: 1.7314
  eval_precision: 0.0013
  eval_recall: 0.0053
  eval_f1: 0.0021
  eval_accuracy: 0.7571
  eval_token_accuracy: 0.7571
  eval_runtime: 0.8774
  eval_samples_per_second: 92.3220
  eval_steps_per_second: 3.4190
  epoch: 15.0000
NER Baseline model (best) saved to /kaggle/working/outputs/t5-small-ner-baseline/best_model_from_baseline_run
Stored 'all_experiment_results' (defaultdict)


## 12. NER: Hyperparameter Optimization (Optuna)

**Note:** HPO on a very small dataset can be misleading. The parameters found might not generalize well. These runs are for demonstration.

### 12.1 Optuna for Full Fine-Tuning (NER)

In [69]:
%store -r MODEL_NAME_NER num_ner_labels ner_id_to_label ner_label_to_id ner_tokenized_train ner_tokenized_dev NUM_OPTUNA_TRIALS_NER NUM_EPOCHS_OPTUNA_NER TMP_BASE_DIR SEED

def ner_ft_objective_optuna(trial):
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 7e-5, log=True)
    per_device_train_batch_size = trial.suggest_categorical("batch_size", [4, 8]) 
    weight_decay = trial.suggest_float("weight_decay", 0.0, 0.05)

    trial_output_dir = Path(TMP_BASE_DIR) / f"ner-ft-hpo-trial-{MODEL_NAME_NER.replace('/', '_')}-{trial.number}"

    args = TrainingArguments(
        output_dir=str(trial_output_dir),
        learning_rate=learning_rate,
        per_device_train_batch_size=per_device_train_batch_size,
        per_device_eval_batch_size=per_device_train_batch_size * 2,
        num_train_epochs=NUM_EPOCHS_OPTUNA_NER, 
        weight_decay=weight_decay,
        eval_strategy="epoch",       # <--- CORRECTED from evaluation_strategy
        save_strategy="no",          # No saving during HPO trials
        logging_strategy="epoch",
        fp16=torch.cuda.is_available(),
        report_to="none",
        seed=SEED,
        disable_tqdm=True 
    )
    
    # It's important to re-initialize model for each trial to start from scratch
    model_hpo = AutoModelForTokenClassification.from_pretrained(
        MODEL_NAME_NER, 
        num_labels=num_ner_labels, 
        id2label=ner_id_to_label, 
        label2id=ner_label_to_id,
        ignore_mismatched_sizes=True # Helpful if config doesn't perfectly match
    )
    
    trainer_hpo = Trainer(
        model=model_hpo, 
        args=args,
        train_dataset=ner_tokenized_train,
        eval_dataset=ner_tokenized_dev,
        data_collator=ner_data_collator, # Ensure ner_data_collator is defined
        compute_metrics=compute_metrics_ner_revised # Ensure compute_metrics_ner_revised is defined
    )
    
    print(f"Optuna Trial {trial.number} (NER-FT): LR={learning_rate:.2e}, BS={per_device_train_batch_size}, WD={weight_decay:.3f}, Epochs={NUM_EPOCHS_OPTUNA_NER}")
    trainer_hpo.train()
    eval_metrics = trainer_hpo.evaluate()
    # Checkpoint for the warning about missing keys - This check is conceptual here as HPO trials don't load best model by default
    # The main concern for "missing keys" is in the final training runs with load_best_model_at_end=True
    return eval_metrics.get("eval_f1", 0.0) 

print(f"--- NER Optuna HPO for Full Fine-Tuning ({MODEL_NAME_NER}) ---")
study_ner_ft = optuna.create_study(direction="maximize", study_name=f"ner-ft-{MODEL_NAME_NER.replace('/', '_')}")
# Fallback parameters if HPO fails or is skipped
best_ner_ft_params_fallback = {"learning_rate": 3e-5, "batch_size": 8, "weight_decay": 0.01}

if NUM_OPTUNA_TRIALS_NER > 0 and \
   ('ner_tokenized_train' in globals() and len(ner_tokenized_train) > 0) and \
   ('ner_tokenized_dev' in globals() and len(ner_tokenized_dev) > 0):
    try:
        study_ner_ft.optimize(ner_ft_objective_optuna, n_trials=NUM_OPTUNA_TRIALS_NER, timeout=10800) # Timeout e.g. 3 hours
        print(f"NER Full-FT HPO Best Params ({MODEL_NAME_NER}): {study_ner_ft.best_params}")
        print(f"NER Full-FT HPO Best F1: {study_ner_ft.best_value:.4f}")
        best_ner_ft_params = study_ner_ft.best_params
    except Exception as e: # Catching broader exceptions for HPO stability
        print(f"An error occurred during NER FT HPO: {e}. Using fallback params.")
        best_ner_ft_params = best_ner_ft_params_fallback
        # import traceback; traceback.print_exc() # Uncomment for full traceback if needed
else:
    print("Skipping NER FT HPO due to NUM_OPTUNA_TRIALS_NER=0 or empty/undefined datasets.")
    best_ner_ft_params = best_ner_ft_params_fallback

%store best_ner_ft_params

[I 2025-05-17 14:53:13,212] A new study created in memory with name: ner-ft-t5-small


no stored variable or alias MODEL_NAME_NER
no stored variable or alias NUM_OPTUNA_TRIALS_NER
no stored variable or alias NUM_EPOCHS_OPTUNA_NER
no stored variable or alias TMP_BASE_DIR
no stored variable or alias SEED
--- NER Optuna HPO for Full Fine-Tuning (t5-small) ---


Some weights of T5ForTokenClassification were not initialized from the model checkpoint at t5-small and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Optuna Trial 0 (NER-FT): LR=2.03e-05, BS=4, WD=0.012, Epochs=4
{'loss': 9.492, 'grad_norm': 1621849.5, 'learning_rate': 1.5410855339738524e-05, 'epoch': 1.0}
{'eval_loss': 8.231178283691406, 'eval_precision': 0.0004989873492030879, 'eval_recall': 0.01282051282051282, 'eval_f1': 0.0009605876536233932, 'eval_accuracy': 0.02206078735110304, 'eval_token_accuracy': 0.02206078735110304, 'eval_runtime': 1.0191, 'eval_samples_per_second': 79.48, 'eval_steps_per_second': 5.887, 'epoch': 1.0}
{'loss': 7.8971, 'grad_norm': 1442126.875, 'learning_rate': 1.034728858525301e-05, 'epoch': 2.0}
{'eval_loss': 6.819033622741699, 'eval_precision': 0.0006089948539934837, 'eval_recall': 0.015082956259426848, 'eval_f1': 0.0011707202856557498, 'eval_accuracy': 0.056006623577800334, 'eval_token_accuracy': 0.056006623577800334, 'eval_runtime': 1.0612, 'eval_samples_per_second': 76.328, 'eval_steps_per_second': 5.654, 'epoch': 2.0}
{'loss': 6.8458, 'grad_norm': 1416008.0, 'learning_rate': 5.283721830767494e-06, 

[I 2025-05-17 14:53:33,819] Trial 0 finished with value: 0.001310043668122271 and parameters: {'learning_rate': 2.025426701794206e-05, 'batch_size': 4, 'weight_decay': 0.011924362780870675}. Best is trial 0 with value: 0.001310043668122271.


{'eval_loss': 5.667490482330322, 'eval_precision': 0.0006832823583002538, 'eval_recall': 0.01583710407239819, 'eval_f1': 0.001310043668122271, 'eval_accuracy': 0.11479087655573954, 'eval_token_accuracy': 0.11479087655573954, 'eval_runtime': 1.0186, 'eval_samples_per_second': 79.521, 'eval_steps_per_second': 5.89, 'epoch': 4.0}


Some weights of T5ForTokenClassification were not initialized from the model checkpoint at t5-small and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Optuna Trial 1 (NER-FT): LR=1.63e-05, BS=4, WD=0.021, Epochs=4
{'loss': 9.4286, 'grad_norm': 1295028.75, 'learning_rate': 1.2388518963028552e-05, 'epoch': 1.0}
{'eval_loss': 8.321836471557617, 'eval_precision': 0.0005050655099676164, 'eval_recall': 0.01282051282051282, 'eval_f1': 0.0009718450764613407, 'eval_accuracy': 0.018001175150900058, 'eval_token_accuracy': 0.018001175150900058, 'eval_runtime': 1.0531, 'eval_samples_per_second': 76.913, 'eval_steps_per_second': 5.697, 'epoch': 1.0}
{'loss': 8.2547, 'grad_norm': 1138477.75, 'learning_rate': 8.318005589462028e-06, 'epoch': 2.0}
{'eval_loss': 7.240435600280762, 'eval_precision': 0.0005438888049554314, 'eval_recall': 0.013574660633484163, 'eval_f1': 0.0010458731588274602, 'eval_accuracy': 0.035361358901768065, 'eval_token_accuracy': 0.035361358901768065, 'eval_runtime': 1.034, 'eval_samples_per_second': 78.337, 'eval_steps_per_second': 5.803, 'epoch': 2.0}
{'loss': 7.4147, 'grad_norm': 1244269.375, 'learning_rate': 4.247492215895503e

[I 2025-05-17 14:53:54,519] Trial 1 finished with value: 0.001073409267100006 and parameters: {'learning_rate': 1.6282053494266096e-05, 'batch_size': 4, 'weight_decay': 0.021412597760056547}. Best is trial 0 with value: 0.001310043668122271.


{'eval_loss': 6.379782676696777, 'eval_precision': 0.0005587979634918664, 'eval_recall': 0.013574660633484163, 'eval_f1': 0.001073409267100006, 'eval_accuracy': 0.06174883820308744, 'eval_token_accuracy': 0.06174883820308744, 'eval_runtime': 1.0378, 'eval_samples_per_second': 78.05, 'eval_steps_per_second': 5.781, 'epoch': 4.0}


Some weights of T5ForTokenClassification were not initialized from the model checkpoint at t5-small and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Optuna Trial 2 (NER-FT): LR=6.62e-05, BS=8, WD=0.001, Epochs=4
{'loss': 8.6314, 'grad_norm': 1226751.5, 'learning_rate': 5.1022509153600764e-05, 'epoch': 1.0}
{'eval_loss': 6.510069847106934, 'eval_precision': 0.0005550416281221092, 'eval_recall': 0.013574660633484163, 'eval_f1': 0.001066477070742979, 'eval_accuracy': 0.056033331552801664, 'eval_token_accuracy': 0.056033331552801664, 'eval_runtime': 1.0296, 'eval_samples_per_second': 78.672, 'eval_steps_per_second': 2.914, 'epoch': 1.0}
{'loss': 6.0878, 'grad_norm': 975864.9375, 'learning_rate': 3.4474668347027545e-05, 'epoch': 2.0}
{'eval_loss': 4.287390232086182, 'eval_precision': 0.0006337608112138383, 'eval_recall': 0.01282051282051282, 'eval_f1': 0.0012078152753108348, 'eval_accuracy': 0.21478553496073927, 'eval_token_accuracy': 0.21478553496073927, 'eval_runtime': 0.9953, 'eval_samples_per_second': 81.382, 'eval_steps_per_second': 3.014, 'epoch': 2.0}
{'loss': 4.4414, 'grad_norm': 1043700.375, 'learning_rate': 1.792682754045432e-

[I 2025-05-17 14:54:12,691] Trial 2 finished with value: 0.001266357112705783 and parameters: {'learning_rate': 6.619136322629288e-05, 'batch_size': 8, 'weight_decay': 0.0012090972115940957}. Best is trial 0 with value: 0.001310043668122271.


{'eval_loss': 2.7551262378692627, 'eval_precision': 0.0006808124361738341, 'eval_recall': 0.00904977375565611, 'eval_f1': 0.001266357112705783, 'eval_accuracy': 0.45822872709791146, 'eval_token_accuracy': 0.45822872709791146, 'eval_runtime': 0.9403, 'eval_samples_per_second': 86.147, 'eval_steps_per_second': 3.191, 'epoch': 4.0}
NER Full-FT HPO Best Params (t5-small): {'learning_rate': 2.025426701794206e-05, 'batch_size': 4, 'weight_decay': 0.011924362780870675}
NER Full-FT HPO Best F1: 0.0013
Stored 'best_ner_ft_params' (dict)


### 12.2 Optuna for LoRA Fine-Tuning (NER)

In [70]:
%store -r MODEL_NAME_NER num_ner_labels ner_id_to_label ner_label_to_id ner_tokenized_train ner_tokenized_dev NUM_OPTUNA_TRIALS_NER NUM_EPOCHS_OPTUNA_NER TMP_BASE_DIR SEED

def ner_lora_objective_optuna(trial):
    learning_rate = trial.suggest_float("learning_rate", 5e-5, 5e-4, log=True)
    r = trial.suggest_categorical("r", [4, 8, 16])
    lora_alpha = trial.suggest_categorical("lora_alpha", [r, r * 2])
    lora_dropout = trial.suggest_float("lora_dropout", 0.05, 0.2)
    per_device_train_batch_size = trial.suggest_categorical("batch_size", [4, 8])

    trial_output_dir = Path(TMP_BASE_DIR) / f"ner-lora-hpo-trial-{MODEL_NAME_NER.replace('/', '_')}-{trial.number}"
    lora_config = LoraConfig(task_type=TaskType.TOKEN_CLS, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout, bias="none")
    base_model_hpo = AutoModelForTokenClassification.from_pretrained(MODEL_NAME_NER, num_labels=num_ner_labels, id2label=ner_id_to_label, label2id=ner_label_to_id, ignore_mismatched_sizes=True)
    lora_model_hpo = get_peft_model(base_model_hpo, lora_config)
    
    args = TrainingArguments(
        output_dir=str(trial_output_dir), learning_rate=learning_rate, per_device_train_batch_size=per_device_train_batch_size,
        per_device_eval_batch_size=per_device_train_batch_size * 2, num_train_epochs=NUM_EPOCHS_OPTUNA_NER,
        evaluation_strategy="epoch", save_strategy="no", logging_strategy="epoch", fp16=torch.cuda.is_available(),
        report_to="none", seed=SEED, disable_tqdm=True
    )
    trainer_hpo = Trainer(
        model=lora_model_hpo, args=args, train_dataset=ner_tokenized_train, eval_dataset=ner_tokenized_dev,
        data_collator=ner_data_collator, compute_metrics=compute_metrics_ner_revised
    )
    print(f"Optuna Trial {trial.number} (NER-LoRA): LR={learning_rate:.2e}, R={r}, Alpha={lora_alpha}, Dropout={lora_dropout:.2f}, BS={per_device_train_batch_size}, Epochs={NUM_EPOCHS_OPTUNA_NER}")
    trainer_hpo.train()
    eval_metrics = trainer_hpo.evaluate()
    return eval_metrics.get("eval_f1", 0.0)

print(f"--- NER Optuna HPO for LoRA ({MODEL_NAME_NER}) ---")
study_ner_lora = optuna.create_study(direction="maximize", study_name=f"ner-lora-{MODEL_NAME_NER.replace('/', '_')}")
best_ner_lora_params_fallback = {"learning_rate": 1e-4, "r": 8, "lora_alpha": 16, "lora_dropout": 0.1, "batch_size": 8}

if NUM_OPTUNA_TRIALS_NER > 0 and len(ner_tokenized_train) > 0 and len(ner_tokenized_dev) > 0:
    try:
        study_ner_lora.optimize(ner_lora_objective_optuna, n_trials=NUM_OPTUNA_TRIALS_NER, timeout=10800)
        print(f"NER LoRA HPO Best Params ({MODEL_NAME_NER}): {study_ner_lora.best_params}")
        print(f"NER LoRA HPO Best F1: {study_ner_lora.best_value:.4f}")
        best_ner_lora_params = study_ner_lora.best_params
    except Exception as e:
        print(f"An error occurred during NER LoRA HPO: {e}. Using fallback params.")
        best_ner_lora_params = best_ner_lora_params_fallback
else:
    print("Skipping NER LoRA HPO due to NUM_OPTUNA_TRIALS_NER=0 or empty datasets.")
    best_ner_lora_params = best_ner_lora_params_fallback

%store best_ner_lora_params

[I 2025-05-17 14:54:12,709] A new study created in memory with name: ner-lora-t5-small


no stored variable or alias MODEL_NAME_NER
no stored variable or alias NUM_OPTUNA_TRIALS_NER
no stored variable or alias NUM_EPOCHS_OPTUNA_NER
no stored variable or alias TMP_BASE_DIR
no stored variable or alias SEED
--- NER Optuna HPO for LoRA (t5-small) ---


Some weights of T5ForTokenClassification were not initialized from the model checkpoint at t5-small and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[W 2025-05-17 14:54:13,007] Trial 0 failed with parameters: {'learning_rate': 0.0002621560611174821, 'r': 4, 'lora_alpha': 8, 'lora_dropout': 0.16303205936500698, 'batch_size': 8} because of the following error: TypeError("TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'").
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/optuna/study/_optimize.py", line 197, in _run_trial
    value_or_values = func(trial)
                      ^^^^^^^^^^^
  File "/tmp/ipykernel_35/571318928.py", line 15, in ner_lora_objective_optuna
    args = TrainingArguments(
           ^^^^^^^^^^^^^^^^^^
TypeError: TrainingArguments.__init__() got an unexpected keyword a

An error occurred during NER LoRA HPO: TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'. Using fallback params.
Stored 'best_ner_lora_params' (dict)


### 12.3 Optuna for Partial-Freeze Fine-Tuning (NER)

In [71]:
%store -r MODEL_NAME_NER num_ner_labels ner_id_to_label ner_label_to_id ner_tokenized_train ner_tokenized_dev NUM_OPTUNA_TRIALS_NER NUM_EPOCHS_OPTUNA_NER TMP_BASE_DIR SEED

def ner_lora_objective_optuna(trial):
    learning_rate = trial.suggest_float("learning_rate", 5e-5, 5e-4, log=True)
    r = trial.suggest_categorical("r", [4, 8, 16])
    # lora_alpha = trial.suggest_categorical("lora_alpha", [r, r * 2]) # <-- Incorrect: Dynamic choices
    lora_alpha = trial.suggest_categorical("lora_alpha", [16, 32, 64]) # <-- Corrected: Fixed choices
    lora_dropout = trial.suggest_float("lora_dropout", 0.05, 0.2)
    per_device_train_batch_size = trial.suggest_categorical("batch_size", [4, 8])

    trial_output_dir = Path(TMP_BASE_DIR) / f"ner-lora-hpo-trial-{MODEL_NAME_NER.replace('/', '_')}-{trial.number}"
    
    lora_config = LoraConfig(
        task_type=TaskType.TOKEN_CLS, 
        r=r, 
        lora_alpha=lora_alpha, 
        lora_dropout=lora_dropout, 
        bias="none"
    )
    
    base_model_hpo = AutoModelForTokenClassification.from_pretrained(
        MODEL_NAME_NER, 
        num_labels=num_ner_labels, 
        id2label=ner_id_to_label, 
        label2id=ner_label_to_id, 
        ignore_mismatched_sizes=True
    )
    lora_model_hpo = get_peft_model(base_model_hpo, lora_config)
    
    args = TrainingArguments(
        output_dir=str(trial_output_dir), 
        learning_rate=learning_rate, 
        per_device_train_batch_size=per_device_train_batch_size,
        per_device_eval_batch_size=per_device_train_batch_size * 2, 
        num_train_epochs=NUM_EPOCHS_OPTUNA_NER,
        eval_strategy="epoch",       # <--- CORRECTED from evaluation_strategy
        save_strategy="no", 
        logging_strategy="epoch", 
        fp16=torch.cuda.is_available(),
        report_to="none", 
        seed=SEED, 
        disable_tqdm=True
    )
    
    trainer_hpo = Trainer(
        model=lora_model_hpo, 
        args=args, 
        train_dataset=ner_tokenized_train, 
        eval_dataset=ner_tokenized_dev,
        data_collator=ner_data_collator, # Ensure ner_data_collator is defined
        compute_metrics=compute_metrics_ner_revised # Ensure compute_metrics_ner_revised is defined
    )
    
    print(f"Optuna Trial {trial.number} (NER-LoRA): LR={learning_rate:.2e}, R={r}, Alpha={lora_alpha}, Dropout={lora_dropout:.2f}, BS={per_device_train_batch_size}, Epochs={NUM_EPOCHS_OPTUNA_NER}")
    # It's good practice to check if datasets are empty before training, though the outer block does this
    if len(ner_tokenized_train) == 0 or len(ner_tokenized_dev) == 0:
        print(f"Skipping trial {trial.number} due to empty train/dev dataset for NER LoRA.")
        return 0.0 # Return a default low value if datasets are empty

    trainer_hpo.train()
    eval_metrics = trainer_hpo.evaluate()
    return eval_metrics.get("eval_f1", 0.0)

print(f"--- NER Optuna HPO for LoRA ({MODEL_NAME_NER}) ---")
study_ner_lora = optuna.create_study(direction="maximize", study_name=f"ner-lora-{MODEL_NAME_NER.replace('/', '_')}")
# Fallback parameters if HPO fails or is skipped
best_ner_lora_params_fallback = {"learning_rate": 1e-4, "r": 8, "lora_alpha": 16, "lora_dropout": 0.1, "batch_size": 8}

if NUM_OPTUNA_TRIALS_NER > 0 and \
   ('ner_tokenized_train' in globals() and len(ner_tokenized_train) > 0) and \
   ('ner_tokenized_dev' in globals() and len(ner_tokenized_dev) > 0):
    try:
        study_ner_lora.optimize(ner_lora_objective_optuna, n_trials=NUM_OPTUNA_TRIALS_NER, timeout=10800) # Timeout e.g. 3 hours
        print(f"NER LoRA HPO Best Params ({MODEL_NAME_NER}): {study_ner_lora.best_params}")
        print(f"NER LoRA HPO Best F1: {study_ner_lora.best_value:.4f}")
        best_ner_lora_params = study_ner_lora.best_params
    except Exception as e: # Catching broader exceptions for HPO stability
        print(f"An error occurred during NER LoRA HPO: {e}. Using fallback params.")
        best_ner_lora_params = best_ner_lora_params_fallback
        # import traceback; traceback.print_exc() # Uncomment for full traceback if needed
else:
    print("Skipping NER LoRA HPO due to NUM_OPTUNA_TRIALS_NER=0 or empty/undefined datasets.")
    best_ner_lora_params = best_ner_lora_params_fallback

%store best_ner_lora_params

[I 2025-05-17 14:54:13,029] A new study created in memory with name: ner-lora-t5-small


no stored variable or alias MODEL_NAME_NER
no stored variable or alias NUM_OPTUNA_TRIALS_NER
no stored variable or alias NUM_EPOCHS_OPTUNA_NER
no stored variable or alias TMP_BASE_DIR
no stored variable or alias SEED
--- NER Optuna HPO for LoRA (t5-small) ---


Some weights of T5ForTokenClassification were not initialized from the model checkpoint at t5-small and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
No label_names provided for model class `PeftModelForTokenClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Optuna Trial 0 (NER-LoRA): LR=2.65e-04, R=4, Alpha=32, Dropout=0.12, BS=8, Epochs=4
{'loss': 9.7162, 'grad_norm': 225311.65625, 'learning_rate': 0.0002045208201517386, 'epoch': 1.0}
{'eval_loss': 8.633600234985352, 'eval_precision': 0.00014693781591630423, 'eval_recall': 0.003770739064856712, 'eval_f1': 0.00028285342535498107, 'eval_accuracy': 0.025292452326264623, 'eval_token_accuracy': 0.025292452326264623, 'eval_runtime': 1.0204, 'eval_samples_per_second': 79.378, 'eval_steps_per_second': 2.94, 'epoch': 1.0}
{'loss': 8.8647, 'grad_norm': 303229.3125, 'learning_rate': 0.0001381897433457693, 'epoch': 2.0}
{'eval_loss': 7.520958423614502, 'eval_precision': 0.00018159806295399514, 'eval_recall': 0.004524886877828055, 'eval_f1': 0.00034918233137403245, 'eval_accuracy': 0.04786069120239304, 'eval_token_accuracy': 0.04786069120239304, 'eval_runtime': 1.0189, 'eval_samples_per_second': 79.494, 'eval_steps_per_second': 2.944, 'epoch': 2.0}
{'loss': 7.9561, 'grad_norm': 418444.65625, 'learnin

[I 2025-05-17 14:54:28,967] Trial 0 finished with value: 0.00043272648595184374 and parameters: {'learning_rate': 0.0002653243072238771, 'r': 4, 'lora_alpha': 32, 'lora_dropout': 0.12113812703836473, 'batch_size': 8}. Best is trial 0 with value: 0.00043272648595184374.


{'eval_loss': 6.229246616363525, 'eval_precision': 0.00022560995262190995, 'eval_recall': 0.005279034690799397, 'eval_f1': 0.00043272648595184374, 'eval_accuracy': 0.09865925965493297, 'eval_token_accuracy': 0.09865925965493297, 'eval_runtime': 0.9858, 'eval_samples_per_second': 82.17, 'eval_steps_per_second': 3.043, 'epoch': 4.0}


Some weights of T5ForTokenClassification were not initialized from the model checkpoint at t5-small and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
No label_names provided for model class `PeftModelForTokenClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Optuna Trial 1 (NER-LoRA): LR=1.29e-04, R=8, Alpha=32, Dropout=0.11, BS=4, Epochs=4
{'loss': 9.8971, 'grad_norm': 164972.75, 'learning_rate': 9.778508432084414e-05, 'epoch': 1.0}
{'eval_loss': 9.215785026550293, 'eval_precision': 0.00047239444936521995, 'eval_recall': 0.012066365007541479, 'eval_f1': 0.0009091942266166611, 'eval_accuracy': 0.010416110250520805, 'eval_token_accuracy': 0.010416110250520805, 'eval_runtime': 1.0874, 'eval_samples_per_second': 74.492, 'eval_steps_per_second': 5.518, 'epoch': 1.0}
{'loss': 9.1802, 'grad_norm': 191356.0625, 'learning_rate': 6.565569947256679e-05, 'epoch': 2.0}
{'eval_loss': 8.297831535339355, 'eval_precision': 0.0005045558424598581, 'eval_recall': 0.01282051282051282, 'eval_f1': 0.0009709015106085268, 'eval_accuracy': 0.01845521072592276, 'eval_token_accuracy': 0.01845521072592276, 'eval_runtime': 1.0679, 'eval_samples_per_second': 75.853, 'eval_steps_per_second': 5.619, 'epoch': 2.0}
{'loss': 8.3692, 'grad_norm': 256646.765625, 'learning_rat

[I 2025-05-17 14:54:46,591] Trial 1 finished with value: 0.0011012896681640342 and parameters: {'learning_rate': 0.00012851753939310945, 'r': 8, 'lora_alpha': 32, 'lora_dropout': 0.11065605956967692, 'batch_size': 4}. Best is trial 1 with value: 0.0011012896681640342.


{'eval_loss': 7.2410125732421875, 'eval_precision': 0.0005726513758702794, 'eval_recall': 0.014328808446455505, 'eval_f1': 0.0011012896681640342, 'eval_accuracy': 0.036135890176806795, 'eval_token_accuracy': 0.036135890176806795, 'eval_runtime': 1.0597, 'eval_samples_per_second': 76.436, 'eval_steps_per_second': 5.662, 'epoch': 4.0}


Some weights of T5ForTokenClassification were not initialized from the model checkpoint at t5-small and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
No label_names provided for model class `PeftModelForTokenClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Optuna Trial 2 (NER-LoRA): LR=5.62e-05, R=8, Alpha=64, Dropout=0.12, BS=4, Epochs=4
{'loss': 9.9808, 'grad_norm': 218381.9375, 'learning_rate': 4.276498496000552e-05, 'epoch': 1.0}
{'eval_loss': 9.466277122497559, 'eval_precision': 0.0004715451946597507, 'eval_recall': 0.012066365007541479, 'eval_f1': 0.0009076211816093258, 'eval_accuracy': 0.008920463650446023, 'eval_token_accuracy': 0.008920463650446023, 'eval_runtime': 1.048, 'eval_samples_per_second': 77.287, 'eval_steps_per_second': 5.725, 'epoch': 1.0}
{'loss': 9.612, 'grad_norm': 236333.96875, 'learning_rate': 2.871363275886085e-05, 'epoch': 2.0}
{'eval_loss': 9.028155326843262, 'eval_precision': 0.0005022601707684581, 'eval_recall': 0.01282051282051282, 'eval_f1': 0.0009666505558240697, 'eval_accuracy': 0.011617969125580899, 'eval_token_accuracy': 0.011617969125580899, 'eval_runtime': 1.0596, 'eval_samples_per_second': 76.441, 'eval_steps_per_second': 5.662, 'epoch': 2.0}
{'loss': 9.2302, 'grad_norm': 286332.96875, 'learning_ra

[I 2025-05-17 14:55:04,411] Trial 2 finished with value: 0.0009687713699566903 and parameters: {'learning_rate': 5.620540880457868e-05, 'r': 8, 'lora_alpha': 64, 'lora_dropout': 0.11845149938383173, 'batch_size': 4}. Best is trial 1 with value: 0.0011012896681640342.


{'eval_loss': 8.545991897583008, 'eval_precision': 0.0005034053893988748, 'eval_recall': 0.01282051282051282, 'eval_f1': 0.0009687713699566903, 'eval_accuracy': 0.015410501575770525, 'eval_token_accuracy': 0.015410501575770525, 'eval_runtime': 1.0759, 'eval_samples_per_second': 75.284, 'eval_steps_per_second': 5.577, 'epoch': 4.0}
NER LoRA HPO Best Params (t5-small): {'learning_rate': 0.00012851753939310945, 'r': 8, 'lora_alpha': 32, 'lora_dropout': 0.11065605956967692, 'batch_size': 4}
NER LoRA HPO Best F1: 0.0011
Stored 'best_ner_lora_params' (dict)


## 13. NER: Final Training Runs with Best HPO Parameters

### 13.1 Final Full Fine-Tuning (NER)

In [72]:
%store -r MODEL_NAME_NER num_ner_labels ner_id_to_label ner_label_to_id ner_tokenized_train ner_tokenized_dev best_ner_ft_params NUM_TRAIN_EPOCHS_FINAL_NER OUTPUT_BASE_DIR SEED all_experiment_results

print(f"--- NER Final Full Fine-Tuning for {MODEL_NAME_NER} ({NUM_TRAIN_EPOCHS_FINAL_NER} epochs) ---")

if 'best_ner_ft_params' not in globals() or not isinstance(best_ner_ft_params, dict):
    print("Warning: best_ner_ft_params not found or invalid from HPO. Using default FT params for final run.")
    best_ner_ft_params = {"learning_rate": 3e-5, "batch_size": 8, "weight_decay": 0.01}

ner_final_ft_output_dir = Path(OUTPUT_BASE_DIR) / f"{MODEL_NAME_NER.replace('/', '_')}-ner-ft-final"

if len(ner_tokenized_train) == 0 or len(ner_tokenized_dev) == 0:
    print("Skipping NER final FT training as processed datasets are empty.")
    final_ft_ner_metrics = {key: 0.0 for key in ['eval_loss', 'eval_precision', 'eval_recall', 'eval_f1', 'eval_accuracy', 'eval_token_accuracy']}
else:
    try:
        ft_final_args_ner = TrainingArguments(
            output_dir=str(ner_final_ft_output_dir),
            learning_rate=best_ner_ft_params["learning_rate"],
            per_device_train_batch_size=best_ner_ft_params["batch_size"],
            per_device_eval_batch_size=best_ner_ft_params["batch_size"] * 2,
            num_train_epochs=NUM_TRAIN_EPOCHS_FINAL_NER, 
            weight_decay=best_ner_ft_params.get("weight_decay", 0.01),
            logging_strategy="epoch",
            eval_strategy="epoch",       # <--- CORRECTED from evaluation_strategy
            save_strategy="epoch",
            load_best_model_at_end=True,
            metric_for_best_model="eval_f1",
            save_total_limit=1,
            fp16=torch.cuda.is_available(),
            report_to="none",
            seed=SEED,
            disable_tqdm=False
        )
        ft_final_model_ner = AutoModelForTokenClassification.from_pretrained(
            MODEL_NAME_NER, num_labels=num_ner_labels, id2label=ner_id_to_label, label2id=ner_label_to_id, ignore_mismatched_sizes=True
        )
        ft_final_trainer_ner = Trainer(
            model=ft_final_model_ner, args=ft_final_args_ner, train_dataset=ner_tokenized_train,
            eval_dataset=ner_tokenized_dev, data_collator=ner_data_collator, compute_metrics=compute_metrics_ner_revised
        )
        print("Starting NER final full fine-tuning...")
        ft_final_trainer_ner.train()
        print("NER final full fine-tuning completed.")
        print("Evaluating final NER FT model (best checkpoint automatically loaded)...")
        final_ft_ner_metrics = ft_final_trainer_ner.evaluate()
        print(f"NER Final Full-FT Metrics ({MODEL_NAME_NER}):")
        for k, v_metric in final_ft_ner_metrics.items():
            print(f"  {k}: {v_metric:.4f}")
        ft_final_trainer_ner.save_model(str(ner_final_ft_output_dir / "best_model"))
        print(f"Final NER FT model saved to {ner_final_ft_output_dir / 'best_model'}")
    except Exception as e:
        print(f"Error during NER final FT run: {e}")
        final_ft_ner_metrics = {key: 0.0 for key in ['eval_loss', 'eval_precision', 'eval_recall', 'eval_f1', 'eval_accuracy', 'eval_token_accuracy']}
        import traceback; traceback.print_exc()

all_experiment_results['NER Final Full-FT (HPO)'] = final_ft_ner_metrics
%store all_experiment_results

no stored variable or alias MODEL_NAME_NER
no stored variable or alias NUM_TRAIN_EPOCHS_FINAL_NER
no stored variable or alias OUTPUT_BASE_DIR
no stored variable or alias SEED
--- NER Final Full Fine-Tuning for t5-small (15 epochs) ---


Some weights of T5ForTokenClassification were not initialized from the model checkpoint at t5-small and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting NER final full fine-tuning...


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy,Token Accuracy
1,9.2181,7.791439,0.000478,0.012066,0.00092,0.025159,0.025159
2,7.429,5.880901,0.000573,0.013575,0.001099,0.085279,0.085279
3,5.7155,4.154993,0.000537,0.010558,0.001022,0.233908,0.233908
4,4.2652,2.769305,0.000626,0.008296,0.001165,0.458122,0.458122
5,3.1308,1.894585,0.00101,0.006787,0.001759,0.672266,0.672266
6,2.4555,1.525349,0.000792,0.002262,0.001173,0.792212,0.792212
7,2.1136,1.321472,0.001136,0.002262,0.001512,0.819614,0.819614
8,1.8727,1.166644,0.001633,0.003017,0.002119,0.825624,0.825624
9,1.7396,1.065592,0.001353,0.002262,0.001693,0.831286,0.831286
10,1.6356,1.000143,0.001453,0.002262,0.00177,0.834811,0.834811


There were missing keys in the checkpoint model loaded: ['transformer.encoder.embed_tokens.weight'].


NER final full fine-tuning completed.
Evaluating final NER FT model (best checkpoint automatically loaded)...


NER Final Full-FT Metrics (t5-small):
  eval_loss: 0.9197
  eval_precision: 0.0036
  eval_recall: 0.0045
  eval_f1: 0.0040
  eval_accuracy: 0.8428
  eval_token_accuracy: 0.8428
  eval_runtime: 0.8830
  eval_samples_per_second: 91.7360
  eval_steps_per_second: 6.7950
  epoch: 15.0000
Final NER FT model saved to /kaggle/working/outputs/t5-small-ner-ft-final/best_model
Stored 'all_experiment_results' (defaultdict)


### 13.2 Final LoRA Fine-Tuning (NER)

In [73]:
%store -r MODEL_NAME_NER num_ner_labels ner_id_to_label ner_label_to_id ner_tokenized_train ner_tokenized_dev best_ner_lora_params NUM_TRAIN_EPOCHS_FINAL_NER OUTPUT_BASE_DIR SEED all_experiment_results

print(f"--- NER Final LoRA Fine-Tuning for {MODEL_NAME_NER} ({NUM_TRAIN_EPOCHS_FINAL_NER} epochs) ---")

if 'best_ner_lora_params' not in globals() or not isinstance(best_ner_lora_params, dict):
    print("Warning: best_ner_lora_params not found from HPO. Using default LoRA params for final run.")
    best_ner_lora_params = {"learning_rate": 1e-4, "r": 8, "lora_alpha": 16, "lora_dropout": 0.1, "batch_size": 8}

ner_final_lora_output_dir = Path(OUTPUT_BASE_DIR) / f"{MODEL_NAME_NER.replace('/', '_')}-ner-lora-final"

if len(ner_tokenized_train) == 0 or len(ner_tokenized_dev) == 0:
    print("Skipping NER final LoRA training as processed datasets are empty.")
    final_lora_ner_metrics = {key: 0.0 for key in ['eval_loss', 'eval_precision', 'eval_recall', 'eval_f1', 'eval_accuracy', 'eval_token_accuracy']}
else:
    try:
        lora_final_config_ner = LoraConfig(
            task_type=TaskType.TOKEN_CLS, 
            r=best_ner_lora_params["r"], 
            lora_alpha=best_ner_lora_params["lora_alpha"], 
            lora_dropout=best_ner_lora_params["lora_dropout"], 
            bias="none"
        )
        base_model_ner_lora_final = AutoModelForTokenClassification.from_pretrained(
            MODEL_NAME_NER, 
            num_labels=num_ner_labels, 
            id2label=ner_id_to_label, 
            label2id=ner_label_to_id, 
            ignore_mismatched_sizes=True
        )
        lora_final_model_ner_peft = get_peft_model(base_model_ner_lora_final, lora_final_config_ner)
        
        lora_final_args_ner = TrainingArguments(
            output_dir=str(ner_final_lora_output_dir), 
            learning_rate=best_ner_lora_params["learning_rate"],
            per_device_train_batch_size=best_ner_lora_params["batch_size"], 
            per_device_eval_batch_size=best_ner_lora_params["batch_size"] * 2,
            num_train_epochs=NUM_TRAIN_EPOCHS_FINAL_NER, 
            logging_strategy="epoch", 
            eval_strategy="epoch",       # <--- CORRECTED from evaluation_strategy
            save_strategy="epoch", 
            load_best_model_at_end=True, 
            metric_for_best_model="eval_f1",
            save_total_limit=1, 
            fp16=torch.cuda.is_available(), 
            report_to="none", 
            seed=SEED, 
            disable_tqdm=False
        )
        lora_final_trainer_ner = Trainer(
            model=lora_final_model_ner_peft, 
            args=lora_final_args_ner, 
            train_dataset=ner_tokenized_train,
            eval_dataset=ner_tokenized_dev, 
            data_collator=ner_data_collator, # Ensure ner_data_collator is defined
            compute_metrics=compute_metrics_ner_revised # Ensure compute_metrics_ner_revised is defined
        )
        print("Starting NER final LoRA training...")
        lora_final_trainer_ner.train()
        print("NER final LoRA training completed.")
        print("Evaluating final NER LoRA model (best checkpoint automatically loaded)...")
        final_lora_ner_metrics = lora_final_trainer_ner.evaluate()
        print(f"NER Final LoRA Metrics ({MODEL_NAME_NER}):")
        for k, v_metric in final_lora_ner_metrics.items():
            print(f"  {k}: {v_metric:.4f}")
        lora_final_trainer_ner.save_model(str(ner_final_lora_output_dir / "best_model"))
        print(f"Final NER LoRA model saved to {ner_final_lora_output_dir / 'best_model'}")
    except Exception as e:
        print(f"Error during NER final LoRA run: {e}")
        final_lora_ner_metrics = {key: 0.0 for key in ['eval_loss', 'eval_precision', 'eval_recall', 'eval_f1', 'eval_accuracy', 'eval_token_accuracy']}
        import traceback; traceback.print_exc()

all_experiment_results['NER Final LoRA (HPO)'] = final_lora_ner_metrics
%store all_experiment_results

no stored variable or alias MODEL_NAME_NER
no stored variable or alias NUM_TRAIN_EPOCHS_FINAL_NER
no stored variable or alias OUTPUT_BASE_DIR
no stored variable or alias SEED
--- NER Final LoRA Fine-Tuning for t5-small (15 epochs) ---


Some weights of T5ForTokenClassification were not initialized from the model checkpoint at t5-small and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
No label_names provided for model class `PeftModelForTokenClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Starting NER final LoRA training...


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy,Token Accuracy
1,10.1685,9.504787,0.000492,0.012821,0.000948,0.010069,0.010069
2,9.0744,7.858705,0.000504,0.012821,0.00097,0.028417,0.028417
3,7.2042,5.282507,0.000713,0.015837,0.001364,0.144784,0.144784
4,4.6369,2.384498,0.001042,0.010558,0.001898,0.569815,0.569815
5,2.5681,1.391037,0.000502,0.000754,0.000603,0.835292,0.835292
6,1.8689,1.130109,0.0,0.0,0.0,0.863469,0.863469
7,1.6395,1.027583,0.0,0.0,0.0,0.871481,0.871481
8,1.4961,0.971962,0.0,0.0,0.0,0.872149,0.872149
9,1.443,0.914114,0.0,0.0,0.0,0.872176,0.872176
10,1.3808,0.886717,0.0,0.0,0.0,0.872202,0.872202


NER final LoRA training completed.
Evaluating final NER LoRA model (best checkpoint automatically loaded)...


NER Final LoRA Metrics (t5-small):
  eval_loss: 2.3845
  eval_precision: 0.0010
  eval_recall: 0.0106
  eval_f1: 0.0019
  eval_accuracy: 0.5698
  eval_token_accuracy: 0.5698
  eval_runtime: 0.9937
  eval_samples_per_second: 81.5150
  eval_steps_per_second: 6.0380
  epoch: 15.0000
Final NER LoRA model saved to /kaggle/working/outputs/t5-small-ner-lora-final/best_model
Stored 'all_experiment_results' (defaultdict)


### 13.3 Final Partial-Freeze Fine-Tuning (NER)

In [74]:
%store -r MODEL_NAME_NER num_ner_labels ner_id_to_label ner_label_to_id ner_tokenized_train ner_tokenized_dev best_ner_freeze_params NUM_TRAIN_EPOCHS_FINAL_NER OUTPUT_BASE_DIR SEED all_experiment_results

print(f"--- NER Final Partial-Freeze for {MODEL_NAME_NER} ({NUM_TRAIN_EPOCHS_FINAL_NER} epochs) ---")

if 'best_ner_freeze_params' not in globals() or not isinstance(best_ner_freeze_params, dict):
    print("Warning: best_ner_freeze_params not found from HPO. Using default Freeze params for final run.")
    model_config_temp_f_final = AutoConfig.from_pretrained(MODEL_NAME_NER)
    num_total_encoder_layers_f_final = getattr(model_config_temp_f_final, 'num_layers', getattr(model_config_temp_f_final, 'num_hidden_layers', 6))
    best_ner_freeze_params = {"learning_rate": 3e-5, "num_layers_to_freeze": num_total_encoder_layers_f_final // 2, "batch_size": 8}

ner_final_freeze_output_dir = Path(OUTPUT_BASE_DIR) / f"{MODEL_NAME_NER.replace('/', '_')}-ner-freeze-final"

if len(ner_tokenized_train) == 0 or len(ner_tokenized_dev) == 0:
    print("Skipping NER final Freeze training as processed datasets are empty.")
    final_freeze_ner_metrics = {key: 0.0 for key in ['eval_loss', 'eval_precision', 'eval_recall', 'eval_f1', 'eval_accuracy', 'eval_token_accuracy']}
else:
    try:
        freeze_final_model_ner = AutoModelForTokenClassification.from_pretrained(
            MODEL_NAME_NER, num_labels=num_ner_labels, id2label=ner_id_to_label, label2id=ner_label_to_id, ignore_mismatched_sizes=True
        )
        num_layers_to_freeze_final = best_ner_freeze_params["num_layers_to_freeze"]
        param_prefix_to_freeze = "encoder.block." if "t5" in MODEL_NAME_NER.lower() else "encoder.layer."
        print(f"Freezing first {num_layers_to_freeze_final} encoder blocks (prefix: '{param_prefix_to_freeze}') for NER model...")
        for name, param in freeze_final_model_ner.named_parameters():
            param.requires_grad = True
            if name.startswith(param_prefix_to_freeze):
                try:
                    layer_idx = int(name.split('.')[2])
                    if layer_idx < num_layers_to_freeze_final: param.requires_grad = False
                except: pass
            elif 'classifier' in name: param.requires_grad = True
            elif 'shared.weight' in name and "t5" in MODEL_NAME_NER.lower(): param.requires_grad = True
            elif 'embeddings' in name and "bert" in MODEL_NAME_NER.lower(): param.requires_grad = True

        freeze_final_args_ner = TrainingArguments(
            output_dir=str(ner_final_freeze_output_dir), 
            learning_rate=best_ner_freeze_params["learning_rate"],
            per_device_train_batch_size=best_ner_freeze_params["batch_size"], 
            per_device_eval_batch_size=best_ner_freeze_params["batch_size"] * 2,
            num_train_epochs=NUM_TRAIN_EPOCHS_FINAL_NER, 
            logging_strategy="epoch", 
            eval_strategy="epoch",       # <--- CORRECTED from evaluation_strategy
            save_strategy="epoch", 
            load_best_model_at_end=True, 
            metric_for_best_model="eval_f1",
            save_total_limit=1, 
            fp16=torch.cuda.is_available(), 
            report_to="none", 
            seed=SEED, 
            disable_tqdm=False
        )
        freeze_final_trainer_ner = Trainer(
            model=freeze_final_model_ner, 
            args=freeze_final_args_ner, 
            train_dataset=ner_tokenized_train,
            eval_dataset=ner_tokenized_dev, 
            data_collator=ner_data_collator, # Ensure ner_data_collator is defined
            compute_metrics=compute_metrics_ner_revised # Ensure compute_metrics_ner_revised is defined
        )
        print("Starting NER final partial-freeze training...")
        freeze_final_trainer_ner.train()
        print("NER final partial-freeze training completed.")
        print("Evaluating final NER Partial-Freeze model (best checkpoint automatically loaded)...")
        final_freeze_ner_metrics = freeze_final_trainer_ner.evaluate()
        print(f"NER Final Partial-Freeze Metrics ({MODEL_NAME_NER}):")
        for k, v_metric in final_freeze_ner_metrics.items():
            print(f"  {k}: {v_metric:.4f}")
        freeze_final_trainer_ner.save_model(str(ner_final_freeze_output_dir / "best_model"))
        print(f"Final NER Partial-Freeze model saved to {ner_final_freeze_output_dir / 'best_model'}")
    except Exception as e:
        print(f"Error during NER final Freeze run: {e}")
        final_freeze_ner_metrics = {key: 0.0 for key in ['eval_loss', 'eval_precision', 'eval_recall', 'eval_f1', 'eval_accuracy', 'eval_token_accuracy']}
        import traceback; traceback.print_exc()

all_experiment_results['NER Final Partial-Freeze (HPO)'] = final_freeze_ner_metrics
%store all_experiment_results

no stored variable or alias MODEL_NAME_NER
no stored variable or alias best_ner_freeze_params
no stored variable or alias NUM_TRAIN_EPOCHS_FINAL_NER
no stored variable or alias OUTPUT_BASE_DIR
no stored variable or alias SEED
--- NER Final Partial-Freeze for t5-small (15 epochs) ---


Some weights of T5ForTokenClassification were not initialized from the model checkpoint at t5-small and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Freezing first 3 encoder blocks (prefix: 'encoder.block.') for NER model...
Starting NER final partial-freeze training...


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy,Token Accuracy
1,9.6898,8.464564,0.000497,0.012821,0.000958,0.019337,0.019337
2,8.0959,6.837008,0.000608,0.015083,0.00117,0.055205,0.055205
3,6.6352,5.348951,0.000733,0.016591,0.001405,0.135703,0.135703
4,5.3856,4.015586,0.000875,0.016591,0.001663,0.26716,0.26716
5,4.2339,2.963552,0.001183,0.016591,0.002208,0.440575,0.440575
6,3.3643,2.210903,0.001003,0.00905,0.001805,0.609743,0.609743
7,2.7512,1.785083,0.000886,0.004525,0.001482,0.732653,0.732653
8,2.3404,1.591673,0.001061,0.003017,0.00157,0.800465,0.800465
9,2.1782,1.481024,0.001636,0.003017,0.002121,0.828668,0.828668
10,2.02,1.380553,0.000989,0.001508,0.001194,0.837322,0.837322


There were missing keys in the checkpoint model loaded: ['transformer.encoder.embed_tokens.weight'].


NER final partial-freeze training completed.
Evaluating final NER Partial-Freeze model (best checkpoint automatically loaded)...


NER Final Partial-Freeze Metrics (t5-small):
  eval_loss: 2.9636
  eval_precision: 0.0012
  eval_recall: 0.0166
  eval_f1: 0.0022
  eval_accuracy: 0.4406
  eval_token_accuracy: 0.4406
  eval_runtime: 0.9458
  eval_samples_per_second: 85.6410
  eval_steps_per_second: 3.1720
  epoch: 15.0000
Final NER Partial-Freeze model saved to /kaggle/working/outputs/t5-small-ner-freeze-final/best_model
Stored 'all_experiment_results' (defaultdict)


# Part II: Relation Extraction (RE)

## 14. RE: Label Mapping

In [75]:
print("--- RE Label Mapping ---")
all_re_relation_types_from_data = []
if 'train_docs_raw' in globals() and train_docs_raw:
    for doc in train_docs_raw:
        for triple in doc.get('triples', []):
            all_re_relation_types_from_data.append(triple.get('relation', 'UNKNOWN_REL'))
else:
    print("Warning: train_docs_raw not found or empty. RE label set will be minimal.")

if not all_re_relation_types_from_data:
    print("CRITICAL WARNING: No relation types found in training data. Using minimal fallback for RE.")
    unique_re_relations_from_data = ["RELATED_TO"]
else:
    re_relation_counts = Counter(all_re_relation_types_from_data)
    unique_re_relations_from_data = sorted(list(re_relation_counts.keys() - {'UNKNOWN_REL'}))

NO_RELATION_LABEL_RE = "NO_RELATION"
re_labels_list = [NO_RELATION_LABEL_RE] + [r for r in unique_re_relations_from_data if r != NO_RELATION_LABEL_RE]
re_labels_list = sorted(list(set(re_labels_list)), key=lambda x: (x != NO_RELATION_LABEL_RE, x))

re_label_to_id = {label: i for i, label in enumerate(re_labels_list)}
re_id_to_label = {i: label for i, label in enumerate(re_labels_list)}
num_re_labels = len(re_labels_list)

print(f"Number of unique RE Labels (including {NO_RELATION_LABEL_RE}): {num_re_labels}")
print(f"Sample RE Labels (first 10): {re_labels_list[:min(10, num_re_labels)]}")
print(f"re_label_to_id['{NO_RELATION_LABEL_RE}'] = {re_label_to_id.get(NO_RELATION_LABEL_RE, 'ERROR')}")

%store re_label_to_id
%store re_id_to_label
%store num_re_labels
%store NO_RELATION_LABEL_RE

--- RE Label Mapping ---
Number of unique RE Labels (including NO_RELATION): 69
Sample RE Labels (first 10): ['NO_RELATION', 'AcademicDegree', 'AdjacentStation', 'Affiliation', 'AppliesToPeople', 'ApprovedBy', 'Author', 'BasedOn', 'Capital', 'CitesWork']
re_label_to_id['NO_RELATION'] = 0
Stored 're_label_to_id' (dict)
Stored 're_id_to_label' (dict)
Stored 'num_re_labels' (int)
Stored 'NO_RELATION_LABEL_RE' (str)


## 15. RE: Tokenizer Initialization

In [76]:
%store -r MODEL_NAME_RE MODEL_NAME_NER 

print(f"--- Initializing Tokenizer for RE model: {MODEL_NAME_RE} ---")
if 'ner_tokenizer' in globals() and MODEL_NAME_RE == MODEL_NAME_NER:
    re_tokenizer = ner_tokenizer
    print(f"Reusing tokenizer from NER for RE model: {MODEL_NAME_RE}")
else:
    re_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME_RE, use_fast=True)
    print(f"Initialized new tokenizer for RE model: {MODEL_NAME_RE}")

# Add special tokens if a more structured input is desired for RE, e.g., entity markers.
# For this version, we embed type and text directly into the input string. Example:
# special_tokens_re = {'additional_special_tokens': ['<H>', '</H>', '<T>', '</T>', '<ENT>']}
# num_added_toks_re = re_tokenizer.add_special_tokens(special_tokens_re)
# if num_added_toks_re > 0:
#    print(f"Added {num_added_toks_re} special tokens to RE tokenizer: {special_tokens_re['additional_special_tokens']}")


no stored variable or alias MODEL_NAME_RE
no stored variable or alias MODEL_NAME_NER
--- Initializing Tokenizer for RE model: t5-small ---
Reusing tokenizer from NER for RE model: t5-small


## 16. RE: Data Preparation Function

In [77]:
import nltk.data 
from tqdm.auto import tqdm

# Ensure sentence tokenizer is loaded (might have been done in RE.1)
try:
    sent_tokenizer_re_nltk = nltk.data.load('tokenizers/punkt/english.pickle')
except LookupError:
    nltk.download('punkt', quiet=True)
    sent_tokenizer_re_nltk = nltk.data.load('tokenizers/punkt/english.pickle')

def get_sentence_spans_re(doc_text):
    """Returns a list of (start_char, end_char, sentence_text) tuples."""
    sentence_spans = []
    try: # Add try-except for robustness if doc_text is not a string
        if not isinstance(doc_text, str):
            # print(f"Warning: doc_text is not a string, it's {type(doc_text)}. Skipping sentence tokenization for this doc.")
            return []
        for start, end in sent_tokenizer_re_nltk.span_tokenize(doc_text):
            sentence_spans.append((start, end, doc_text[start:end]))
    except Exception as e:
        # print(f"Error during sentence tokenization: {e}")
        pass
    return sentence_spans

def create_re_examples_for_classification(documents, relation_map, no_relation_val):
    re_formatted_examples = []
    max_entities_per_sentence_heuristic = 10 

    for doc_idx_prog, doc in enumerate(tqdm(documents, desc="Processing documents for RE examples")):
        doc_id = doc.get('id', f'doc_{doc_idx_prog}')
        # Use 'document_text' as standardized in the data loading cell
        doc_text = doc.get('document_text', '') 
        entities_in_doc = doc.get('entities', [])
        triples_in_doc = doc.get('triples', [])

        if not doc_text or not entities_in_doc:
            continue

        gold_relation_lookup = {(triple['head'], triple['tail']): triple['relation'] for triple in triples_in_doc}
        
        # Map original entity index to its details (first mention text, type, AND DYNAMICALLY FOUND start/end)
        entity_details_map = {}
        for orig_entity_idx, entity_data in enumerate(entities_in_doc):
            if entity_data.get('mentions') and isinstance(entity_data['mentions'], list) and entity_data['mentions']:
                # *** CORRECTION STARTS HERE ***
                first_mention_text = entity_data['mentions'][0] # Assuming this is a string
                
                if not isinstance(first_mention_text, str):
                    # print(f"Warning: Expected first_mention_text to be a string, but got {type(first_mention_text)} in doc {doc_id}. Skipping entity.")
                    continue

                # Find start/end using the document text
                mention_char_start = doc_text.find(first_mention_text)
                if mention_char_start == -1:
                    # print(f"Warning: Mention '{first_mention_text}' not found in doc_text of doc {doc_id}. Skipping entity.")
                    continue
                mention_char_end = mention_char_start + len(first_mention_text)
                # *** CORRECTION ENDS HERE ***

                entity_details_map[orig_entity_idx] = {
                    'text': first_mention_text, # Store the string itself
                    'type': entity_data.get('type', 'UNK_TYPE'),
                    'start': mention_char_start, # Store the found start
                    'end': mention_char_end      # Store the found end
                }
        
        sentence_spans = get_sentence_spans_re(doc_text)

        for sent_idx, (sent_start_char, sent_end_char, sentence_text) in enumerate(sentence_spans):
            entities_in_current_sentence = []
            for orig_idx, details in entity_details_map.items():
                # Check if the entity's (dynamically found) mention is within this sentence
                if details['start'] != -1 and sent_start_char <= details['start'] < sent_end_char:
                    entities_in_current_sentence.append((orig_idx, details))
            
            if len(entities_in_current_sentence) < 2 or len(entities_in_current_sentence) > max_entities_per_sentence_heuristic:
                continue

            for i in range(len(entities_in_current_sentence)):
                for j in range(len(entities_in_current_sentence)):
                    if i == j: continue 

                    head_original_idx, head_details = entities_in_current_sentence[i]
                    tail_original_idx, tail_details = entities_in_current_sentence[j]
                    
                    input_text_re = (
                        f"relación: {head_details['type']} "
                        f"\\\"{head_details['text']}\\\" y {tail_details['type']} "
                        f"\\\"{tail_details['text']}\\\". contexto: {sentence_text}"
                    )
                    
                    relation_type = gold_relation_lookup.get((head_original_idx, tail_original_idx), no_relation_val)
                    label_id = relation_map.get(relation_type, relation_map[no_relation_val])
                                            
                    re_formatted_examples.append({"text": input_text_re, "label": label_id})
                            
    return re_formatted_examples

print("RE example creation function (create_re_examples_for_classification) defined with correction for string mentions.")

RE example creation function (create_re_examples_for_classification) defined with correction for string mentions.


## 17. RE: Create Hugging Face Datasets

In [78]:
%store -r re_label_to_id NO_RELATION_LABEL_RE re_id_to_label # Retrieve from RE label mapping cell

print("--- Creating RE examples for training set ---")
re_train_examples = create_re_examples_for_classification(train_docs_raw, re_label_to_id, NO_RELATION_LABEL_RE)
print(f"Generated {len(re_train_examples)} RE examples for training.")

print("\n--- Creating RE examples for development set ---")
re_dev_examples = create_re_examples_for_classification(dev_docs_raw, re_label_to_id, NO_RELATION_LABEL_RE)
print(f"Generated {len(re_dev_examples)} RE examples for development.")

if not re_train_examples:
    print("CRITICAL WARNING: No RE training examples generated. RE training will be skipped or fail.")
    re_hf_train = Dataset.from_dict({"text": [], "label": []})
else:
    re_hf_train = Dataset.from_list(re_train_examples)

if not re_dev_examples:
    print("WARNING: No RE development examples generated. RE evaluation will be skipped or produce trivial results.")
    re_hf_dev = Dataset.from_dict({"text": [], "label": []})
else:
    re_hf_dev = Dataset.from_list(re_dev_examples)

print("\nSample RE training example:")
if len(re_hf_train) > 0:
    print(re_hf_train[0])
    re_label_counts_train = Counter(re_hf_train['label'])
    print(f"\nRE Training label distribution (Top 10 relations by count):")
    for label_id_count, count_val in sorted(re_label_counts_train.items(), key=lambda item: item[1], reverse=True)[:10]:
        print(f"  Label '{re_id_to_label.get(label_id_count, 'UnknownLabel')}': {count_val}")
else:
    print("No RE training examples available.")

%store re_hf_train
%store re_hf_dev

no stored variable or alias #
no stored variable or alias Retrieve
no stored variable or alias from
no stored variable or alias RE
no stored variable or alias label
no stored variable or alias mapping
no stored variable or alias cell
--- Creating RE examples for training set ---


Processing documents for RE examples:   0%|          | 0/51 [00:00<?, ?it/s]

Generated 6284 RE examples for training.

--- Creating RE examples for development set ---


Processing documents for RE examples:   0%|          | 0/23 [00:00<?, ?it/s]

Generated 1390 RE examples for development.

Sample RE training example:
{'text': 'relación: PERSON \\"William Thomson, 1st Baron Kelvin\\" y ORDINAL \\"1st\\". contexto: The  automatic curb sender was a kind of telegraph key, invented by William Thomson, 1st Baron Kelvin for sending messages on a submarine communications cable, as the well-known Wheatstone transmitter sends them on a land line.', 'label': 0}

RE Training label distribution (Top 10 relations by count):
  Label 'NO_RELATION': 6284
Stored 're_hf_train' (Dataset)
Stored 're_hf_dev' (Dataset)


## 18. RE: Tokenize Datasets

In [79]:
MAX_SEQ_LENGTH_RE = 256 # Max sequence length for RE model inputs

def tokenize_re_function(examples):
    return re_tokenizer(examples["text"], truncation=True, max_length=MAX_SEQ_LENGTH_RE, padding=False)

print("--- Tokenizing RE datasets ---")
if len(re_hf_train) > 0:
    re_tokenized_train = re_hf_train.map(tokenize_re_function, batched=True, remove_columns=["text"])
    print(f"Tokenized RE training set: {len(re_tokenized_train)} examples.")
else:
    re_tokenized_train = re_hf_train 
    print("RE training set is empty, tokenization skipped.")

if len(re_hf_dev) > 0:
    re_tokenized_dev = re_hf_dev.map(tokenize_re_function, batched=True, remove_columns=["text"])
    print(f"Tokenized RE development set: {len(re_tokenized_dev)} examples.")
else:
    re_tokenized_dev = re_hf_dev 
    print("RE development set is empty, tokenization skipped.")

if len(re_tokenized_train) > 0:
    print(f"Features in tokenized RE train set: {re_tokenized_train.features}")

%store re_tokenized_train
%store re_tokenized_dev

--- Tokenizing RE datasets ---


Map:   0%|          | 0/6284 [00:00<?, ? examples/s]

Tokenized RE training set: 6284 examples.


Map:   0%|          | 0/1390 [00:00<?, ? examples/s]

Tokenized RE development set: 1390 examples.
Features in tokenized RE train set: {'label': Value(dtype='int64', id=None), 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}
Stored 're_tokenized_train' (Dataset)
Stored 're_tokenized_dev' (Dataset)


## 19. RE: Model, Data Collator, and Metrics Function

In [80]:
%store -r MODEL_NAME_RE num_re_labels re_id_to_label re_label_to_id NO_RELATION_LABEL_RE # Retrieve variables

print(f"--- Initializing RE Model: {MODEL_NAME_RE} ---")
re_model_config = AutoConfig.from_pretrained(
    MODEL_NAME_RE, 
    num_labels=num_re_labels, 
    id2label=re_id_to_label, 
    label2id=re_label_to_id
)
# This model instance is used for the raw evaluation
re_model_for_raw_eval = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME_RE, 
    config=re_model_config, 
    ignore_mismatched_sizes=True # Helpful if base model not originally for seq classification
)
print(f"RE Model for Raw Eval ({type(re_model_for_raw_eval)}) loaded.")


re_data_collator = DataCollatorWithPadding(tokenizer=re_tokenizer, padding='longest')
print("RE Data Collator initialized.")

def compute_metrics_re_revised(p):
    predictions_logits_input, labels = p
    
    print(f"Inside compute_metrics_re_revised:")
    print(f"  Type of predictions_logits_input: {type(predictions_logits_input)}")
    
    processed_logits = []
    valid_labels_list = []

    if isinstance(predictions_logits_input, tuple):
        # Common if model outputs multiple things (e.g. logits, hidden_states)
        # Assuming the first element is the actual logits
        print(f"  predictions_logits_input is a tuple. Using the first element.")
        actual_logits = predictions_logits_input[0]
    else:
        actual_logits = predictions_logits_input

    if isinstance(actual_logits, np.ndarray):
        print(f"  Shape of actual_logits (as ndarray): {actual_logits.shape}")
        # If it's already a 2D numpy array (num_samples, num_classes), it should be fine
        if actual_logits.ndim == 2 and actual_logits.shape[1] == num_re_labels:
            predictions = np.argmax(actual_logits, axis=1)
            # Ensure labels align with the number of prediction samples
            if len(labels) != actual_logits.shape[0]:
                 print(f"  Warning: Mismatch between labels length ({len(labels)}) and predictions_logits samples ({actual_logits.shape[0]})")
                 # Attempt to align if it's a batching artifact, otherwise metrics will be wrong
                 # This part might need more sophisticated handling depending on why the mismatch occurs
                 min_len = min(len(labels), actual_logits.shape[0])
                 labels = labels[:min_len]
                 predictions = predictions[:min_len]
        elif actual_logits.ndim == 1 and len(actual_logits) == num_re_labels: # Single sample passed directly
            print(f"  actual_logits looks like a single sample's logits. Reshaping.")
            predictions = np.array([np.argmax(actual_logits)]) # Make it a 1-element array
            labels = np.array([labels[0]]) if isinstance(labels, (list, np.ndarray)) and len(labels)>0 else np.array([])
        else:
            print(f"  Error: actual_logits is an ndarray but has unexpected shape: {actual_logits.shape}. Expected (num_samples, {num_re_labels}).")
            return {'accuracy': 0.0, 'f1_weighted': 0.0, 'precision_weighted': 0.0, 'recall_weighted': 0.0, 'f1_actual_relations_micro': 0.0}
            
    elif isinstance(actual_logits, list): # Potentially a list of lists or list of arrays
        print(f"  actual_logits is a list. Length: {len(actual_logits)}")
        # Try to convert to a proper numpy array, assuming it's a list of logit arrays (one per sample)
        # This is where the inhomogeneous error likely originates if inner lists/arrays have different structures
        try:
            # We expect each item in the list to be a 1D array/list of logits of length num_re_labels
            # Filter out any elements that are not lists/arrays or don't have the correct length
            filtered_logits = []
            corresponding_labels = []
            for idx, logit_item in enumerate(actual_logits):
                if hasattr(logit_item, '__len__') and len(logit_item) == num_re_labels:
                    filtered_logits.append(logit_item)
                    if idx < len(labels):
                        corresponding_labels.append(labels[idx])
                    else:
                        print(f"  Warning: Label missing for logit item at index {idx}")
                else:
                    print(f"  Warning: Skipping logit item at index {idx} due to unexpected structure/length. Item: {logit_item}")

            if not filtered_logits:
                print("  Error: No valid logits found after filtering list items.")
                return {'accuracy': 0.0, 'f1_weighted': 0.0, 'precision_weighted': 0.0, 'recall_weighted': 0.0, 'f1_actual_relations_micro': 0.0}

            predictions_logits_np = np.array(filtered_logits) # This might still fail if items had sub-structure
            print(f"  Shape of predictions_logits_np after converting list: {predictions_logits_np.shape}")
            if predictions_logits_np.ndim != 2 or predictions_logits_np.shape[1] != num_re_labels:
                print(f"  Error: Converted predictions_logits_np has unexpected shape: {predictions_logits_np.shape}. Expected (num_samples, {num_re_labels}).")
                return {'accuracy': 0.0, 'f1_weighted': 0.0, 'precision_weighted': 0.0, 'recall_weighted': 0.0, 'f1_actual_relations_micro': 0.0}
            
            predictions = np.argmax(predictions_logits_np, axis=1)
            labels = np.array(corresponding_labels) # Use the labels that correspond to the filtered logits

        except ValueError as e:
            print(f"  ValueError converting list of logits to numpy array: {e}")
            print(f"  First 5 items of actual_logits list (lengths): {[len(li) if hasattr(li, '__len__') else 'NotIterable' for li in actual_logits[:5]]}")
            return {'accuracy': 0.0, 'f1_weighted': 0.0, 'precision_weighted': 0.0, 'recall_weighted': 0.0, 'f1_actual_relations_micro': 0.0}
    else:
        print(f"  Error: Unexpected type for actual_logits: {type(actual_logits)}")
        return {'accuracy': 0.0, 'f1_weighted': 0.0, 'precision_weighted': 0.0, 'recall_weighted': 0.0, 'f1_actual_relations_micro': 0.0}

    if len(predictions) == 0 or len(labels) == 0 or len(predictions) != len(labels):
        print(f"  Error: Predictions (len {len(predictions)}) and labels (len {len(labels)}) are empty or have mismatched lengths after processing. Cannot compute metrics.")
        return {'accuracy': 0.0, 'f1_weighted': 0.0, 'precision_weighted': 0.0, 'recall_weighted': 0.0, 'f1_actual_relations_micro': 0.0}

    # --- Metric calculation continues from here ---
    accuracy = accuracy_score(labels, predictions)
    precision_w, recall_w, f1_w, _ = precision_recall_fscore_support(
        labels, predictions, average='weighted', zero_division=0
    )
    
    no_relation_class_id = re_label_to_id.get(NO_RELATION_LABEL_RE, -1) 
    actual_relations_mask = (labels != no_relation_class_id)
    f1_actual_micro = 0.0
    precision_actual = 0.0
    recall_actual = 0.0

    if np.sum(actual_relations_mask) > 0:
        labels_actual = labels[actual_relations_mask]
        preds_actual = predictions[actual_relations_mask]
        if len(labels_actual) > 0 and len(np.unique(labels_actual)) > 0 : 
            # Ensure there are some actual relation instances and more than one class (or handle single class case)
            # Also, ensure labels parameter for precision_recall_fscore_support contains only present labels
            unique_present_labels_actual = np.unique(labels_actual)
            if len(unique_present_labels_actual) == 1 and len(np.unique(preds_actual)) == 1 and unique_present_labels_actual[0] == np.unique(preds_actual)[0] :
                 # Handle case where only one class is present and correctly predicted for all instances of it.
                 # This might require careful thought if this is the only "actual" relation class.
                 # For micro, if all are correct, P/R/F1 = 1. If not, 0.
                 if np.array_equal(labels_actual, preds_actual):
                     precision_actual, recall_actual, f1_actual = 1.0, 1.0, 1.0
                 else: # Some are wrong if not all equal
                     p_a, r_a, f_a, _ = precision_recall_fscore_support(
                        labels_actual, preds_actual, average='micro', zero_division=0, labels=unique_present_labels_actual
                     )
                     precision_actual, recall_actual, f1_actual = p_a, r_a, f_a

            elif len(unique_present_labels_actual) > 0 :
                p_a, r_a, f_a, _ = precision_recall_fscore_support(
                    labels_actual, preds_actual, average='micro', zero_division=0, labels=unique_present_labels_actual
                )
                precision_actual, recall_actual, f1_actual = p_a, r_a, f_a


    return {
        'accuracy_overall': accuracy,
        'f1_weighted_overall': f1_w,
        'precision_weighted_overall': precision_w,
        'recall_weighted_overall': recall_w,
        'f1_actual_relations_micro': f1_actual_micro, # <--- CORRECTED to use f1_actual_micro
        'precision_actual_relations_micro': precision_actual,
        'recall_actual_relations_micro': recall_actual
    
    }

print("RE Model and Metrics function defined (with enhanced debugging for predictions_logits).")

no stored variable or alias MODEL_NAME_RE
no stored variable or alias #
no stored variable or alias Retrieve
no stored variable or alias variables
--- Initializing RE Model: t5-small ---


Some weights of T5ForSequenceClassification were not initialized from the model checkpoint at t5-small and are newly initialized: ['classification_head.dense.bias', 'classification_head.dense.weight', 'classification_head.out_proj.bias', 'classification_head.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


RE Model for Raw Eval (<class 'transformers.models.t5.modeling_t5.T5ForSequenceClassification'>) loaded.
RE Data Collator initialized.
RE Model and Metrics function defined (with enhanced debugging for predictions_logits).


## 20. RE: Raw Pre-trained Model Evaluation (Baseline before Fine-tuning)

In [81]:
%store -r MODEL_NAME_RE re_tokenized_dev OUTPUT_BASE_DIR SEED all_experiment_results

print(f"--- RE Raw Model Evaluation for {MODEL_NAME_RE} ---")

if len(re_tokenized_dev) == 0:
    print("Skipping RE raw model evaluation as tokenized dev dataset is empty.")
    raw_re_metrics = {key: 0.0 for key in ['eval_loss', 'eval_accuracy', 'eval_f1_weighted', 'eval_precision_weighted', 'eval_recall_weighted', 'eval_f1_actual_relations_micro']}
else:
    try:
        # Model already loaded as re_model_for_raw_eval in the previous cell
        raw_re_output_dir = Path(OUTPUT_BASE_DIR) / f"{MODEL_NAME_RE.replace('/', '_')}-re-raw-eval"
        raw_re_eval_args = TrainingArguments(
            output_dir=str(raw_re_output_dir),
            per_device_eval_batch_size=16,
            do_train=False, do_eval=True, report_to="none", seed=SEED
        )
        raw_re_trainer = Trainer(
            model=re_model_for_raw_eval, args=raw_re_eval_args, eval_dataset=re_tokenized_dev,
            data_collator=re_data_collator, compute_metrics=compute_metrics_re_revised
        )
        print("Evaluating raw RE model...")
        raw_re_metrics = raw_re_trainer.evaluate()
        print(f"RE Raw Model Metrics ({MODEL_NAME_RE}):")
        for k, v_metric in raw_re_metrics.items():
            print(f"  {k}: {v_metric:.4f}")
    except Exception as e:
        print(f"Error during RE raw model evaluation: {e}")
        raw_re_metrics = {key: 0.0 for key in ['eval_loss', 'eval_accuracy', 'eval_f1_weighted', 'eval_precision_weighted', 'eval_recall_weighted', 'eval_f1_actual_relations_micro']}
        import traceback; traceback.print_exc()

all_experiment_results['RE Raw Model'] = raw_re_metrics
%store all_experiment_results

no stored variable or alias MODEL_NAME_RE
no stored variable or alias OUTPUT_BASE_DIR
no stored variable or alias SEED
--- RE Raw Model Evaluation for t5-small ---
Evaluating raw RE model...


Inside compute_metrics_re_revised:
  Type of predictions_logits_input: <class 'tuple'>
  predictions_logits_input is a tuple. Using the first element.
  Shape of actual_logits (as ndarray): (1390, 69)
RE Raw Model Metrics (t5-small):
  eval_loss: 4.3210
  eval_accuracy_overall: 0.0000
  eval_f1_weighted_overall: 0.0000
  eval_precision_weighted_overall: 0.0000
  eval_recall_weighted_overall: 0.0000
  eval_f1_actual_relations_micro: 0.0000
  eval_precision_actual_relations_micro: 0.0000
  eval_recall_actual_relations_micro: 0.0000
  eval_runtime: 4.1743
  eval_samples_per_second: 332.9900
  eval_steps_per_second: 10.5410
Stored 'all_experiment_results' (defaultdict)


## 21. RE: Model Training (Baseline/Final)

In [None]:
%store -r MODEL_NAME_RE re_tokenized_train re_tokenized_dev NUM_TRAIN_EPOCHS_FINAL_RE OUTPUT_BASE_DIR SEED re_id_to_label re_label_to_id num_re_labels all_experiment_results

print(f"--- RE Final Training for {MODEL_NAME_RE} ({NUM_TRAIN_EPOCHS_FINAL_RE} epochs) ---")

re_final_output_dir = Path(OUTPUT_BASE_DIR) / f"{MODEL_NAME_RE.replace('/', '_')}-re-final"

if len(re_tokenized_train) == 0 or len(re_tokenized_dev) == 0:
    print("Skipping RE final training as tokenized train or dev dataset is empty.")
    final_re_metrics = {key: 0.0 for key in ['eval_loss', 'eval_accuracy', 'eval_f1_weighted', 'eval_precision_weighted', 'eval_recall_weighted', 'eval_f1_actual_relations_micro']}
else:
    try:
        re_final_args = TrainingArguments(
            output_dir=str(re_final_output_dir),
            per_device_train_batch_size=8, 
            per_device_eval_batch_size=16,
            learning_rate=2e-5, 
            num_train_epochs=NUM_TRAIN_EPOCHS_FINAL_RE,
            weight_decay=0.01,
            logging_strategy="epoch",
            eval_strategy="epoch",       # <--- CORRECTED from evaluation_strategy
            save_strategy="epoch",
            load_best_model_at_end=True,
            metric_for_best_model="eval_f1_weighted_overall", # Ensure this key matches output of compute_metrics_re_revised
            save_total_limit=1,
            fp16=torch.cuda.is_available(),
            report_to="none",
            seed=SEED,
            disable_tqdm=False
        )
        re_final_model_config = AutoConfig.from_pretrained(
            MODEL_NAME_RE, 
            num_labels=num_re_labels, 
            id2label=re_id_to_label, 
            label2id=re_label_to_id
        )
        re_final_model = AutoModelForSequenceClassification.from_pretrained(
            MODEL_NAME_RE, 
            config=re_final_model_config, 
            ignore_mismatched_sizes=True
        )
        
        # Check if tokenizer and model vocab size match, especially if special tokens were added to re_tokenizer
        # This is a common source of errors if not handled.
        # if 're_tokenizer' in globals() and hasattr(re_tokenizer, 'vocab_size') and \
        #   len(re_tokenizer) != re_final_model.config.vocab_size:
        #    print(f"Warning: RE Tokenizer vocab size ({len(re_tokenizer)}) and model vocab size ({re_final_model.config.vocab_size}) mismatch.")
        #    print("Attempting to resize model token embeddings for RE final run. This is needed if special tokens were added to the tokenizer AFTER the base model was loaded for another task (e.g. NER).")
        #    re_final_model.resize_token_embeddings(len(re_tokenizer))


        re_final_trainer = Trainer(
            model=re_final_model, 
            args=re_final_args, 
            train_dataset=re_tokenized_train,
            eval_dataset=re_tokenized_dev, 
            data_collator=re_data_collator, # Ensure re_data_collator is defined
            compute_metrics=compute_metrics_re_revised # Ensure compute_metrics_re_revised is defined
        )
        print("Starting RE final training...")
        re_final_trainer.train()
        print("RE final training completed.")
        print("Evaluating final RE model (best checkpoint automatically loaded)...")
        final_re_metrics = re_final_trainer.evaluate()
        print(f"RE Final Metrics ({MODEL_NAME_RE}):")
        for k, v_metric in final_re_metrics.items():
            print(f"  {k}: {v_metric:.4f}")
        re_final_trainer.save_model(str(re_final_output_dir / "best_model"))
        print(f"Final RE model saved to {re_final_output_dir / 'best_model'}")
    except Exception as e:
        print(f"Error during RE final training: {e}")
        final_re_metrics = {key: 0.0 for key in ['eval_loss', 'eval_accuracy', 'eval_f1_weighted', 'eval_precision_weighted', 'eval_recall_weighted', 'eval_f1_actual_relations_micro']}
        import traceback; traceback.print_exc()

all_experiment_results['RE Final Training'] = final_re_metrics
%store all_experiment_results

no stored variable or alias MODEL_NAME_RE
no stored variable or alias NUM_TRAIN_EPOCHS_FINAL_RE
no stored variable or alias OUTPUT_BASE_DIR
no stored variable or alias SEED
--- RE Final Training for t5-small (10 epochs) ---


Some weights of T5ForSequenceClassification were not initialized from the model checkpoint at t5-small and are newly initialized: ['classification_head.dense.bias', 'classification_head.dense.weight', 'classification_head.out_proj.bias', 'classification_head.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting RE final training...


Epoch,Training Loss,Validation Loss,Accuracy Overall,F1 Weighted Overall,Precision Weighted Overall,Recall Weighted Overall,F1 Actual Relations Micro,Precision Actual Relations Micro,Recall Actual Relations Micro
1,0.8318,4e-06,1.0,1.0,1.0,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0
3,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0
4,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0
5,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0


Inside compute_metrics_re_revised:
  Type of predictions_logits_input: <class 'tuple'>
  predictions_logits_input is a tuple. Using the first element.
  Shape of actual_logits (as ndarray): (1390, 69)
Inside compute_metrics_re_revised:
  Type of predictions_logits_input: <class 'tuple'>
  predictions_logits_input is a tuple. Using the first element.
  Shape of actual_logits (as ndarray): (1390, 69)
Inside compute_metrics_re_revised:
  Type of predictions_logits_input: <class 'tuple'>
  predictions_logits_input is a tuple. Using the first element.
  Shape of actual_logits (as ndarray): (1390, 69)
Inside compute_metrics_re_revised:
  Type of predictions_logits_input: <class 'tuple'>
  predictions_logits_input is a tuple. Using the first element.
  Shape of actual_logits (as ndarray): (1390, 69)
Inside compute_metrics_re_revised:
  Type of predictions_logits_input: <class 'tuple'>
  predictions_logits_input is a tuple. Using the first element.
  Shape of actual_logits (as ndarray): (1390,

## 22. RE: Hyperparameter Optimization (Optuna) - Placeholder

If `NUM_OPTUNA_TRIALS_RE` is set > 0, this section would run HPO for the RE task, similar to NER HPO. This involves defining an `re_objective_optuna` function and using Optuna to find the best hyperparameters. For this iteration, it's a placeholder but the structure would mirror the NER HPO cells.

## 23. Final Results Summary

In [None]:
%store -r all_experiment_results

print("--- Aggregated Experiment Results ---")

results_data = []
for experiment_name, metrics in all_experiment_results.items():
    if isinstance(metrics, dict): # Ensure metrics is a dict
        results_data.append({
            "Experiment": experiment_name,
            "Eval Loss": metrics.get('eval_loss', float('nan')),
            "F1": metrics.get('eval_f1', metrics.get('eval_f1_weighted', metrics.get('eval_f1_weighted_overall', float('nan')))), # Try different F1 keys
            "Accuracy": metrics.get('eval_accuracy', metrics.get('eval_accuracy_overall', float('nan'))),
            "Token Acc (NER)": metrics.get('eval_token_accuracy', float('nan')),
            "Precision": metrics.get('eval_precision', metrics.get('eval_precision_weighted', metrics.get('eval_precision_weighted_overall', float('nan')))),
            "Recall": metrics.get('eval_recall', metrics.get('eval_recall_weighted', metrics.get('eval_recall_weighted_overall', float('nan'))))
        })
    else:
        print(f"Warning: Metrics for '{experiment_name}' is not a dictionary: {metrics}")

if results_data:
    results_df = pd.DataFrame(results_data)
    results_df = results_df.set_index("Experiment")
    # Format float columns
    float_cols = results_df.select_dtypes(include=['float']).columns
    for col in float_cols:
        results_df[col] = results_df[col].apply(lambda x: f"{x:.4f}" if not pd.isna(x) else "N/A")
    
    print("\nFinal Performance Summary Table:")
    print(results_df.to_markdown())
else:
    print("No results were collected to display in the summary table.")

print("\n--- Notes on Results ---")
print("- 'F1' for NER refers to seqeval's overall_f1 (entity-level). For RE, it's 'f1_weighted_overall'.")
print("- 'Accuracy' for NER refers to seqeval's overall_accuracy (entity-level). For RE, it's 'accuracy_overall'.")
print("- 'Token Acc (NER)' is a separate token-wise accuracy for NER.")
print("- Extremely low scores are expected due to the very small dataset size.")
print("- If T5 models show 'missing keys' for embeddings during 'load_best_model_at_end', NER results are unreliable.")

## 24. Final Test Set Evaluation (Conceptual Placeholder)

After identifying the best performing NER and RE models on the development set, those specific saved models should be evaluated **once** on the `test_docs_raw` to get a final, unbiased performance measure. This step is crucial for reporting final results but is not automated in this notebook to prevent accidental multiple evaluations on the test set.

**Conceptual Steps for Test Evaluation:**
1.  Load your best saved NER model and its tokenizer.
2.  Preprocess `test_docs_raw` for NER (tokenize, align labels if available for scoring, otherwise just tokenize for prediction).
3.  Run NER predictions and evaluate if gold test labels exist.
4.  Load your best saved RE model and its tokenizer.
5.  Prepare RE test examples: Use entities (either gold from test data or predicted by your NER model) and sentence context from `test_docs_raw`. Tokenize these examples.
6.  Run RE predictions and evaluate if gold test relations exist.

## 25. Conclusion and Next Steps

This notebook has been structured to perform NER and RE tasks, including baseline evaluations, various fine-tuning strategies for NER, and a baseline for RE. It also includes a framework for hyperparameter optimization and a summary of results.

**Key Considerations for Improving Results Beyond this Structural Revision:**
1.  **CRITICAL: DATASET SIZE & QUALITY:** The single most impactful factor will be using a significantly larger (thousands of examples), high-quality, and diverse training dataset for both NER and RE.
2.  **SUFFICIENT TRAINING DURATION:** With more data, increase `NUM_TRAIN_EPOCHS_FINAL_NER` and `NUM_TRAIN_EPOCHS_FINAL_RE` (e.g., 10-50+ epochs, or use `max_steps`) to allow models to converge.
3.  **ROBUST HYPERPARAMETER OPTIMIZATION (HPO):** Conduct more extensive HPO (more trials, longer training per trial) for both NER and RE once you have sufficient data and adequate training time per trial.
4.  **MODEL SELECTION:** Experiment with different pre-trained models. For NER, BERT-based models (`bert-base-cased`, `roberta-base`, etc.) are often strong. For RE, various architectures can be explored.
5.  **T5 EMBEDDING WARNING (NER):** If the `missing keys` warning for T5 embeddings persists during `load_best_model_at_end`, this indicates a fundamental issue with how embeddings are handled or saved/loaded for `T5ForTokenClassification` in your setup. This will severely hamper NER performance and needs to be resolved (e.g., by checking library versions, model checkpoint structure, or switching to a different base model for NER).
6.  **RE DATA PREPARATION:** The strategy for generating RE examples (especially negative `NO_RELATION` instances and constructing input sequences with entity information) is crucial and can be refined.
7.  **ERROR ANALYSIS:** After training, thoroughly analyze the errors made by your models on the development set to understand weaknesses and guide improvements (e.g., more data for specific classes, feature engineering, rule-based post-processing).

This notebook provides a foundation. True high-performance NLP requires iterative experimentation, a strong understanding of your data, and careful tuning.

What this means together:

Your model has very quickly learned that the safest bet to minimize loss and maximize overall accuracy on your highly imbalanced dataset is to predict "NO_RELATION" for almost every entity pair. It's doing an excellent job at that, hence the perfect overall scores. However, it has learned nothing about the actual, meaningful relationships between entities, hence the zero scores for "actual relations."

Why is this happening?

Extreme Class Imbalance: As seen in your EDA (and the sample output for RE data generation in cell #42 of the notebook, which showed 6284 "NO_RELATION" examples in training), the "NO_RELATION" class likely dwarfs all other individual relation types. The model has many more examples of "NO_RELATION" to learn from.
Insufficient Signal for Minority Classes: With only 51 original documents to source your RE examples, the number of instances for each specific relation type (e.g., "Founded", "Creator") is probably very small. There isn't enough data for the model to learn the complex patterns that distinguish these specific relations.
Loss Function Domination: Standard cross-entropy loss will be dominated by the majority class. The model gets high rewards for correctly predicting "NO_RELATION" and little penalty for misclassifying the rare actual relations as "NO_RELATION".
Is this "bad"?

If your goal is to identify specific relationships, then yes, these results indicate the model is not achieving that goal, despite the misleadingly perfect overall scores.
The eval_f1_actual_relations_micro is the metric you should focus on if you care about identifying true relations.