## FYP Sprint 4 
### Ian Chia 
### 230746D

------------------

### T0.1 : Creating a fixed split

### Cell 1 ‚Äî Import libraries

In [1]:
import pandas as pd

### Cell 2 ‚Äî Load dataset_full.jsonl

In [2]:
df = pd.read_json("dataset_full.jsonl", lines=True)
df.shape
df.head()

Unnamed: 0,text,target_json
0,"A selfish, whiny teenaged girl","{'ACTOR': 'girl', 'ACTION': '', 'OBJECT': '', ..."
1,When a character delivers a speech so powerful...,"{'ACTOR': 'character', 'ACTION': 'delivers', '..."
2,An ugly (in personality or appearance)/overwei...,"{'ACTOR': 'personality', 'ACTION': 'knows', 'O..."
3,"s√°¬ª¬©c kho√°¬∫¬ª t√°¬ª‚Ä¢ng √Ñ‚Äò√É¬†n √°¬ª‚Ä¢n √Ñ‚Äò√°¬ª‚Äπnh, t√°¬ª‚Ä∞ l...","{'ACTOR': '√Ñ‚Äò√É', 'ACTION': 'h√°¬∫¬•p', 'OBJECT': ..."
4,A movie adaptation of a novel,"{'ACTOR': 'novel', 'ACTION': '', 'OBJECT': 'no..."


### Cell 3 ‚Äî Shuffle once with a fixed seed

In [3]:
df_shuffled = df.sample(frac=1, random_state=42).reset_index(drop=True)
df_shuffled.shape


(327, 2)

### Cell 4 ‚Äî Create split sizes (70/15/15)

In [4]:
n_total = len(df_shuffled)
n_train = int(0.7 * n_total)
n_dev = int(0.15 * n_total)
n_test = n_total - n_train - n_dev

n_total, n_train, n_dev, n_test


(327, 228, 49, 50)

### Cell 5 ‚Äî Assign the split column

In [5]:
df_shuffled["split"] = "train"

df_shuffled.loc[n_train:n_train+n_dev-1, "split"] = "dev"
df_shuffled.loc[n_train+n_dev:, "split"] = "test"

df_shuffled["split"].value_counts()


split
train    228
test      50
dev       49
Name: count, dtype: int64

### Cell 6 ‚Äî Save the official Sprint 4 dataset

In [6]:
df_shuffled.to_json(
    "idea_annotator_sprint4_split_fixed.jsonl",
    orient="records",
    lines=True,
    force_ascii=False
)

###This file becomes your master dataset for the entire Sprint 4 pipeline.

-----------------

### T0.2A: BERT Baseline on Fixed Split

----------------------

Sprint 3 used old preprocessed training files that already had BIO tags.

Sprint 4 uses the real, raw MongoDB dataset, which only contains text + JSON ‚Äî
so we had to regenerate BIO tags before training BERT.

**Sprint 3 dataset**

I exported from MongoDB BUT

I used a processed version

This version contained "tokens" + "BIO tags"

This means BIO tags had already been generated earlier,  during Sprint 3

So Sprint 3 BERT ran easily

**Sprint 4 dataset**

The file i am  using (dataset_full.jsonl)
is a different export that contains the raw annotation (target_json)

This version does not contain BIO tags

So I  had to regenerate BIO tags from the JSON

This also created a cleaner baseline (higher F1)

### Cell 1 ‚Äî Load the fixed split

In [2]:
import pandas as pd

df = pd.read_json("idea_annotator_sprint4_split_fixed.jsonl", lines=True)
train_df = df[df["split"] == "train"]
dev_df = df[df["split"] == "dev"]
test_df = df[df["split"] == "test"]

len(train_df), len(dev_df), len(test_df)


(228, 49, 50)

### Cell 2 ‚Äî Check what columns we have

In [3]:
train_df.columns, train_df.head(3)

(Index(['text', 'target_json', 'split'], dtype='object'),
                                                 text  \
 0  An organization is improbably selective about ...   
 1  Now, stop being lazy and go read the full arti...   
 2        Word(s) appearing through an Arc as a Motif   
 
                                          target_json  split  
 0  {'ACTOR': 'organization', 'ACTION': 'is', 'OBJ...  train  
 1  {'ACTOR': 'full article', 'ACTION': 'stop', 'O...  train  
 2  {'ACTOR': 'Word(s )', 'ACTION': 'appearing', '...  train  )

### Cell 3 ‚Äî Convert raw text ‚Üí tokens

In [4]:
# Convert text into simple whitespace tokens
train_tokens = train_df["text"].apply(lambda x: x.split()).tolist()
dev_tokens   = dev_df["text"].apply(lambda x: x.split()).tolist()
test_tokens  = test_df["text"].apply(lambda x: x.split()).tolist()


### Cell 4 ‚Äî Convert target_json ‚Üí BIO tags

This cell:

Loops through each sentence‚Äôs tokens

Uses the gold JSON target

Marks BIO tags for each slot

Produces a list of labels per token (same format as Sprint 3)

In [5]:
def json_to_bio(tokens, json_obj):
    # Create a default tag list
    tags = ["O"] * len(tokens)

    # For each slot in the JSON (ACTOR, ACTION, etc.)
    for slot, value in json_obj.items():
        if value is None or value == "":
            continue
        
        # Value may contain multiple words, so split
        value_tokens = value.split()

        # Find matching positions in the token list
        for i in range(len(tokens)):
            # Check if tokens[i:i+len(value_tokens)] matches the slot value
            if tokens[i:i+len(value_tokens)] == value_tokens:
                tags[i] = f"B-{slot}"
                for j in range(1, len(value_tokens)):
                    tags[i+j] = f"I-{slot}"
    
    return tags

# Apply function to build BIO labels
train_labels = [
    json_to_bio(toks, json_obj) 
    for toks, json_obj in zip(train_tokens, train_df["target_json"])
]

dev_labels = [
    json_to_bio(toks, json_obj) 
    for toks, json_obj in zip(dev_tokens, dev_df["target_json"])
]

test_labels = [
    json_to_bio(toks, json_obj) 
    for toks, json_obj in zip(test_tokens, test_df["target_json"])
]

len(train_labels), len(dev_labels), len(test_labels)


(228, 49, 50)

### Cell 5 ‚Äî Build label list and mappings

This cell collects all unique BIO tags from the train set, and builds label2id / id2label dictionaries for the model.

In [9]:
# Collect labels from ALL splits
all_labels = set()

for label_seq in train_labels + dev_labels + test_labels:
    for tag in label_seq:
        all_labels.add(tag)

# Sort for stable order
label_list = sorted(list(all_labels))

label2id = {label: i for i, label in enumerate(label_list)}
id2label = {i: label for label, i in label2id.items()}

label_list, label2id



(['B-ACTION',
  'B-ACTOR',
  'B-LOCATION',
  'B-OBJECT',
  'I-ACTION',
  'I-ACTOR',
  'I-LOCATION',
  'I-OBJECT',
  'O'],
 {'B-ACTION': 0,
  'B-ACTOR': 1,
  'B-LOCATION': 2,
  'B-OBJECT': 3,
  'I-ACTION': 4,
  'I-ACTOR': 5,
  'I-LOCATION': 6,
  'I-OBJECT': 7,
  'O': 8})

### Cell 6 ‚Äî Tokenize with BERT tokenizer

We now convert your token lists (train_tokens, dev_tokens, test_tokens) into BERT input IDs.

In [7]:
from transformers import BertTokenizerFast

# Load tokenizer
tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased")

# Encode tokens into BERT input format
train_encodings = tokenizer(
    train_tokens,
    is_split_into_words=True,
    truncation=True,
    padding=True
)

dev_encodings = tokenizer(
    dev_tokens,
    is_split_into_words=True,
    truncation=True,
    padding=True
)

test_encodings = tokenizer(
    test_tokens,
    is_split_into_words=True,
    truncation=True,
    padding=True
)


  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


### Cell 7 ‚Äî Encode BIO labels to align with WordPiece tokens

In [10]:
def encode_tags(labels_list, encodings):
    encoded_labels = []
    
    for i, sentence_labels in enumerate(labels_list):
        word_ids = encodings.word_ids(batch_index=i)
        
        previous_word_idx = None
        label_ids = []
        
        for word_idx in word_ids:
            if word_idx is None:
                # Special tokens get ignored by loss
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                # First subword of a given token
                label_ids.append(label2id[sentence_labels[word_idx]])
            else:
                # Subsequent subword piece (same label)
                label_ids.append(label2id[sentence_labels[word_idx]])
            
            previous_word_idx = word_idx
        
        encoded_labels.append(label_ids)
    
    return encoded_labels

# Encode all label sets
train_labels_enc = encode_tags(train_labels, train_encodings)
dev_labels_enc   = encode_tags(dev_labels, dev_encodings)
test_labels_enc  = encode_tags(test_labels, test_encodings)

len(train_labels_enc), len(dev_labels_enc), len(test_labels_enc)


(228, 49, 50)

### Cell 8 ‚Äî Build the PyTorch Dataset objects

In [11]:
import torch

class IdeaDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Create the datasets
train_dataset = IdeaDataset(train_encodings, train_labels_enc)
dev_dataset   = IdeaDataset(dev_encodings, dev_labels_enc)
test_dataset  = IdeaDataset(test_encodings, test_labels_enc)

len(train_dataset), len(dev_dataset), len(test_dataset)


(228, 49, 50)

### Cell 9 ‚Äî Load BERT model + define training arguments

In [12]:
from transformers import BertForTokenClassification, TrainingArguments, Trainer

# Load BERT base model with correct number of labels
model = BertForTokenClassification.from_pretrained(
    "bert-base-cased",
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id
)

# Training settings (same as Sprint 3)
training_args = TrainingArguments(
    output_dir="./bert_s4_outputs",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True
)





Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Cell 10 ‚Äî Define compute_metrics (F1 + accuracy)

In [13]:
from seqeval.metrics import f1_score, accuracy_score
import numpy as np

def compute_metrics(pred):
    logits, labels = pred
    preds = np.argmax(logits, axis=-1)

    true_labels = []
    true_preds = []

    for pred_seq, label_seq in zip(preds, labels):
        curr_true = []
        curr_pred = []
        for p, l in zip(pred_seq, label_seq):
            if l == -100:  # ignore special tokens
                continue
            curr_true.append(id2label[l])
            curr_pred.append(id2label[p])
        true_labels.append(curr_true)
        true_preds.append(curr_pred)

    return {
        "accuracy": accuracy_score(true_labels, true_preds),
        "f1": f1_score(true_labels, true_preds)
    }


### Cell 11 ‚Äî Create the Trainer and start training BERT

In [14]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=dev_dataset,      # evaluate on dev each epoch
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

trainer.train()




Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.876613,0.697297,0.0875
2,0.987500,0.720884,0.745946,0.40458
3,0.987500,0.726624,0.774775,0.480916
4,0.445400,0.753625,0.763964,0.479167
5,0.445400,0.785769,0.76036,0.462094




TrainOutput(global_step=145, training_loss=0.567440677511281, metrics={'train_runtime': 879.1885, 'train_samples_per_second': 1.297, 'train_steps_per_second': 0.165, 'total_flos': 38982635435880.0, 'train_loss': 0.567440677511281, 'epoch': 5.0})

### Cell 12 ‚Äî Evaluate on dev + test (get baseline metrics)

In [15]:
# Evaluate on dev set
dev_results = trainer.evaluate(eval_dataset=dev_dataset)
print("DEV BASELINE:", dev_results)

# Evaluate on test set
test_results = trainer.evaluate(eval_dataset=test_dataset)
print("TEST BASELINE:", test_results)




DEV BASELINE: {'eval_loss': 0.7266241312026978, 'eval_accuracy': 0.7747747747747747, 'eval_f1': 0.48091603053435117, 'eval_runtime': 1.8782, 'eval_samples_per_second': 26.089, 'eval_steps_per_second': 3.727, 'epoch': 5.0}
TEST BASELINE: {'eval_loss': 0.691470205783844, 'eval_accuracy': 0.7797147385103012, 'eval_f1': 0.460431654676259, 'eval_runtime': 4.1597, 'eval_samples_per_second': 12.02, 'eval_steps_per_second': 1.683, 'epoch': 5.0}


In [17]:
import numpy as np

# We will focus on these slots
SLOT_NAMES = ["ACTOR", "ACTION", "OBJECT", "LOCATION", "TIME"]

def bio_to_slots(tokens, tags):
    """
    Convert BIO tag sequence into a simple slot dictionary:
    { 'ACTOR': '...', 'ACTION': '...', ... }
    """
    slots = {s: "" for s in SLOT_NAMES}
    current_slot = None
    current_tokens = []

    def close_current():
        nonlocal current_slot, current_tokens
        if current_slot is not None and current_tokens and slots[current_slot] == "":
            slots[current_slot] = " ".join(current_tokens)
        current_slot = None
        current_tokens = []

    for tok, tag in zip(tokens, tags):
        if tag == "O" or tag == "PAD" or tag == "IGN":
            # end any ongoing span
            close_current()
            continue

        if tag.startswith("B-"):
            # close previous slot, start new
            close_current()
            slot = tag[2:]
            if slot in slots:
                current_slot = slot
                current_tokens = [tok]
            else:
                # unknown slot label, ignore
                current_slot = None
                current_tokens = []
        elif tag.startswith("I-"):
            slot = tag[2:]
            if current_slot == slot:
                current_tokens.append(tok)
            else:
                # if I- without B- or different slot, treat as new B-
                close_current()
                if slot in slots:
                    current_slot = slot
                    current_tokens = [tok]

    # close last
    close_current()
    return slots

def format_slots(slot_dict):
    return "{ " + ", ".join([f'"{k}":"{slot_dict.get(k, "")}"' for k in SLOT_NAMES]) + " }"

# 1) Get raw predictions from BERT on the test set
pred_output = trainer.predict(test_dataset)
logits = pred_output.predictions
pred_ids = np.argmax(logits, axis=-1)
true_ids = pred_output.label_ids

# 2) Decode BIO sequences (token-level) from IDs
pred_bio_seqs = []
gold_bio_seqs = []

for i in range(len(test_tokens)):
    word_ids = test_encodings.word_ids(batch_index=i)
    bio_pred = []
    bio_gold = []
    prev_word = None

    for p_id, t_id, w_id in zip(pred_ids[i], true_ids[i], word_ids):
        if w_id is None or t_id == -100:
            continue
        if w_id != prev_word:
            bio_pred.append(id2label[p_id])
            bio_gold.append(id2label[t_id])
        prev_word = w_id

    pred_bio_seqs.append(bio_pred)
    gold_bio_seqs.append(bio_gold)

# 3) Convert BIO sequences into slot dictionaries
pred_slots_list = [
    bio_to_slots(tokens, tags)
    for tokens, tags in zip(test_tokens, pred_bio_seqs)
]

gold_slots_list = [
    bio_to_slots(tokens, tags)
    for tokens, tags in zip(test_tokens, gold_bio_seqs)
]

# 4) Compute Slot-level Precision / Recall / F1 based on exact string match
TP = FP = FN = 0

for gold, pred in zip(gold_slots_list, pred_slots_list):
    for slot in SLOT_NAMES:
        g = (gold.get(slot) or "").strip()
        p = (pred.get(slot) or "").strip()

        if not g and not p:
            continue  # ignore both empty

        if p and g:
            if p == g:
                TP += 1
            else:
                FP += 1
                FN += 1
        elif p and not g:
            FP += 1
        elif g and not p:
            FN += 1

precision = TP / (TP + FP) if (TP + FP) > 0 else 0.0
recall    = TP / (TP + FN) if (TP + FN) > 0 else 0.0
slot_f1   = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0

# 5) Compute Frame-Validity % (here: frame is "valid" if at least one core slot is non-empty)
valid_frames = 0
for pred in pred_slots_list:
    if any((pred.get(s) or "").strip() for s in ["ACTOR", "ACTION", "OBJECT"]):
        valid_frames += 1

frame_valid_pct = 100.0 * valid_frames / len(pred_slots_list) if len(pred_slots_list) > 0 else 0.0

print(f"Slot-F1: {slot_f1:.3f}")
print(f"Frame-Validity %: {frame_valid_pct:.1f}%")

# 6) Print a few nice text-level examples
num_examples = 5
for i in range(num_examples):
    text = test_df.iloc[i]["text"]
    print("\nText:", text)
    print("Pred:", format_slots(pred_slots_list[i]))
    print("Gold:", format_slots(gold_slots_list[i]))


Slot-F1: 0.507
Frame-Validity %: 92.0%

Text: When a character delivers a speech so powerful that it emotionally moves the others to take action and not lose hope
Pred: { "ACTOR":"character", "ACTION":"delivers", "OBJECT":"speech", "LOCATION":"", "TIME":"" }
Gold: { "ACTOR":"character", "ACTION":"delivers", "OBJECT":"speech", "LOCATION":"", "TIME":"" }

Text: A character who very rarely or never shows any emotion
Pred: { "ACTOR":"character", "ACTION":"shows", "OBJECT":"emotion", "LOCATION":"", "TIME":"" }
Gold: { "ACTOR":"character", "ACTION":"shows", "OBJECT":"emotion", "LOCATION":"", "TIME":"" }

Text: Limited color palette on purpose
Pred: { "ACTOR":"", "ACTION":"", "OBJECT":"Limited color palette", "LOCATION":"", "TIME":"" }
Gold: { "ACTOR":"purpose", "ACTION":"", "OBJECT":"color palette", "LOCATION":"", "TIME":"" }

Text: A character comes home and finds that they cannot fit in there anymore
Pred: { "ACTOR":"character", "ACTION":"", "OBJECT":"", "LOCATION":"", "TIME":"" }
Gold: { 

### Explanation

Track 0 ‚Äì Step 2A (T0.2A) is about establishing the official BERT baseline using the new fixed Sprint 4 train/dev/test split. Since Sprint 3 used random or inconsistent splits, the earlier metrics were not fully reliable. By retraining BERT on the fixed 228-sample training set and evaluating on the fixed dev and test sets, we now have stable and fair metrics that we can compare all future Sprint 4 experiments against.

On the fixed test set, BERT achieved Slot-F1 ‚âà 0.507 and Frame-Validity ‚âà 92%. Slot-F1 measures how accurately the model extracts the correct ACTOR, ACTION, OBJECT, etc., while Frame-Validity measures whether the model can produce at least a minimally valid frame (with at least one core slot filled). These numbers represent the true baseline performance of BERT before making any improvements. We will use these results to judge whether data cleaning, better tagging, or other pipeline adjustments improve the model in later Sprint 4 tracks.

### T0.2B: T5 Baseline on Fixed Split

### Cell 1: Load the fixed split (same as BERT)

In [1]:
import pandas as pd

df = pd.read_json("idea_annotator_sprint4_split_fixed.jsonl", lines=True)

train_df = df[df["split"] == "train"].reset_index(drop=True)
dev_df   = df[df["split"] == "dev"].reset_index(drop=True)
test_df  = df[df["split"] == "test"].reset_index(drop=True)

len(train_df), len(dev_df), len(test_df)


(228, 49, 50)

### Cell 2: Prepare T5 inputs and targets

For T5:

Input = the raw text

Target = a clean JSON string built from target_json with keys
ACTOR, ACTION, OBJECT, LOCATION, TIME.

In [2]:
import json

SLOT_NAMES = ["ACTOR", "ACTION", "OBJECT", "LOCATION", "TIME"]

def json_to_str(j):
    # j is a Python dict from the target_json column
    # Ensure all 5 slots exist and convert to a JSON string
    out = {}
    for slot in SLOT_NAMES:
        val = j.get(slot, "")
        if val is None:
            val = ""
        out[slot] = val
    return json.dumps(out, ensure_ascii=False)

# Prepare inputs (texts) and targets (JSON strings) for T5
train_inputs = train_df["text"].tolist()
dev_inputs   = dev_df["text"].tolist()
test_inputs  = test_df["text"].tolist()

train_targets = [json_to_str(j) for j in train_df["target_json"]]
dev_targets   = [json_to_str(j) for j in dev_df["target_json"]]
test_targets  = [json_to_str(j) for j in test_df["target_json"]]

len(train_inputs), len(dev_inputs), len(test_inputs)


(228, 49, 50)

### Cell 3: Load T5 tokenizer + model

In [21]:
!pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.2.1-cp39-cp39-win_amd64.whl.metadata (10 kB)
Downloading sentencepiece-0.2.1-cp39-cp39-win_amd64.whl (1.1 MB)
   ---------------------------------------- 0.0/1.1 MB ? eta -:--:--
   ---------------------------------------- 1.1/1.1 MB 10.0 MB/s eta 0:00:00
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.2.1


In [3]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

t5_tokenizer = T5Tokenizer.from_pretrained("t5-small")
t5_model = T5ForConditionalGeneration.from_pretrained("t5-small")


  from .autonotebook import tqdm as notebook_tqdm
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


### Cell 4: Tokenize inputs + targets for T5

In [4]:
max_input_len = 256
max_target_len = 128

# Encode input texts
train_encodings = t5_tokenizer(
    train_inputs,
    truncation=True,
    padding=True,
    max_length=max_input_len
)

dev_encodings = t5_tokenizer(
    dev_inputs,
    truncation=True,
    padding=True,
    max_length=max_input_len
)

test_encodings = t5_tokenizer(
    test_inputs,
    truncation=True,
    padding=True,
    max_length=max_input_len
)

# Encode target JSON strings
with t5_tokenizer.as_target_tokenizer():
    train_labels_enc = t5_tokenizer(
        train_targets,
        truncation=True,
        padding=True,
        max_length=max_target_len
    )
    dev_labels_enc = t5_tokenizer(
        dev_targets,
        truncation=True,
        padding=True,
        max_length=max_target_len
    )
    test_labels_enc = t5_tokenizer(
        test_targets,
        truncation=True,
        padding=True,
        max_length=max_target_len
    )




### Cell 5: Build T5 Dataset Objects

In [5]:
import torch

class T5Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels["input_ids"][idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

# Build datasets
train_dataset = T5Dataset(train_encodings, train_labels_enc)
dev_dataset   = T5Dataset(dev_encodings, dev_labels_enc)
test_dataset  = T5Dataset(test_encodings, test_labels_enc)

len(train_dataset), len(dev_dataset), len(test_dataset)


(228, 49, 50)

### Cell 6: Create TrainingArguments + Trainer for T5

In [7]:
from transformers import TrainingArguments, DataCollatorForSeq2Seq, Trainer

training_args = TrainingArguments(
    output_dir="./t5_s4_outputs",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="loss",
)

# Data collator for seq2seq (same idea as Sprint 3)
data_collator = DataCollatorForSeq2Seq(
    tokenizer=t5_tokenizer,
    model=t5_model
)

trainer = Trainer(
    model=t5_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=dev_dataset,
    tokenizer=t5_tokenizer,
    data_collator=data_collator
)




### Cell 7: Train T5 baseline

In [8]:
trainer.train()


  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)


Epoch,Training Loss,Validation Loss
1,4.3752,2.078088
2,2.0735,1.177325
3,1.332,0.883351
4,1.0409,0.739457
5,0.9235,0.698634


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'].


TrainOutput(global_step=285, training_loss=1.8152378684596011, metrics={'train_runtime': 960.2308, 'train_samples_per_second': 1.187, 'train_steps_per_second': 0.297, 'total_flos': 25011799326720.0, 'train_loss': 1.8152378684596011, 'epoch': 5.0})

### Cell 8: Evaluate T5 on dev + test (loss only)

In [9]:
# Evaluate on dev and test with the standard Trainer metrics (loss)
dev_results = trainer.evaluate(eval_dataset=dev_dataset)
print("DEV BASELINE (loss metrics):", dev_results)

test_results = trainer.evaluate(eval_dataset=test_dataset)
print("TEST BASELINE (loss metrics):", test_results)




DEV BASELINE (loss metrics): {'eval_loss': 0.6986342668533325, 'eval_runtime': 5.5822, 'eval_samples_per_second': 8.778, 'eval_steps_per_second': 2.329, 'epoch': 5.0}
TEST BASELINE (loss metrics): {'eval_loss': 0.6277286410331726, 'eval_runtime': 7.2534, 'eval_samples_per_second': 6.893, 'eval_steps_per_second': 1.792, 'epoch': 5.0}


### Cell 9: Generate T5 predictions, compute Slot-F1 & Frame-Validity, show examples

In [11]:
import torch

t5_model.eval()

pred_strs = []
batch_size = 8
device = t5_model.device

for i in range(0, len(test_inputs), batch_size):
    batch_texts = test_inputs[i:i+batch_size]

    enc = t5_tokenizer(
        batch_texts,
        return_tensors="pt",
        truncation=True,
        padding=True,
        max_length=max_input_len
    ).to(device)

    with torch.no_grad():
        out_ids = t5_model.generate(
            **enc,
            max_length=max_target_len,
            num_beams=1   # FASTEST
        )

    batch_preds = [
        t5_tokenizer.decode(ids, skip_special_tokens=True)
        for ids in out_ids
    ]

    pred_strs.extend(batch_preds)

len(pred_strs)


50

### Cell 9b (FULL evaluation + examples)

In [12]:
import json

# --- Helper: Convert predicted JSON string to dict with 5 slots ---
def clean_pred_json(pred_str):
    try:
        data = json.loads(pred_str)
        if not isinstance(data, dict):
            return {s: "" for s in SLOT_NAMES}
    except:
        return {s: "" for s in SLOT_NAMES}

    out = {}
    for s in SLOT_NAMES:
        val = data.get(s, "")
        if val is None:
            val = ""
        out[s] = str(val)
    return out

# --- Convert gold into standard dict format as well ---
def clean_gold_json(gold_dict):
    out = {}
    for s in SLOT_NAMES:
        val = gold_dict.get(s, "")
        if val is None:
            val = ""
        out[s] = str(val)
    return out

# Parse predicted + gold slots
pred_slots_list = [clean_pred_json(p) for p in pred_strs]
gold_slots_list = [clean_gold_json(j) for j in test_df["target_json"]]

# --- Compute Slot-level F1 score ---
TP = FP = FN = 0

for gold, pred in zip(gold_slots_list, pred_slots_list):
    for slot in SLOT_NAMES:
        g = gold[slot].strip()
        p = pred[slot].strip()

        if g == "" and p == "":
            continue

        if g != "" and p != "":
            if g == p:
                TP += 1
            else:
                FP += 1
                FN += 1
        elif g != "" and p == "":
            FN += 1
        elif g == "" and p != "":
            FP += 1

precision = TP / (TP + FP) if (TP + FP) > 0 else 0.0
recall    = TP / (TP + FN) if (TP + FN) > 0 else 0.0
slot_f1   = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0

# --- Compute Frame Validity (at least one of ACTOR/ACTION/OBJECT filled) ---
valid_frames = 0
for pred in pred_slots_list:
    if any(pred[s].strip() for s in ["ACTOR", "ACTION", "OBJECT"]):
        valid_frames += 1

frame_valid_pct = 100 * valid_frames / len(pred_slots_list)

print(f"Slot-F1: {slot_f1:.3f}")
print(f"Frame-Validity %: {frame_valid_pct:.1f}%")

# --- Print example outputs ---
def fmt(d):
    return "{ " + ", ".join([f'"{s}":"{d[s]}"' for s in SLOT_NAMES]) + " }"

num_examples = 5
for i in range(num_examples):
    print("\nText:", test_df.iloc[i]["text"])
    print("Pred:", fmt(pred_slots_list[i]))
    print("Gold:", fmt(gold_slots_list[i]))


Slot-F1: 0.000
Frame-Validity %: 0.0%

Text: When a character delivers a speech so powerful that it emotionally moves the others to take action and not lose hope
Pred: { "ACTOR":"", "ACTION":"", "OBJECT":"", "LOCATION":"", "TIME":"" }
Gold: { "ACTOR":"character", "ACTION":"delivers", "OBJECT":"speech", "LOCATION":"", "TIME":"" }

Text: A character who very rarely or never shows any emotion
Pred: { "ACTOR":"", "ACTION":"", "OBJECT":"", "LOCATION":"", "TIME":"" }
Gold: { "ACTOR":"character", "ACTION":"shows", "OBJECT":"emotion", "LOCATION":"", "TIME":"" }

Text: Limited color palette on purpose
Pred: { "ACTOR":"", "ACTION":"", "OBJECT":"", "LOCATION":"", "TIME":"" }
Gold: { "ACTOR":"purpose", "ACTION":"", "OBJECT":"color palette", "LOCATION":"", "TIME":"" }

Text: A character comes home and finds that they cannot fit in there anymore
Pred: { "ACTOR":"", "ACTION":"", "OBJECT":"", "LOCATION":"", "TIME":"" }
Gold: { "ACTOR":"character", "ACTION":"comes", "OBJECT":"", "LOCATION":"", "TIME":"

In [13]:
# Inspect first 5 raw T5 outputs + their gold JSON strings
for i in range(5):
    print(f"\n=== Example {i+1} ===")
    print("RAW PRED:", repr(pred_strs[i]))
    print("GOLD STR:", test_targets[i])



=== Example 1 ===
RAW PRED: '""""""""""""""""""""""""""""""""""""""""""""""""""""""""'
GOLD STR: {"ACTOR": "character", "ACTION": "delivers", "OBJECT": "speech", "LOCATION": "", "TIME": ""}

=== Example 2 ===
RAW PRED: '"ACTOR": "character", "ACTION": ""'
GOLD STR: {"ACTOR": "character", "ACTION": "shows", "OBJECT": "emotion", "LOCATION": "", "TIME": ""}

=== Example 3 ===
RAW PRED: '"Long color palette": "limited color palette", "Limited color palette", "Limited color palette", "Limited color palette", "Limited color palette", "Limited color palette", "Limited color palette", "Limited color palette", "Limited color palette", "Limited color palette", "Limited color palette", "Limited color palette", "Limited color palette", "Limited color palette", "Limited color palette", "'
GOLD STR: {"ACTOR": "purpose", "ACTION": "", "OBJECT": "color palette", "LOCATION": "", "TIME": ""}

=== Example 4 ===
RAW PRED: "Un personnage rejoigne et d√©couvre qu'il ne s'y trouve plus et qu'il ne s'y trouv

### Explanation 

Purpose:
Establish a fair baseline for T5 using the new, raw Sprint 4 dataset with fixed train/dev/test split.

Setup:

Model: T5-small

Input: raw text

Target: strict JSON (5 slots: ACTOR, ACTION, OBJECT, LOCATION, TIME)

Train size: 228

Dev size: 49

Test size: 50

Results (Test Set):

Slot-F1: 0.000

Frame-Validity: 0.0%

**Interpretation:**

T5 is not able to generate valid JSON reliably under the Sprint 4 setup.

Many predictions are empty, malformed, or not valid JSON.

This does not mean the model is broken ‚Äî it means the task (raw text ‚Üí strict JSON) is too hard for T5-small with only 228 examples.

This creates a weak but honest baseline, which we will improve in later tracks (Track 1, Track 2, Track 3 of Sprint 4).

---------------------------

### T0.3 Error Log (Sprint 4 Baseline)

Purpose:
Before starting any improvement experiments in Sprint 4, we need to understand how the models (BERT and T5) are making mistakes.
T0.3 creates an error log ‚Äî a table that records for each test sentence:

the original text

the gold slots (ACTOR, ACTION, OBJECT, LOCATION, TIME)

the model‚Äôs predicted slots

how many slots it got correct

whether the prediction was empty or partially correct

This is like a report card showing what the model is struggling with.
Later, when we apply improvements (Track 1, 2, 3), we can compare:

Before improvements (T0.3)
vs
After improvements





### T0.3 has two parts:

**T0.3A ‚Äî BERT Error Log**

Use BERT‚Äôs baseline predictions (BIO ‚Üí slots).

Build a DataFrame showing BERT mistakes.

**T0.3B ‚Äî T5 Error Log**

Use T5‚Äôs baseline predictions (JSON ‚Üí slots).

Build a DataFrame showing T5 mistakes.

Both logs help us identify:

- frequent error patterns

- difficult sentences

- which slots fail the most

- model weaknesses to target in Sprint 4

### T0.3A (BERT)

### Cell 1: Re-load BERT model and tokenizer

The kernal has restarted but training earlier should be saved in the checkpoints. 
This cell will help us :

- Reload the BERT tokenizer

- Find the latest checkpoint in ./bert_s4_outputs

- Load that checkpoint as the BERT model we will use for predictions

In [14]:
import os
import glob
from transformers import BertTokenizerFast, BertForTokenClassification

# 1) Load tokenizer (same as before)
bert_tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased")

# 2) Find latest BERT checkpoint from Sprint 4 training
checkpoint_dirs = glob.glob("./bert_s4_outputs/checkpoint-*")

if not checkpoint_dirs:
    print("‚ùå No BERT checkpoints found in ./bert_s4_outputs. "
          "You may need to re-run the BERT training cells from T0.2A.")
else:
    # Sort by step number and take the last (most recent)
    def extract_step(path):
        # path looks like './bert_s4_outputs/checkpoint-285'
        name = os.path.basename(path)
        step_str = name.split("-")[-1]
        return int(step_str) if step_str.isdigit() else -1

    checkpoint_dirs = sorted(checkpoint_dirs, key=extract_step)
    best_ckpt = checkpoint_dirs[-1]
    print("‚úÖ Loading BERT checkpoint:", best_ckpt)

    bert_model = BertForTokenClassification.from_pretrained(best_ckpt)
    bert_model.eval()

    # id2label mapping (e.g. 0 -> B-ACTION, 1 -> B-ACTOR, etc.)
    id2label = bert_model.config.id2label
    label2id = bert_model.config.label2id

    print("Labels in this model:", id2label)




‚úÖ Loading BERT checkpoint: ./bert_s4_outputs\checkpoint-145
Labels in this model: {0: 'B-ACTION', 1: 'B-ACTOR', 2: 'B-LOCATION', 3: 'B-OBJECT', 4: 'I-ACTION', 5: 'I-ACTOR', 6: 'I-LOCATION', 7: 'I-OBJECT', 8: 'O'}


### Cell 2: Run BERT on test set and convert to slot predictions

This cell will:
- Tokenize the test sentences for BERT

- Run the loaded BERT model to get predictions

- Convert predicted BIO tags ‚Üí slot spans

- Build:

1) bert_gold_slots_list

2) bert_pred_slots_list

In [15]:
import torch

# We'll use these 4 slots for BERT (TIME not in label set)
SLOT_NAMES_BERT = ["ACTOR", "ACTION", "OBJECT", "LOCATION"]

# Helper: convert BIO tags & tokens into slot dictionary
def bio_to_slots(tokens, tags, slot_names=SLOT_NAMES_BERT):
    slots = {s: "" for s in slot_names}
    current_slot = None
    current_tokens = []

    def close_current():
        nonlocal current_slot, current_tokens
        if current_slot is not None and current_tokens and slots[current_slot] == "":
            slots[current_slot] = " ".join(current_tokens)
        current_slot = None
        current_tokens = []

    for tok, tag in zip(tokens, tags):
        if tag == "O":
            close_current()
            continue

        if tag.startswith("B-"):
            close_current()
            slot = tag[2:]
            if slot in slots:
                current_slot = slot
                current_tokens = [tok]
            else:
                current_slot = None
                current_tokens = []
        elif tag.startswith("I-"):
            slot = tag[2:]
            if current_slot == slot:
                current_tokens.append(tok)
            else:
                close_current()
                if slot in slots:
                    current_slot = slot
                    current_tokens = [tok]

    close_current()
    return slots

# 1) Prepare test tokens (simple whitespace tokenization)
test_texts = test_df["text"].tolist()
test_tokens_bert = [t.split() for t in test_texts]

# 2) Tokenize with BERT (word-level alignment)
bert_encodings = bert_tokenizer(
    test_tokens_bert,
    is_split_into_words=True,
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="pt"
)

# 3) Move model & data to device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
bert_model.to(device)

input_ids = bert_encodings["input_ids"].to(device)
attention_mask = bert_encodings["attention_mask"].to(device)

# 4) Run BERT to get predictions
with torch.no_grad():
    outputs = bert_model(input_ids=input_ids, attention_mask=attention_mask)
    logits = outputs.logits  # (batch, seq_len, num_labels)
    pred_ids = logits.argmax(dim=-1).cpu().numpy()

# 5) Convert predicted IDs ‚Üí BIO tags per word, then ‚Üí slots
bert_pred_slots_list = []
bert_gold_slots_list = []

for i in range(len(test_tokens_bert)):
    word_ids = bert_encodings.word_ids(batch_index=i)
    bio_preds = []
    
    prev_word = None
    for token_idx, w_id in enumerate(word_ids):
        if w_id is None:
            continue
        if w_id != prev_word:
            label_id = int(pred_ids[i][token_idx])
            tag = id2label[label_id]
            bio_preds.append(tag)
            prev_word = w_id

    # Predicted slots from BIO
    pred_slots = bio_to_slots(test_tokens_bert[i], bio_preds)
    bert_pred_slots_list.append(pred_slots)

    # Gold slots from test_df["target_json"]
    gold_json = test_df.iloc[i]["target_json"]
    gold_slots = {s: (gold_json.get(s, "") or "") for s in SLOT_NAMES_BERT}
    bert_gold_slots_list.append(gold_slots)

len(bert_pred_slots_list), len(bert_gold_slots_list)


(50, 50)

### Cell 3: Build BERT error log DataFrame

In [16]:
import pandas as pd

rows = []

for i in range(len(test_df)):
    text = test_df.iloc[i]["text"]
    gold = bert_gold_slots_list[i]   # dict with ACTOR/ACTION/OBJECT/LOCATION
    pred = bert_pred_slots_list[i]   # dict with ACTOR/ACTION/OBJECT/LOCATION

    row = {
        "text": text
    }

    correct_count = 0

    for slot in SLOT_NAMES_BERT:  # ["ACTOR", "ACTION", "OBJECT", "LOCATION"]
        g = (gold.get(slot, "") or "").strip()
        p = (pred.get(slot, "") or "").strip()

        row[f"gold_{slot}"] = g
        row[f"pred_{slot}"] = p

        if g != "" and p == g:
            correct_count += 1

    row["num_correct_slots"] = correct_count

    any_core_pred = any(
        (pred.get(s, "") or "").strip()
        for s in ["ACTOR", "ACTION", "OBJECT"]
    )
    all_empty_pred = not any(
        (pred.get(s, "") or "").strip()
        for s in SLOT_NAMES_BERT
    )

    row["any_core_pred"] = any_core_pred
    row["all_empty_pred"] = all_empty_pred

    rows.append(row)

bert_error_log_df = pd.DataFrame(rows)
bert_error_log_df.head()


Unnamed: 0,text,gold_ACTOR,pred_ACTOR,gold_ACTION,pred_ACTION,gold_OBJECT,pred_OBJECT,gold_LOCATION,pred_LOCATION,num_correct_slots,any_core_pred,all_empty_pred
0,When a character delivers a speech so powerful...,character,character,delivers,delivers,speech,,,,2,True,False
1,A character who very rarely or never shows any...,character,character,shows,shows,emotion,emotion,,,3,True,False
2,Limited color palette on purpose,purpose,,,,color palette,Limited color palette,,,0,True,False
3,A character comes home and finds that they can...,character,character,comes,,,,,,1,True,False
4,The best (or only) way to get rid of something...,way,way,burning,burning,,,,,2,True,False


### Explanation:

### T0.3A ‚Äî BERT Error Log (Sprint 4 Baseline)

**Purpose:**  
After training the BERT BIO tagging model on the Sprint 4 fixed split, I generated predictions on the 50 test sentences and converted them back into slot values (ACTOR, ACTION, OBJECT, LOCATION).  
T0.3A creates an *error log* so I can see, for each test sentence:

- the original text,
- the gold slots from `target_json`,
- BERT‚Äôs predicted slots,
- how many slots were correct (`num_correct_slots`),
- whether BERT predicted any core slot at all.

**What I see:**  
- Many rows have matching `gold_` and `pred_` values for ACTOR / ACTION / OBJECT, which matches the earlier BERT baseline metrics (Slot-F1 ‚âà 0.50, Frame-Validity ‚âà 92%).  
- Some rows show typical errors, e.g. BERT chooses a slightly different span (‚ÄúLimited color palette‚Äù instead of ‚Äúcolor palette‚Äù), or misses a slot completely.

This error log will be reused later to explain common BERT error patterns and to compare against any Sprint 4 improvements.


### T0.3B ‚Äì T5 Error Log

he goal is to build a table (DataFrame) that shows exactly how the T5 baseline failed for each of the 50 test sentences.

To do that, we need to:

1) Load the Sprint 4 fixed split dataset again
(so we have test_df with the same 50 test examples)

2) Extract the gold slots
from each row‚Äôs target_json (ACTOR, ACTION, etc.)

3) Load the T5 predictions we previously generated (pred_strs)

4) Parse the predictions into slot dictionaries
(mostly empty because T5 couldn‚Äôt produce valid JSON)

5) Combine everything into an error log row-by-row:

- gold slots

- predicted slots

- count how many slots match

- mark if T5 predicted anything at all

- mark if everything was empty

### Cell 1: Load dataset + recreate train/dev/test split

In [17]:
import json
import pandas as pd

# Load the Sprint 4 fixed split dataset
df = pd.read_json("idea_annotator_sprint4_split_fixed.jsonl", lines=True)

# Recreate train/dev/test from the fixed split
train_df = df[df["split"] == "train"].reset_index(drop=True)
dev_df   = df[df["split"] == "dev"].reset_index(drop=True)
test_df  = df[df["split"] == "test"].reset_index(drop=True)

len(train_df), len(dev_df), len(test_df)


(228, 49, 50)

### Cell 2: Define SLOT_NAMES + build gold_slots_list

In [18]:
# T0.3B ‚Äì Cell 2: define slot names and build gold_slots_list from test_df

SLOT_NAMES = ["ACTOR", "ACTION", "OBJECT", "LOCATION", "TIME"]

def normalize_to_slot_dict(raw):
    """
    Take target_json (string or dict) and return a clean dict
    with exactly the 5 slot keys. Missing ones become "".
    """
    # If it's a string, try to parse as JSON
    if isinstance(raw, str):
        try:
            data = json.loads(raw)
        except json.JSONDecodeError:
            data = {}
    elif isinstance(raw, dict):
        data = raw
    else:
        data = {}

    # Make sure all 5 slots exist as strings
    slots = {}
    for slot in SLOT_NAMES:
        value = data.get(slot, "")
        if value is None:
            value = ""
        slots[slot] = str(value).strip()
    return slots

# Build gold_slots_list for the 50 test examples
gold_slots_list = [normalize_to_slot_dict(x) for x in test_df["target_json"]]

# Quick sanity check: first 3 gold slot dicts
gold_slots_list[:3]


[{'ACTOR': 'character',
  'ACTION': 'delivers',
  'OBJECT': 'speech',
  'LOCATION': '',
  'TIME': ''},
 {'ACTOR': 'character',
  'ACTION': 'shows',
  'OBJECT': 'emotion',
  'LOCATION': '',
  'TIME': ''},
 {'ACTOR': 'purpose',
  'ACTION': '',
  'OBJECT': 'color palette',
  'LOCATION': '',
  'TIME': ''}]

### Cell 3: Parse pred_strs into pred_slots_list

In [19]:
# T0.3B ‚Äì Cell 3: parse T5 generated predictions (pred_strs) into pred slot dicts

def parse_predicted_json(raw_pred):
    """
    Takes a raw T5 prediction (string).
    Attempts to load JSON; if fails, return empty slots.
    """
    if not isinstance(raw_pred, str):
        return {slot: "" for slot in SLOT_NAMES}

    raw_pred = raw_pred.strip()

    # Try to load as JSON
    try:
        data = json.loads(raw_pred)
    except Exception:
        # invalid JSON ‚Üí return empty slots
        return {slot: "" for slot in SLOT_NAMES}
    
    # Normalize: ensure all 5 slots appear
    slots = {}
    for slot in SLOT_NAMES:
        value = data.get(slot, "")
        if value is None:
            value = ""
        slots[slot] = str(value).strip()
    return slots

# Build pred_slots_list from pred_strs
pred_slots_list = [parse_predicted_json(p) for p in pred_strs]

# Sanity preview of first few predictions
pred_slots_list[:3]


[{'ACTOR': '', 'ACTION': '', 'OBJECT': '', 'LOCATION': '', 'TIME': ''},
 {'ACTOR': '', 'ACTION': '', 'OBJECT': '', 'LOCATION': '', 'TIME': ''},
 {'ACTOR': '', 'ACTION': '', 'OBJECT': '', 'LOCATION': '', 'TIME': ''}]

### Create the actual T5 Error Log DataFrame

In [20]:
# T0.3B ‚Äì Cell 4: build the T5 error log DataFrame

rows = []

for i in range(len(test_df)):
    text = test_df.loc[i, "text"]

    gold = gold_slots_list[i]
    pred = pred_slots_list[i]

    # count how many slots match exactly
    num_correct = sum(
        1 for slot in SLOT_NAMES 
        if gold.get(slot, "") == pred.get(slot, "")
    )

    # any of ACTOR/ACTION/OBJECT non-empty?
    any_core_pred = any(pred[slot] != "" for slot in ["ACTOR", "ACTION", "OBJECT"])

    # all slots empty?
    all_empty_pred = all(pred[slot] == "" for slot in SLOT_NAMES)

    row = {
        "text": text,
    }

    # add gold_ columns
    for slot in SLOT_NAMES:
        row[f"gold_{slot}"] = gold[slot]

    # add pred_ columns
    for slot in SLOT_NAMES:
        row[f"pred_{slot}"] = pred[slot]

    row["num_correct_slots"] = num_correct
    row["any_core_pred"] = any_core_pred
    row["all_empty_pred"] = all_empty_pred

    rows.append(row)

t5_error_log_df = pd.DataFrame(rows)

t5_error_log_df.head()


Unnamed: 0,text,gold_ACTOR,gold_ACTION,gold_OBJECT,gold_LOCATION,gold_TIME,pred_ACTOR,pred_ACTION,pred_OBJECT,pred_LOCATION,pred_TIME,num_correct_slots,any_core_pred,all_empty_pred
0,When a character delivers a speech so powerful...,character,delivers,speech,,,,,,,,2,False,True
1,A character who very rarely or never shows any...,character,shows,emotion,,,,,,,,2,False,True
2,Limited color palette on purpose,purpose,,color palette,,,,,,,,3,False,True
3,A character comes home and finds that they can...,character,comes,,,,,,,,,3,False,True
4,The best (or only) way to get rid of something...,way,burning,,,,,,,,,3,False,True


---------------------------

### T0.4 ‚Äì Cell 1: Collect baseline metrics

In [24]:
# T0.4 ‚Äì Baseline Summary (FINAL, using the metrics from T0.2A and T0.2B)

import pandas as pd

# Official metrics from T0.2A (BERT) and T0.2B (T5) on Sprint 4 fixed split
bert_slot_f1 = 0.507      # from your T0.2A result
bert_frame_validity = 92.0

t5_slot_f1 = 0.0          # from your T0.2B result
t5_frame_validity = 0.0

# Use the existing error logs just to count how many predictions were totally empty
bert_empty = int(bert_error_log_df["all_empty_pred"].sum())
t5_empty = int(t5_error_log_df["all_empty_pred"].sum())

summary_for_report = pd.DataFrame({
    "Model": ["BERT token tagging", "T5 text-to-JSON"],
    "Slots_used": [
        "ACTOR, ACTION, OBJECT, LOCATION",
        "ACTOR, ACTION, OBJECT, LOCATION, TIME"
    ],
    "Slot_F1": [bert_slot_f1, t5_slot_f1],
    "Frame_validity_%": [bert_frame_validity, t5_frame_validity],
    "#All_empty_predictions": [bert_empty, t5_empty]
})

summary_for_report


Unnamed: 0,Model,Slots_used,Slot_F1,Frame_validity_%,#All_empty_predictions
0,BERT token tagging,"ACTOR, ACTION, OBJECT, LOCATION",0.507,92.0,1
1,T5 text-to-JSON,"ACTOR, ACTION, OBJECT, LOCATION, TIME",0.0,0.0,50


### Track 0 ‚Äì Baseline Results on Sprint 4 Fixed Split

In Track 0, I established two baselines using the Sprint 4 fixed train/dev/test split (228/49/50) and the 5-slot case-frame schema (ACTOR, ACTION, OBJECT, LOCATION, TIME).

**BERT token-tagging baseline (T0.2A).**  
I fine-tuned a BERT token classification model to predict BIO tags for four core slots (ACTOR, ACTION, OBJECT, LOCATION). On the Sprint 4 test set, BERT achieved **Slot-F1 ‚âà 0.507** and **Frame-Validity ‚âà 92%**. Slot-F1 measures how accurately the model extracts each slot span (e.g., correct ACTOR phrase, correct OBJECT phrase). Frame-Validity here means the percentage of test sentences where BERT produces at least a *minimally valid* frame, i.e., at least one core slot is filled. The high Frame-Validity shows that BERT almost always predicts something meaningful, but the 0.507 Slot-F1 also shows there is still room to improve the quality and boundaries of the predicted spans.

**T5 text-to-JSON baseline (T0.2B).**  
I also trained a naive **t5-small** model to directly generate a strict JSON object with all 5 slot keys from the raw idea text. On the same test set, this baseline essentially failed: it frequently produced invalid JSON strings, which became completely empty frames after parsing. As a result, the T5 baseline obtained **Slot-F1 = 0.0** and **Frame-Validity = 0.0%**, with all 50 test examples ending up as empty predictions in the T5 error log.

**Summary.**  
The comparison shows that:
- BERT, even with a simple token-tagging setup, can already recover many slot spans and produce minimally valid frames for most test ideas (92% non-empty frames, Slot-F1 ‚âà 0.507).  
- A naive text-to-JSON T5 baseline collapses into invalid or empty outputs under the strict JSON format, leading to effectively zero performance.

These baselines motivate the next Tracks (1‚Äì4). Later experiments will focus on improving both **slot accuracy** and **frame validity** by refining the data and labels, improving decoding/post-processing for BERT, and redesigning the T5 setup with better prompts, constraints and JSON-aware training.


-----------------