## FYP Sprint 4 
### Ian Chia 
### 230746D

--------------------------

### Improving T5 JSON Output (Table of Contents)

### 1) Track 2.1 — Setup & Loading

### 2) Track 2.2 — Safer Decoding

### 3) Track 2.3 — JSON Repair Pipeline

### 4) Track 2.4 — Generate & Repair Predictions

### 5) Track 2.5 — Evaluate (Track 2 Metrics)

-------------------

### Track 2.1 — Setup & Loading

Main Objective:
Prepare the environment by loading libraries, constants, dataset paths, and the existing T5 model checkpoint.

2.1A: Import libraries + define slot names

2.1B: Load dataset (train/dev/test)

2.1C: Load T5 tokenizer + model from your checkpoint

### 2.1A — Import Libraries & Define Constants

Small explanation:
This cell loads all the Python libraries we will use and defines the slot names & dataset path. Nothing is executed on the model yet.

In [1]:
# 2.1A — Import libraries & define constants

import json
import re
from pathlib import Path

import pandas as pd
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Slot names (same as Sprint 4)
SLOT_NAMES = ["ACTOR", "ACTION", "OBJECT", "LOCATION", "TIME"]

# Dataset path (same as Sprint 4 fixed split)
DATA_PATH = Path("idea_annotator_sprint4_split_fixed.jsonl")

print("Loaded libraries. Ready for next step.")


  from .autonotebook import tqdm as notebook_tqdm


Loaded libraries. Ready for next step.


### 2.1B — Load Dataset (train/dev/test)

This cell reads your Sprint 4 fixed JSONL file, splits it into train / dev / test, and prepares the test texts + gold JSON slots we’ll need later for evaluation.

In [3]:
# 2.1B — Load dataset (train / dev / test) and prepare gold slots for test
# (Corrected version: target_json already stored as dict, no need json.loads)

df = pd.read_json(DATA_PATH, lines=True)

print("Columns:", df.columns.tolist())

# Split into train / dev / test
train_df = df[df["split"] == "train"].reset_index(drop=True)
dev_df   = df[df["split"] == "dev"].reset_index(drop=True)
test_df  = df[df["split"] == "test"].reset_index(drop=True)

print(f"Train size: {len(train_df)}")
print(f"Dev size:   {len(dev_df)}")
print(f"Test size:  {len(test_df)}")

# Expected sizes (safety check)
assert len(train_df) == 228
assert len(dev_df)   == 49
assert len(test_df)  == 50

# Prepare test inputs
test_texts = test_df["text"].tolist()

# Gold slot dicts — already stored as dicts in the file
gold_slots_list = test_df["target_json"].tolist()

print("\nExample test text:", test_texts[0])
print("Example gold slots:", gold_slots_list[0])
print("\nDataset loaded and split successfully.")


Columns: ['text', 'target_json', 'split']
Train size: 228
Dev size:   49
Test size:  50

Example test text: When a character delivers a speech so powerful that it emotionally moves the others to take action and not lose hope
Example gold slots: {'ACTOR': 'character', 'ACTION': 'delivers', 'OBJECT': 'speech', 'LOCATION': '', 'TIME': ''}

Dataset loaded and split successfully.


### 2.1C — Load T5 Tokenizer & Model from Checkpoint

This cell:

points to your saved T5 model folder (e.g. ./t5_s4_outputs/)

loads the tokenizer and model

moves model to GPU if available (else CPU)

sets it to evaluation mode

In [7]:
# 2.1C — Load T5 tokenizer & model from latest checkpoint in ./t5_s4_outputs

import torch
import glob
import os
from pathlib import Path

BASE_T5_DIR = Path("./t5_s4_outputs")

print("Looking for T5 checkpoints in:", BASE_T5_DIR)

# 1) Find all checkpoint subfolders, e.g. ./t5_s4_outputs/checkpoint-500
checkpoint_dirs = glob.glob(str(BASE_T5_DIR / "checkpoint-*"))

if not checkpoint_dirs:
    raise RuntimeError(
        " No T5 checkpoints found in ./t5_s4_outputs.\n"
        "Please open your Sprint4 _track_0_Fixed Split.ipynb notebook,\n"
        "run the T5 training cells again (trainer.train()), and then come back here."
    )

# 2) Pick the latest checkpoint based on step number
def extract_step(path):
    basename = os.path.basename(path)
    try:
        return int(basename.split("-")[-1])
    except ValueError:
        return -1

checkpoint_dirs = sorted(checkpoint_dirs, key=extract_step)
best_ckpt = checkpoint_dirs[-1]

print(" Found T5 checkpoint:", best_ckpt)

# 3) Load tokenizer
from transformers import T5Tokenizer, T5ForConditionalGeneration

try:
    tokenizer = T5Tokenizer.from_pretrained(best_ckpt)
    print("Tokenizer loaded from checkpoint folder.")
except Exception as e:
    print("Could not load tokenizer from checkpoint folder. Reason:")
    print(" ", e)
    print("\nFalling back to base 't5-small' tokenizer (same as used for training).")
    tokenizer = T5Tokenizer.from_pretrained("t5-small")
    print("Tokenizer loaded from 't5-small'.")

# 4) Load model weights from the same checkpoint
model = T5ForConditionalGeneration.from_pretrained(best_ckpt)

# 5) Move model to device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

print("\n T5 model loaded successfully from:", best_ckpt)
print("Running on device:", device)


Looking for T5 checkpoints in: t5_s4_outputs
 Found T5 checkpoint: t5_s4_outputs\checkpoint-285
Tokenizer loaded from checkpoint folder.

 T5 model loaded successfully from: t5_s4_outputs\checkpoint-285
Running on device: cpu


------------------

### Track 2.2 — Safer Decoding

Main Objective:This makes sures that there will be output and it will
make T5 generate cleaner, more predictable JSON by controlling how it produces text. 
This reduces:

- random trailing sentences

- half-closed braces

- extra noise tokens

- unnecessary repetition

Track 2.2 does NOT repair JSON yet — it simply makes the raw generation less messy, so the repair step (Track 2.3) becomes easier and more accurate.


2.2A: Define safe generation settings

2.2B: Test-generation on 1 sentence (sanity check)

### 2.2A — Define safe generation settings

This cell creates a “safe generation config” you will reuse whenever T5 is generating JSON.
These settings reduce junk text, repetition, and incomplete JSON.

- disable sampling

- encourage T5 to stop early

- reduce randomness

This makes T5 output more stable JSON-like text.

In [8]:
# 2.2A — Define safer generation settings for T5

safe_gen_kwargs = {
    "max_length": 128,
    "no_repeat_ngram_size": 3,
    "repetition_penalty": 1.2,
    "early_stopping": True,
    "num_beams": 4,            # beam search to reduce randomness
    "do_sample": False,        # deterministic output
}

print("Safe generation settings created.")
safe_gen_kwargs


Safe generation settings created.


{'max_length': 128,
 'no_repeat_ngram_size': 3,
 'repetition_penalty': 1.2,
 'early_stopping': True,
 'num_beams': 4,
 'do_sample': False}

### 2.2B — Quick Safe-Decoding Sanity Test

This cell will:

1) Take one test sentence (the 1st one in your test set)

2) Tokenize it

3) Run T5 with the safe generation settings

4) Print the raw output so we can visually confirm that:
    - generation works
    - no errors
    - T5 produces something JSON-like
    - beam search + deterministic decoding is functioning

This is NOT repairing JSON yet — just making sure T5 is generating.

In [9]:
# 2.2B — Quick sanity test with safe decoding

# Take the first test input
sample_text = test_texts[0]

print("INPUT TEXT:")
print(sample_text)
print("\n--- Generating with safe decoding ---\n")

# Encode the text
inputs = tokenizer(
    sample_text,
    return_tensors="pt",
    truncation=True,
    padding=True
).to(device)

# Generate with safe settings
with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        **safe_gen_kwargs
    )

# Decode to string
sample_pred = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print("RAW T5 OUTPUT:")
print(sample_pred)


INPUT TEXT:
When a character delivers a speech so powerful that it emotionally moves the others to take action and not lose hope

--- Generating with safe decoding ---

RAW T5 OUTPUT:
"TIME": "character", "


----------------

### Track 2.3 — JSON Repair Pipeline

Main Objective:
Repair invalid JSON produced by T5 so it becomes valid and readable.

2.3A: Basic cleanup (remove weird tokens, trim text)

2.3B: Structure repair (fix braces/quotes/commas)

2.3C: Slot normalization (ensure all 5 slots exist)

2.3D: “Full repair” function combining all steps

### 2.3A — Basic Cleanup Helper


This cell creates a function that:

- trims spaces

- removes obvious markdown wrappers (like ```json) if they ever appear

- if there’s a { ... } region, it keeps only that inner part (helps when T5 adds extra commentary).

This makes the text “cleaner” before we attempt parsing or regex extraction.

In [10]:
# 2.3A — Basic cleanup of raw T5 output

def clean_raw_text(raw: str) -> str:
    """
    Basic cleanup of the raw T5 string before attempting JSON parsing / regex.
    This does NOT guarantee valid JSON, it just removes obvious junk.
    """
    if raw is None:
        return ""

    s = str(raw).strip()

    # Remove common markdown wrappers if they appear
    for prefix in ["```json", "```"]:
        if s.lower().startswith(prefix):
            s = s[len(prefix):].lstrip(": \n")

    # If there is a clear JSON block {...}, keep only that region
    first_brace = s.find("{")
    last_brace = s.rfind("}")
    if first_brace != -1 and last_brace != -1 and last_brace > first_brace:
        s = s[first_brace:last_brace+1]

    return s

# Quick test on the sample output you saw
test_clean = clean_raw_text(sample_pred)
print("RAW:", repr(sample_pred))
print("CLEANED:", repr(test_clean))
Main Objective:
Repair invalid JSON produced by T5 so it becomes valid and readable.

2.3A: Basic cleanup (remove weird tokens, trim text)

2.3B: Structure repair (fix braces/quotes/commas)

2.3C: Slot normalization (ensure all 5 slots exist)

2.3D: “Full repair” function combining all steps

RAW: '"TIME": "character", "'
CLEANED: '"TIME": "character", "'


### 2.3B — Attempt JSON Parse (with safe fallback)

Small explanation:
This function tries:

1. Try json.loads(s) normally

2. If it fails → we return None
    (we will handle this later with regex extraction in 2.3C)

This builds the “structured repair” part.

In [11]:
# 2.3B — Try to parse cleaned text as JSON

def try_json_parse(s: str):
    """
    Attempt to parse the string as JSON.
    If it fails, return None (we will fix with regex in 2.3C).
    """
    if not s:
        return None
    
    try:
        return json.loads(s)
    except:
        return None

# Test on the cleaned sample
parsed = try_json_parse(test_clean)
print("Parsed output:", parsed)


Parsed output: None


### 2.3C — Regex Slot Extraction + Slot Normalization

Small explanation:
This cell does two things at once:

1) Uses regex to find patterns like
    "ACTOR": "the hero"
    "TIME": "yesterday"
    even if the overall string is not valid JSON.

2) Normalizes them into a dict with exactly these 5 keys:

In [12]:
# 2.3C — Extract slot values using regex + normalize to 5 slots

def extract_slots_with_regex(s: str):
    """
    Use a regex to find patterns like "KEY": "value" and map them
    into our 5-slot dict. Any missing slot is filled with "".
    """
    # Start with empty slots
    slots = {slot: "" for slot in SLOT_NAMES}
    
    if not s:
        return slots
    
    # Regex: optional quotes around key, colon, then quoted value
    # Examples it can match:
    # "ACTOR": "the hero"
    # ACTOR: "the hero"
    pattern = r'"?\s*([A-Za-z_]+)\s*"?\s*:\s*"([^"]*)"'
    
    matches = re.findall(pattern, s)
    
    for key, value in matches:
        key_norm = key.strip().upper()
        value_norm = value.strip()
        
        if key_norm in slots:
            slots[key_norm] = value_norm
    
    return slots

# Quick test on the cleaned sample we have
regex_slots = extract_slots_with_regex(test_clean)
print("Slots extracted via regex from sample:")
print(regex_slots)


Slots extracted via regex from sample:
{'ACTOR': '', 'ACTION': '', 'OBJECT': '', 'LOCATION': '', 'TIME': 'character'}


### 2.3D — Full Repair Pipeline (repair_prediction)

Small explanation:
This step combines everything i have built:

- 2.3A: Clean the raw text

- 2.3B: Try JSON parsing

- 2.3C: Extract slots via regex

- Slot normalization: Always produce all 5 slots

In [13]:
# 2.3D — Full repair pipeline combining 2.3A + 2.3B + 2.3C

def repair_prediction(raw_text: str):
    """
    Full repair pipeline:
    1. Clean raw T5 text
    2. Try JSON parsing
    3. If JSON parsing fails, extract slots via regex
    4. Ensure all 5 slots exist
    """
    # Step 1: basic cleanup
    cleaned = clean_raw_text(raw_text)

    # Step 2: try real JSON parsing
    parsed = try_json_parse(cleaned)
    
    # If JSON parsing worked and produced a dict
    if isinstance(parsed, dict):
        # Normalize keys (uppercase) + ensure all 5 slots
        final = {slot: "" for slot in SLOT_NAMES}
        for k, v in parsed.items():
            k_norm = str(k).strip().upper()
            if k_norm in final:
                final[k_norm] = str(v).strip()
        return final
    
    # Step 3: fallback to regex extraction if JSON parse failed
    return extract_slots_with_regex(cleaned)

# Test on your sample
fixed_sample = repair_prediction(sample_pred)
print("Repaired sample prediction:")
print(fixed_sample)


Repaired sample prediction:
{'ACTOR': '', 'ACTION': '', 'OBJECT': '', 'LOCATION': '', 'TIME': 'character'}


----------------

### Track 2.4 — Generate & Repair Predictions

Main Objective:
Run T5 on all 50 test sentences → repair each → save into pred_slots_list.

- 2.4A: Generate raw predictions (pred_strs)

- 2.4B: Apply repair pipeline → pred_slots_list


### 2.4A — Generate Raw Predictions (pred_strs)

Small explanation:

This cell loops through all 50 test sentences and:

1) Tokenizes each sentence

2) Runs T5 with safe generation settings

3) Stores the raw (unrepaired) T5 output in a list called pred_strs

In [14]:
# 2.4A — Generate raw T5 predictions for all 50 test sentences

pred_strs = []  # raw T5 outputs

print("Generating raw predictions...")
for i, text in enumerate(test_texts):
    # Encode
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        padding=True
    ).to(device)

    # Generate using safe settings
    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            **safe_gen_kwargs
        )

    # Decode
    pred = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    pred_strs.append(pred)

    if i < 3:  # print first few
        print(f"\nExample {i+1}:")
        print("Input:", text)
        print("Raw T5 Output:", pred)

print("\nDone. Total predictions:", len(pred_strs))


Generating raw predictions...

Example 1:
Input: When a character delivers a speech so powerful that it emotionally moves the others to take action and not lose hope
Raw T5 Output: "TIME": "character", "

Example 2:
Input: A character who very rarely or never shows any emotion
Raw T5 Output: "ACTOR": "character", "ACTOR", "OBJECT", "LOCATION", "EXCELLENCE", "TIME": „"

Example 3:
Input: Limited color palette on purpose
Raw T5 Output: "Long color palette": "limited color palette", "Limited color palette on purpose", "color palette" : ""

Done. Total predictions: 50


### 2.4B — Apply Repair Pipeline to All Predictions

Small explanation:
This cell:

- Takes each raw T5 output from pred_strs

- Runs repair_prediction() on it

- Stores the fixed JSON dict in pred_slots_list

After this step, you will have:

- pred_slots_list → clean, valid predictions ready for evaluation

- No invalid JSON

- No missing slots

- No parser crashes

This is the step that removes all the problems that caused Track 0 to fail.

In [15]:
# 2.4B — Apply repair pipeline to all raw predictions
pred_slots_list = []

print("Repairing predictions...")
for i, raw in enumerate(pred_strs):
    fixed = repair_prediction(raw)
    pred_slots_list.append(fixed)

    # Show first 3 repaired outputs for inspection
    if i < 3:
        print(f"\nOriginal Raw Prediction {i+1}: {raw}")
        print("Repaired:", fixed)

print("\nDone. Total repaired predictions:", len(pred_slots_list))


Repairing predictions...

Original Raw Prediction 1: "TIME": "character", "
Repaired: {'ACTOR': '', 'ACTION': '', 'OBJECT': '', 'LOCATION': '', 'TIME': 'character'}

Original Raw Prediction 2: "ACTOR": "character", "ACTOR", "OBJECT", "LOCATION", "EXCELLENCE", "TIME": „"
Repaired: {'ACTOR': 'character', 'ACTION': '', 'OBJECT': '', 'LOCATION': '', 'TIME': ''}

Original Raw Prediction 3: "Long color palette": "limited color palette", "Limited color palette on purpose", "color palette" : ""
Repaired: {'ACTOR': '', 'ACTION': '', 'OBJECT': '', 'LOCATION': '', 'TIME': ''}

Done. Total repaired predictions: 50


-------------------------

### Track 2.5 — Evaluate (Track 2 Metrics)

Main Objective:
Re-run Slot-F1 and Frame-Validity with fixed JSON so results reflect T5’s true capability.

2.5A: Slot-F1 calculation

2.5B: Frame-validity %

2.5C: Summary of Track 2 improvements

### 2.5A — Compute Slot-Level Precision / Recall / F1

Small explanation:
This cell compares gold_slots_list vs pred_slots_list and calculates:

how many slots are exactly correct (TP)

how many are wrong (FP, FN)

overall precision, recall, F1

The rule we’ll use (simple and sensible):

If gold slot is non-empty and prediction matches exactly → TP

If prediction is non-empty but wrong → FP + FN

If gold slot is non-empty but prediction is empty → FN

If both empty → ignored (doesn’t help or hurt)

This is a clean, slot-level F1 — good enough and easy to explain in your report.

In [16]:
# 2.5A — Compute slot-level precision, recall, and F1 across all 5 slots

def compute_slot_f1(golds, preds, slot_names=SLOT_NAMES):
    tp = 0  # true positives (correct non-empty slot matches)
    fp = 0  # predicted something but it was wrong
    fn = 0  # gold had something but prediction missed or was wrong

    for gold, pred in zip(golds, preds):
        for slot in slot_names:
            gold_val = str(gold.get(slot, "")).strip()
            pred_val = str(pred.get(slot, "")).strip()

            # ignore if both are empty (no information)
            if gold_val == "" and pred_val == "":
                continue

            if gold_val != "" and pred_val == gold_val:
                # correct non-empty match
                tp += 1
            elif gold_val == "" and pred_val != "":
                # predicted something where gold has nothing
                fp += 1
            elif gold_val != "" and pred_val == "":
                # missed a gold slot
                fn += 1
            elif gold_val != "" and pred_val != "" and pred_val != gold_val:
                # both non-empty but different → wrong prediction
                fp += 1
                fn += 1

    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall    = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    f1        = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0

    return {
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "tp": tp,
        "fp": fp,
        "fn": fn,
    }

slot_metrics = compute_slot_f1(gold_slots_list, pred_slots_list)

print("Track 2 — Slot-level metrics")
print(f"TP: {slot_metrics['tp']}, FP: {slot_metrics['fp']}, FN: {slot_metrics['fn']}")
print(f"Precision: {slot_metrics['precision']:.4f}")
print(f"Recall:    {slot_metrics['recall']:.4f}")
print(f"F1:        {slot_metrics['f1']:.4f}")


Track 2 — Slot-level metrics
TP: 1, FP: 11, FN: 133
Precision: 0.0833
Recall:    0.0075
F1:        0.0137


### Explanation

Track 2 successfully repaired T5 outputs into valid JSON, enabling proper evaluation.
Although T5’s Slot-F1 remains low (≈0.01), this reveals the model’s true weakness in strictly-structured slot filling.
This result justifies moving to BERT token classification in Track 3, which is better suited for structured extraction.

### 2.5B — Frame Validity (% of predictions containing all 5 slots)

In [17]:
# 2.5B — Compute frame validity (does each prediction contain all 5 slots?)

def compute_frame_validity(preds, slot_names=SLOT_NAMES):
    valid = 0
    total = len(preds)

    for frame in preds:
        if all(slot in frame for slot in slot_names):
            valid += 1

    return valid / total if total > 0 else 0.0

frame_validity = compute_frame_validity(pred_slots_list)

print(f"Track 2 — Frame Validity: {frame_validity*100:.2f}%")


Track 2 — Frame Validity: 100.00%


### Track 2.5C — Final Summary Cell

In [18]:
# 2.5C — Track 2 Final Summary (T5 Improvements on Sprint 4 Fixed Split)

print("===============================================================")
print("                 TRACK 2 — FINAL SUMMARY")
print("===============================================================\n")

# Slot-level metrics
print("Slot-Level Evaluation (ACTOR, ACTION, OBJECT, LOCATION, TIME)")
print(f"  True Positives (TP): {slot_metrics['tp']}")
print(f"  False Positives (FP): {slot_metrics['fp']}")
print(f"  False Negatives (FN): {slot_metrics['fn']}")
print(f"  Precision: {slot_metrics['precision']:.4f}")
print(f"  Recall:    {slot_metrics['recall']:.4f}")
print(f"  F1 Score:  {slot_metrics['f1']:.4f}\n")

# Frame validity
print("Frame Validity (percentage of frames containing all 5 required slots)")
print(f"  Frame Validity: {frame_validity * 100:.2f}%\n")

# Explanation for report clarity
print("Notes:")
print("1. Track 0 failed because T5 produced invalid JSON, leading to F1 = 0 and Frame Validity = 0%.")
print("2. Track 2 introduced safer decoding and a repair pipeline to fix all invalid outputs.")
print("3. As a result, all 50 predictions are now valid JSON structures.")
print("4. Track 2 reveals T5's true slot-filling capability: very low F1, but no longer 0.")
print("5. This demonstrates that T5-small is not well-suited for strict slot extraction.")
print("6. This motivates migrating to BERT-based token classification in Track 3.\n")

print("Track 2 Summary Completed.")
print("===============================================================")


                 TRACK 2 — FINAL SUMMARY

Slot-Level Evaluation (ACTOR, ACTION, OBJECT, LOCATION, TIME)
  True Positives (TP): 1
  False Positives (FP): 11
  False Negatives (FN): 133
  Precision: 0.0833
  Recall:    0.0075
  F1 Score:  0.0137

Frame Validity (percentage of frames containing all 5 required slots)
  Frame Validity: 100.00%

Notes:
1. Track 0 failed because T5 produced invalid JSON, leading to F1 = 0 and Frame Validity = 0%.
2. Track 2 introduced safer decoding and a repair pipeline to fix all invalid outputs.
3. As a result, all 50 predictions are now valid JSON structures.
4. Track 2 reveals T5's true slot-filling capability: very low F1, but no longer 0.
5. This demonstrates that T5-small is not well-suited for strict slot extraction.
6. This motivates migrating to BERT-based token classification in Track 3.

Track 2 Summary Completed.
