## FYP Sprint 3 ML training

### Ian Chia 
### 230746D

### Mini GamePlan Idea to give the teacher some sort of understanding of what i am trying to do 

1) Build a mini version of the ML pipeline


2) Define the shape of the JSON


3) Use a tiny dataset (few examples made up on the spot) Why? : So we can test the full flow without risking the real MongoDB data


4) Connect to your real data (350 annotated examples) : Once the testing is complete we will replace everything with the real one.

### Why are we doing this:

So right now, we just use the mini version because:

It’s faster — we don’t need to connect to MongoDB yet.

It’s safer — we can test code without touching your real data.

It’s simple — we only need 5 core slots to prove the system works.

Once it works, we’ll swap in the real schema (which already lives in your app).

------------------

### Mini Testing pipeline :

-----------------------------

### Cell 0 — Install & Import needed libaries

Installs the Hugging Face tools we need to train a small T5 model.
Run once per fresh environment. No output is expected.

In [1]:
!pip install -q transformers datasets seqeval jsonschema accelerate

In [2]:
!pip install -q transformers datasets accelerate

In [1]:
from dataclasses import dataclass
from typing import List, Dict, Any
import json, re, random
from pathlib import Path

import numpy as np
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer, AutoModelForTokenClassification, DataCollatorForTokenClassification, TrainingArguments, Trainer
from seqeval.metrics import precision_score, recall_score, f1_score
from jsonschema import validate, ValidationError
from transformers import TrainingArguments, Trainer

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer
from datasets import Dataset, DatasetDict
import json, numpy as np

  from .autonotebook import tqdm as notebook_tqdm





-----------------------------

### Cell 1 — tiny seed dataset

Sets the 5 slots (ACTOR, ACTION, OBJECT, LOCATION, TIME).

Creates a mini practice dataset (3 train / 1 dev / 1 test).

Wraps each sentence into a prompt like:
“Extract case-frame JSON… Text: "…" JSON:”

Stores the correct JSON as the target.
This gives T5 examples of what to write.

In [2]:
SLOTS = ["ACTOR","ACTION","OBJECT","LOCATION","TIME"]

# tiny seed data (same content as BERT demo)
examples = [
    ("Alice kicked the ball at the park yesterday.",
     {"ACTOR":"Alice","ACTION":"kicked","OBJECT":"the ball","LOCATION":"the park","TIME":"yesterday"}),
    ("Bob repaired the bike in the garage last night.",
     {"ACTOR":"Bob","ACTION":"repaired","OBJECT":"the bike","LOCATION":"the garage","TIME":"last night"}),
    ("Chloe reads a novel at home every morning.",
     {"ACTOR":"Chloe","ACTION":"reads","OBJECT":"a novel","LOCATION":"home","TIME":"every morning"}),
    ("Daniel cooked pasta in the kitchen at noon.",
     {"ACTOR":"Daniel","ACTION":"cooked","OBJECT":"pasta","LOCATION":"the kitchen","TIME":"at noon"}),
    ("Eva painted the fence outside on Sunday.",
     {"ACTOR":"Eva","ACTION":"painted","OBJECT":"the fence","LOCATION":"outside","TIME":"on Sunday"}),
]

# split 3/1/1
train = examples[:3]; dev = examples[3:4]; test = examples[4:5]

def to_pairs(pairs):
    recs = []
    for text, y in pairs:
        prompt = (
          "Extract a case-frame JSON with keys ACTOR, ACTION, OBJECT, LOCATION, TIME.\n"
          f'Text: "{text}"\nJSON:'
        )
        recs.append({"input": prompt, "target": json.dumps(y)})
    return recs

ds = DatasetDict({
    "train": Dataset.from_list(to_pairs(train)),
    "validation": Dataset.from_list(to_pairs(dev)),
    "test": Dataset.from_list(to_pairs(test)),
})
ds

DatasetDict({
    train: Dataset({
        features: ['input', 'target'],
        num_rows: 3
    })
    validation: Dataset({
        features: ['input', 'target'],
        num_rows: 1
    })
    test: Dataset({
        features: ['input', 'target'],
        num_rows: 1
    })
})

-----------------------------

### Cell 2 — Tokenize

Loads a small T5 model (e.g., flan-t5-small).

Converts each prompt/target into token IDs the model understands.

Pairs inputs with the correct output IDs (labels) so the model can learn.

In [3]:
MODEL = "google/flan-t5-small"  # or "t5-small"
tok = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL)

def tok_fn(batch):
    model_in = tok(batch["input"], truncation=True)
    with tok.as_target_tokenizer():
        labels = tok(batch["target"], truncation=True)
    model_in["labels"] = labels["input_ids"]
    return model_in

tok_ds = ds.map(tok_fn, batched=True, remove_columns=ds["train"].column_names)
collator = DataCollatorForSeq2Seq(tok, model=model)
tok_ds

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Map: 100%|████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 41.63 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 88.20 examples/s]
Map: 100%|█████████████████████████

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 3
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1
    })
})

-----------------------------

### Cell 3 — Train (short)

Sets training settings (epochs, batch size, learning rate).

Trains on the 3 tiny train examples and evaluates on the 1 dev example.

In [5]:
args = TrainingArguments(
    output_dir="./models/t5_text2json",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=10,
    save_total_limit=1,
    report_to=[]   # keeps it quiet
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tok_ds["train"],
    eval_dataset=tok_ds["validation"],
    data_collator=collator,
    tokenizer=tok,
)

trainer.train()
trainer.evaluate()


  trainer = Trainer(


Step,Training Loss




{'eval_loss': 2.364403009414673,
 'eval_runtime': 0.354,
 'eval_samples_per_second': 2.825,
 'eval_steps_per_second': 2.825,
 'epoch': 3.0}

### Explanation: 

**eval_loss** = “how wrong” the model still is on the dev prompt (lower is better).

You only see loss (not F1) because we didn’t ask the trainer to generate here; that’s okay — we generate in Cell 4.

With 3 train examples, loss won’t look great. This cell mainly proves training runs end-to-end.

-----------------------------

### Cell 4 — Generate on test + parse + simple Slot-F1

Asks T5 to write the JSON for the test sentence (it “decodes” its understanding).

Tries to parse what it wrote into a Python dict.

Compares predicted JSON vs gold JSON slot by slot (exact match) to compute a simple Slot-F1.

In [6]:
import json

# small helper to ask T5 to produce JSON for a sentence
def generate_json(text):
    prompt = (
      "Extract a case-frame JSON with keys ACTOR, ACTION, OBJECT, LOCATION, TIME.\n"
      f'Text: "{text}"\nJSON:'
    )
    ids = tok(prompt, return_tensors="pt")
    # put tensors on same device as model (handles CPU/GPU safely)
    ids = {k: v.to(model.device) for k, v in ids.items()}
    gen_ids = model.generate(**ids, max_new_tokens=80)
    s = tok.decode(gen_ids[0], skip_special_tokens=True)

    # trim to JSON block if model adds extra text
    if "{" in s and "}" in s:
        s = s[s.find("{"): s.rfind("}")+1]
    try:
        return json.loads(s)
    except:
        # fallback if parsing fails
        return {k: None for k in SLOTS}

# pull raw text & gold label from the test set
test_input = ds["test"][0]["input"]
text = test_input.split('Text: "')[1].split('"\nJSON:')[0]
gold = json.loads(ds["test"][0]["target"])

pred = generate_json(text)
print("TEXT:", text)
print("PRED:", pred)
print("GOLD:", gold)

# very simple exact-match Slot-F1 (per-slot equality)
def slot_f1(pred, gold):
    tp=fp=fn=0
    for k in SLOTS:
        p=(pred.get(k) or "").strip().lower()
        g=(gold.get(k) or "").strip().lower()
        if p and g and p==g: tp+=1
        elif p and g and p!=g: fp+=1; fn+=1
        elif p and not g: fp+=1
        elif not p and g: fn+=1
    prec = tp/(tp+fp) if tp+fp else 0.0
    rec  = tp/(tp+fn) if tp+fn else 0.0
    f1   = 2*prec*rec/(prec+rec) if prec+rec else 0.0
    return {"precision":prec, "recall":rec, "f1":f1}

print("Slot-F1:", slot_f1(pred, gold))


TEXT: Eva painted the fence outside on Sunday.
PRED: {'ACTOR': None, 'ACTION': None, 'OBJECT': None, 'LOCATION': None, 'TIME': None}
GOLD: {'ACTOR': 'Eva', 'ACTION': 'painted', 'OBJECT': 'the fence', 'LOCATION': 'outside', 'TIME': 'on Sunday'}
Slot-F1: {'precision': 0.0, 'recall': 0.0, 'f1': 0.0}


### Explanation:

The model didn’t learn enough from only 3 training examples, so it produced an “empty” JSON (all None).

Exact-match Slot-F1 is 0.0 because none of the five slots match the gold answer.

This is normal for a micro toy run; we need more data and/or more training to get useful outputs.

--------------------

### Cell 5 — Save artifacts (optional)

In [7]:
from pathlib import Path
Path("results").mkdir(exist_ok=True)
with open("results/t5_test_result.json","w") as f:
    json.dump({"text":text,"pred":pred,"gold":gold}, f, indent=2)
"saved"

'saved'

Saves one test example (text, predicted JSON, gold JSON) to results/t5_test_result.json so it can be screenshotted or attach it for the proposal.


----------------

### Conclusion :

- We completed a T5 text→JSON prototype on a tiny seed set.

- Training/eval ran successfully, but outputs are poor on such little data — expected.

- This prototype proves the pipeline: prompt → T5 generate → parse JSON → Slot-F1.

**Next steps:**

1) Export the 350 MongoDB annotations into {"text","target_json"} and re-train (5–10 epochs).

2) Compare with your BERT track using the same split and report Slot-F1 per slot and Frame-Validity %.

3) Carry the better performing track (BERT or T5) into your Step-7 proposal with metrics + example predictions.

----------------