## FYP Sprint 3 ML training

### Ian Chia 
### 230746D

### Mini GamePlan Idea to give the teacher some sort of understanding of what i am trying to do 

1) Build a mini version of the ML pipeline


2) Define the shape of the JSON


3) Use a tiny dataset (few examples made up on the spot) Why? : So we can test the full flow without risking the real MongoDB data


4) Connect to your real data (350 annotated examples) : Once the testing is complete we will replace everything with the real one.

### Why are we doing this:

So right now, we just use the mini version because:

It’s faster — we don’t need to connect to MongoDB yet.

It’s safer — we can test code without touching your real data.

It’s simple — we only need 5 core slots to prove the system works.

Once it works, we’ll swap in the real schema (which already lives in your app).

---------------------------

### Mini Testing pipeline :

--------------------------------------------



### Cell 0 — Install some libraries. 

#### Downloading the tools needed to make it work:
transformers = lets you use models like BERT.

datasets = helps handle small text datasets.

seqeval = measures how well BERT tags words (F1 score).

jsonschema = checks if your output JSON follows your rulebook.

In [1]:
!pip install -q transformers datasets seqeval jsonschema accelerate


  DEPRECATION: Building 'seqeval' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'seqeval'. Discussion can be found at https://github.com/pypa/pip/issues/6334


---------------------------------------------------

### Cell 1 — Import the needed libraries & Configuration 

In [1]:
from dataclasses import dataclass
from typing import List, Dict, Any
import json, re, random
from pathlib import Path

import numpy as np
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer, AutoModelForTokenClassification, DataCollatorForTokenClassification, TrainingArguments, Trainer
from seqeval.metrics import precision_score, recall_score, f1_score
from jsonschema import validate, ValidationError

#Cell 8 - training
from transformers import TrainingArguments, Trainer


  from .autonotebook import tqdm as notebook_tqdm





---------------------------------------

### Cell 2 — Define Schema & Mini Examples

define slots  : These are the slots that we want to use

bio labels    : BIO labels are just a way to teach the model which words belong to which slot.

define schema : CASE_FRAME_SCHEMA = {

mini examples : examples = [

split data    : train / validation / test

In [2]:
# slot labels we’ll predict
SLOTS = ["ACTOR","ACTION","OBJECT","LOCATION","TIME"]
BIO_LABELS = ["O"] + [f"{p}-{s}" for s in SLOTS for p in ["B","I"]]
LABEL2ID = {lab:i for i,lab in enumerate(BIO_LABELS)}
ID2LABEL = {i:lab for lab,i in LABEL2ID.items()}

# minimal schema (acceptance: ACTOR & ACTION required for "valid frame")
CASE_FRAME_SCHEMA = {
    "type": "object",
    "properties": {s: {"type": ["string","null"]} for s in SLOTS},
    "required": ["ACTOR","ACTION"],
    "additionalProperties": False
}

# Tiny seed set to start (replace with your own later)
examples = [
    ("Alice kicked the ball at the park yesterday.",
     {"ACTOR":"Alice","ACTION":"kicked","OBJECT":"the ball","LOCATION":"the park","TIME":"yesterday"}),
    ("Bob repaired the bike in the garage last night.",
     {"ACTOR":"Bob","ACTION":"repaired","OBJECT":"the bike","LOCATION":"the garage","TIME":"last night"}),
    ("Chloe reads a novel at home every morning.",
     {"ACTOR":"Chloe","ACTION":"reads","OBJECT":"a novel","LOCATION":"home","TIME":"every morning"}),
    ("Daniel cooked pasta in the kitchen at noon.",
     {"ACTOR":"Daniel","ACTION":"cooked","OBJECT":"pasta","LOCATION":"the kitchen","TIME":"at noon"}),
    ("Eva painted the fence outside on Sunday.",
     {"ACTOR":"Eva","ACTION":"painted","OBJECT":"the fence","LOCATION":"outside","TIME":"on Sunday"}),
]

# simple split: 3 train, 1 dev, 1 test (tiny on purpose)
random.seed(3)
random.shuffle(examples)
train_ex = examples[:3]
dev_ex   = examples[3:4]
test_ex  = examples[4:5]

len(train_ex), len(dev_ex), len(test_ex)


(3, 1, 1)

--------------------------------------------------

### Cell 3 — Helper: build simple BIO tags from the gold JSON (whitespace matching)

It takes each sentence and its correct answer (the JSON), and colors every word with a tag:

B-ACTOR = first word of the ACTOR

I-ACTOR = the rest of the ACTOR words

… same for ACTION / OBJECT / LOCATION / TIME

O = not part of any slot

So later, BERT can learn: “when I see words like this, color them like that.

### More In-dept explanation:


**bio_tag(text, target_json)**

**Input:**

text = the sentence (e.g., “Alice kicked the ball…”)

target_json = the correct slots (e.g., ACTOR=Alice, ACTION=kicked,…)

**What it does:**

Splits the sentence into tokens (words)

Tries to find each slot phrase in the sentence
(e.g., OBJECT = “the ball” → it looks for the + ball)

Gives each word a label like B-OBJECT or I-OBJECT

**Output:**

tokens = list of words

labels = list of tags (same length as tokens)

Think “token list” = your word list, “label list” = your color list.

--------------------------------

**to_records(pairs)**

**Input:** a list of (text, gold_json) pairs

**What it does:** calls bio_tag for each pair and packs everything into a neat record:

{"tokens":[...], "labels":[...], "text": "...", "target_json": {...}}

**Output:** a list of these neat records (for train/dev/test)

In [3]:
def bio_tag(text: str, target_json: Dict[str,str]):
    toks = text.split()
    labels = ["O"]*len(toks)

    def norm(x): return re.sub(r"[^\w']+", "", x.lower())

    for slot, phrase in target_json.items():
        if not phrase:
            continue
        p_tokens = phrase.split()
        n = len(p_tokens)
        # find first exact (punctuation-stripped) match
        for i in range(len(toks)-n+1):
            window = [norm(t) for t in toks[i:i+n]]
            if window == [norm(t) for t in p_tokens]:
                labels[i] = f"B-{slot}"
                for j in range(1, n):
                    labels[i+j] = f"I-{slot}"
                break
    return toks, labels

def to_records(pairs):
    recs = []
    for text, gold in pairs:
        tokens, labels = bio_tag(text, gold)
        recs.append({"tokens": tokens, "labels": labels, "text": text, "target_json": gold})
    return recs

train_recs = to_records(train_ex)
dev_recs   = to_records(dev_ex)
test_recs  = to_records(test_ex)

train_recs[0]


{'tokens': ['Alice',
  'kicked',
  'the',
  'ball',
  'at',
  'the',
  'park',
  'yesterday.'],
 'labels': ['B-ACTOR',
  'B-ACTION',
  'B-OBJECT',
  'I-OBJECT',
  'O',
  'B-LOCATION',
  'I-LOCATION',
  'B-TIME'],
 'text': 'Alice kicked the ball at the park yesterday.',
 'target_json': {'ACTOR': 'Alice',
  'ACTION': 'kicked',
  'OBJECT': 'the ball',
  'LOCATION': 'the park',
  'TIME': 'yesterday'}}

----------------------------------------

### Cell 4 —  Build Hugging Face Datasets

group bio_tag records into three boxes:

a train box (to learn),

a validation/dev box (to check while learning),

a test box (final quiz).

In [4]:
ds = DatasetDict({
    "train": Dataset.from_list(train_recs),
    "validation": Dataset.from_list(dev_recs),
    "test": Dataset.from_list(test_recs)
})
ds

DatasetDict({
    train: Dataset({
        features: ['tokens', 'labels', 'text', 'target_json'],
        num_rows: 3
    })
    validation: Dataset({
        features: ['tokens', 'labels', 'text', 'target_json'],
        num_rows: 1
    })
    test: Dataset({
        features: ['tokens', 'labels', 'text', 'target_json'],
        num_rows: 1
    })
})

------------------------------------------------

### Cell 5 — Tokenizer and label alignment (wordpieces → BIO ids)
 
It turns word tokens into BERT's subword pieces and carry the BIO labels over correctly

**tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)**

Loads BERT’s tokenizer (splits words into subword pieces like “play” + “##ing”).

**tokenized = tokenizer(examples["tokens"], is_split_into_words=True, truncation=True)**

Tokenizes lists of words and keeps a mapping from subword → original word.

**word_ids = tokenized.word_ids(batch_index=i)**

Lets us know which subwords belong to which original word.

In [5]:
MODEL_NAME = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def align_labels_with_tokens(examples):
    tokenized = tokenizer(examples["tokens"], is_split_into_words=True, truncation=True)
    new_labels = []
    for i, labels in enumerate(examples["labels"]):
        word_ids = tokenized.word_ids(batch_index=i)
        prev_word = None
        label_ids = []
        for w_id in word_ids:
            if w_id is None:
                label_ids.append(-100)  # ignore special tokens
            else:
                lab = labels[w_id]
                # if it's a continuation of a wordpiece, keep I- if B- was there
                if w_id == prev_word:
                    if lab.startswith("B-"):
                        lab = "I-" + lab[2:]
                label_ids.append(LABEL2ID[lab])
                prev_word = w_id
        new_labels.append(label_ids)
    tokenized["labels"] = new_labels
    return tokenized

#Applies this alignment to train/dev/test.
tokenized_ds = ds.map(align_labels_with_tokens, batched=True)
len(tokenized_ds["train"]), len(tokenized_ds["validation"]), len(tokenized_ds["test"])


Map: 100%|████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 42.00 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 78.70 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 71.24 examples/s]


(3, 1, 1)

--------------------

### Cell 6 — load BERT for token classification

Create a BERT model head that predcits a BIO label for each subword

Loads BERT + a classification layer with exactly the number of BIO labels you defined.

**data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)**
Handles dynamic padding so batches are the same length (important for training).

ETUDS: in cell 6 the model will try to learn to predict one label per token, for example wehn it looks at a person name it will try to learn that its a person name. 

In [6]:
model = AutoModelForTokenClassification.from_pretrained(
    MODEL_NAME,
    num_labels=len(BIO_LABELS),
    id2label=ID2LABEL,
    label2id=LABEL2ID
)
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


----------------------------------------

### Cell 7 — metrics for token labels (seqeval)

Goal: after the model predicts, compute sequence tagging metrics correctly.We want to measure how close the guesses are to the real answers. 

We compare those guesses with the true BIO labels you made in Cell 3.

Using the seqeval library, we calculate:

- Precision = how many of the model’s “colored” words were correct.

- Recall = how many of the real colored words it found.

- F1 = balanced score between the two.

**preds = np.argmax(logits, axis=-1)**
Turns model scores into predicted label IDs.



In [7]:
def compute_token_metrics(pred):
    logits, labels = pred
    preds = np.argmax(logits, axis=-1)

    true_tags, pred_tags = [], []
    for p, l in zip(preds, labels):
        cur_true, cur_pred = [], []
        for pi, li in zip(p, l):
            if li == -100:
                continue
            cur_true.append(ID2LABEL[li])
            cur_pred.append(ID2LABEL[pi])
        true_tags.append(cur_true)
        pred_tags.append(cur_pred)

    return {
        "precision": precision_score(true_tags, pred_tags),
        "recall": recall_score(true_tags, pred_tags),
        "f1": f1_score(true_tags, pred_tags),
    }


----------------------------------------

### Cell 8 — train (tiny, fast)

In [9]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    output_dir="./models/bert_slot_tagger",
    logging_dir="./models/bert_slot_tagger/logs",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=10,
    save_total_limit=1
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_token_metrics
)

trainer.train()
trainer.evaluate()


  trainer = Trainer(


Step,Training Loss




{'eval_loss': 2.1785221099853516,
 'eval_precision': 0.14285714285714285,
 'eval_recall': 0.2,
 'eval_f1': 0.16666666666666666,
 'eval_runtime': 0.2741,
 'eval_samples_per_second': 3.648,
 'eval_steps_per_second': 3.648,
 'epoch': 3.0}

### Explanation
What happened: BERT learned from the tiny “study notes” (3 sentences) and was quizzed on 1 validation sentence.

**Key fields that can be seen:**

eval_loss: how wrong the model still is (lower is better).

eval_precision, eval_recall, eval_f1: how well it colored the words with the right BIO tags (higher is better).

**Why are they low?** I only trained on 3 sentences on CPU. That’s not enough for BERT to learn patterns; it’s just to **prove the pipeline runs.**

----------------------------------------

### Cell 9 — predict BIO tags on the test set + assemble JSON
What it does: uses the trained model to “color” the test sentence(s), then turns colors → case-frame JSON.

In [10]:
import numpy as np

# 1) get model predictions on tokenized test data
tok_test = tokenized_ds["test"]
raw_test = ds["test"]  # has original tokens/text/gold

pred_logits = trainer.predict(tok_test).predictions
pred_ids = np.argmax(pred_logits, axis=-1)

# 2) map predicted ids back to BIO labels (skip -100 positions)
pred_bio = []
for row_pred, row_labels in zip(pred_ids, tok_test["labels"]):
    labs = []
    k = 0
    for li in row_labels:
        if li == -100:
            continue
        labs.append(ID2LABEL[row_pred[k]])
        k += 1
    pred_bio.append(labs)

# 3) assemble JSON from BIO tags
def assemble_json_from_bio(tokens, labels):
    out = {s: None for s in SLOTS}
    i = 0
    while i < len(tokens):
        lab = labels[i]
        if lab.startswith("B-"):
            slot = lab.split("-")[1]
            # collect I- continuation
            j = i + 1
            span = [tokens[i]]
            while j < len(tokens) and labels[j] == f"I-{slot}":
                span.append(tokens[j]); j += 1
            phrase = " ".join(span).strip(" ,.")
            if out[slot] is None:  # take first span per slot
                out[slot] = phrase
            i = j
        else:
            i += 1
    return out

pred_jsons = []
for idx in range(len(raw_test)):
    tokens = raw_test[idx]["tokens"]
    labels = pred_bio[idx]
    pred_jsons.append(assemble_json_from_bio(tokens, labels))

list(zip([r["text"] for r in raw_test], pred_jsons))  # quick peek


[('Bob repaired the bike in the garage last night.',
  {'ACTOR': None,
   'ACTION': None,
   'OBJECT': 'bike',
   'LOCATION': 'repaired',
   'TIME': 'the'})]

### Explanation: 

**What happened:** The model guessed BIO tags, we converted them to a JSON.

Why does it look wrong? With almost no training data, the model “colored” badly:

- It missed ACTOR and ACTION

- Misplaced LOCATION (“repaired”) and TIME (“the”)

- Only OBJECT (“bike”) came out somewhat okay

This is expected with such a tiny toy dataset.

---------------------------------------------


### Cell 10 — validate with schema + compute simple Slot-F1 + Frame-Validity
What it does: checks if the JSON is well-formed and measures how close each slot is to the gold answer.

In [11]:
from jsonschema import validate, ValidationError

def is_schema_valid(cf):
    try:
        validate(cf, CASE_FRAME_SCHEMA)
        return True
    except ValidationError:
        return False

def slot_f1(pred, gold):
    # exact string match per slot (case-insensitive)
    tp = fp = fn = 0
    for s in SLOTS:
        p = (pred.get(s) or "").strip().lower()
        g = (gold.get(s) or "").strip().lower()
        if p and g and p == g:
            tp += 1
        elif p and g and p != g:
            fp += 1; fn += 1
        elif p and not g:
            fp += 1
        elif not p and g:
            fn += 1
    prec = tp / (tp+fp) if (tp+fp) else 0.0
    rec  = tp / (tp+fn) if (tp+fn) else 0.0
    f1   = 2*prec*rec/(prec+rec) if (prec+rec) else 0.0
    return {"precision":prec, "recall":rec, "f1":f1}

texts = [r["text"] for r in raw_test]
golds = [r["target_json"] for r in raw_test]
valid_flags = [is_schema_valid(pj) for pj in pred_jsons]
slot_scores = [slot_f1(pj, gj) for pj, gj in zip(pred_jsons, golds)]

results = {
    "slot_f1_avg": float(np.mean([s["f1"] for s in slot_scores])),
    "frame_validity_pct": 100.0 * sum(valid_flags) / len(valid_flags),
    "per_item": [
        {"text": t, "pred": p, "gold": g, "is_schema_valid": v, "slot_f1": s}
        for t,p,g,v,s in zip(texts, pred_jsons, golds, valid_flags, slot_scores)
    ]
}
results


{'slot_f1_avg': 0.0,
 'frame_validity_pct': 100.0,
 'per_item': [{'text': 'Bob repaired the bike in the garage last night.',
   'pred': {'ACTOR': None,
    'ACTION': None,
    'OBJECT': 'bike',
    'LOCATION': 'repaired',
    'TIME': 'the'},
   'gold': {'ACTION': 'repaired',
    'ACTOR': 'Bob',
    'LOCATION': 'the garage',
    'OBJECT': 'the bike',
    'TIME': 'last night'},
   'is_schema_valid': True,
   'slot_f1': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0}}]}

### Explanation:

**slot_f1_avg: 0.0** → Across the five slots, none matched the gold text exactly on this one test example, so average F1 is 0.

**frame_validity_pct: 100.0** → The JSON format is valid (keys exist and types are allowed).
Note: our schema allows null, so the frame can be “valid shape” even if content is missing. That’s by design, so the pipeline never crashes.

------------------------------

### Cell 11 — save artifacts (metrics + predictions)

What it does: writes outputs to files so you can paste screenshots/tables into your proposal.

In [12]:
from pathlib import Path
import json

Path("results").mkdir(exist_ok=True)
with open("results/metrics.json","w") as f:
    json.dump(results, f, indent=2)
with open("results/predictions.jsonl","w") as f:
    for t, p, g, v in zip(texts, pred_jsons, golds, valid_flags):
        f.write(json.dumps({"text":t, "pred_json":p, "gold_json":g, "is_schema_valid":v})+"\n")
"saved to results/"


'saved to results/'

This saved two files:

results/metrics.json (your scores)

results/predictions.jsonl (text, predicted JSON, gold JSON, validity)

--------------------------------------------

## Conclusion:

We successfully proved the full BERT pipeline: text → BIO tags → assembled JSON → schema validation → metrics. On a tiny toy set (3 train / 1 dev / 1 test), scores are low (e.g., slot_f1_avg ≈ 0) and predictions are noisy, which is expected because the model had almost no data to learn from. The output JSONs are still schema-valid, confirming the pipeline is robust and won’t crash.

This is not the final model. Next, we will (1) try the T5 track (direct text→JSON) for comparison, and (2) train again using our real 350 annotated examples and the real schema file. With more data and a few training tweaks (more epochs, appropriate learning rate), we expect Slot-F1 and Frame-Validity to improve significantly. The best-performing track (BERT or T5) will be carried into the proposal with metrics and example outputs.


---------------------------