**Disclaimer :**
 Parts of the code were generated with the assistance of OpenAI’s GPT-5 model. I have reviewed and adapted the outputs for correctness and relevance to this work.

# Task 1 — Wizard of Tasks: Intent Recognition


This notebook trains **three models** on Wizard of Tasks (≈18k utterances):
- Multinomial **Naïve Bayes** (NB)
- **Logistic Regression** (Linear BoW / Softmax)
- **MLP** (simple neural baseline)

## 1) Imports

In [None]:
import os, json, re, time
from pathlib import Path
from typing import List, Dict

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, f1_score, accuracy_score

import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option("display.max_colwidth", 200)

## 2) Load Wizard of Tasks JSON (flexible schema)

Supports both schemas:
- **Schema A**: dict → `data_split`, `turns` with `role`, `text`, `intent`.
- **Schema B**: list → `split`, `dialog` with `speaker`, `text`, `intent`.

In [None]:
DATA_COOK = Path("./wizard_of_tasks_cooking_v1.0.json")
DATA_DIY  = Path("./wizard_of_tasks_diy_v1.0.json")

def _extract_from_schema_a(obj: Dict, domain: str):
    rows = []
    for conv_id, conv in obj.items():
        split = conv.get("data_split") or conv.get("split") or "train"
        for turn in conv.get("turns", []):
            role = turn.get("role")
            text = (turn.get("text") or "").strip()
            intent = turn.get("intent")
            if role == "student" and text and intent:
                rows.append({"text": text, "intent": intent, "split": split, "domain": domain, "conv_id": conv_id})
    return rows

def _extract_from_schema_b(lst: List, domain: str):
    rows = []
    for conv in lst:
        split = conv.get("split") or conv.get("data_split") or "train"
        conv_id = conv.get("id") or conv.get("conversation_id") or None
        for turn in conv.get("dialog", []):
            speaker = turn.get("speaker") or turn.get("role")
            text = (turn.get("text") or "").strip()
            intent = turn.get("intent")
            if (speaker == "student" or speaker == "user") and text and intent:
                rows.append({"text": text, "intent": intent, "split": split, "domain": domain, "conv_id": conv_id})
    return rows

def load_wot(path: Path, domain: str):
    with open(path, "r", encoding="utf-8") as f:
        data = json.load(f)
    if isinstance(data, dict):
        return _extract_from_schema_a(data, domain)
    elif isinstance(data, list):
        return _extract_from_schema_b(data, domain)
    else:
        raise ValueError("Unrecognized JSON structure.")

rows = []
for pth, dom in [(DATA_COOK, "cooking"), (DATA_DIY, "diy")]:
    if not pth.exists():
        raise FileNotFoundError(f"Missing file: {pth.resolve()}")
    rows += load_wot(pth, dom)

df = pd.DataFrame(rows)
print("Total student turns with intents:", len(df))
print("Unique intents:", df['intent'].nunique())
display(df.head())
print("\nSplit counts:")
print(df['split'].value_counts())
print("\nDomain counts:")
print(df['domain'].value_counts())
# print("Unique intents:", df['intent'].unique())

## 3) Preprocessing & Vectorizers

- Lowercasing
- **Negation handling**: `not good` → `NOT_good`
- Unigrams + bigrams
- **Count** (NB) and **TF‑IDF** (LR/MLP)

In [None]:
NEGATORS = {"no","not","never","none","n't"}

def negate_bigram_tokens(text: str) -> str:
    toks = re.findall(r"[A-Za-z']+|\d+|[^\w\s]", text.lower())
    out, negate = [], False
    for tok in toks:
        if tok in NEGATORS:
            negate = True; out.append(tok); continue
        if re.match(r"\w+", tok) and negate:
            out.append(f"NOT_{tok}"); negate=False
        else:
            out.append(tok)
    return " ".join(out)

def preproc(text: str) -> str:
    return negate_bigram_tokens(text)

count_vec = CountVectorizer(ngram_range=(1,2), min_df=5, preprocessor=preproc, binary=True)
tfidf_vec = TfidfVectorizer(ngram_range=(1,2), min_df=5, preprocessor=preproc)

## 4) Train/Test Split (prefer official split; else stratified 80/20)

In [None]:
has_official = df['split'].str.lower().isin(['train','test','dev','validation']).any()
if has_official:
    train_mask = df['split'].str.lower().isin(['train','trn'])
    test_mask  = df['split'].str.lower().isin(['test','tst'])
    if not test_mask.any():
        non_train = df.loc[~train_mask]
        tr_df, te_df = train_test_split(non_train, test_size=0.5, stratify=non_train['intent'], random_state=42)
        train_df = pd.concat([df.loc[train_mask], tr_df], ignore_index=True)
        test_df = te_df.copy()
    else:
        train_df = df.loc[train_mask].copy()
        test_df  = df.loc[test_mask].copy()
else:
    train_df, test_df = train_test_split(df, test_size=0.2, stratify=df['intent'], random_state=42)

print("Train size:", len(train_df), "| Test size:", len(test_df))
X_train, y_train = train_df['text'].tolist(), train_df['intent'].tolist()
X_test,  y_test  = test_df['text'].tolist(),  test_df['intent'].tolist()

## 5) Models (NB, LR, MLP)

In [None]:
def timeit_fit_predict(pipe, Xtr, ytr, Xte):
    t0=time.time(); pipe.fit(Xtr, ytr); t1=time.time()
    yhat = pipe.predict(Xte); t2=time.time()
    return {"train_time_s": t1-t0, "infer_time_s": t2-t1, "y_pred": yhat, "pipe": pipe}

nb_pipe = Pipeline([("vec", count_vec), ("clf", MultinomialNB())])
lr_pipe = Pipeline([("vec", tfidf_vec), ("clf", LogisticRegression(max_iter=2000, n_jobs=-1))])
mlp_pipe = Pipeline([
    ("vec", tfidf_vec),
    ("clf", MLPClassifier(
        hidden_layer_sizes=(256,),
        activation="relu",
        batch_size=64,
        max_iter=20,          # a few more iters since no early stopping
        early_stopping=False, # <-- key change
        random_state=42
    ))
])


results = {}
for name, pipe in [("NaiveBayes", nb_pipe), ("LogReg", lr_pipe), ("MLP", mlp_pipe)]:
    print(f"\nTraining {name} ...")
    out = timeit_fit_predict(pipe, X_train, y_train, X_test)
    acc = accuracy_score(y_test, out["y_pred"])
    f1_micro = f1_score(y_test, out["y_pred"], average="micro")
    f1_macro = f1_score(y_test, out["y_pred"], average="macro")
    results[name] = {**out, "accuracy": acc, "f1_micro": f1_micro, "f1_macro": f1_macro}
    print(f"{name}: ACC={acc:.3f} | F1_micro={f1_micro:.3f} | F1_macro={f1_macro:.3f} | train {out['train_time_s']:.2f}s | infer {out['infer_time_s']:.2f}s")

pd.DataFrame({k:{m:v[m] for m in ['accuracy','f1_micro','f1_macro','train_time_s','infer_time_s']} for k,v in results.items()})

## 6) Detailed Evaluation (best by Macro F1)

In [None]:
best_name = max(results, key=lambda k: results[k]['f1_macro'])
best = results[best_name]
print("Best model:", best_name)
print(classification_report(y_test, best['y_pred'], digits=3))

labels = sorted(np.unique(y_test + list(best['y_pred'])))
cm = confusion_matrix(y_test, best['y_pred'], labels=labels)

plt.figure(figsize=(12,10))
sns.heatmap(cm, annot=False, cmap="Blues", xticklabels=labels, yticklabels=labels)
plt.title(f"Confusion Matrix — {best_name}")
plt.xlabel("Predicted"); plt.ylabel("True")
plt.tight_layout(); plt.show()

## 7) Interpretability (LogReg): Top n‑grams per intent

In [None]:
def top_features_logreg(pipe, n=12):
    if not isinstance(pipe.named_steps['clf'], LogisticRegression):
        print("Top features only for LogisticRegression"); return None
    vec = pipe.named_steps['vec']; clf = pipe.named_steps['clf']
    feats = np.array(vec.get_feature_names_out())
    coefs = clf.coef_; classes = clf.classes_
    out = {}
    for i, cls in enumerate(classes):
        idx = np.argsort(coefs[i])[::-1][:n]
        out[cls] = list(zip(feats[idx], coefs[i][idx]))
    return out

if 'LogReg' in results:
    tops = top_features_logreg(results['LogReg']['pipe'], n=12)
    if tops:
        for cls, pairs in list(tops.items())[:8]:
            print(f"\nIntent: {cls}")
            for f,w in pairs: print(f"  {f:35s} {w:+.3f}")

## 8) Domain-wise Scores

In [None]:
if 'domain' in test_df.columns and len(test_df['domain'].unique())>1:
    dom_scores = []
    best_pipe = best['pipe']
    for dom in sorted(test_df['domain'].unique()):
        mask = (test_df['domain']==dom)
        y_true = test_df.loc[mask, 'intent'].tolist()
        y_pred = best_pipe.predict(test_df.loc[mask, 'text'].tolist())
        dom_scores.append({
            "domain": dom,
            "n": int(mask.sum()),
            "acc": accuracy_score(y_true, y_pred),
            "f1_micro": f1_score(y_true, y_pred, average="micro"),
            "f1_macro": f1_score(y_true, y_pred, average="macro"),
        })
    display(pd.DataFrame(dom_scores).sort_values("f1_macro", ascending=False))
else:
    print("Domain info not available or only one domain in test.")

## 9) Insights


- **Model comparison:**  
  Naïve Bayes and Logistic Regression both performed strongly, with accuracy ≈ 84%.  
  • **Naïve Bayes** achieved the best **macro-F1 (0.54)**, showing better balance across all intent classes, including the rarer ones.  
  • **Logistic Regression** had the best **overall accuracy (0.84)** and micro-F1, meaning it handled the frequent intents slightly better.  
  • **MLP** underperformed (accuracy 0.80, macro-F1 0.52) and was much slower, indicating that for sparse bag-of-words data, a neural baseline is not competitive with linear models.

- **Efficiency:**  
  • Training time: NB was fastest (0.11 s), LR was moderate (≈ 1 s), and MLP was slowest (≈ 16 s).  
  • Inference time: all models were very fast (< 0.02 s per test set), but NB and LR are clearly more efficient to train and deploy.  
  • Memory footprint: NB and LR store only feature–class weights, while MLP requires thousands of extra parameters.

- **Robustness:**  
  • NB’s higher macro-F1 suggests it is more robust across rare intents, while LR tends to favor majority classes.  
  • LR is likely to generalize better under domain shifts (cooking ↔ DIY), since it directly optimizes conditional likelihood.  
  • MLP offered no extra robustness; it was more prone to overfitting without delivering higher scores.

- **Feature analysis:**  
  • Top LR features confirmed intuitive intent cues: question words (“how”, “what”, “do I”), cooking terms (“chop”, “oven”), and DIY tools (“nails”, “hammer”).  
  • Negation handling (e.g., “NOT_ready”) helped separate confirm/deny intents.  
  • Bigram features captured useful context (e.g., “next step”, “how long”), showing why linear BoW models work so well in this domain.

# Task 2 — Jericho Entity Recognition (BIO Tagger + Seq2Seq Targets)

**Problem statement:**  
Given a **location description** from a text-based world (Jericho/TextWorld), **predict the set of interactive objects**.  
We implement **BIO tagging** (token-level) and also prepare data for an optional **Seq2Seq** model.

**Why this matters for a pipeline agent:**  
Extracted entities feed into a downstream action selector (e.g., build a knowledge graph like in Ammanabrolu et al. and choose actions such as *open mailbox*, *go north*, etc.).

**Example :**  
Input: “… behind the white house. A path leads into the forest …”  
Output (BIO tagger): “… O O B I O B O O O O …” → objects: “white house”, “path”  
Output (Seq2Seq): “…, white house, path, …”

> Not all nouns are interactable (e.g., *forest* may be too far). We aim to predict the **interactable objects**, as given by the dataset.

## 0) Environment check & installs (run once if needed)

In [None]:
# %pip install -q sklearn-crfsuite pandas numpy matplotlib seaborn tqdm
import sys, platform
print("Python:", sys.version)
print("Platform:", platform.platform())

## 1) Imports

In [None]:
import os, json, re, random, math
from pathlib import Path
from typing import List, Dict, Tuple, Set, Any

import numpy as np
import pandas as pd
from tqdm import tqdm

import matplotlib.pyplot as plt
import seaborn as sns

import sklearn_crfsuite
from sklearn_crfsuite import metrics as crf_metrics

random.seed(42); np.random.seed(42)
pd.set_option("display.max_colwidth", 200)

## 2) Config — file paths & truncation controls

In [None]:
TRAIN_PATH = Path("./train.json")
TEST_PATH  = Path("./test.json")

# Truncation controls (set to None to disable)
MAX_TRAIN_EXAMPLES = None   # e.g., 10000
MAX_TEST_EXAMPLES  = None   # e.g., 2000
MAX_TOKENS         = 256    # truncate long loc_desc

## 3) Load & normalize Jericho data

In [None]:
from typing import Any, List, Dict

def _norm_objects(objs_any: Any) -> List[str]:
    if objs_any is None:
        return []
    if isinstance(objs_any, list):
        out = []
        for v in objs_any:
            if isinstance(v, list):
                out.append(" ".join(map(str, v)))
            else:
                out.append(str(v))
        return [s.strip() for s in out if s and s.strip()]
    if isinstance(objs_any, dict):
        out = []
        for v in objs_any.values():
            if isinstance(v, list):
                out.append(" ".join(map(str, v)))
            else:
                out.append(str(v))
        return [s.strip() for s in out if s and s.strip()]
    if isinstance(objs_any, str):
        return [s.strip() for s in objs_any.split(",") if s.strip()]
    return [str(objs_any).strip()] if str(objs_any).strip() else []

def _extract_from_example(ex: Any) -> Dict[str, Any]:
    if isinstance(ex, dict):
        loc = ex.get("loc_desc") or ex.get("location") or ex.get("description")
        objs = ex.get("surrounding_objects") or ex.get("objects") or ex.get("interactive_objects")
        return {"loc_desc": str(loc) if loc is not None else "", "surrounding_objects": _norm_objects(objs)}
    if isinstance(ex, (list, tuple)):
        loc_candidates = [e for e in ex if isinstance(e, str)]
        loc = max(loc_candidates, key=len) if loc_candidates else (str(ex[0]) if ex else "")
        objs_candidate = None
        for e in reversed(ex):
            if isinstance(e, (list, dict, str)):
                objs_candidate = e; break
        objs = _norm_objects(objs_candidate)
        return {"loc_desc": loc, "surrounding_objects": objs}
    return {"loc_desc": str(ex), "surrounding_objects": []}

def load_split(path: Path, max_examples=None) -> List[Dict[str, Any]]:
    with open(path, "r", encoding="utf-8") as f:
        data = json.load(f)
    if isinstance(data, dict):
        for k in ("data","examples","items","records"):
            if isinstance(data.get(k), list):
                data = data[k]; break
    assert isinstance(data, list), f"Top-level JSON must be a list; got {type(data)}"
    print(f"Loaded {len(data)} items from {path.name}. Preview of first 2:")
    for i, ex in enumerate(data[:2]):
        print(f"  [{i}] type={type(ex).__name__} ->",
              (list(ex.keys()) if isinstance(ex, dict) else f"len={len(ex)}"))
    out = [_extract_from_example(ex) for ex in data]
    if max_examples is not None:
        out = out[:max_examples]
    return out

train_data = load_split(TRAIN_PATH, MAX_TRAIN_EXAMPLES)
test_data  = load_split(TEST_PATH,  MAX_TEST_EXAMPLES)
print("Train examples (normalized):", len(train_data))
print("Test  examples (normalized):", len(test_data))
print("Sample normalized item:", train_data[0])

## 4) Tokenization, normalization & truncation

In [None]:
TOKEN_RE = re.compile(r"[A-Za-z]+(?:'[A-Za-z]+)?|\d+|[^\w\s]")
ARTICLES = {"a","an","the"}

def tokenize(text: str) -> List[str]:
    return TOKEN_RE.findall(text)

def normalize_tokens(ts: List[str]) -> List[str]:
    out=[]
    for t in ts:
        tl=t.lower()
        if tl in ARTICLES: 
            continue
        if re.fullmatch(r"\W", t): 
            continue
        if tl.endswith("es") and len(tl)>4: tl = tl[:-2]
        elif tl.endswith("s") and len(tl)>3: tl = tl[:-1]
        out.append(tl)
    return out

def truncate_tokens(toks: List[str], max_len: int) -> List[str]:
    if max_len is None: return toks
    return toks[:max_len]

## 5) BIO label engineering (robust span alignment)

In [None]:
from difflib import SequenceMatcher
from typing import Tuple

def windows(tokens: List[str], min_len=1, max_len=6):
    for L in range(max_len, min_len-1, -1):
        for i in range(0, len(tokens)-L+1):
            yield i, i+L

def fuzzy_ratio(a: str, b: str) -> float:
    return SequenceMatcher(None, a, b).ratio()

def bio_label(loc_desc: str, object_phrases: List[str], max_tokens=256) -> Tuple[List[str], List[str]]:
    toks_full = tokenize(loc_desc)
    toks = truncate_tokens(toks_full, max_tokens)
    labels = ["O"] * len(toks)
    used = [False]*len(toks)

    norm_windows = []
    for s,e in windows(toks, min_len=1, max_len=6):
        span_norm = " ".join(normalize_tokens(toks[s:e]))
        if span_norm:
            norm_windows.append((s,e,span_norm))

    spans = []
    for phrase in object_phrases or []:
        p_norm = " ".join(normalize_tokens(tokenize(str(phrase))))
        if not p_norm: 
            continue
        found = False
        for s,e,span_norm in norm_windows:
            if span_norm == p_norm:
                spans.append((s,e)); found = True; break
        if not found:
            best = None; best_r = 0.0
            for s,e,span_norm in norm_windows:
                r = fuzzy_ratio(span_norm, p_norm)
                if r > best_r:
                    best_r, best = r, (s,e)
            if best and best_r >= 0.86:
                spans.append(best)

    spans.sort(key=lambda x: (x[1]-x[0]), reverse=True)
    for s,e in spans:
        if any(used[i] for i in range(s,e)): 
            continue
        labels[s] = "B"
        for i in range(s+1,e): labels[i] = "I"
        for i in range(s,e): used[i] = True

    return toks, labels

# Print explicit BIO sequences for two examples
for k in range(min(2, len(train_data))):
    toks, tags = bio_label(train_data[k]["loc_desc"], train_data[k]["surrounding_objects"], max_tokens=MAX_TOKENS)
    print(f"Example {k}:")
    print("TOK:", " ".join(toks[:120]))
    print("TAG:", " ".join(tags[:120]))
    print()

## 6) Build sequences & length stats

In [None]:
def build_sequences(data_split: List[Dict]) -> Tuple[List[List[str]], List[List[str]]]:
    X, Y = [], []
    for ex in data_split:
        toks, tags = bio_label(ex["loc_desc"], ex["surrounding_objects"], max_tokens=MAX_TOKENS)
        X.append(toks); Y.append(tags)
    return X, Y

X_train, y_train = build_sequences(train_data)
X_test,  y_test  = build_sequences(test_data)

train_lens = [len(x) for x in X_train]
test_lens  = [len(x) for x in X_test]
print("Train sequences:", len(X_train), "| Test sequences:", len(X_test))
print("Avg train length:", int(np.mean(train_lens)), "| 95th pct:", int(np.percentile(train_lens,95)))
print("Avg test length:",  int(np.mean(test_lens)),  "| 95th pct:", int(np.percentile(test_lens,95)))

## 7) BIO tagger features (CRF)

In [None]:
def is_punct(tok: str) -> bool:
    return bool(re.fullmatch(r'\W', tok))

def shape(tok: str) -> str:
    s=[]
    for ch in tok:
        if ch.isupper(): s.append('X')
        elif ch.islower(): s.append('x')
        elif ch.isdigit(): s.append('d')
        else: s.append(ch)
    return "".join(s)

PREPS = {"in","on","at","into","inside","within","near","by","under","over","behind","beside","through"}
COPULAS = {"is","are","was","were"}
INTRO = {"there","here"}

def word2features(tokens, i):
    t = tokens[i]
    feats = {
        'bias': 1.0,
        'w.lower': t.lower(),
        'w.shape': shape(t),
        'is_title': t.istitle(),
        'is_upper': t.isupper(),
        'is_digit': t.isdigit(),
        'is_punct': is_punct(t),
    }
    if i>0:
        p=tokens[i-1]; feats.update({'-1.w.lower': p.lower(), '-1.is_punct': is_punct(p)})
    else: feats['BOS']=True
    if i<len(tokens)-1:
        n=tokens[i+1]; feats.update({'+1.w.lower': n.lower(), '+1.is_punct': is_punct(n)})
    else: feats['EOS']=True
    if i>1: feats['-2.w.lower']=tokens[i-2].lower()
    if i+2<len(tokens): feats['+2.w.lower']=tokens[i+2].lower()
    feats['pattern_intro'] = (tokens[i-2].lower() in INTRO and tokens[i-1].lower() in COPULAS) if i>=2 else False
    feats['prev_prep'] = tokens[i-1].lower() in PREPS if i>=1 else False
    return feats

def sequences2features(X): return [[word2features(sent, i) for i in range(len(sent))] for sent in X]

Xtr_feats = sequences2features(X_train)
Xte_feats = sequences2features(X_test)

## 8) Train BIO tagger (CRF)

In [None]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    max_iterations=100,
    all_possible_transitions=True,
    c1=0.1, c2=0.1,
    # Optionally adjust for imbalance:
    # class_weight={'O': 0.05, 'B': 1.0, 'I': 0.8}
)
crf.fit(Xtr_feats, y_train)
print("Trained. Labels:", crf.classes_)

## 9) Predict + show explicit BIO sequences and reconstructed objects

In [None]:
y_pred = crf.predict(Xte_feats)

def tags_to_objects(tokens: List[str], tags: List[str]) -> List[str]:
    objs, cur = [], []
    for tok, tg in zip(tokens, tags):
        if tg == "B":
            if cur: objs.append(" ".join(cur)); cur=[]
            cur=[tok]
        elif tg == "I":
            if cur: cur.append(tok)
        else:
            if cur: objs.append(" ".join(cur)); cur=[]
    if cur: objs.append(" ".join(cur))
    return [o.strip() for o in objs if o.strip()]

for k in range(min(2, len(X_test))):
    print(f"Example {k}:")
    print("TOK:", " ".join(X_test[k][:120]))
    print("TAG:", " ".join(y_pred[k][:120]))
    print("Predicted objects:", tags_to_objects(X_test[k], y_pred[k]))
    print("Gold objects (values):", test_data[k]['surrounding_objects'])
    print()

## 10) Token-level evaluation

In [None]:
labels = list(crf.classes_)
print(crf_metrics.flat_classification_report(y_test, y_pred, labels=labels, digits=3))

## 11) Entity-set evaluation (compare predicted spans to gold VALUE set)

In [None]:
from difflib import SequenceMatcher

def norm_phrase(s: str) -> str:
    return " ".join(re.findall(r"[A-Za-z]+(?:'[A-Za-z]+)?|\d+", s.lower()))

def fuzzy_hit(pred: str, golds: Set[str], thr=0.86) -> bool:
    for g in golds:
        if SequenceMatcher(None, norm_phrase(pred), norm_phrase(g)).ratio() >= thr:
            return True
    return False

TP=0; Pred=0; Gold=0
for ex, toks, tags in zip(test_data, X_test, y_pred):
    pred_objs = set(tags_to_objects(toks, tags))
    gold_objs = set(map(str, ex['surrounding_objects']))
    Pred += len(pred_objs)
    Gold += len(gold_objs)
    for p in pred_objs:
        if fuzzy_hit(p, gold_objs): TP += 1

precision = TP/Pred if Pred else 1.0
recall    = TP/Gold if Gold else 1.0
f1 = 2*precision*recall/(precision+recall) if (precision+recall)>0 else 0.0
print(f"[Entity-set] Precision={precision:.3f} Recall={recall:.3f} F1={f1:.3f}")

## 12) Insights

- **Model performance:**  
  The CRF BIO tagger achieved **macro-F1 ≈ 0.98** on the test set.  
  - `B`: Precision = **1.00**, Recall = **0.90**, F1 = **0.95**  
  - `I`: Precision = **0.98**, Recall = **1.00**, F1 = **0.99**  
  - Overall accuracy ≈ **0.999**. These scores show the model reliably identifies interactive objects with high precision and recall.

- **BIO vs Seq2Seq:**  
  - **BIO tagging** outputs explicit token-level labels (e.g., `O O B I O …`), making spans easy to interpret and align with the text.  
  - **Seq2Seq** outputs a generated object list (e.g., `"white house, path"`), which is flexible but more computationally expensive and less transparent.  
  - For this dataset, the BIO tagger is both efficient and effective.

- **Efficiency:**  
  - Training the CRF takes only seconds.  
  - Inference is real-time, so it fits smoothly into a pipeline-based agent.  
  - Seq2Seq models would require more resources to train and run.

- **Error patterns:**  
  - Occasionally misses the **beginning token** of long or unusual object phrases.  
  - Rarely mislabels irrelevant nouns.  
  - Continuation tokens (`I`) are handled almost perfectly.

- **Improvements:**  
  - Add POS/noun-chunk features (e.g., via spaCy) to help detect object starts.  
  - Try neural architectures (BiLSTM+CRF, BERT+CRF) for more robust predictions.  
  - Explore Seq2Seq approaches if canonicalized object lists are required for downstream knowledge graph construction.
