# 04 — Agent Walkthrough (Retrieval → Extraction → Summarization → Synthesis)

This notebook:
- Loads dataset + agent configs
- Plans a query and retrieves relevant documents (TF-IDF)
- Runs rule-based + spaCy NER on the retrieved docs
- Summarizes retrieved docs (extractive TextRank by default)
- Synthesizes a final answer with evidence
- Saves a JSON output and (optionally) writes to memory


In [1]:
import os, sys, yaml, json, time
from pathlib import Path

# Ensure local package is importable (src/ layout)
REPO_ROOT = Path.cwd().resolve().parents[0] if Path.cwd().name == "notebooks" else Path.cwd()
SRC_DIR = REPO_ROOT / "src"
if str(SRC_DIR) not in sys.path:
    sys.path.insert(0, str(SRC_DIR))

print("Repo root:", REPO_ROOT)
print("Using src dir:", SRC_DIR)

import pandas as pd
from IPython.display import display, HTML


Repo root: C:\Users\BAB AL SAFA\Documents\Vani\personal\escalate-nlp-agent
Using src dir: C:\Users\BAB AL SAFA\Documents\Vani\personal\escalate-nlp-agent\src


In [2]:
# Master config -> dataset config
CFG_PATH = REPO_ROOT / "configs" / "config.yaml"
with open(CFG_PATH, "r") as f:
    cfg = yaml.safe_load(f)

DS_CFG_PATH = REPO_ROOT / cfg["dataset_config"]
AGENT_CFG_PATH = REPO_ROOT / "configs" / "agent" / "news_aggregator.yaml"

with open(DS_CFG_PATH, "r") as f:
    ds_cfg = yaml.safe_load(f)
with open(AGENT_CFG_PATH, "r") as f:
    agent_cfg = yaml.safe_load(f)

print("Dataset:", ds_cfg["name"], "| id:", ds_cfg["id"])
print("Agent:", agent_cfg["name"])


Dataset: NLTK Reuters | id: reuters
Agent: News QA Agent


In [3]:
import spacy
try:
    _ = spacy.load("en_core_web_sm")
except Exception:
    !python -m spacy download en_core_web_sm


In [4]:
proc_dir = REPO_ROOT / ds_cfg["outputs"]["processed_dir"]
train_path = proc_dir / "train.parquet"
assert train_path.exists(), f"Missing {train_path}. Run Step 1 preprocessing."

docs = pd.read_parquet(train_path)[["id","text","title"]]
print("Docs:", docs.shape)
docs.head(3)


Docs: (7616, 3)


Unnamed: 0,id,text,title
0,training/1,bahia cocoa review showers continued throughou...,
1,training/10,computer terminal systems &lt;cpml> completes ...,
2,training/100,n.z. trading bank deposit growth rises slightl...,


In [5]:
from escalate_nlp_agent.agent.planner import make_plan
from escalate_nlp_agent.agent.retriever import TfidfRetriever
from escalate_nlp_agent.agent.toolchain import run_extractors, run_summarizer, synthesize_answer
from escalate_nlp_agent.agent.memory import remember


  from .autonotebook import tqdm as notebook_tqdm


In [6]:
# Try a few examples (Reuters):
# "What did the company report about profits?"
# "Which organizations were involved in the merger?"
# "What changes happened in oil prices?"
QUERY = "What did the company report about profits?"

plan = make_plan(QUERY)
plan

Plan(query='What did the company report about profits?', keywords=['what', 'did', 'the', 'company', 'report', 'about', 'profits'], need_entities=True, need_numbers=True)

In [7]:
top_k = int(agent_cfg["retriever"]["top_k"])
ngram_range = tuple(agent_cfg["retriever"].get("ngram_range", [1,2]))
max_features = int(agent_cfg["retriever"].get("max_features", 50000))

t0 = time.time()
ret = TfidfRetriever(ngram_range=ngram_range, max_features=max_features).fit(docs[["id","text"]])
hits = ret.search(plan.query, top_k=top_k)
t1 = time.time()

print(f"Retrieved {len(hits)} docs in {t1 - t0:.2f}s")
hits[["id","score"]].head()


Retrieved 5 docs in 2.71s


Unnamed: 0,id,score
5478,training/6597,0.142219
6472,training/8163,0.118463
433,training/10689,0.116421
6902,training/8861,0.116033
21,training/10041,0.115332


In [8]:
# Read extractor configs listed in the agent config
extractor_cfg_paths = agent_cfg["extractors"]

def load_yaml(p):
    with open(p, "r") as f:
        return yaml.safe_load(f)

extract_cfgs = [load_yaml(REPO_ROOT / p) for p in extractor_cfg_paths]
extract_cfgs


[{'name': 'Rule-based Extraction',
  'patterns': {'dates': '\\b\\d{1,2}[/-]\\d{1,2}[/-]\\d{2,4}\\b',
   'numbers': '\\b\\d+(?:\\.\\d+)?\\b',
   'emails': '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}',
   'urls': 'https?://[^\\s]+'},
  'outputs': {'file': 'data/processed/{dataset}/extractions_rule_based.parquet'}},
 {'name': 'spaCy NER Extraction',
  'model': 'en_core_web_sm',
  'labels': ['PERSON', 'ORG', 'GPE', 'DATE', 'MONEY', 'PERCENT'],
  'outputs': {'file': 'data/processed/{dataset}/extractions_spacy_ner.parquet'}}]

In [12]:
os.chdir(REPO_ROOT)
print("CWD set to:", os.getcwd())


CWD set to: C:\Users\BAB AL SAFA\Documents\Vani\personal\escalate-nlp-agent


In [13]:
# We’ll reuse the high-level runner from toolchain (which dispatches to rule-based / spaCy)
extracts = run_extractors(hits[["id","text"]].copy(), extractor_cfg_paths)
# Dictionary with keys like 'rule_based' and 'spacy_ner'
list(extracts.keys())

['rule_based', 'spacy_ner']

In [10]:
sum_kind = agent_cfg["summarizer"]["type"]            # "extractive" or "abstractive"
sum_cfg_path = REPO_ROOT / agent_cfg["summarizer"]["config"]

summaries = run_summarizer(hits[["id","text"]].copy(), str(sum_cfg_path), sum_kind)
summaries.head(3)


Unnamed: 0,id,summary
0,training/6597,craftmatic/contour &lt;crcc> sees higher profi...
1,training/8163,booker says 1987 starts well booker plc &lt;bo...
2,training/10689,"japan isolated, yen rises, world feels cheated..."


In [14]:
answer = synthesize_answer(
    query=plan.query,
    docs=hits,
    summaries=summaries,
    extracts=extracts,
    n_sent=int(agent_cfg["reasoning"]["answer_sentences"])
)

print("=== FINAL ANSWER ===\n")
print(answer["answer"])


=== FINAL ANSWER ===

craftmatic/contour &lt;crcc> sees higher profits craftmatic/contour industries inc said it would report substantial profits for the first quarter of fiscal 1987 ending march 31. the company recorded net income of 732,000 dlrs, or 22 cts per share, on revenues of 10.2 mln dlrs. booker says 1987 starts well booker plc &lt;bokl.l> said 1987 had started well and the group had the resources to invest in its growth business both organically and by acquisition. it was commenting on figures for 1986 which showed pretax profits rising to 54.6 mln from 46.5 mln previously.


In [15]:
def show_evidence(ans):
    print("\n--- Evidence (top retrieved docs) ---")
    for e in ans["support"][:5]:
        print(f"[{e['id']}] score={e['score']:.3f} :: {e['snippet'][:220]}...")

def show_entities(ans, max_items=15):
    ents = ans.get("entities", [])[:max_items]
    if not ents:
        print("\n(entities: none)")
        return
    print("\n--- Entities ---")
    for ent in ents:
        print(f"{ent['label']}: {ent['text']}")

def show_numbers_dates(ans, max_docs=5):
    nums = ans.get("numbers_dates", [])[:max_docs]
    if not nums:
        print("\n(numbers/dates: none)")
        return
    print("\n--- Numbers/Dates by doc ---")
    for nd in nums:
        print(f"[{nd['id']}] numbers={nd['numbers']} dates={nd['dates']}")

show_evidence(answer)
show_entities(answer)
show_numbers_dates(answer)



--- Evidence (top retrieved docs) ---
[training/6597] score=0.142 :: craftmatic/contour &lt;crcc> sees higher profits craftmatic/contour industries inc said it would report substantial profits for the first quarter of fiscal 1987 ending march 31. the company recorded net income of 732,000...
[training/8163] score=0.118 :: booker says 1987 starts well booker plc &lt;bokl.l> said 1987 had started well and the group had the resources to invest in its growth business both organically and by acquisition. it was commenting on figures for 1986 w...
[training/10689] score=0.116 :: japan isolated, yen rises, world feels cheated japan is becoming dangerously isolated again as the u.s. and europe feel they have been cheated by japanese promises to switch from export to domestic-led growth, officials ...
[training/8861] score=0.116 :: prudential records best results in six years &lt;prudential corporation plc>, which earlier announced a 62 pct rise in 1986 pre-tax profits, said it had recorded it

In [16]:
out_json = REPO_ROOT / agent_cfg["outputs"]["demo_json"]
out_json.parent.mkdir(parents=True, exist_ok=True)
out_json.write_text(json.dumps(answer, ensure_ascii=False, indent=2), encoding="utf-8")
print("Saved demo JSON ->", out_json)

if agent_cfg["memory"]["enabled"]:
    remember(agent_cfg["memory"]["file"], answer)
    print("Memory appended ->", REPO_ROOT / agent_cfg["memory"]["file"])


Saved demo JSON -> C:\Users\BAB AL SAFA\Documents\Vani\personal\escalate-nlp-agent\reports\agent\demo_output.json
Memory appended -> C:\Users\BAB AL SAFA\Documents\Vani\personal\escalate-nlp-agent\data\agent_memory.jsonl


In [17]:
def ask(query: str, k: int = None):
    k = k or int(agent_cfg["retriever"]["top_k"])
    plan = make_plan(query)
    hits = ret.search(plan.query, top_k=k)
    extracts = run_extractors(hits[["id","text"]].copy(), agent_cfg["extractors"])
    summaries = run_summarizer(hits[["id","text"]].copy(), str(REPO_ROOT / agent_cfg["summarizer"]["config"]), agent_cfg["summarizer"]["type"])
    return synthesize_answer(query, hits, summaries, extracts, n_sent=int(agent_cfg["reasoning"]["answer_sentences"]))

# Example:
ans2 = ask("Which organizations were mentioned in the report?")
print(ans2["answer"])
show_evidence(ans2)


japan isolated, yen rises, world feels cheated japan is becoming dangerously isolated again as the u.s. and europe feel they have been cheated by japanese promises to switch from export to domestic-led growth, officials and businessmen from around the world said. as the dollar today slipped to a rec world bank report criticises peru economic plan a confidential world bank report on the peruvian economy said the government's strategy does not offer good prospects for medium and long-term growth and is likely to quickly lead to inflation. the report, published today by an economic monthly, the pe indonesia seen at crossroads over economic change indonesia appears to be nearing a political crossroads over measures to deregulate its protected economy, the u.s.

--- Evidence (top retrieved docs) ---
[training/10689] score=0.185 :: japan isolated, yen rises, world feels cheated japan is becoming dangerously isolated again as the u.s. and europe feel they have been cheated by japanese promise