# 03 — Context: Semantics & Metadata

This notebook demonstrates **AI Readiness Pillar #3: Context**.

You will:
- Create a tiny **business glossary** (definitions)
- Build a **metadata registry** (owner, lineage, sensitivity)
- Create a curated, "RAG-ready" knowledge base from structured data
- Run a lightweight retrieval demo (TF‑IDF cosine similarity)

> This avoids external model/API keys so it runs anywhere.


In [None]:
from pathlib import Path
import pandas as pd

BASE_PATH = Path("..")
DATA_CURATED = BASE_PATH / "data" / "curated"
DOCS_PATH = BASE_PATH / "data" / "curated" / "knowledge_base"
DOCS_PATH.mkdir(parents=True, exist_ok=True)


In [None]:
customers = pd.read_csv(DATA_CURATED / "customers_silver.csv")
orders = pd.read_csv(DATA_CURATED / "orders_silver.csv")


## Business glossary (minimal example)

In real organisations this lives in Purview/Collibra.
Here we keep it in a simple table.


In [None]:
glossary = pd.DataFrame([
    {"term":"Customer", "definition":"An individual or organisation that has an active relationship with the institution and a unique customer_id."},
    {"term":"Order", "definition":"A recorded transaction linked to a customer_id, representing a purchase or chargeable event."},
    {"term":"Department", "definition":"Business unit responsible for the customer relationship and data stewardship for related records."},
    {"term":"Confidential", "definition":"Data that must not be exposed outside approved roles; access requires auditing and policy enforcement."},
])
glossary


## Metadata registry + lineage

Context also includes:
- ownership
- sensitivity
- how the dataset was created (lineage)


In [None]:
registry = pd.DataFrame([
    {"asset":"customers_silver", "owner":"data.steward@org", "sensitivity":"confidential", "source":"customers.csv", "transform":"dedupe + standardise country + quarantine"},
    {"asset":"orders_silver", "owner":"finance.steward@org", "sensitivity":"internal", "source":"orders.csv", "transform":"remove negative amounts + require currency"},
])
registry


## Build a 'RAG-ready' knowledge base (structured → text)

A common mistake is dumping raw files into a vector DB.
Instead, curate small, meaningful documents with:
- definitions
- key stats
- ownership
- lineage

We'll generate one text document per asset.


In [None]:
def asset_doc(asset_name: str) -> str:
    meta = registry[registry["asset"]==asset_name].iloc[0].to_dict()
    if asset_name == "customers_silver":
        df = customers
        stats = {
            "rows": len(df),
            "distinct_customers": df["customer_id"].nunique(),
            "departments": sorted(df["department"].dropna().unique().tolist()),
            "countries": sorted(df["country"].dropna().unique().tolist())
        }
    else:
        df = orders
        stats = {
            "rows": len(df),
            "total_amount": float(df["amount"].sum()),
            "avg_amount": float(df["amount"].mean()),
            "currencies": sorted(df["currency"].dropna().unique().tolist())
        }

    return f"""ASSET: {asset_name}
OWNER: {meta['owner']}
SENSITIVITY: {meta['sensitivity']}
SOURCE: {meta['source']}
LINEAGE: {meta['transform']}

KEY STATS: {stats}
"""

docs = {}
for asset in registry["asset"]:
    docs[asset] = asset_doc(asset)
    (DOCS_PATH / f"{asset}.txt").write_text(docs[asset], encoding="utf-8")

list(DOCS_PATH.glob("*.txt"))[:3]


## Retrieval demo (TF‑IDF)

This mimics semantic retrieval so you can see why curation matters.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

doc_names = list(docs.keys())
corpus = [docs[n] for n in doc_names]

vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(corpus)

def retrieve(query: str, top_k: int = 2):
    q = vectorizer.transform([query])
    sims = cosine_similarity(q, X).flatten()
    ranked = sorted(zip(doc_names, sims), key=lambda x: x[1], reverse=True)[:top_k]
    return ranked

retrieve("Which dataset is confidential and who owns it?")


## Why this matters for AI

- RAG needs **curated, governed** context
- Agents need **definitions + lineage** to act safely
- Semantic consistency reduces hallucinations

Context is what turns 'data' into 'knowledge'.
