# 01 — Metadata Cleaning & Normalization

**Objective.** Define a minimal schema and apply normalization routines (title casing, controlled values, keyword splitting, year checks).

**Schema (proposal).**
- `id`, `title`, `author`, `year`, `discipline`, `keywords`, `abstract`

**Checks.**
- Year within [1990..current]
- Discipline in a lightweight controlled list
- Keywords normalized to `;` separated lower-case tokens

**Artifacts.**
- Preview table of cleaned metadata
- List of records with validation warnings


In [None]:
import pandas as pd, pathlib, datetime, re

ROOT = pathlib.Path(__file__).resolve().parents[1]
DATA = ROOT / "data"

df = pd.read_csv(DATA / "metadata_samples.csv")

CONTROLLED_DISCIPLINES = {
    "business", "computer science", "public health", "education",
    "economics", "environment", "policy"
}

def normalize_title(x: str) -> str:
    return re.sub(r"\s+", " ", x.strip()).strip().rstrip(" .")

def normalize_keywords(x: str) -> str:
    toks = [t.strip().lower() for t in re.split(r"[;,]", str(x)) if t.strip()]
    return ";".join(sorted(set(toks)))

def validate_year(y):
    try:
        y = int(y)
        return 1990 <= y <= datetime.datetime.now().year
    except:
        return False

def normalize_discipline(x: str) -> str:
    s = str(x).strip().lower()
    return s if s in CONTROLLED_DISCIPLINES else f"uncontrolled::{s}"

df["title"] = df["title"].map(normalize_title)
df["keywords"] = df["keywords"].map(normalize_keywords)
df["year_valid"] = df["year"].map(validate_year)
df["discipline_norm"] = df["discipline"].map(normalize_discipline)

warnings = df[~df["year_valid"] | df["discipline_norm"].str.startswith("uncontrolled::")]
display(df.head(10))
display(warnings)