# Venues data inspection: Raw → Silver → Gold

This notebook inspects the data foundation pipeline outputs: raw Overture + OSM data, silver (conflated venues), and gold (venues with `gold_text`). It validates that processing works as intended and assesses how useful the final result is.

**Paths** (relative to repo root):
- Raw: `data/raw/overture/temp/overture_sample.parquet`, `data/raw/osm/temp/osm_pois.parquet`
- Silver: `data/silver/venues.parquet`
- Gold: `data/gold/venues.parquet`

## Setup

In [None]:
import sys
from pathlib import Path

import pandas as pd

# Repo root — run this notebook from the repo root so that data/ is found
ROOT = Path(".").resolve()

sys.path.insert(0, str(ROOT / "src"))

PATHS = {
    "overture": ROOT / "data" / "raw" / "overture" / "temp" / "overture_sample.parquet",
    "osm": ROOT / "data" / "raw" / "osm" / "temp" / "osm_pois.parquet",
    "silver": ROOT / "data" / "silver" / "venues.parquet",
    "gold": ROOT / "data" / "gold" / "venues.parquet",
}

for name, p in PATHS.items():
    print(f"  {name}: {p}  exists={p.exists()}")

: 

## 1. Raw data

### 1.1 Overture sample

Sampled Overture Places (bbox + limit). Expected columns: at least `gers_id`, `lat`, `lon` (and optionally `name`, `category`, `city` if the ingest passes them through).

In [None]:
if PATHS["overture"].exists():
    df_o = pd.read_parquet(PATHS["overture"])
    print("Shape:", df_o.shape)
    print("Columns:", list(df_o.columns))
    print("\nDtypes:")
    print(df_o.dtypes)
    print("\nHead:")
    display(df_o.head(5))
else:
    print("Overture sample not found. Run the pipeline (e.g. overture_sample task) first.")

### 1.2 OSM POIs

Extracted OSM points used for conflation. Expected: `osm_id`, `lat`, `lon`, and tag columns (e.g. `amenity`, `cuisine`, `dog_friendly`).

In [None]:
if PATHS["osm"].exists():
    df_osm = pd.read_parquet(PATHS["osm"])
    print("Shape:", df_osm.shape)
    print("Columns:", list(df_osm.columns))
    print("\nHead:")
    display(df_osm.head(5))
    if "amenity" in df_osm.columns:
        print("\nAmenity value counts (top 15):")
        print(df_osm["amenity"].value_counts(dropna=False).head(15))
else:
    print("OSM POIs not found. Run osm_extract (and ensure data/raw/osm/mini_region.parquet or RPG_OSM_EXTRACT_URI is set).")

## 2. Silver (conflated venues)

One row per Overture place, with lists of matched OSM IDs and amenities within the conflation radius. Validations:
- Row count equals Overture sample count.
- Required columns: `gers_id`, `lat`, `lon`, `osm_ids`, `osm_amenities`, `has_dog_friendly`.
- `osm_ids` / `osm_amenities` are list-like; match rate and coverage.

In [None]:
if not PATHS["silver"].exists():
    print("Silver file not found.")
else:
    silver = pd.read_parquet(PATHS["silver"])
    print("Shape:", silver.shape)
    print("Columns:", list(silver.columns))
    print("\nDtypes:")
    print(silver.dtypes)

    # Validations
    required = {"gers_id", "lat", "lon", "osm_ids", "osm_amenities", "has_dog_friendly"}
    missing = required - set(silver.columns)
    assert not missing, f"Silver missing columns: {missing}"

    n_match = silver["osm_ids"].apply(lambda x: len(x) if isinstance(x, (list, tuple)) else (1 if x is not None else 0)).astype(int)
    n_with_match = (n_match > 0).sum()
    print(f"\nValidation: Overture rows with ≥1 OSM match: {n_with_match}/{len(silver)} ({100*n_with_match/len(silver):.1f}%)")
    print(f"Matches per row: mean={n_match.mean():.2f}, max={n_match.max()}")

    print("\nHead:")
    display(silver.head(5))

## 3. Gold (venues + gold_text)

Silver plus a `gold_text` column: short descriptive sentences for each venue (name, category, city, amenities, location). Validations:
- Same row count as silver; silver columns preserved; `gold_text` present.
- Every row has a non-empty `gold_text`.
- Assess usefulness: length distribution, presence of name/category/amenities in text.

In [None]:
if not PATHS["gold"].exists():
    print("Gold file not found.")
else:
    gold = pd.read_parquet(PATHS["gold"])
    print("Shape:", gold.shape)
    print("Columns:", list(gold.columns))

    assert "gold_text" in gold.columns, "Gold must have gold_text column"
    gold_text = gold["gold_text"].astype(str)
    empty = (gold_text.str.strip() == "").sum()
    print(f"\nValidation: Rows with non-empty gold_text: {len(gold) - int(empty)}/{len(gold)}")

    lengths = gold_text.str.len()
    print(f"gold_text length: min={lengths.min()}, max={lengths.max()}, mean={lengths.mean():.1f}")

    # Usefulness: does text look like "X is a Y in Z" or just " is a . It is located at ..."?
    has_name_like = gold_text.str.contains(r"\w+ is a ", regex=True).sum()
    has_category_like = gold_text.str.contains(r" is a \w+", regex=True).sum()
    has_amenities = gold_text.str.contains("It features ", regex=False).sum()
    print(f"\nUsefulness: rows with name-like start: {has_name_like}, category-like: {has_category_like}, 'It features' (amenities): {has_amenities}")

    print("\nSample gold_text (first 5):")
    for i, t in enumerate(gold["gold_text"].head(5)):
        print(f"  [{i}] {t}")

    print("\nGold head (table):")
    display(gold.head(5))

## 4. Cross-check and summary

- **Processing**: Raw → Silver (conflation) → Gold (gold_text) row counts and schemas.
- **Usefulness**: If Overture sample only has `gers_id`, `lat`, `lon`, gold_text will be minimal (" is a . It is located at (lat, lon)."). Richer gold_text requires name/category/city in the silver layer (e.g. from Overture source or a join).

In [None]:
def row_count(p: Path) -> int | None:
    if not p.exists():
        return None
    return len(pd.read_parquet(p))

o = row_count(PATHS["overture"])
s = row_count(PATHS["silver"])
g = row_count(PATHS["gold"])
print("Row counts: Overture (raw)", o, "→ Silver", s, "→ Gold", g)
if o is not None and s is not None:
    print("  Silver == Overture:", s == o)
if s is not None and g is not None:
    print("  Gold == Silver:", g == s)
print("\nProcessing validation: OK" if (o is not None and s == o and g == s) else "\nCheck pipeline (run build_silver / build_gold if needed).")