# W03B — Construyendo **Silver** (schema estable) + primeras vistas **Gold**

**Objetivo:** pasar de Bronze-lite (`raw_ps`) a un **Silver** con:
- columnas oficiales (Contract v1),
- llaves y granularidad explícitas,
- reglas simples de limpieza,
- dimensiones con **1 fila por clave** (para JOINs sanos),
y luego crear 2 vistas **Gold** de ejemplo.

**Conecta con W02/W03A**
- W02: cardinalidad y JOINs (por qué explota).
- W03A: contrato + checks mínimos.
- W03B: convertimos lo anterior en tablas “de verdad”: Silver + Gold.

In [None]:
import os
os.chdir("..")
os.getcwd()

In [None]:
# Setup (cross-platform)
from pathlib import Path
import duckdb, json
from datetime import datetime, timezone

PROJECT_ROOT = Path(".").resolve()
RAW_DIR = PROJECT_ROOT / "data" / "raw"
DB_PATH = PROJECT_ROOT / "data" / "exoplanets.duckdb"
DOCS_DIR = PROJECT_ROOT / "docs"
ART_DIR  = PROJECT_ROOT / "artifacts"

RAW_DIR.mkdir(parents=True, exist_ok=True)
DOCS_DIR.mkdir(parents=True, exist_ok=True)
ART_DIR.mkdir(parents=True, exist_ok=True)

raw_csv = RAW_DIR / "pscomppars.csv"
con = duckdb.connect(str(DB_PATH))

def sql_quote(s: str) -> str:
    return "'" + s.replace("'", "''") + "'"

if not raw_csv.exists():
    raise FileNotFoundError(
        f"No encuentro {raw_csv}. Asegúrate de tenerlo en data/raw/pscomppars.csv (W01/W02)."
    )

raw_csv_abs = raw_csv.resolve()
con.execute(f'''
CREATE OR REPLACE VIEW raw_ps AS
SELECT * FROM read_csv_auto({sql_quote(str(raw_csv_abs))})
''')

con.sql("DESCRIBE raw_ps").show()

## 1) Diseño Silver (mínimo viable)

**Contract v1 (Core 16 columnas):**
`pl_name, hostname, discoverymethod, disc_year, sy_snum, sy_pnum, sy_dist, ra, dec, pl_orbper, pl_rade, pl_bmasse, pl_eqt, st_teff, st_rad, st_mass`

**Reglas Silver (hoy):**
- `pl_name` y `hostname` no nulos
- `disc_year` en [1980, 2026] si no es nulo
- rangos didácticos:
  - `pl_rade` > 0 y ≤ 30 si no es nulo
  - `pl_bmasse` > 0 si no es nulo

In [None]:
# TU TURNO 1: construir silver_planet (igual que el docente, pero tú decides 1 regla extra)
# Elige UNA regla extra:
# - (pl_orbper IS NULL OR pl_orbper > 0)
# - (sy_dist  IS NULL OR sy_dist  > 0)
# - (pl_eqt   IS NULL OR pl_eqt   > 0)

con.execute("DROP TABLE IF EXISTS silver_planet")

con.execute('''
-- TODO: crea silver_planet seleccionando las 16 columnas (Core) y aplicando reglas
SELECT 1;
''')

# Validación:
# con.sql("SELECT COUNT(*) AS n_rows, COUNT(DISTINCT pl_name) AS n_pl FROM silver_planet").show()

## 2) Dimensiones (1 fila por clave)

Estrategia “mesurada” (sin window functions):
- `GROUP BY hostname`
- para cada columna: `MAX(col)` para consolidar una fila por hostname

In [None]:
# TU TURNO 2: crea dim_host_full (1 fila por hostname)
# Debe contener: hostname + 3 columnas de tu elección entre:
# sy_dist, ra, dec, st_teff, st_rad, st_mass

con.execute("DROP TABLE IF EXISTS dim_host_full")
con.execute('''
-- TODO: CREATE TABLE dim_host_full AS SELECT hostname, MAX(...) ... GROUP BY hostname
SELECT 1;
''')

# Validación:
# con.sql("SELECT COUNT(*) AS n_rows, COUNT(DISTINCT hostname) AS n_keys FROM dim_host_full").show()

## 3) Fact table (grain: 1 fila ≈ 1 planeta)

Creamos `fact_planet` desde Silver usando `SELECT DISTINCT`.
Si aparecen duplicados por `pl_name`, se documenta como issue de calidad.

In [None]:
# TU TURNO 3: crea fact_planet desde silver_planet (usa DISTINCT)
con.execute("DROP TABLE IF EXISTS fact_planet")
con.execute('''
-- TODO: CREATE TABLE fact_planet AS SELECT DISTINCT ...
SELECT 1;
''')

# Validación:
# con.sql("SELECT COUNT(*) AS n_rows, COUNT(DISTINCT pl_name) AS n_pl FROM fact_planet").show()

## 4) JOINs sanos + 2 vistas Gold
- JOIN sano: `COUNT(join)` ≈ `COUNT(fact)` (no inflar).
- Gold: `gold_by_method` y `gold_by_host`.

In [None]:
# TU TURNO 4: crea UNA vista Gold (elige A o B)

# A) gold_by_method
# con.execute("DROP VIEW IF EXISTS gold_by_method")
# con.execute('''
# -- TODO: CREATE VIEW gold_by_method AS ...
# ''')
# con.sql("SELECT * FROM gold_by_method LIMIT 10").show()

# B) gold_by_host
# con.execute("DROP VIEW IF EXISTS gold_by_host")
# con.execute('''
# -- TODO: CREATE VIEW gold_by_host AS ...
# ''')
# con.sql("SELECT * FROM gold_by_host LIMIT 10").show()

## 5) Contract Silver v1 + trazabilidad

Escribimos `docs/data_contract_silver_v1.json` describiendo tablas Silver/Gold.
Si cambias columnas o reglas: incrementa versión.

In [None]:
contract_silver_v1 = {
  "dataset": "nasa_exoplanets_pscomppars",
  "version": "1.0.0",
  "bronze": {"table": "raw_ps", "note": "Bronze-lite (Core 16 columnas)"},
  "silver": {
    "tables": ["silver_planet", "dim_host_full", "fact_planet"],
    "grain_fact": "1 fila ≈ 1 planeta (pl_name)",
    "keys": {"fact_planet": ["pl_name"], "dim_host_full": ["hostname"]},
    "core_columns": [
      "pl_name","hostname","discoverymethod","disc_year",
      "sy_snum","sy_pnum","sy_dist","ra","dec",
      "pl_orbper","pl_rade","pl_bmasse","pl_eqt",
      "st_teff","st_rad","st_mass"
    ],
    "quality_minimum": [
      "pl_name NOT NULL",
      "hostname NOT NULL",
      "disc_year in [1980,2026] if not null",
      "pl_rade in (0,30] if not null",
      "dim_host_full: hostname unique"
    ]
  },
  "gold": {"views": ["gold_by_method", "gold_by_host"]},
  "notes": "Si cambias columnas/reglas, incrementa version."
}

(DOCS_DIR / "data_contract_silver_v1.json").write_text(
    json.dumps(contract_silver_v1, indent=2), encoding="utf-8"
)
print("Guardé docs/data_contract_silver_v1.json")

## Para entregar (W03B)

### En clase (evidencia mínima)
1) `docs/w03b_silver_report.md` con outputs:
   - `DESCRIBE silver_planet` + conteos (rows/distinct pl_name/hostname)
   - `dim_host_full`: `n_rows` vs `n_keys`
   - `n_fact` vs `n_join` (JOIN sano)
2) `docs/decisions_log.md`: 1 decisión:
   - “Reglas Silver aplicadas + evidencia”.

### Tarea
1) Completa TU TURNO 4 (una vista Gold) si no la hiciste.
2) Exporta 1 vista Gold a CSV en `artifacts/` (COPY ... TO).
3) (Opcional) si agregaste una regla extra, documéntala como versión `v1.0.1`.

## Reflexión (bitácora)
- ¿Qué evidencia mínima te convence de que tu JOIN es sano?
- ¿Qué trade-off hay entre “limpiar mucho” vs “no perder datos”?