# W05B — Gold final + Export a `artifacts/` + Cierre H2 (DuckDB)

**Objetivo:** cerrar el H2 dejando el proyecto en “estado de entrega”:

✅ Modelo estable:
- `dim_host_sk(host_id PK, hostname UNIQUE)`
- `fact_planet_sk(pl_name PK, host_id FK)`

✅ 2 outputs Gold:
- `gold_by_discoverymethod`
- `gold_by_host`

✅ Evidencia:
- exports a CSV en `artifacts/`
- contrato actualizado (`docs/data_contract.md`)
- bitácora de decisiones (`docs/decisions_log.md`)

## Bibliografía (W05B)

### DDIA (Kleppmann)
- **Cap. 2 — Data Models and Query Languages**
  - Cómo el modelo “define” qué preguntas son fáciles o difíciles (relaciones, joins, agregaciones).
- **Cap. 3 — Storage and Retrieval**
  - Contexto OLAP: agregaciones y scans → por qué Gold existe como “data product”.

### Complementario (muy recomendado)
- **Kimball — The Data Warehouse Toolkit**
  - Star schema, fact/dim, y por qué se materializan “salidas listas para negocio”.

In [None]:
from pathlib import Path
import duckdb

PROJECT_ROOT = Path(".").resolve()
RAW_DIR = PROJECT_ROOT / "data" / "raw"
DB_PATH = PROJECT_ROOT / "data" / "exoplanets.duckdb"
ART_DIR = PROJECT_ROOT / "artifacts"
DOCS_DIR = PROJECT_ROOT / "docs"

RAW_DIR.mkdir(parents=True, exist_ok=True)
ART_DIR.mkdir(parents=True, exist_ok=True)
DOCS_DIR.mkdir(parents=True, exist_ok=True)

con = duckdb.connect(str(DB_PATH))

raw_csv = RAW_DIR / "pscomppars.csv"
if not raw_csv.exists():
    raise FileNotFoundError(f"No encuentro {raw_csv}. Necesitas el CSV de W01/W02.")

def sql_quote(s: str) -> str:
    return "'" + s.replace("'", "''") + "'"

con.execute(f'''
CREATE OR REPLACE VIEW raw_ps AS
SELECT * FROM read_csv_auto({sql_quote(str(raw_csv.resolve()))})
''')

# 1) Silver mínima (autocontenida)
con.execute("DROP TABLE IF EXISTS silver_planet")
con.execute('''
CREATE TABLE silver_planet AS
SELECT
  pl_name,
  hostname,
  discoverymethod,
  disc_year,
  sy_snum,
  sy_pnum,
  sy_dist,
  ra,
  dec,
  pl_orbper,
  pl_rade,
  pl_bmasse,
  pl_eqt,
  st_teff,
  st_rad,
  st_mass
FROM raw_ps
WHERE pl_name IS NOT NULL
  AND hostname IS NOT NULL
  AND (disc_year IS NULL OR (disc_year BETWEEN 1980 AND 2026))
  AND (pl_rade  IS NULL OR (pl_rade  > 0 AND pl_rade  <= 30))
  AND (pl_bmasse IS NULL OR (pl_bmasse > 0))
''')

# 2) Dim base (1 fila por hostname)
con.execute("DROP TABLE IF EXISTS dim_host_full")
con.execute('''
CREATE TABLE dim_host_full AS
SELECT
  hostname,
  MAX(sy_dist)  AS sy_dist,
  MAX(ra)       AS ra,
  MAX(dec)      AS dec,
  MAX(st_teff)  AS st_teff,
  MAX(st_rad)   AS st_rad,
  MAX(st_mass)  AS st_mass
FROM silver_planet
GROUP BY hostname
''')

# 3) Fact base (por pl_name)
con.execute("DROP TABLE IF EXISTS fact_planet")
con.execute('''
CREATE TABLE fact_planet AS
SELECT DISTINCT
  pl_name,
  hostname,
  discoverymethod,
  disc_year,
  pl_orbper,
  pl_rade,
  pl_bmasse,
  pl_eqt
FROM silver_planet
''')

# 4) dim_host_sk + fact_planet_sk (modelo con llaves)
con.execute("DROP TABLE IF EXISTS dim_host_sk")
con.execute('''
CREATE TABLE dim_host_sk (
  host_id INTEGER PRIMARY KEY,
  hostname VARCHAR NOT NULL UNIQUE,
  sy_dist DOUBLE,
  ra DOUBLE,
  dec DOUBLE,
  st_teff DOUBLE,
  st_rad DOUBLE,
  st_mass DOUBLE
)
''')

con.execute('''
INSERT INTO dim_host_sk
SELECT
  ROW_NUMBER() OVER (ORDER BY hostname) AS host_id,
  hostname,
  sy_dist, ra, dec, st_teff, st_rad, st_mass
FROM dim_host_full
''')

con.execute("DROP TABLE IF EXISTS fact_planet_sk")
con.execute('''
CREATE TABLE fact_planet_sk (
  pl_name VARCHAR PRIMARY KEY,
  host_id INTEGER NOT NULL REFERENCES dim_host_sk(host_id),
  discoverymethod VARCHAR,
  disc_year INTEGER,
  pl_orbper DOUBLE,
  pl_rade DOUBLE,
  pl_bmasse DOUBLE,
  pl_eqt DOUBLE
)
''')

con.execute('''
INSERT INTO fact_planet_sk
SELECT
  f.pl_name,
  d.host_id,
  f.discoverymethod,
  f.disc_year,
  f.pl_orbper,
  f.pl_rade,
  f.pl_bmasse,
  f.pl_eqt
FROM fact_planet f
JOIN dim_host_sk d
  ON f.hostname = d.hostname
''')

# Checks básicos (deben pasar)
con.sql("SELECT COUNT(*) AS n_rows, COUNT(DISTINCT hostname) AS n_keys FROM dim_host_sk").show()

con.sql('''
SELECT COUNT(*) AS orphan_rows
FROM fact_planet_sk f
LEFT JOIN dim_host_sk d
  ON f.host_id = d.host_id
WHERE d.host_id IS NULL
''').show()

con.sql("SELECT COUNT(*) AS n_fact_sk FROM fact_planet_sk").show()

## Tu Turno: construye 2 Gold outputs

In [None]:
# TODO 1 (Gold 1): crea la vista gold_by_discoverymethod (igual que en clase)
# Requisitos:
# - COUNT(*) por discoverymethod
# - AVG(pl_rade) y/o AVG(pl_bmasse)
# - MIN(disc_year) / MAX(disc_year)
# - ORDER BY n_planets DESC

con.execute("DROP VIEW IF EXISTS gold_by_discoverymethod")

# TODO: CREATE VIEW ...
# con.execute(""" ... """)

con.sql("SELECT * FROM gold_by_discoverymethod LIMIT 10").show()

In [None]:
# TODO 2 (Gold 2): crea la vista gold_by_host
# Requisitos:
# - JOIN fact_planet_sk con dim_host_sk por host_id
# - GROUP BY hostname
# - COUNT(*) como n_planets
# - AVG(pl_rade) y/o AVG(pl_bmasse)
# - ORDER BY n_planets DESC

con.execute("DROP VIEW IF EXISTS gold_by_host")

# TODO: CREATE VIEW ...
# con.execute(""" ... """)

con.sql("SELECT * FROM gold_by_host LIMIT 10").show()

## Tu Turno: export a artifacts/

In [None]:
# TODO 3: exporta las 2 vistas Gold a CSV en artifacts/

out1 = ART_DIR / "gold_by_discoverymethod.csv"
out2 = ART_DIR / "gold_by_host.csv"

# TODO: COPY ...
# con.execute(...)

print("Debe existir:", out1, "y", out2)

## Cierre H2 — Lo mínimo que debe quedar “listo” hoy

### 1) Evidencia en `artifacts/`
- `artifacts/gold_by_discoverymethod.csv`
- `artifacts/gold_by_host.csv`

### 2) Data Contract actualizado (`docs/data_contract.md`)
Incluye explícitamente:
- Datasets: `silver_planet`, `dim_host_sk`, `fact_planet_sk`, `gold_by_*`
- Grain:
  - `dim_host_sk`: 1 fila por `hostname`
  - `fact_planet_sk`: 1 fila por `pl_name`
- Keys:
  - PK dim: `host_id`, UNIQUE: `hostname`
  - PK fact: `pl_name`
  - FK: `fact_planet_sk.host_id → dim_host_sk.host_id`
- Checks mínimos (con evidencia):
  - uniqueness dim: `COUNT(*) == COUNT(DISTINCT hostname)`
  - orphans: `orphan_rows == 0`

### 3) Decisions Log (`docs/decisions_log.md`)
Al final de W05 deben aparecer al menos 2 decisiones importantes:
- “Surrogate key + FK”
- “Gold outputs y por qué esas métricas”

## Para entregar (W05B) — cierre H2

### En clase
1) Muestra en pantalla (o pega en `docs/w05b_gold_report.md`):
   - Top 10 de `gold_by_discoverymethod`
   - Top 10 de `gold_by_host`
2) Verifica que existan los CSV exportados en `artifacts/`
3) `docs/decisions_log.md` con **mínimo 2 decisiones** (con evidencia)

### Tarea (H2 final)
1) `docs/data_contract.md` completo y consistente con lo construido
2) `docs/w05b_gold_report.md` (1–2 páginas max) con:
   - qué significa cada output Gold
   - una interpretación científica simple (2–3 líneas)