# Bronze — Ingesta cruda TMDB (JSON)

Este notebook descarga datos crudos desde la API de TMDB y los guarda como archivos JSON en la capa **Bronze** (raw) dentro de Unity Catalog Volumes.

**Datos:**
- Lista de géneros (`/genre/movie/list`)
- Descubrimiento de películas por género (`/discover/movie`) por páginas

**Salida:**
- `/Volumes/tmdb/default/bronze_tmdb/genres/genres_YYYY-MM-DD.json`
- `/Volumes/tmdb/default/bronze_tmdb/discover/discover_genre_<gid>_page_<n>_YYYY-MM-DD.json`


In [0]:
from datetime import datetime
import json
import requests
import os

# --- TMDB config ---
TMDB_API_KEY = "cf6da806a4eae3a84e96e06ef66b8c31"  # Ideal: usar Secrets o variable de entorno
TMDB_BASE_URL = "https://api.themoviedb.org/3"

# --- Bronze paths (Unity Catalog Volume) ---
BASE = "/Volumes/tmdb/default/bronze_tmdb"
BRONZE_GENRES = f"{BASE}/genres"
BRONZE_DISCOVER = f"{BASE}/discover"

# Run date (para versionar archivos)
RUN_DATE = datetime.now().strftime("%Y-%m-%d")

# Parámetros de extracción
PAGES_PER_GENRE = 3

BASE, BRONZE_GENRES, BRONZE_DISCOVER, RUN_DATE, PAGES_PER_GENRE



('/Volumes/tmdb/default/bronze_tmdb',
 '/Volumes/tmdb/default/bronze_tmdb/genres',
 '/Volumes/tmdb/default/bronze_tmdb/discover',
 '2026-02-05',
 3)

In [0]:
os.makedirs(BRONZE_GENRES, exist_ok=True)
os.makedirs(BRONZE_DISCOVER, exist_ok=True)

if not os.path.isdir(BRONZE_GENRES):
    raise RuntimeError("No se pudo crear carpeta BRONZE_GENRES")

if not os.path.isdir(BRONZE_DISCOVER):
    raise RuntimeError("No se pudo crear carpeta BRONZE_DISCOVER")

print("Carpetas OK:")
print(" -", BRONZE_GENRES)
print(" -", BRONZE_DISCOVER)


Carpetas OK:
 - /Volumes/tmdb/default/bronze_tmdb/genres
 - /Volumes/tmdb/default/bronze_tmdb/discover


In [0]:
def tmdb_get(endpoint: str, params: dict | None = None) -> dict:
    """
    Llama a la API de TMDB y devuelve un dict (JSON).
    """
    if params is None:
        params = {}

    params["api_key"] = TMDB_API_KEY
    url = f"{TMDB_BASE_URL}{endpoint}"

    response = requests.get(url, params=params, timeout=30)
    response.raise_for_status()

    payload = response.json()
    if not payload:
        raise ValueError(f"Respuesta vacía desde TMDB para endpoint {endpoint}")

    return payload


def save_json(obj: dict, folder: str, filename: str) -> str:
    """
    Guarda un dict como JSON en la carpeta indicada y devuelve el path final.
    """
    os.makedirs(folder, exist_ok=True)
    path = f"{folder}/{filename}"

    with open(path, "w", encoding="utf-8") as f:
        json.dump(obj, f, ensure_ascii=False)

    return path


In [0]:
genres_payload = tmdb_get("/genre/movie/list", params={"language": "en-US"})

genres_path = save_json(
    obj=genres_payload,
    folder=BRONZE_GENRES,
    filename=f"genres_{RUN_DATE}.json"
)

print("✅ Genres guardado en:", genres_path)
print("Total genres:", len(genres_payload.get("genres", [])))


✅ Genres guardado en: /Volumes/tmdb/default/bronze_tmdb/genres/genres_2026-02-05.json
Total genres: 19


In [0]:
genres = genres_payload.get("genres", [])

for g in genres:
    gid = g["id"]
    gname = g.get("name", "unknown")

    for page in range(1, PAGES_PER_GENRE + 1):
        discover_payload = tmdb_get(
            "/discover/movie",
            params={
                "with_genres": gid,
                "sort_by": "vote_count.desc",
                "page": page,
                "language": "en-US"
            }
        )

        out_path = save_json(
            obj=discover_payload,
            folder=BRONZE_DISCOVER,
            filename=f"discover_genre_{gid}_page_{page}_{RUN_DATE}.json"
        )

        print(f"✅ [{gname} | {gid}] page={page} -> {out_path}")


✅ [Action | 28] page=1 -> /Volumes/tmdb/default/bronze_tmdb/discover/discover_genre_28_page_1_2026-02-05.json
✅ [Action | 28] page=2 -> /Volumes/tmdb/default/bronze_tmdb/discover/discover_genre_28_page_2_2026-02-05.json
✅ [Action | 28] page=3 -> /Volumes/tmdb/default/bronze_tmdb/discover/discover_genre_28_page_3_2026-02-05.json
✅ [Adventure | 12] page=1 -> /Volumes/tmdb/default/bronze_tmdb/discover/discover_genre_12_page_1_2026-02-05.json
✅ [Adventure | 12] page=2 -> /Volumes/tmdb/default/bronze_tmdb/discover/discover_genre_12_page_2_2026-02-05.json
✅ [Adventure | 12] page=3 -> /Volumes/tmdb/default/bronze_tmdb/discover/discover_genre_12_page_3_2026-02-05.json
✅ [Animation | 16] page=1 -> /Volumes/tmdb/default/bronze_tmdb/discover/discover_genre_16_page_1_2026-02-05.json
✅ [Animation | 16] page=2 -> /Volumes/tmdb/default/bronze_tmdb/discover/discover_genre_16_page_2_2026-02-05.json
✅ [Animation | 16] page=3 -> /Volumes/tmdb/default/bronze_tmdb/discover/discover_genre_16_page_3_2026-02-

In [0]:
print("📌 Outputs esperados:")
print("Genres:", f"{BRONZE_GENRES}/genres_{RUN_DATE}.json")
print("Discover:", f"{BRONZE_DISCOVER}/discover_genre_<gid>_page_<n>_{RUN_DATE}.json")


📌 Outputs esperados:
Genres: /Volumes/tmdb/default/bronze_tmdb/genres/genres_2026-02-05.json
Discover: /Volumes/tmdb/default/bronze_tmdb/discover/discover_genre_<gid>_page_<n>_2026-02-05.json
