# **2do parcial - MVP TECNICO**

# Materia: Mineria de datos II

###Alumno: Emilio Gomez Lencina



---


Pipelines

**(1) Batch**

Landing ----batch---->Bronze (csv)--->Silver---> Gold ---> Serving (cassandra)

**(2) Stream**

Landing ----speed----> Bronze (jsonl)--->Silver --> gold (stream) --->Serving (Cassandra)

**(3) Queries de cassandra**
---

Resumen de la implementación

*Cada celda (con excepcion de los demos) corresponde a un archivo .py que eventualmente va a estar en el repo final al moment ode la entrega final.

Por eso, se usaron parches como:


```
try:
    from cassandra_utils import get_cassandra_session, KEYSPACE
except ModuleNotFoundError:
    pass
```
En el archivo .py solo va a estar la llamada a cassandra_utils, pero como en el colab no es necesario, epuse ese parchecito temporal solo al efecto de la dentrega de este MVP.


Este notebook contiene un end-to-end que incluye:

    *implementacion de left anti joins para manejo de cuarentena (evitando duplicidad de errores) y estrategias de escritura segura.

    *reglas de de negocio activas (costos negativos, integridad referencial) y desvío automático a zona de Quarantine.

    *ptimizaciones Spark: Configuracion de shuffle.partitions para entorno local/Colab y uso de broadcast joins.

    -*streaming con watermarking, dedupe y checkpointing para tolerancia a fallos.

    *serving Layer: conexin a AstraDB (Cassandra) con modelado Query-First.


Laburo completo aca ----->[Repositorio GitHub (Código + Readme + Diagramas) ](https://github.com/Sinnick4r/Cloud_Provider_Analytics_MVP)



In [69]:
# config.py

from __future__ import annotations
import zipfile
from pathlib import Path
from typing import Final

from pyspark.sql import SparkSession



# integracion con google drive que estuve usando, se puede poner true o false

USE_GOOGLE_DRIVE: Final[bool] = False
GOOGLE_DRIVE_PROJECT_SUBDIR: Final[str] = "Mineria de datos II/Proyecto Cloud Provider Analysis"



def get_project_root() -> Path:
    """
    Devuelve la raíz del proyecto.

    - so USE_GOOGLE_DRIVE es true, monta gdrive en Colab y usa la carpeta indicada.
    - si es Ffalse, usa el directorio actual (repo descomprimido).
    """
    if USE_GOOGLE_DRIVE:
        try:
            from google.colab import drive as gdrive
        except ImportError as exc:
            raise RuntimeError(
                "USE_GOOGLE_DRIVE=True pero no esamos en colab."
            ) from exc

        gdrive.mount("/content/drive")
        return (Path("/content/drive/MyDrive") / GOOGLE_DRIVE_PROJECT_SUBDIR).resolve()

    return Path(".").resolve()

#aca van loss directorios de todo el proyecto

PROJECT_ROOT: Final[Path] = get_project_root()
DATA_DIR: Final[Path] = PROJECT_ROOT / "data"
DATALAKE_ROOT: Final[Path] = PROJECT_ROOT / "datalake"

LANDING_PATH: Final[Path] = DATALAKE_ROOT / "landing"
BRONZE_PATH: Final[Path] = DATALAKE_ROOT / "bronze"
SILVER_PATH: Final[Path] = DATALAKE_ROOT / "silver"
GOLD_PATH: Final[Path] = DATALAKE_ROOT / "gold"
QUARANTINE_PATH: Final[Path] = DATALAKE_ROOT / "quarantine"

RAW_ZIP_NAME: Final[str] = "cloud_provider_challenge_dataset_v1.zip"


# sesion de park

def create_spark(app_name: str = "CloudProviderAnalytics_Pipeline") -> SparkSession:
    spark = (
        SparkSession.builder
        .appName(app_name)
        .master("local[*]")
        .config("spark.sql.shuffle.partitions", "4")
        .getOrCreate()
    )
    spark.sparkContext.setLogLevel("WARN")
    return spark


# utils de archivos/directorios

def ensure_dirs() -> None:
# Crea la estructura del datalake si no existe.
    for path in (LANDING_PATH, BRONZE_PATH, SILVER_PATH, GOLD_PATH, QUARANTINE_PATH):
        path.mkdir(parents=True, exist_ok=True)


def unpack_raw_dataset() -> None:

    #descomprime el zip del datalake
    zip_path = DATA_DIR / RAW_ZIP_NAME


    if not zip_path.exists():
        print(f"[WARN] No se encontró el ZIP de datos en {zip_path}. Saltando unpack.")
        return
    with zipfile.ZipFile(zip_path, "r") as zf:
        zf.extractall(PROJECT_ROOT)

print(f"[OK] Dataset descomprimido en {PROJECT_ROOT / 'datalake' / 'landing'}")

[OK] Dataset descomprimido en /content/datalake/landing


In [70]:
# descompresiondel archivo y chequeo

print(f"[INFO] PROJECT_ROOT: {PROJECT_ROOT}")
ensure_dirs()
unpack_raw_dataset()
spark = create_spark()
print(f"[INFO] Spark version: {spark.version}")

[INFO] PROJECT_ROOT: /content
[INFO] Spark version: 3.5.1


In [71]:
# schemas.py


from __future__ import annotations

from typing import Final
from pyspark.sql import types as T


# Esquemascsv y json totalmente manuales despeus de chequearlos

customers_orgs_schema: Final[T.StructType] = T.StructType([
    T.StructField("org_id", T.StringType(), nullable=False),
    T.StructField("org_name", T.StringType(), nullable=True),
    T.StructField("industry", T.StringType(), nullable=True),
    T.StructField("hq_region", T.StringType(), nullable=True),
    T.StructField("plan_tier", T.StringType(), nullable=True),
    T.StructField("is_enterprise", T.BooleanType(), nullable=True),
    T.StructField("signup_date", T.DateType(), nullable=True),
    T.StructField("sales_rep", T.StringType(), nullable=True),
    T.StructField("lifecycle_stage", T.StringType(), nullable=True),
    T.StructField("marketing_source", T.StringType(), nullable=True),
    T.StructField("nps_score", T.DoubleType(), nullable=True),
])

users_schema: Final[T.StructType] = T.StructType([
    T.StructField("user_id", T.StringType(), nullable=False),
    T.StructField("org_id", T.StringType(), nullable=False),
    T.StructField("email", T.StringType(), nullable=True),
    T.StructField("role", T.StringType(), nullable=True),
    T.StructField("active", T.BooleanType(), nullable=True),
    T.StructField("created_at", T.DateType(), nullable=True),
    T.StructField("last_login", T.DateType(), nullable=True),
])

resources_schema: Final[T.StructType] = T.StructType([
    T.StructField("resource_id", T.StringType(), nullable=False),
    T.StructField("org_id", T.StringType(), nullable=False),
    T.StructField("service", T.StringType(), nullable=True),
    T.StructField("region", T.StringType(), nullable=True),
    T.StructField("created_at", T.DateType(), nullable=True),
    T.StructField("state", T.StringType(), nullable=True),
    T.StructField("tags_json", T.StringType(), nullable=True),
])

support_tickets_schema: Final[T.StructType] = T.StructType([
    T.StructField("ticket_id", T.StringType(), nullable=False),
    T.StructField("org_id", T.StringType(), nullable=False),
    T.StructField("category", T.StringType(), nullable=True),
    T.StructField("severity", T.StringType(), nullable=True),
    T.StructField("created_at", T.DateType(), nullable=True),
    T.StructField("resolved_at", T.DateType(), nullable=True),
    T.StructField("csat", T.DoubleType(), nullable=True),
    T.StructField("sla_breached", T.BooleanType(), nullable=True),
])

marketing_touches_schema: Final[T.StructType] = T.StructType([
    T.StructField("touch_id", T.StringType(), nullable=False),
    T.StructField("org_id", T.StringType(), nullable=False),
    T.StructField("campaign", T.StringType(), nullable=True),
    T.StructField("channel", T.StringType(), nullable=True),
    T.StructField("timestamp", T.DateType(), nullable=True),
    T.StructField("clicked", T.BooleanType(), nullable=True),
    T.StructField("converted", T.BooleanType(), nullable=True),
])

nps_surveys_schema: Final[T.StructType] = T.StructType([
    T.StructField("org_id", T.StringType(), nullable=False),
    T.StructField("survey_date", T.DateType(), nullable=True),
    T.StructField("nps_score", T.DoubleType(), nullable=True),
    T.StructField("comment", T.StringType(), nullable=True),
])

billing_monthly_schema: Final[T.StructType] = T.StructType([
    T.StructField("invoice_id", T.StringType(), nullable=False),
    T.StructField("org_id", T.StringType(), nullable=False),
    T.StructField("month", T.DateType(), nullable=True),
    T.StructField("subtotal", T.DecimalType(10,4), nullable=True),
    T.StructField("credits", T.DecimalType(10,4), nullable=True),
    T.StructField("taxes", T.DecimalType(10,4), nullable=True),
    T.StructField("currency", T.StringType(), nullable=True),
    T.StructField("exchange_rate_to_usd", T.DoubleType(), nullable=True),
])


usage_events_schema: Final[T.StructType] = T.StructType([
    T.StructField("event_id", T.StringType(), nullable=False),
    T.StructField("timestamp", T.StringType(), nullable=False),
    T.StructField("org_id", T.StringType(), nullable=False),
    T.StructField("resource_id", T.StringType(), nullable=False),
    T.StructField("service", T.StringType(), nullable=True),
    T.StructField("region", T.StringType(), nullable=True),
    T.StructField("metric", T.StringType(), nullable=True),
    T.StructField("value", T.DecimalType(10,4), nullable=True),
    T.StructField("unit", T.StringType(), nullable=True),
    T.StructField("cost_usd_increment", T.DecimalType(10,4), nullable=True),
    T.StructField("schema_version", T.IntegerType(), nullable=True),
    T.StructField("carbon_kg", T.DoubleType(), nullable=True),
])


In [72]:
# io_utils.py

from __future__ import annotations
from pathlib import Path
from typing import Final
from pyspark.sql import DataFrame, SparkSession
from pyspark.sql import types as T


# --- Helpers de rutas genéricos --- #

def zone_path(zone_root: Path, table_name: str) -> Path:
    """
    Devuelve la ruta completa a una tabla dentro de una zona del datalake.

    """

    return (zone_root / table_name).resolve()


def add_audit_columns(df: DataFrame) -> DataFrame:
    """
    Enriquece un DataFrame con columnas técnicas de auditoría:
    - ingest_ts: Timestamp de ingestión
    - source_file: Nombre del archivo origen
    """
    return df.withColumn("ingest_ts", F.current_timestamp()) \
             .withColumn("source_file", F.input_file_name())

# --- Lectura de CSV --- #

def read_csv(
    spark: SparkSession,
    path: Path,
    schema: T.StructType,
    header: bool = True,
) -> DataFrame:

    #lee CSV con ruta y esquema como param

    return (
        spark.read
        .option("header", str(header).lower())
        .schema(schema)
        .csv(str(path))
    )



def write_parquet(
    df: DataFrame,
    base_path: Path,
    partition_cols: list[str] | None = None,
    mode: str = "overwrite",
) -> None:
    #Escribe un  df en formato pparquet

    writer = df.write.mode(mode)
    if partition_cols:
        writer = writer.partitionBy(*partition_cols)
    writer.parquet(str(base_path))


def read_parquet(spark, base_path, partition_glob: str | None = None):

    #lee un Parquet / si no hay partition_glob lee la ruta / si hay artition_glob usa basePath y patron.

    base_path = Path(base_path)
    base_str = str(base_path)

    if partition_glob:
        return (
            spark.read
                 .option("basePath", base_str)
                 .parquet(f"{base_str}/{partition_glob}")
        )

    return spark.read.parquet(base_str)


In [73]:
# audit.py
from __future__ import annotations

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.utils import AnalysisException

# Esto solo esta en el colab, en el codigo definitivo solo se importa
try:
    from config import BRONZE_PATH, SILVER_PATH, GOLD_PATH, QUARANTINE_PATH
    from io_utils import read_parquet, zone_path
except ModuleNotFoundError:
    import __main__ as _m
    BRONZE_PATH = _m.BRONZE_PATH
    SILVER_PATH = _m.SILVER_PATH
    GOLD_PATH = _m.GOLD_PATH
    QUARANTINE_PATH = _m.QUARANTINE_PATH
    read_parquet = _m.read_parquet
    zone_path = _m.zone_path






def audit_bronze_layer(spark: SparkSession, table_name: str, pk_col: str) -> None:

    # se chequea vol, uni de PK y el llenado con ingest_ts

    print(f"\n chequeo Bronze: {table_name}")

    path = zone_path(BRONZE_PATH, table_name)
    try:
        df = read_parquet(spark, path)
    except AnalysisException:
        print(f"[ERR] no se encontro la tabla en {path}")
        return

    # Vol
    count_total = df.count()

    # Uni
    count_distinct = df.select(pk_col).distinct().count()
    duplicates = count_total - count_distinct

    # ingest_ts
    if "ingest_ts" in df.columns:
        null_tech = df.filter(F.col("ingest_ts").isNull()).count()
    else:
        null_tech = "falta columna ingest_ts"

    print(f"registros totales: {count_total}")
    print(f"duplicados en PK ({pk_col}): {duplicates}")
    print(f"nulos en ingest_ts: {null_tech}")

    # Resulktado
    if duplicates == 0 and (isinstance(null_tech, int) and null_tech == 0):
        print("Resultado: Todo ok")
    else:
        print("Resultado: Revisar data por posibles duplicados o falta de metadatos")


def audit_silver_quality(spark: SparkSession) -> None:

    #Chequea el resultado del proceso Silver Batch, calculando un ratio entre registros en Silver vs Cuarentena
    print(f"\n Chequeop Silver:")

    path_good = zone_path(SILVER_PATH, "usage_events_enriched")
    path_bad  = zone_path(QUARANTINE_PATH, "usage_events_quarantine")

    # contar buenos
    try:
        good_df = read_parquet(spark, path_good)
        count_good = good_df.count()
    except AnalysisException:
        count_good = 0

    # contar malos
    try:
        bad_df = read_parquet(spark, path_bad)
        count_bad = bad_df.count()
        has_bad = True
    except AnalysisException:
        count_bad = 0
        has_bad = False

    total = count_good + count_bad
    if total == 0:
        print("[WARN] No hay datos procesados en Silver ni cuarentena.")
        return

    bad_ratio = (count_bad / total) * 100

    print(f"Total: {total}")
    print(f"Aceptados (Silver): {count_good}")
    print(f"Rechazados (cuarentena): {count_bad} ({bad_ratio:.2f}%)")

    if bad_ratio == 0:
        print("CALIDAD: PERFECTA (0% Rechazo)")
    elif bad_ratio < 5:
        print("CALIDAD: ACEPTABLE")
    else:
        print("CALIDAD: CRÍTICA (>5% Rechazo). Revisar reglas de negocio.")

    if has_bad:
        print("ejemplo de rechazo:")
        bad_df.select("event_id", "quarantine_reason").show(1, truncate=False)


def audit_gold_layer(spark: SparkSession, table_name: str) -> None:

    #Audita un Mart Gold verificando reglas de negocio para serving layer: Vol, Integridad y KPIs
    print(f"\n Chequeo Gold: {table_name}")

    path = zone_path(GOLD_PATH, table_name)
    try:
        # Gold esta particionado
        df = read_parquet(spark, path, partition_glob="event_date=*")
    except AnalysisException:
        print(f"   [ERR] No se encontró el mart en {path}")
        return

    count_total = df.count()

    # Regla: Costos Negativos
    neg_costs = df.filter(F.col("daily_cost_usd") < 0).count()

    print(f"registros Totales (Agregados): {count_total}")
    print(f"Costos Negativos detectados: {neg_costs}")

    if neg_costs == 0:
        print("todo listo para serving layer")
    else:
        print("error: data corrupta en Gold")

def audit_quarantine(spark: SparkSession):
    print(f"\nCalidad de Datos (Silver Batch)")

    path_good = zone_path(SILVER_PATH, "usage_events_enriched")
    path_bad  = zone_path(QUARANTINE_PATH, "usage_events_quarantine")

    # conteo de datos buenos
    try:
        good_df = read_parquet(spark, path_good)
        count_good = good_df.count()
    except AnalysisException:
        count_good = 0
        print("[WARN] No hay data en Silver.")

    # conteo  malos
    try:
        bad_df = read_parquet(spark, path_bad)
        count_bad = bad_df.count()
        has_bad_data = True
    except AnalysisException:
        count_bad = 0
        has_bad_data = False
        print("[INFO] No hay data en cuarentena")

    # ratio
    total = count_good + count_bad
    if total == 0:
        print("[ERR] No hay data procesada")
        return

    bad_ratio = (count_bad / total) * 100

    print(f"\n Estadisticas:")
    print(f"Total Procesado: {total}")
    print(f"Aceptados (Silver): {count_good} ({(100 - bad_ratio):.2f}%)")
    print(f"Rechazados (Quarantine): {count_bad} ({bad_ratio:.2f}%)")

    print("\nresultado:")

    if bad_ratio == 0:
        print("Satifactorio - sin datos rechazados")
    elif bad_ratio < 5:
        print("Aceptable- rechazo  bajo y esperado.")
    else:
        print("malo -demasiada data rechazada (>5%)")

    # muestra errores
    if has_bad_data:
        print("\n Muestra de registros en Cuarentena (top 5):")
        cols_to_show = ["event_id", "cost_usd_increment", "org_id", "quarantine_reason"]
        actual_cols = [c for c in cols_to_show if c in bad_df.columns]
        bad_df.select(*actual_cols).show(5, truncate=False)


def audit_speed_layer_results(spark: SparkSession):

    #aca se chequea que la Speed Layer haya persistido datos en disco.
    #Se ejecuta despues de parar el stream.

    print(f"\n cheque de Speed Layer ---")
    path = zone_path(GOLD_PATH, "org_daily_usage_by_service_speed")

    try:
        df = read_parquet(spark, path, partition_glob="*")
        total_rows = df.count()

        print(f"Ruta: {path}, Total acumulado en disco: {total_rows}")

        if total_rows > 0:
          print("funcionando todo OK (Datos persistidos correctamente)")
          df.show(3, truncate=False)
        else:
          print("vacio - dejar el stream corriendo mas tiempo")
    except Exception as e:
       print(f"[ERR] No se pudo leer la Speed Layer: {e}")


In [74]:
# bronze_batch.py

from __future__ import annotations

from pathlib import Path
from typing import Optional

from pyspark.sql import DataFrame, SparkSession


try:
    from config import LANDING_PATH, BRONZE_PATH

except ModuleNotFoundError:
    import __main__ as _m
    try:
        LANDING_PATH = _m.LANDING_PATH
        BRONZE_PATH = _m.BRONZE_PATH
    except AttributeError as exc:
        raise RuntimeError(
            "No se pudo importar config, hay que corre primero la celda 'config.py'."
        ) from exc


# importacion de squemas

try:
    from schemas import (
        customers_orgs_schema,
        users_schema,
        resources_schema,
        support_tickets_schema,
        marketing_touches_schema,
        nps_surveys_schema,
        billing_monthly_schema,
    )
except ModuleNotFoundError:
    import __main__ as _m  # type: ignore[import]
    try:
        customers_orgs_schema = _m.customers_orgs_schema
        users_schema = _m.users_schema
        resources_schema = _m.resources_schema
        support_tickets_schema = _m.support_tickets_schema
        marketing_touches_schema = _m.marketing_touches_schema
        nps_surveys_schema = _m.nps_surveys_schema
        billing_monthly_schema = _m.billing_monthly_schema
    except AttributeError as exc:
        raise RuntimeError(
            "No se pudo importar schemas, hayque correr primero la celda 'schemas.py'"
        ) from exc


#  importacion de todo lo qe es IO desde io_utils

try:
    from io_utils import read_csv, write_parquet, zone_path, add_audit_columns
except ModuleNotFoundError:
    import __main__ as _m
    try:
        read_csv = _m.read_csv
        write_parquet = _m.write_parquet
        zone_path = _m.zone_path
    except AttributeError as exc:
        raise RuntimeError(
            "No se pudo importar io_utils, hay que correr antes primero la celda 'io_utils.py'."
        ) from exc


# helper interno para leer .csv

def _read_landing_csv(
    spark: SparkSession,
    file_name: str,
    schema,
) -> Optional[DataFrame]:

    csv_path = LANDING_PATH / file_name
    if not csv_path.exists():
        print(f"[WARN] CSV no encontrado en landing: {csv_path}")
        return None

    print(f"[INFO] Leyendo {csv_path}")
    return read_csv(spark, csv_path, schema)


# Ingesta

def ingest_customers_orgs_to_bronze(spark: SparkSession) -> None:
    df = _read_landing_csv(spark, "customers_orgs.csv", customers_orgs_schema)
    if df is None: return
    df = add_audit_columns(df)
    dest = zone_path(BRONZE_PATH, "customers_orgs")
    write_parquet(df, dest, partition_cols=["hq_region"])
    print(f"[OK] Bronze customers_orgs -> {dest}")


def ingest_users_to_bronze(spark: SparkSession) -> None:
    df = _read_landing_csv(spark, "users.csv", users_schema)
    if df is None:
        return
    df = add_audit_columns(df)
    dest = zone_path(BRONZE_PATH, "users")
    write_parquet(df, dest, partition_cols=["role"])
    print(f"[OK] Bronze users -> {dest}")


def ingest_resources_to_bronze(spark: SparkSession) -> None:
    df = _read_landing_csv(spark, "resources.csv", resources_schema)
    if df is None:
        return
    df = add_audit_columns(df)
    dest = zone_path(BRONZE_PATH, "resources")
    write_parquet(df, dest, partition_cols=["region"])
    print(f"[OK] Bronze resources -> {dest}")


def ingest_support_tickets_to_bronze(spark: SparkSession) -> None:
    df = _read_landing_csv(spark, "support_tickets.csv", support_tickets_schema)
    if df is None:
        return
    df = add_audit_columns(df)
    dest = zone_path(BRONZE_PATH, "support_tickets")
    write_parquet(df, dest, partition_cols=["severity"])
    print(f"[OK] Bronze support_tickets -> {dest}")


def ingest_marketing_touches_to_bronze(spark: SparkSession) -> None:
    df = _read_landing_csv(spark, "marketing_touches.csv", marketing_touches_schema)
    if df is None:
        return
    df = add_audit_columns(df)
    dest = zone_path(BRONZE_PATH, "marketing_touches")
    write_parquet(df, dest, partition_cols=["channel"])
    print(f"[OK] Bronze marketing_touches -> {dest}")


def ingest_nps_surveys_to_bronze(spark: SparkSession) -> None:
    df = _read_landing_csv(spark, "nps_surveys.csv", nps_surveys_schema)
    if df is None:
        return
    df = add_audit_columns(df)
    dest = zone_path(BRONZE_PATH, "nps_surveys")
    write_parquet(df, dest, partition_cols=["survey_date"])
    print(f"[OK] Bronze nps_surveys -> {dest}")


def ingest_billing_monthly_to_bronze(spark: SparkSession) -> None:
    df = _read_landing_csv(spark, "billing_monthly.csv", billing_monthly_schema)
    if df is None:
        return
    df = add_audit_columns(df)
    dest = zone_path(BRONZE_PATH, "billing_monthly")
    write_parquet(df, dest, partition_cols=["month"])
    print(f"[OK] Bronze billing_monthly -> {dest}")


# orrquestador de Bronze batch

def run_bronze_batch(spark: SparkSession) -> None:
    ingest_customers_orgs_to_bronze(spark)
    ingest_users_to_bronze(spark)
    ingest_resources_to_bronze(spark)
    ingest_support_tickets_to_bronze(spark)
    ingest_marketing_touches_to_bronze(spark)
    ingest_nps_surveys_to_bronze(spark)
    ingest_billing_monthly_to_bronze(spark)

-main bronze-

In [75]:
ensure_dirs()
unpack_raw_dataset()
spark = create_spark()
run_bronze_batch(spark)

[INFO] Leyendo /content/datalake/landing/customers_orgs.csv
[OK] Bronze customers_orgs -> /content/datalake/bronze/customers_orgs
[INFO] Leyendo /content/datalake/landing/users.csv
[OK] Bronze users -> /content/datalake/bronze/users
[INFO] Leyendo /content/datalake/landing/resources.csv
[OK] Bronze resources -> /content/datalake/bronze/resources
[INFO] Leyendo /content/datalake/landing/support_tickets.csv
[OK] Bronze support_tickets -> /content/datalake/bronze/support_tickets
[INFO] Leyendo /content/datalake/landing/marketing_touches.csv
[OK] Bronze marketing_touches -> /content/datalake/bronze/marketing_touches
[INFO] Leyendo /content/datalake/landing/nps_surveys.csv
[OK] Bronze nps_surveys -> /content/datalake/bronze/nps_surveys
[INFO] Leyendo /content/datalake/landing/billing_monthly.csv
[OK] Bronze billing_monthly -> /content/datalake/bronze/billing_monthly


In [76]:
 # bronze_stream.py

from __future__ import annotations

from typing import Optional

from pyspark.sql import DataFrame, SparkSession
from pyspark.sql import functions as F


try:
    from config import LANDING_PATH, BRONZE_PATH
except ModuleNotFoundError:
    import __main__ as _m
    try:
        LANDING_PATH = _m.LANDING_PATH
        BRONZE_PATH = _m.BRONZE_PATH
    except AttributeError as exc:
        raise RuntimeError(
            "No se pudo importar config, hay que correr primero la celda 'config.py'."
        ) from exc


try:
    from schemas import usage_events_schema
except ModuleNotFoundError:
    import __main__ as _m
    try:
        usage_events_schema = _m.usage_events_schema
    except AttributeError as exc:
        raise RuntimeError(
            "No se pudo importar schemas, hay que correr primero la celda 'schemas.py'."
        ) from exc



try:
    from io_utils import zone_path
except ModuleNotFoundError:
    import __main__ as _m
    try:
        zone_path = _m.zone_path
    except AttributeError as exc:
        raise RuntimeError(
            "No se pudo importar io_utils hay qeu ejecutar primero la celda 'io_utils.py'."
        ) from exc


# -creacion de DF de streaming

def create_usage_events_stream(spark: SparkSession) -> DataFrame:

    src_dir = LANDING_PATH / "usage_events_stream"

    return (
        spark.readStream
        .schema(usage_events_schema)
        .option("maxFilesPerTrigger", 1)
        .json(str(src_dir))
    )


def transform_usage_events_bronze(df_stream: DataFrame) -> DataFrame:

    # Transformaciones :'timestamp' a 'event_ts', 'event_date' (date),  watermark y dedupe por event_id

    df = (
        df_stream
        .withColumn("event_ts", F.to_timestamp("timestamp"))
        .withColumn("event_date", F.to_date("event_ts"))
    )

    df = (
      df
      .withWatermark("event_ts", "1 day")
      .dropDuplicates(["event_id"])
    )

    return df

  # Arranca el streaming desde usage_events_stream en Landing

def start_usage_events_to_bronze(spark: SparkSession):

    df_stream = create_usage_events_stream(spark)
    df_bronze = transform_usage_events_bronze(df_stream)

    dest_path = zone_path(BRONZE_PATH, "usage_events")
    checkpoint_path = BRONZE_PATH / "_checkpoints" / "usage_events"

    query = (
        df_bronze
        .writeStream
        .format("parquet")
        .option("checkpointLocation", str(checkpoint_path))
        .option("path", str(dest_path))
        .partitionBy("event_date")
        .outputMode("append")
        .start()
    )

    print(f"[INFO] Streaming usage_events -> {dest_path}")
    print(f"[INFO] Checkpoints en {checkpoint_path}")
    return query

In [77]:
# esto es para correr en notebook
import time

# runner de streaming usage_events (para notebook)
query_usage = start_usage_events_to_bronze(spark)

print("[INFO] corriendo el stream por 20 segundos...")
time.sleep(20)

print("[INFO] parando stream...")
query_usage.stop()
print("[INFO] Stream parado.")

[INFO] Streaming usage_events -> /content/datalake/bronze/usage_events
[INFO] Checkpoints en /content/datalake/bronze/_checkpoints/usage_events
[INFO] corriendo el stream por 20 segundos...
[INFO] parando stream...
[INFO] Stream parado.


In [78]:
# silver.py


from __future__ import annotations

from pyspark.sql import DataFrame, SparkSession
from pyspark.sql import functions as F

# parche de imports para colab de nuevo, esto en el PY final no va a estar
try:
    from config import BRONZE_PATH, SILVER_PATH, QUARANTINE_PATH
except ModuleNotFoundError:
    import __main__ as _m
    BRONZE_PATH = _m.BRONZE_PATH
    SILVER_PATH = _m.SILVER_PATH
    QUARANTINE_PATH = _m.QUARANTINE_PATH

try:
    from io_utils import read_parquet, write_parquet, zone_path
except ModuleNotFoundError:
    import __main__ as _m
    read_parquet = _m.read_parquet
    write_parquet = _m.write_parquet
    zone_path = _m.zone_path



def read_bronze_usage_events(spark: SparkSession) -> DataFrame:
    return read_parquet(spark, zone_path(BRONZE_PATH, "usage_events"), partition_glob="event_date=*")

def read_bronze_customers_orgs(spark: SparkSession) -> DataFrame:
    return read_parquet(spark, zone_path(BRONZE_PATH, "customers_orgs"))

# -impieza, Joins y cuarentena

def run_silver_batch(spark: SparkSession) -> None:

    print("[INFO] Iniciando Silver...")

    usage_df = read_bronze_usage_events(spark)
    orgs_df = read_bronze_customers_orgs(spark)

    orgs_sel = orgs_df.select(
        "org_id", "org_name", "hq_region", "plan_tier", "is_enterprise"
    )

    # Join broadcast) // se usa broadcast porque orgs es chica comparada con eventos
    enriched_df = usage_df.join(F.broadcast(orgs_sel), on="org_id", how="left")

    # Reglas:
    # 1: El costo no puede ser negativo (permitimos 0 o mayor, o -0.00...1 errores de float y decimal, pero defino corte en -0.01 para mayor seguridad)
    # 2: tiene tener org_id (el join lo mantiene, pero se valida que no sea nulo si era inner logic)
    dq_condition = (F.col("cost_usd_increment") >= -0.01) & (F.col("org_id").isNotNull())

    # split
    good_df = enriched_df.filter(dq_condition)
    bad_df = enriched_df.filter(~dq_condition)

    if not bad_df.rdd.isEmpty():
        # A. Preparar datos fallidos actuales
        bad_df = bad_df.withColumn("quarantine_reason", F.lit("cost_negative_or_null_org"))

        quarantine_dest = zone_path(QUARANTINE_PATH, "usage_events_quarantine")

        #  se verifica si ya existe para no duplicar
        try:
            existing_quarantine = read_parquet(spark, quarantine_dest)

            # C. aca hago el left anti join para no tener dupes en cuarentena
            unique_bad_df = bad_df.join(
                existing_quarantine,
                on="event_id",
                how="left_anti"
            )

            new_errors_count = unique_bad_df.count()
            if new_errors_count > 0:
                print(f"[WARN] Nuevos registros invalidos detectados: {new_errors_count}")
                write_parquet(unique_bad_df, quarantine_dest, mode="append")
            else:
                print(f"[INFO] Errores detectados ya existian en cuarentena")

        except Exception:
            # si no hay archivo, se crea
            print(f"[WARN] Creando cuarentena por primera vez")
            write_parquet(bad_df, quarantine_dest, mode="append")

    # escritura de Silver limpio
    silver_dest = zone_path(SILVER_PATH, "usage_events_enriched")
    good_df = good_df.withColumnRenamed("service", "service_name")

    write_parquet(
        good_df,
        silver_dest,
        partition_cols=["event_date"],
        mode="overwrite"
    )
    print(f"[OK] Silver Batch completado, todo ok -> {silver_dest}")

In [79]:
# gold.py

from __future__ import annotations
from pyspark.sql import DataFrame, SparkSession
from pyspark.sql import functions as F

# parche de imports para colab de nuevo, esto en el PY final no va a estar
try:
    from config import SILVER_PATH, GOLD_PATH, BRONZE_PATH
except ModuleNotFoundError:
    import __main__ as _m
    SILVER_PATH = _m.SILVER_PATH
    GOLD_PATH = _m.GOLD_PATH
    BRONZE_PATH = _m.BRONZE_PATH

try:
    from io_utils import read_parquet, write_parquet, zone_path
except ModuleNotFoundError:
    import __main__ as _m
    read_parquet = _m.read_parquet
    write_parquet = _m.write_parquet
    zone_path = _m.zone_path

try:
    from bronze_stream import create_usage_events_stream, transform_usage_events_bronze
except ModuleNotFoundError:
    import __main__ as _m
    create_usage_events_stream = _m.create_usage_events_stream
    transform_usage_events_bronze = _m.transform_usage_events_bronze


# Gold de Batch, aca se calculan los finops

def build_gold_finops_mart(spark: SparkSession) -> DataFrame:
  # se crea el Mart de FinOps agregando datos desde Silver limpuio considerando org_id, service_name, event_date.


    silver_path = zone_path(SILVER_PATH, "usage_events_enriched")
    silver_df = read_parquet(spark, silver_path)

    # agregaciones
    aggregated_df = (
        silver_df
        .groupBy("org_id", "org_name", "service_name", "event_date", "hq_region", "plan_tier")
        .agg(
            F.sum("cost_usd_increment").alias("daily_cost_usd"),
            F.sum(
                F.when(F.col("metric") == "requests", F.col("value")).otherwise(0.0)
            ).alias("daily_requests"),
            F.sum("carbon_kg").alias("daily_carbon_kg")
        )
    )

    #KPIs
    gold_df = (
        aggregated_df
        .withColumn(
            "cost_per_request",
            F.when(F.col("daily_requests") > 0,
                   F.col("daily_cost_usd") / F.col("daily_requests")).otherwise(None)
        )
        .withColumn(
            "carbon_per_dollar",
            F.when(F.col("daily_cost_usd") > 0,
                   F.col("daily_carbon_kg") / F.col("daily_cost_usd")).otherwise(None)
        )
    )

    return gold_df

def run_gold_batch(spark: SparkSession) -> None:
    df = build_gold_finops_mart(spark)
    dest = zone_path(GOLD_PATH, "org_daily_usage_by_service")
    write_parquet(df, dest, partition_cols=["event_date"], mode="overwrite")
    print(f"[OK] Gold Batch (FinOps) -> {dest}")


# SPEED GOLD: Streaming Directo a Gold


def start_gold_speed_stream(spark: SparkSession):

    # Speed Layer: Lee stream yvuelca a Gold
    # se usa cache de  Orgs para evitar I/O repetitivo y Coalesce(1) para evitar el problema de tener muchos archivos chicos en Gold.

    print("[INFO] Comenzando Speed Layer...")

    # Stream
    raw_stream = create_usage_events_stream(spark)
    stream_bronze = transform_usage_events_bronze(raw_stream)

    #  cacheo de Orgs

    orgs_df = read_parquet(spark, zone_path(BRONZE_PATH, "customers_orgs"))
    orgs_sel = orgs_df.select("org_id", "org_name", "hq_region", "plan_tier")
    orgs_sel.cache()
    print(f"[INFO] Dimensión Organizaciones cacheada: {orgs_sel.count()} registros.")


    dest_speed = zone_path(GOLD_PATH, "org_daily_usage_by_service_speed")

    def process_microbatch(batch_df: DataFrame, batch_id: int):

        if batch_df.rdd.isEmpty():
            return

        # metricas
        input_count = batch_df.count()

        # procesado
        enriched = batch_df.join(F.broadcast(orgs_sel), on="org_id", how="left")

        # data quality
        valid_stream = enriched.filter(F.col("cost_usd_increment") >= -0.01)
        valid_count = valid_stream.count()
        dropped_count = input_count - valid_count

        # aregaciones
        agg_batch = (
            valid_stream
            .groupBy("org_id", "org_name", "service", "event_date")
            .agg(
                F.sum("cost_usd_increment").alias("daily_cost_usd"),
                F.sum(F.when(F.col("metric") == "requests", F.col("value")).otherwise(0)).alias("daily_requests"),
                F.sum("carbon_kg").alias("daily_carbon_kg")
            )
            .withColumnRenamed("service", "service_name")
        )

        # Append
        (
            agg_batch
            .coalesce(1)
            .write
            .mode("append")
            .partitionBy("event_date")
            .parquet(str(dest_speed))
        )

        #  Logde quality para monitoreo ---

        print(f"[STREAM {batch_id}] Reporte ")
        print(f" Input: {input_count} eventos")
        print(f"Validos: {valid_count}")
        if dropped_count > 0:
            print(f"Dropped (Cost < -0.01): {dropped_count} ({(dropped_count/input_count)*100:.1f}%)")
        print(f"todo pasado a Gold de Speed layer")

    # Arranca Stream con outputMode("update") para permitir agregacionesy con  foreachBatch manejando la salida final
    query = (
        stream_bronze
        .writeStream
        .foreachBatch(process_microbatch)
        .outputMode("update")
        .trigger(processingTime="5 seconds") # trigger para no saturar
        .start()
    )

    print(f"[INFO] Streaming ejecutandose -> {dest_speed}")
    return query

In [80]:
# cassandra_loader.py

def upload_gold_to_cassandra(spark: SparkSession):

    # Lee la tabla Gold y la carga en Cassandra

    print("\n[SERVING] Iniciando carga a Cassandra...")

    # Esquema (DDL)
    try:
        session = get_cassandra_session()
        create_schema(session)
        session.shutdown()
    except Exception as e:
        print(f"[ERR] Error conectando a Cassandra: {e}")
        return


    gold_df = read_parquet(spark, zone_path(GOLD_PATH, "org_daily_usage_by_service"), partition_glob="event_date=*")

    # convierto a python con el driver

    print(f"[SERVING] Leyendo {gold_df.count()} filas de Gold...")
    rows = gold_df.collect()
    data_to_insert = [row.asDict() for row in rows]

    insert_batch_to_cassandra(data_to_insert)
    print(f"[SERVING] Carga completada. {len(data_to_insert)} registros insertados.")

# --- PARA EL STREAMING (Opcional / Puntos Extra) ---
def write_stream_to_cassandra(batch_df, batch_id):
    """
    Función para usar en .foreachBatch del Streaming.
    """
    if batch_df.rdd.isEmpty(): return

    # Convertir a lista de dicts
    rows = batch_df.collect()
    data = [row.asDict() for row in rows]

    # Insertar (Upsert por naturaleza de Cassandra)
    insert_batch_to_cassandra(data)
    print(f"[CASSANDRA STREAM] Batch {batch_id} cargado ({len(data)} filas).")

Demo de Batch Layer → Gold

In [81]:
print("Demo Batch Layer")

run_bronze_batch(spark)
audit_bronze_layer(spark, "customers_orgs", pk_col="org_id")

run_silver_batch(spark)
audit_silver_quality(spark)
audit_quarantine(spark)

run_gold_batch(spark)
audit_gold_layer(spark, "org_daily_usage_by_service")


print("\nFin demo Batch Layer (Bronze CSV → Silver → Gold)")

Demo Batch Layer
[INFO] Leyendo /content/datalake/landing/customers_orgs.csv
[OK] Bronze customers_orgs -> /content/datalake/bronze/customers_orgs
[INFO] Leyendo /content/datalake/landing/users.csv
[OK] Bronze users -> /content/datalake/bronze/users
[INFO] Leyendo /content/datalake/landing/resources.csv
[OK] Bronze resources -> /content/datalake/bronze/resources
[INFO] Leyendo /content/datalake/landing/support_tickets.csv
[OK] Bronze support_tickets -> /content/datalake/bronze/support_tickets
[INFO] Leyendo /content/datalake/landing/marketing_touches.csv
[OK] Bronze marketing_touches -> /content/datalake/bronze/marketing_touches
[INFO] Leyendo /content/datalake/landing/nps_surveys.csv
[OK] Bronze nps_surveys -> /content/datalake/bronze/nps_surveys
[INFO] Leyendo /content/datalake/landing/billing_monthly.csv
[OK] Bronze billing_monthly -> /content/datalake/bronze/billing_monthly

 chequeo Bronze: customers_orgs
registros totales: 80
duplicados en PK (org_id): 0
nulos en ingest_ts: 0
Res

Demo de Speed Layer → Gold

En un despliegue real, el pipeline de streaming se ejecutaría como servicio/orquestación aparte

In [82]:
# DEMO Speed → Gold
import time

query_speed_gold = start_gold_speed_stream(spark)

print("Streaming Speed → Gold")
print(f"ID: {query_speed_gold.id}")
print(f"Nombre: {query_speed_gold.name}")
print(f"Activo: {query_speed_gold.isActive}")


# pausa para dejar que procese

time.sleep(15)

print(" Progreso del streaming")
print(query_speed_gold.lastProgress)

print("aparando stream...")

if query_speed_gold.isActive:  # esto lo hice por si el stream se para mientras esta procesando algo y tira error java.lang.InterruptedException que no para la ejecucion del colab.
    query_speed_gold.stop()    # ese error solo pasa aca porque el stream se para un momento arbitrario, en un deploy real no se haria
    try:
        query_speed_gold.awaitTermination(timeout=2)
    except Exception:
        pass

audit_speed_layer_results(spark)
print("Streaming Speed → Gold parado.")

[INFO] Comenzando Speed Layer...
[INFO] Dimensión Organizaciones cacheada: 80 registros.
[INFO] Streaming ejecutandose -> /content/datalake/gold/org_daily_usage_by_service_speed
Streaming Speed → Gold
ID: a68e2513-f8b4-4e69-83f3-c533a23abb0b
Nombre: None
Activo: True
[STREAM 0] Reporte 
 Input: 360 eventos
Validos: 359
Dropped (Cost < -0.01): 1 (0.3%)
todo pasado a Gold de Speed layer
[STREAM 1] Reporte 
 Input: 360 eventos
Validos: 359
Dropped (Cost < -0.01): 1 (0.3%)
todo pasado a Gold de Speed layer
[STREAM 2] Reporte 
 Input: 10 eventos
Validos: 10
todo pasado a Gold de Speed layer
[STREAM 3] Reporte 
 Input: 9 eventos
Validos: 9
todo pasado a Gold de Speed layer
 Progreso del streaming
{'id': 'a68e2513-f8b4-4e69-83f3-c533a23abb0b', 'runId': '2ed8bc51-e5b2-43df-ac04-0deed58f4f07', 'name': None, 'timestamp': '2025-11-25T22:27:15.000Z', 'batchId': 3, 'numInputRows': 360, 'inputRowsPerSecond': 72.0, 'processedRowsPerSecond': 192.41047568145376, 'durationMs': {'addBatch': 1683, 'commit

In [83]:
#por si no estan insalados, voy a lo seguro
!pip install cassandra-driver
!pip install astrapy



In [87]:
# cassandra_utils.py

import os
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from dotenv import load_dotenv
from pathlib import Path

# carga de credenciales
load_dotenv("/content/creds/cred.env", override=True)
# config
SECURE_BUNDLE_PATH = "/content/creds/secure-connect-proyecto-cloud-analytics.zip"
ASTRA_DB_TOKEN = os.getenv("ASTRA_DB_APPLICATION_TOKEN")
KEYSPACE = "Cloud_analytics_db"

# funcines de implementacion de cassandra: tomar, crear e insertar
def get_cassandra_session():

    if not Path(SECURE_BUNDLE_PATH).exists():
        raise FileNotFoundError(f"Falta el Secure Connect Bundle en: {SECURE_BUNDLE_PATH}")

    if not ASTRA_DB_TOKEN:
        raise RuntimeError("No se encontro ASTRA_DB_APPLICATION_TOKEN en cred.env")

    cloud_config = {
        'secure_connect_bundle': SECURE_BUNDLE_PATH
    }

    auth_provider = PlainTextAuthProvider("token", ASTRA_DB_TOKEN)

    cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider, protocol_version=4)
    session = cluster.connect()
    return session

def create_schema(session):
    print(f"[CASSANDRA] Creando esquema en keyspace '{KEYSPACE}'...")

    ddl_gold = f"""
    CREATE TABLE IF NOT EXISTS "{KEYSPACE}".org_daily_usage_by_service (
        org_id text,
        usage_date date,
        service_name text,
        daily_cost_usd double,
        daily_requests double,
        daily_carbon_kg double,
        cost_per_request double,
        carbon_per_dollar double,
        PRIMARY KEY ((org_id), usage_date, service_name)
    ) WITH CLUSTERING ORDER BY (usage_date DESC, service_name ASC);
    """
    session.execute(ddl_gold)
    print("[CASSANDRA] Tabla 'org_daily_usage_by_service' lista.")

def insert_batch_to_cassandra(rows: list[dict]):
    if not rows:
        return
    session = get_cassandra_session()

    query = f"""
    INSERT INTO "{KEYSPACE}".org_daily_usage_by_service (
        org_id, usage_date, service_name,
        daily_cost_usd, daily_requests, daily_carbon_kg,
        cost_per_request, carbon_per_dollar
    ) VALUES (?, ?, ?, ?, ?, ?, ?, ?)
    """
    prepared = session.prepare(query)

    for row in rows:
        session.execute(prepared, (
            row["org_id"],
            row["event_date"],
            row["service_name"],
            row["daily_cost_usd"],
            row["daily_requests"],
            row["daily_carbon_kg"],
            row["cost_per_request"],
            row["carbon_per_dollar"]
        ))

    session.shutdown()

In [88]:
upload_gold_to_cassandra(spark)


[SERVING] Iniciando carga a Cassandra...
[CASSANDRA] Creando esquema en keyspace 'Cloud_analytics_db'...
[CASSANDRA] Tabla 'org_daily_usage_by_service' lista.
[SERVING] Leyendo 816 filas de Gold...
[SERVING] Carga completada. 816 registros insertados.


In [89]:
# cassandra_queries.py


try:
    from cassandra_utils import get_cassandra_session, KEYSPACE
except ModuleNotFoundError:
    pass

def run_business_queries():

    print("\n  resultados de negocio \n")
    session = get_cassandra_session()

    # CONSULTA 1: Costos por Org y Servicio (Filtro por una Org específica)

    target_org = "org_c11ertj5"  # se puede cambiar por cualquier orga. Esto se ppodria hacer en la integracion con BI

    print(f"1. detalle de consumo para {target_org} (ultimos registros)")
    query_1 = f"""
        SELECT usage_date, service_name, daily_cost_usd, daily_requests
        FROM "{KEYSPACE}".org_daily_usage_by_service
        WHERE org_id = '{target_org}'
        LIMIT 5
    """
    rows = session.execute(query_1)
    for r in rows:
        print(f"{r.usage_date}|{r.service_name:10}|${r.daily_cost_usd:.2f} |{r.daily_requests} reqs")


    # CONSULTA 2: Servicios mas costosos

    print(f"\n 2. costos (chequeo de datos insertados")
    query_2 = f"""
        SELECT org_id, service_name, daily_cost_usd
        FROM "{KEYSPACE}".org_daily_usage_by_service
        WHERE org_id = '{target_org}'
        ORDER BY usage_date DESC
        LIMIT 5
    """
    rows = session.execute(query_2)
    for r in rows:
        print(f"{r.org_id}|{r.service_name} | costo:${r.daily_cost_usd:.4f}")

    session.shutdown()

# Ejecutar reporte final
run_business_queries()



  resultados de negocio 

1. detalle de consumo para org_c11ertj5 (ultimos registros)
2025-08-31|compute   |$7.66 |120.0 reqs
2025-08-31|database  |$5.85 |119.0 reqs
2025-08-31|genai     |$31.13 |245.0 reqs
2025-08-18|compute   |$0.06 |0.0 reqs
2025-08-13|database  |$0.11 |0.0 reqs

 2. costos (chequeo de datos insertados
org_c11ertj5|compute | costo:$7.6606
org_c11ertj5|database | costo:$5.8466
org_c11ertj5|genai | costo:$31.1266
org_c11ertj5|compute | costo:$0.0571
org_c11ertj5|database | costo:$0.1065


In [90]:
#chequeo fina lde idempotencia
print("\n chequeo de idempotencia")

# se cuentan filas actuales en Cassandra
session = get_cassandra_session()
count_1 = session.execute(f'SELECT count(*) FROM "{KEYSPACE}".org_daily_usage_by_service').one()[0]
session.shutdown()
print(f"Registros en Cassandra (1era ejecucion): {count_1}")

# se re-ejecuta la carga a cassandra
print(">>> Re-ejecutando carga a Cassandra...")
upload_gold_to_cassandra(spark)

# se cuentade nuevo
session = get_cassandra_session()
count_2 = session.execute(f'SELECT count(*) FROM "{KEYSPACE}".org_daily_usage_by_service').one()[0]
session.shutdown()
print(f"Registros en Cassandra (2da ejecucion): {count_2}")

if count_1 == count_2:
    print("idempotencia ok")
else:
    print("[WARN]se generaron duplicados.")


 chequeo de idempotencia




Registros en Cassandra (1era ejecucion): 876
>>> Re-ejecutando carga a Cassandra...

[SERVING] Iniciando carga a Cassandra...
[CASSANDRA] Creando esquema en keyspace 'Cloud_analytics_db'...
[CASSANDRA] Tabla 'org_daily_usage_by_service' lista.
[SERVING] Leyendo 816 filas de Gold...
[SERVING] Carga completada. 816 registros insertados.




Registros en Cassandra (2da ejecucion): 876
idempotencia ok
