
## Notebook 11: Model Registry con MLflow
**Objetivo**: Gestionar el ciclo de vida del modelo SECOP (Versiones y Stages).


In [1]:
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import QuantileDiscretizer
from pyspark.sql.functions import col, when
import mlflow
import mlflow.spark
from mlflow.tracking import MlflowClient

# %%
spark = SparkSession.builder \
    .appName("SECOP_ModelRegistry") \
    .master("spark://spark-master:7077") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/02/15 19:49:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## RETO 1: Configurar MLflow y el Registry
**Diferencia**: Tracking registra "intentos" (fotos), Registry gestiona "productos" (versiones oficiales).


In [None]:
mlflow.set_tracking_uri("http://mlflow:5000")
client = MlflowClient()
model_name = "Clasificador_Contratos_Top25"

# Carga de datos
df_raw = spark.read.parquet("/opt/spark-data/processed/secop_ml_ready.parquet")
discretizer = QuantileDiscretizer(numBuckets=4, inputCol="label", outputCol="cuartil")
df_final = discretizer.fit(df_raw).transform(df_raw) \
    .withColumn("label", when(col("cuartil") == 3.0, 1.0).otherwise(0.0)) \
    .select("features", "label")

train, test = df_final.randomSplit([0.8, 0.2], seed=42)
evaluator = BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC")


                                                                                

## RETO 2: Entrenar y Registrar Modelo v1 (Baseline)


In [3]:
mlflow.set_experiment("/SECOP_Model_Registry")

with mlflow.start_run(run_name="Run_v1_Baseline") as run:
    lr_v1 = LogisticRegression(regParam=0.1, labelCol="label")
    model_v1 = lr_v1.fit(train)
    auc_v1 = evaluator.evaluate(model_v1.transform(test))
    
    mlflow.log_metric("auc", auc_v1)
    
    # Registro automático en el Model Registry
    mlflow.spark.log_model(
        spark_model=model_v1,
        artifact_path="model",
        registered_model_name=model_name
    )
    print(f"v1 Registrada. AUC: {auc_v1:.4f}")

The git executable must be specified in one of the following ways:
    - be included in your $PATH
    - be set via $GIT_PYTHON_GIT_EXECUTABLE
    - explicitly set via git.refresh(<full-path-to-git-executable>)

All git commands will error until this is rectified.

This initial message can be silenced or aggravated in the future by setting the
$GIT_PYTHON_REFRESH environment variable. Use one of the following values:
    - quiet|q|silence|s|silent|none|n|0: for no message or exception
    - error|e|exception|raise|r|2: for a raised exception

Example:
    export GIT_PYTHON_REFRESH=quiet

26/02/15 19:49:45 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
26/02/15 19:49:45 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.VectorBLAS
26/02/15 19:49:54 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenera

v1 Registrada. AUC: 0.8260


Created version '19' of model 'Clasificador_Contratos_Top25'.



## RETO 3: Entrenar y Registrar Modelo v2 (Optimizado)




In [None]:
with mlflow.start_run(run_name="Run_v2_Optimizado") as run:
    lr_v2 = LogisticRegression(regParam=0.01, labelCol="label")
    model_v2 = lr_v2.fit(train)
    auc_v2 = evaluator.evaluate(model_v2.transform(test))
    
    mlflow.log_metric("auc", auc_v2)
    
    mlflow.spark.log_model(
        spark_model=model_v2,
        artifact_path="model",
        registered_model_name=model_name
    )
    print(f"v2 Registrada. AUC: {auc_v2:.4f}")

Registered model 'Clasificador_Contratos_Top25' already exists. Creating a new version of this model...
2026/02/15 19:50:05 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: Clasificador_Contratos_Top25, version 20


v2 Registrada. AUC: 0.8262


Created version '20' of model 'Clasificador_Contratos_Top25'.



## RETO 4: Gestionar Versiones y Stages
**Pregunta**: ¿Por qué Staging? 
**Respuesta**: Para realizar pruebas de integración (ver si el modelo carga bien 
 en la API) antes de afectar a los usuarios reales en Production.


In [5]:
# Listar versiones
versions = client.search_model_versions(f"name='{model_name}'")
for v in versions:
    print(f"Versión: {v.version}, Stage: {v.current_stage}")

# Promover v2 a Production y Archivar v1
client.transition_model_version_stage(name=model_name, version=2, stage="Production")
client.transition_model_version_stage(name=model_name, version=1, stage="Archived")

print("Flujo de estados completado: v2 -> Production, v1 -> Archived")


Versión: 20, Stage: None
Versión: 19, Stage: None
Versión: 2, Stage: Production
Versión: 1, Stage: Archived
Versión: 18, Stage: None
Versión: 17, Stage: None
Versión: 16, Stage: None
Versión: 15, Stage: None
Versión: 14, Stage: None
Versión: 13, Stage: None
Versión: 12, Stage: None
Versión: 11, Stage: None
Versión: 10, Stage: None
Versión: 9, Stage: None
Versión: 8, Stage: None
Versión: 7, Stage: None
Versión: 6, Stage: None
Versión: 5, Stage: None
Versión: 4, Stage: None
Versión: 3, Stage: None


  client.transition_model_version_stage(name=model_name, version=2, stage="Production")
  client.transition_model_version_stage(name=model_name, version=1, stage="Archived")


Flujo de estados completado: v2 -> Production, v1 -> Archived



## RETO 5: Agregar Metadata al Modelo


In [None]:
client.update_model_version(
    name=model_name,
    version=2,
    description=f"Modelo optimizado con Ridge (regParam=0.01). AUC validado: {auc_v2:.4f}. Dataset SECOP 2026."
)

<ModelVersion: aliases=[], creation_timestamp=1771181286413, current_stage='Production', description=('Modelo optimizado con Ridge (regParam=0.01). AUC validado: 0.8262. Dataset '
 'SECOP 2026.'), last_updated_timestamp=1771185006714, name='Clasificador_Contratos_Top25', run_id='0523ec85d7f74e4086ff6d6b7c3e8173', run_link='', source='file:///opt/mlflow/mlruns/678707886628810925/0523ec85d7f74e4086ff6d6b7c3e8173/artifacts/model', status='READY', status_message='', tags={}, user_id='', version='2'>


## Reto 6: Reactivación SparkContext y Predicción


In [None]:
from pyspark import SparkContext

sc = SparkContext.getOrCreate()

if sc is not None:
    print(" SparkContext reactivado correctamente.")
    
    try:
        # 1. Aseguramos que el modelo v2 esté listo
        print(" Generando predicciones finales...")

        demo_data = test.limit(5)
        final_predictions = model_v2.transform(demo_data)

        final_predictions.select("label", "probability", "prediction").show(truncate=False)
        
    except Exception as e:
        print(f" Error en la transformación: {e}")
        print("Sugerencia: Si el error persiste, vuelve a ejecutar la celda donde entrenaste 'model_v2'.")
else:
    print("No se pudo activar el SparkContext. Reinicia el Kernel.")

✅ SparkContext reactivado correctamente.
 Generando predicciones finales...
+-----+----------------------------------------+----------+
|label|probability                             |prediction|
+-----+----------------------------------------+----------+
|0.0  |[0.6182985454953545,0.38170145450464554]|0.0       |
|0.0  |[0.6182985454953545,0.38170145450464554]|0.0       |
|0.0  |[0.6182985454953545,0.38170145450464554]|0.0       |
|0.0  |[0.6182985454953545,0.38170145450464554]|0.0       |
|0.0  |[0.6182985454953545,0.38170145450464554]|0.0       |
+-----+----------------------------------------+----------+



26/02/15 20:07:48 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: Master removed our application: KILLED
26/02/15 20:07:48 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exiting due to error from cluster scheduler: Master removed our application: KILLED
	at org.apache.spark.errors.SparkCoreErrors$.clusterSchedulerError(SparkCoreErrors.scala:291)
	at org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:981)
	at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.dead(StandaloneSchedulerBackend.scala:165)
	at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint.markDead(StandaloneAppClient.scala:263)
	at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(StandaloneAppClient.scala:170)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.proce

Resultado 0.0: Significa que para esos 5 ejemplos, el modelo predice que NO pertenecen al Top 25% de los contratos más caros.