### Modelado de Datos en Databricks - Predicción de Accidentes en Minería

Este notebook forma parte del trabajo final del curso de MLOps y está enfocado en la etapa de **modelado** del pipeline de Machine Learning.

El objetivo es construir un modelo predictivo capaz de identificar **riesgos de accidentes en minería** utilizando datos históricos. Para ello, se sigue la siguiente estructura:

- **Carga y limpieza de datos:** Se eliminan columnas innecesarias y se seleccionan las variables relevantes.
- **División del dataset:** Se separan los datos en entrenamiento (80%) y prueba (20%).
- **Construcción de una pipeline de ML:** Se define un proceso escalable que incluye:
  - **Vectorización de características** con `VectorAssembler`.
  - **Normalización de datos** con `StandardScaler` *(opcional)*.
  - **Entrenamiento de un modelo** `RandomForestClassifier`.

Este modelo será evaluado en la siguiente etapa para validar su precisión y determinar su viabilidad para su despliegue en producción dentro de un entorno MLOps.

<style>
    p { text-align: justify; }
</style>


###  1. Carga de datos y configuración del entorno
Se inicia una sesión de Apache Spark en Databricks y se carga el conjunto de datos previamente limpiado en las etapas de ingesta y exploración.Este dataset servirá como base para el entrenamiento del modelo de predicción de accidentes.

In [0]:
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
import mlflow
import mlflow.spark

# Asegúrate de tener un SparkSession activo (Databricks normalmente lo provee).
spark = SparkSession.builder.getOrCreate()

# Carga la tabla que sale del Notebook 01 o 02
df_spark = spark.table("mining.safety_analysis.cleaned_mining_data")

# Muestra un preview
display(df_spark)

# Opcional: describe columnas
print("Columnas disponibles:", df_spark.columns)


Accident_Occurred,Hours_Worked,Weather_Risk_Index,Machine_Age_Years,Employee_Experience_Years,Safety_Violations,Inspection_Frequency,Job_Risk_Level,Shift_Type,Area,Employee_Age,Noise_Level_dB,Temperature_C,features,scaled_features,Moving_Avg_Noise,Moving_Avg_Temperature,Diff_Safety_Violations,Diff_Inspection_Frequency,Age_Group
0,8,0.2792038645274789,9,16,8,2,2,0,0,18,94.67060562045376,35.142627099878524,"Map(vectorType -> dense, length -> 10, values -> List(8.0, 0.2792038645274789, 9.0, 16.0, 8.0, 2.0, 2.0, 18.0, 94.67060562045377, 35.142627099878524))","Map(vectorType -> dense, length -> 10, values -> List(-0.2875829431914361, -0.7535824891525968, -0.1651267216301929, 0.16987826760051303, 1.2191699090714634, -1.2659000558795286, 0.7448507855134497, -1.7034203426534413, 1.360695451761776, 1.1778917142587966))",94.67060562045376,35.142627099878524,,,Joven
1,8,0.9522924209262468,16,13,5,1,1,0,0,18,77.11794943249967,21.95232815104577,"Map(vectorType -> dense, length -> 10, values -> List(8.0, 0.9522924209262469, 16.0, 13.0, 5.0, 1.0, 1.0, 18.0, 77.11794943249967, 21.95232815104577))","Map(vectorType -> dense, length -> 10, values -> List(-0.2875829431914361, 1.557519390932546, 1.1109684996232383, -0.17676491210397047, 0.1777970580907674, -1.5825017105112626, -0.7392113494007705, -1.7034203426534413, 0.14602346410106173, -0.3475641320609971))",85.89427752647671,28.547477625462147,-3.0,-1.0,Joven
0,7,0.7876088143823133,1,9,9,11,2,0,0,18,97.82037145789188,32.299454658105226,"Map(vectorType -> dense, length -> 10, values -> List(7.0, 0.7876088143823133, 1.0, 9.0, 9.0, 11.0, 2.0, 18.0, 97.82037145789187, 32.299454658105226))","Map(vectorType -> dense, length -> 10, values -> List(-0.8772547561736581, 0.9920654263915089, -1.6235212602055429, -0.6389558183766151, 1.5662941927316953, 1.5835148358060784, 0.7448507855134497, -1.7034203426534413, 1.5786643044219437, 0.8490793014224517))",89.86964217028178,29.798136636343173,4.0,10.0,Joven
1,10,0.6184551853641882,14,0,4,1,2,1,0,18,92.3351107942414,10.96302728252493,"Map(vectorType -> dense, length -> 10, values -> List(10.0, 0.6184551853641882, 14.0, 0.0, 4.0, 1.0, 2.0, 18.0, 92.3351107942414, 10.96302728252493))","Map(vectorType -> dense, length -> 10, values -> List(0.891760682773008, 0.411263292213776, 0.7463698649794008, -1.6788853574900655, -0.16932722556946456, -1.5825017105112626, 0.7448507855134497, -1.7034203426534413, 1.1990754603285962, -1.618474922019686))",90.48600932627168,25.089359297888613,-5.0,-10.0,Joven
0,8,0.1218559206868523,2,22,8,1,1,0,1,18,65.00342819864227,20.02801594253608,"Map(vectorType -> dense, length -> 10, values -> List(8.0, 0.12185592068685236, 2.0, 22.0, 8.0, 1.0, 1.0, 18.0, 65.00342819864227, 20.02801594253608))","Map(vectorType -> dense, length -> 10, values -> List(-0.2875829431914361, -1.2938488855733414, -1.4412219428836242, 0.8631646270094799, 1.2191699090714634, -1.5825017105112626, -0.7392113494007705, -1.7034203426534413, -0.6923208542241678, -0.570110514065131))",83.0692149708188,21.310706508553004,4.0,0.0,Joven
0,11,0.6781768212500029,9,9,8,3,1,0,1,18,60.08889602431877,39.25504829750295,"Map(vectorType -> dense, length -> 10, values -> List(11.0, 0.6781768212500029, 9.0, 9.0, 8.0, 3.0, 1.0, 18.0, 60.088896024318764, 39.25504829750295))","Map(vectorType -> dense, length -> 10, values -> List(1.48143249575523, 0.6163221714137601, -0.1651267216301929, -0.6389558183766151, 1.2191699090714634, -0.9492984012477944, -0.7392113494007705, -1.7034203426534413, -1.032414370594915, 1.6534925314033369))",78.81195161877358,25.636386545167294,0.0,2.0,Joven
1,7,0.8378964868326003,3,16,7,7,2,0,1,18,89.38827781727758,16.410058428066762,"Map(vectorType -> dense, length -> 10, values -> List(7.0, 0.8378964868326003, 3.0, 16.0, 7.0, 7.0, 2.0, 18.0, 89.38827781727758, 16.410058428066762))","Map(vectorType -> dense, length -> 10, values -> List(-0.8772547561736581, 1.1647320587983598, -1.2589226255617054, 0.16987826760051303, 0.8720456254112313, 0.317108217279142, 0.7448507855134497, -1.7034203426534413, 0.9951498874065191, -0.9885266897403794))",76.70392820862,21.66403748765768,-1.0,4.0,Joven
1,11,0.830407296579816,14,28,7,6,1,0,0,18,87.92094123160857,14.990767105593406,"Map(vectorType -> dense, length -> 10, values -> List(11.0, 0.830407296579816, 14.0, 28.0, 7.0, 6.0, 1.0, 18.0, 87.92094123160857, 14.990767105593406))","Map(vectorType -> dense, length -> 10, values -> List(1.48143249575523, 1.1390173419026668, 0.7463698649794008, 1.556450986418447, 0.8720456254112313, 5.065626474079068E-4, -0.7392113494007705, -1.7034203426534413, 0.8936078399809542, -1.152667491544536))",75.6003858179618,22.6709724434248,0.0,-1.0,Joven
1,9,0.896378483223569,7,29,4,11,2,0,1,18,77.89275439144811,17.4946753151898,"Map(vectorType -> dense, length -> 10, values -> List(9.0, 0.896378483223569, 7.0, 29.0, 4.0, 11.0, 2.0, 18.0, 77.89275439144811, 17.4946753151898))","Map(vectorType -> dense, length -> 10, values -> List(0.3020888697907859, 1.3655345394309775, -0.5297253562740304, 1.671998712986608, -0.16932722556946456, 1.5835148358060784, 0.7448507855134497, -1.7034203426534413, 0.19964121115071623, -0.8630909298764662))",78.82271736616326,22.03763728658823,-3.0,5.0,Joven
0,11,0.5197432380137817,12,2,5,9,1,0,0,18,98.31591315581436,38.0379540193144,"Map(vectorType -> dense, length -> 10, values -> List(11.0, 0.5197432380137817, 12.0, 2.0, 5.0, 9.0, 1.0, 18.0, 98.31591315581437, 38.0379540193144))","Map(vectorType -> dense, length -> 10, values -> List(1.48143249575523, 0.0723281477374324, 0.38177123033556337, -1.4477899043537432, 0.1777970580907674, 0.9503115265426103, -0.7392113494007705, -1.7034203426534413, 1.6129565854726815, 1.5127357836308084))",88.37947164903716,21.73336371704109,1.0,-2.0,Joven


Columnas disponibles: ['Accident_Occurred', 'Hours_Worked', 'Weather_Risk_Index', 'Machine_Age_Years', 'Employee_Experience_Years', 'Safety_Violations', 'Inspection_Frequency', 'Job_Risk_Level', 'Shift_Type', 'Area', 'Employee_Age', 'Noise_Level_dB', 'Temperature_C', 'features', 'scaled_features', 'Moving_Avg_Noise', 'Moving_Avg_Temperature', 'Diff_Safety_Violations', 'Diff_Inspection_Frequency', 'Age_Group']


### 2. Limpieza de columnas previas de la pipeline
Antes de construir la pipeline de modelado, se eliminan columnas temporales como features o scaled_features, que pueden haber quedado de procesos anteriores.

In [0]:
# Paso 2: Limpiar las columnas previas de pipeline (features, scaled_features)

df_spark_clean = df_spark.drop("features", "scaled_features")

# Verificar que se hayan borrado
print("Columnas nuevas:", df_spark_clean.columns)

# Chequear la cantidad de registros
print("Cantidad de registros:", df_spark_clean.count())

# Mostrar un preview
display(df_spark_clean.limit(5))


Columnas nuevas: ['Accident_Occurred', 'Hours_Worked', 'Weather_Risk_Index', 'Machine_Age_Years', 'Employee_Experience_Years', 'Safety_Violations', 'Inspection_Frequency', 'Job_Risk_Level', 'Shift_Type', 'Area', 'Employee_Age', 'Noise_Level_dB', 'Temperature_C', 'Moving_Avg_Noise', 'Moving_Avg_Temperature', 'Diff_Safety_Violations', 'Diff_Inspection_Frequency', 'Age_Group']
Cantidad de registros: 10000


Accident_Occurred,Hours_Worked,Weather_Risk_Index,Machine_Age_Years,Employee_Experience_Years,Safety_Violations,Inspection_Frequency,Job_Risk_Level,Shift_Type,Area,Employee_Age,Noise_Level_dB,Temperature_C,Moving_Avg_Noise,Moving_Avg_Temperature,Diff_Safety_Violations,Diff_Inspection_Frequency,Age_Group
0,8,0.2792038645274789,9,16,8,2,2,0,0,18,94.67060562045376,35.142627099878524,94.67060562045376,35.142627099878524,,,Joven
1,8,0.9522924209262468,16,13,5,1,1,0,0,18,77.11794943249967,21.95232815104577,85.89427752647671,28.547477625462147,-3.0,-1.0,Joven
0,7,0.7876088143823133,1,9,9,11,2,0,0,18,97.82037145789188,32.299454658105226,89.86964217028178,29.798136636343173,4.0,10.0,Joven
1,10,0.6184551853641882,14,0,4,1,2,1,0,18,92.3351107942414,10.96302728252493,90.48600932627168,25.089359297888613,-5.0,-10.0,Joven
0,8,0.1218559206868523,2,22,8,1,1,0,1,18,65.00342819864227,20.02801594253608,83.0692149708188,21.310706508553004,4.0,0.0,Joven


### 3. Definición de columnas de características y variable objetivo
Se seleccionan las variables predictoras (features) y la variable objetivo (label) que el modelo utilizará para predecir accidentes en minería.

In [0]:
# Paso 3: Definir columnas

feature_cols = [
    "Shift_Type",
    "Weather_Risk_Index",
    "Job_Risk_Level",
    "Hours_Worked",
    "Employee_Experience_Years",
    "Safety_Violations",
    "Inspection_Frequency",
    "Temperature_C"
]

label_col = "Accident_Occurred"  # Ajusta si tu label es distinto

print("Columnas de características:", feature_cols)
print("Columna Label:", label_col)


Columnas de características: ['Shift_Type', 'Weather_Risk_Index', 'Job_Risk_Level', 'Hours_Worked', 'Employee_Experience_Years', 'Safety_Violations', 'Inspection_Frequency', 'Temperature_C']
Columna Label: Accident_Occurred


### 4. División del dataset en entrenamiento y prueba
Se divide el conjunto de datos en dos partes:

80% entrenamiento (train_data): para ajustar el modelo.
20% prueba (test_data): para evaluar el rendimiento del modelo.


In [0]:
# Paso 4: Partir datos en entrenamiento (train) y prueba (test)

train_data, test_data = df_spark_clean.randomSplit([0.8, 0.2], seed=42)

print(f"Tamaño train_data: {train_data.count()}")
print(f"Tamaño test_data: {test_data.count()}")

# Verifica la distribución
print("train_data - ejemplo:")
display(train_data.limit(5))

print("test_data - ejemplo:")
display(test_data.limit(5))


Tamaño train_data: 7950
Tamaño test_data: 2050
train_data - ejemplo:


Accident_Occurred,Hours_Worked,Weather_Risk_Index,Machine_Age_Years,Employee_Experience_Years,Safety_Violations,Inspection_Frequency,Job_Risk_Level,Shift_Type,Area,Employee_Age,Noise_Level_dB,Temperature_C,Moving_Avg_Noise,Moving_Avg_Temperature,Diff_Safety_Violations,Diff_Inspection_Frequency,Age_Group
0,6,0.0006887463886322553,2,29,1,10,2,0,0,22,82.6818721435589,28.32275681321615,67.20817812777105,21.19731439340012,-1,0,Joven
0,6,0.0010298403174023,12,22,2,5,1,0,1,62,71.05716001457117,20.731845144997497,82.70016476257013,23.73145244787014,1,-1,Mayor
0,6,0.0017217368857351,19,19,7,1,1,0,0,35,54.6235579719154,31.612214480155107,70.15203929897523,33.88353921737678,3,-1,Adulto Joven
0,6,0.002139862392091,14,29,8,8,1,0,1,45,77.90271308579051,11.301129843891292,75.9459567862335,29.77813260853289,-1,0,Adulto Joven
0,6,0.0028721379227563,16,9,6,2,1,0,1,28,89.74902482083128,27.519597007003103,89.04514392786267,21.566136708670165,2,0,Joven


test_data - ejemplo:


Accident_Occurred,Hours_Worked,Weather_Risk_Index,Machine_Age_Years,Employee_Experience_Years,Safety_Violations,Inspection_Frequency,Job_Risk_Level,Shift_Type,Area,Employee_Age,Noise_Level_dB,Temperature_C,Moving_Avg_Noise,Moving_Avg_Temperature,Diff_Safety_Violations,Diff_Inspection_Frequency,Age_Group
0,6,0.006373931764365,10,14,4,6,1,0,1,44,66.42093764723188,30.814743394223257,80.66060614483226,25.861273269124187,4,2,Adulto Joven
0,6,0.0092622321332359,18,10,6,6,1,0,1,36,79.77109334433887,13.89005471820044,89.49790417718718,26.265453336662105,-3,4,Adulto Joven
0,6,0.0152387922183293,8,1,7,6,2,0,0,21,78.33906595976951,38.874422493884815,69.95415608701755,29.112824841490465,-2,4,Joven
0,6,0.0224401434234241,10,18,0,7,1,0,0,51,76.47789478726244,12.670552326271835,81.73660400565944,18.309108610352634,-7,0,Adulto Mayor
0,6,0.0237128124736765,16,29,4,1,2,0,0,59,74.25597978298865,26.198995209769333,81.0064323122527,20.79613700068474,-3,-9,Adulto Mayor


###  5. Construcción de la pipeline de Machine Learning
Se crea una pipeline compuesta por tres pasos:

VectorAssembler: Convierte las columnas de entrada en un solo vector de características.

StandardScaler (opcional): Escala las características para mejorar la estabilidad del modelo.

RandomForestClassifier: Modelo de clasificación basado en árboles de decisión.

In [0]:
# Paso 5: Crear la Pipeline (Assembler, Escalador opcional y RandomForest)

from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline

# 5.1 VectorAssembler para convertir las 8 columnas de entrada en 'features'
assembler = VectorAssembler(
    inputCols=feature_cols,
    outputCol="rawFeatures"
)

# 5.2 Escalador (opcional). Si no quieres escalar, puedes omitir este step.
scaler = StandardScaler(
    inputCol="rawFeatures",
    outputCol="scaledFeatures",
    withMean=False,
    withStd=True
)

# 5.3 Definir el RandomForest, que tomará 'scaledFeatures' como input
rf = RandomForestClassifier(
    featuresCol="scaledFeatures",
    labelCol=label_col,
    predictionCol="prediction",
    maxDepth=5,      # Ejemplo, puedes ajustar
    numTrees=50      # Ejemplo, puedes ajustar
)

# 5.4 Construir la pipeline
pipeline = Pipeline(stages=[assembler, scaler, rf])

print("Pipeline creado con 3 stages: Assembler -> Scaler -> RandomForest")


Pipeline creado con 3 stages: Assembler -> Scaler -> RandomForest


### 6. Entrenamiento del modelo y generación de predicciones
Se entrena la pipeline de Machine Learning utilizando los datos de entrenamiento. Luego, el modelo resultante se aplica a los datos de prueba para generar predicciones y evaluar su desempeño.Este paso es clave dentro del flujo de trabajo MLOps, ya que permite validar la capacidad del modelo para generalizar en datos nuevos antes de su despliegue en producción. 



In [0]:
# Paso 6: Entrenar la pipeline
pipeline_model = pipeline.fit(train_data)

# Probar la pipeline en datos de prueba
pred_test = pipeline_model.transform(test_data)

print("Predicciones en test_data:")
display(pred_test.limit(5))


Predicciones en test_data:


Accident_Occurred,Hours_Worked,Weather_Risk_Index,Machine_Age_Years,Employee_Experience_Years,Safety_Violations,Inspection_Frequency,Job_Risk_Level,Shift_Type,Area,Employee_Age,Noise_Level_dB,Temperature_C,Moving_Avg_Noise,Moving_Avg_Temperature,Diff_Safety_Violations,Diff_Inspection_Frequency,Age_Group,rawFeatures,scaledFeatures,rawPrediction,probability,prediction
0,6,0.006373931764365,10,14,4,6,1,0,1,44,66.42093764723188,30.814743394223257,80.66060614483226,25.861273269124187,4,2,Adulto Joven,"Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.006373931764365071, 1.0, 6.0, 14.0, 4.0, 6.0, 30.814743394223257))","Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.02203575404321017, 1.4832442015258043, 3.529877107004314, 1.6220392020904264, 1.3905013280954106, 1.906741597944472, 3.5682427168043978))","Map(vectorType -> dense, length -> 2, values -> List(43.74952888249381, 6.2504711175061916))","Map(vectorType -> dense, length -> 2, values -> List(0.8749905776498761, 0.1250094223501238))",0.0
0,6,0.0092622321332359,18,10,6,6,1,0,1,36,79.77109334433887,13.89005471820044,89.49790417718718,26.265453336662105,-3,4,Adulto Joven,"Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.00926223213323596, 1.0, 6.0, 10.0, 6.0, 6.0, 13.890054718200439))","Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.032021094157326076, 1.4832442015258043, 3.529877107004314, 1.1585994300645903, 2.085751992143116, 1.906741597944472, 1.6084212011814032))","Map(vectorType -> dense, length -> 2, values -> List(44.22044505061191, 5.7795549493881))","Map(vectorType -> dense, length -> 2, values -> List(0.884408901012238, 0.11559109898776197))",0.0
0,6,0.0152387922183293,8,1,7,6,2,0,0,21,78.33906595976951,38.874422493884815,69.95415608701755,29.112824841490465,-2,4,Joven,"Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.015238792218329356, 2.0, 6.0, 1.0, 7.0, 6.0, 38.874422493884815))","Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.052683067477447464, 2.9664884030516085, 3.529877107004314, 0.11585994300645903, 2.4333773241669685, 1.906741597944472, 4.501526206438757))","Map(vectorType -> dense, length -> 2, values -> List(43.49014781335763, 6.509852186642375))","Map(vectorType -> dense, length -> 2, values -> List(0.8698029562671525, 0.1301970437328475))",0.0
0,6,0.0224401434234241,10,18,0,7,1,0,0,51,76.47789478726244,12.670552326271835,81.73660400565944,18.309108610352634,-7,0,Adulto Mayor,"Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.022440143423424153, 1.0, 6.0, 18.0, 0.0, 7.0, 12.670552326271835))","Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.07757934967824248, 1.4832442015258043, 3.529877107004314, 2.0854789741162625, 0.0, 2.2245318642685503, 1.467206962514709))","Map(vectorType -> dense, length -> 2, values -> List(44.74346925095585, 5.256530749044165))","Map(vectorType -> dense, length -> 2, values -> List(0.8948693850191167, 0.10513061498088327))",0.0
0,6,0.0237128124736765,16,29,4,1,2,0,0,59,74.25597978298865,26.198995209769333,81.0064323122527,20.79613700068474,-3,-9,Adulto Mayor,"Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.023712812473676514, 2.0, 6.0, 29.0, 4.0, 1.0, 26.198995209769333))","Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.08197918061564827, 2.9664884030516085, 3.529877107004314, 3.359938347187312, 1.3905013280954106, 0.31779026632407864, 3.033754740348672))","Map(vectorType -> dense, length -> 2, values -> List(43.95405647835588, 6.045943521644125))","Map(vectorType -> dense, length -> 2, values -> List(0.8790811295671175, 0.12091887043288248))",0.0


### 7. Evaluación del Modelo
Se evalúa el rendimiento del modelo en los datos de prueba utilizando la métrica Área Bajo la Curva ROC (AUC-ROC). Esta métrica mide la capacidad del modelo para distinguir entre clases positivas y negativas, donde un valor cercano a 1.0 indica un buen desempeño.

In [0]:
# Paso 7: Evaluar el modelo

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(
    labelCol=label_col,   # 'Accident_Occurred'
    rawPredictionCol="rawPrediction",
    metricName="areaUnderROC"  # 'areaUnderPR' es otra opción
)

auc_test = evaluator.evaluate(pred_test)
print(f"AUC en test_data: {auc_test:.4f}")


AUC en test_data: 0.9815


### 8. Registro del Modelo en MLflow

Se registra la pipeline de Machine Learning en MLflow, lo que permite gestionar versiones del modelo y facilitar su despliegue en producción.
Se incluyen:

- Registro del AUC como métrica clave.
- Definición de la firma del modelo, especificando las entradas y la salida.
- Ejemplo de entrada (input_example) para documentar cómo debe utilizarse el modelo.

In [0]:
# Paso 8: Registrar la pipeline en MLflow

import mlflow
import mlflow.spark

with mlflow.start_run(run_name="Pipeline_RandomForest"):
    # Logueamos la métrica AUC
    mlflow.log_metric("AUC_test", auc_test)
    
    from mlflow.models.signature import ModelSignature
    from mlflow.types.schema import Schema, ColSpec

    # 1. Define la firma de entrada
    input_schema = Schema([
        ColSpec("double", "Shift_Type"),
        ColSpec("double", "Weather_Risk_Index"),
        ColSpec("double", "Job_Risk_Level"),
        ColSpec("double", "Hours_Worked"),
        ColSpec("double", "Employee_Experience_Years"),
        ColSpec("double", "Safety_Violations"),
        ColSpec("double", "Inspection_Frequency"),
        ColSpec("double", "Temperature_C")
    ])

    # 2. Define la firma de salida (p.ej. 'prediction' tipo double)
    output_schema = Schema([
        ColSpec("double")
    ])

    signature = ModelSignature(inputs=input_schema, outputs=output_schema)

    import pandas as pd
    input_example = pd.DataFrame(
        data=[[1.0, 3.5, 2.0, 8.0, 5.0, 0.0, 3.0, 25.0]],
        columns=[
            "Shift_Type", "Weather_Risk_Index", "Job_Risk_Level", "Hours_Worked",
            "Employee_Experience_Years", "Safety_Violations", "Inspection_Frequency", "Temperature_C"
        ]
    )

    # ★ IMPORTANTE: Pasa 'signature=' e 'input_example=' a log_model
    mlflow.spark.log_model(
        spark_model=pipeline_model,     # la pipeline entrenada
        artifact_path="random_forest_pipeline",
        registered_model_name="AccidentPrediction2025",
        signature=signature,
        input_example=input_example
    )
    
    # Opcional: log hiperparámetros
    # mlflow.log_param("maxDepth", 5)
    # mlflow.log_param("numTrees", 50)

print("✅ Pipeline registrada con firma e input_example en MLflow.")


2025/03/04 13:23:49 INFO mlflow.spark: Inferring pip requirements by reloading the logged model from the databricks artifact repository, which can be time-consuming. To speed up, explicitly specify the conda_env or pip_requirements when calling log_model().
Registered model 'AccidentPrediction2025' already exists. Creating a new version of this model...
Created version '1' of model 'mlop_final2025.default.accidentprediction2025'.
2025/03/04 13:24:25 INFO mlflow.tracking._tracking_service.client: 🏃 View run Pipeline_RandomForest at: adb-1781258311325241.1.azuredatabricks.net/ml/experiments/1516674257214700/runs/106d32f2aea04537b5c7670ca93a4f1e.
2025/03/04 13:24:25 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: adb-1781258311325241.1.azuredatabricks.net/ml/experiments/1516674257214700.


✅ Pipeline registrada con firma e input_example en MLflow.
