# üìä MLFlow Tracking - Experimentos de Hate Speech Detection

Este notebook demuestra c√≥mo usar MLFlow para trackear experimentos de machine learning.


In [1]:
import sys
from pathlib import Path
import pandas as pd
import numpy as np

# A√±adir src al path
project_root = Path('../').resolve()
sys.path.append(str(project_root / 'src'))

from models.train import train_model
from models.evaluate import evaluate_model
from features.vectorization import load_vectorized_data
from utils.mlflow_tracking import get_tracker


## 1. Cargar Datos Vectorizados


In [2]:
# Cargar datos vectorizados
X_train, X_test, y_train, y_test = load_vectorized_data(
    input_dir=Path('../data/processed'),
    prefix='tfidf'
)

print(f"‚úÖ Datos cargados:")
print(f"   Train: {X_train.shape}")
print(f"   Test: {X_test.shape}")


‚úÖ Datos vectorizados cargados desde: ../data/processed
‚úÖ Datos cargados:
   Train: (800, 1000)
   Test: (200, 1000)


## 2. Inicializar MLFlow Tracker


In [3]:
# Inicializar tracker de MLFlow
tracker = get_tracker(experiment_name="hate_speech_detection")
print(f"‚úÖ MLFlow tracker inicializado: {tracker.experiment_name}")


‚úÖ MLFlow tracker inicializado: hate_speech_detection
‚úÖ MLFlow tracker inicializado: hate_speech_detection


  return FileStore(store_uri, store_uri)


## 3. Entrenar y Registrar Modelos


In [4]:
# Entrenar m√∫ltiples modelos y registrarlos en MLFlow
models_to_test = [
    {'name': 'svm', 'type': 'svm', 'params': {'C': 0.056, 'kernel': 'linear', 'class_weight': 'balanced'}},
    {'name': 'logistic', 'type': 'logistic', 'params': {'C': 0.1, 'penalty': 'l2', 'class_weight': 'balanced', 'max_iter': 1000}},
    {'name': 'naive_bayes', 'type': 'naive_bayes', 'params': {'alpha': 10.0}},
    {'name': 'random_forest', 'type': 'random_forest', 'params': {'n_estimators': 50, 'max_depth': 5, 'min_samples_split': 10, 'min_samples_leaf': 5, 'class_weight': 'balanced'}}
]

results = {}

for model_config in models_to_test:
    print(f"\nüîß Entrenando {model_config['name']}...")
    
    # Entrenar modelo
    model = train_model(
        model_type=model_config['type'],
        X_train=X_train,
        y_train=y_train,
        **model_config['params']
    )
    
    # Evaluar modelo
    metrics = evaluate_model(
        model, X_train, X_test, y_train, y_test, verbose=False
    )
    
    results[model_config['name']] = metrics
    
    # Registrar en MLFlow
    tracker.log_model_training(
        model=model,
        model_name=model_config['name'],
        metrics=metrics,
        params=model_config['params'],
        vectorizer_type='tfidf',
        tags={'experiment': 'model_comparison', 'vectorizer': 'tfidf'}
    )
    
    print(f"‚úÖ {model_config['name']} registrado en MLFlow")
    print(f"   F1-score (test): {metrics['test_f1']:.4f}")
    print(f"   Overfitting: {metrics['diff_f1']:.2f}%")



üîß Entrenando svm...




‚ö†Ô∏è  M√©tricas saltadas: confusion_matrix (array 2D)
‚ö†Ô∏è  No se pudo guardar el modelo en MLFlow: No module named '_lzma'
   Se guardaron las m√©tricas y par√°metros correctamente.
   Para guardar modelos, instala Python con soporte completo o usa SQLite backend.
‚úÖ Modelo svm registrado en MLFlow
‚úÖ svm registrado en MLFlow
   F1-score (test): 0.6866
   Overfitting: 2.54%

üîß Entrenando logistic...
‚ö†Ô∏è  M√©tricas saltadas: confusion_matrix (array 2D)




‚ö†Ô∏è  No se pudo guardar el modelo en MLFlow: No module named '_lzma'
   Se guardaron las m√©tricas y par√°metros correctamente.
   Para guardar modelos, instala Python con soporte completo o usa SQLite backend.
‚úÖ Modelo logistic registrado en MLFlow
‚úÖ logistic registrado en MLFlow
   F1-score (test): 0.7119
   Overfitting: 11.36%

üîß Entrenando naive_bayes...
‚ö†Ô∏è  M√©tricas saltadas: confusion_matrix (array 2D)
‚ö†Ô∏è  No se pudo guardar el modelo en MLFlow: No module named '_lzma'
   Se guardaron las m√©tricas y par√°metros correctamente.
   Para guardar modelos, instala Python con soporte completo o usa SQLite backend.
‚úÖ Modelo naive_bayes registrado en MLFlow
‚úÖ naive_bayes registrado en MLFlow
   F1-score (test): 0.4355
   Overfitting: 30.45%

üîß Entrenando random_forest...




‚ö†Ô∏è  M√©tricas saltadas: confusion_matrix (array 2D)
‚ö†Ô∏è  No se pudo guardar el modelo en MLFlow: No module named '_lzma'
   Se guardaron las m√©tricas y par√°metros correctamente.
   Para guardar modelos, instala Python con soporte completo o usa SQLite backend.
‚úÖ Modelo random_forest registrado en MLFlow
‚úÖ random_forest registrado en MLFlow
   F1-score (test): 0.6335
   Overfitting: 12.24%


In [5]:
# Crear DataFrame con resultados
comparison_data = []
for model_name, metrics in results.items():
    comparison_data.append({
        'Modelo': model_name,
        'F1 (test)': metrics['test_f1'],
        'F1 (train)': metrics['train_f1'],
        'Overfitting (%)': metrics['diff_f1'],
        'Accuracy (test)': metrics['test_accuracy'],
        'Precision (test)': metrics['test_precision'],
        'Recall (test)': metrics['test_recall']
    })

df_comparison = pd.DataFrame(comparison_data)
df_comparison = df_comparison.sort_values('F1 (test)', ascending=False)

print("\nüìä Comparaci√≥n de Modelos:")
print(df_comparison.to_string(index=False))



üìä Comparaci√≥n de Modelos:
       Modelo  F1 (test)  F1 (train)  Overfitting (%)  Accuracy (test)  Precision (test)  Recall (test)
     logistic   0.711864    0.825485        11.362036            0.745          0.741176       0.684783
          svm   0.686567    0.711930         2.536300            0.580          0.522727       1.000000
random_forest   0.633540    0.755952        12.241201            0.705          0.739130       0.554348
  naive_bayes   0.435484    0.740000        30.451613            0.650          0.843750       0.293478


## 5. Visualizar en MLFlow UI

Para ver los experimentos en la interfaz de MLFlow:

```bash
mlflow ui
```

Luego abre: http://localhost:5000
