# Reducción de Overfitting - Versión 2 (Balanceada)
## Estrategia mejorada: Balance entre overfitting y rendimiento

**Problema identificado**: La versión anterior logró 0% overfitting pero con F1=0 (modelo inútil)

**Nueva estrategia**: 
- Balance entre control de overfitting (<5%) Y rendimiento (F1 > 0.60)
- Ajustar función objetivo para no penalizar demasiado
- Probar diferentes configuraciones de vectorizador


In [None]:
# Librerías para manipulación de datos
import pandas as pd
import numpy as np
import pickle

# Librerías para modelos de ML
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Librerías para optimización
import optuna

# Librerías para evaluación
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix
)

# Configuración
pd.set_option('display.max_columns', None)
%matplotlib inline

print("✅ Librerías importadas")


✅ Librerías importadas


In [None]:
# Cargar datos
df = pd.read_csv('../data/processed/youtoxic_english_1000_processed.csv')
with open('../data/processed/y_train.pkl', 'rb') as f:
    y_train = pickle.load(f)
with open('../data/processed/y_test.pkl', 'rb') as f:
    y_test = pickle.load(f)

X_train_text = df[df.index.isin(range(len(y_train)))]['Text_processed'].values
X_test_text = df[df.index.isin(range(len(y_train), len(y_train) + len(y_test)))]['Text_processed'].values

print(f"✅ Datos cargados: {len(X_train_text)} train, {len(X_test_text)} test")


✅ Datos cargados: 800 train, 200 test


In [3]:
def evaluate_model(model, X_train, X_test, y_train, y_test):
    """Evalúa un modelo y retorna métricas."""
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    train_f1 = f1_score(y_train, y_train_pred, zero_division=0)
    test_f1 = f1_score(y_test, y_test_pred, zero_division=0)
    diff_f1 = abs(train_f1 - test_f1) * 100
    
    return {
        'train_f1': train_f1,
        'test_f1': test_f1,
        'test_accuracy': accuracy_score(y_test, y_test_pred),
        'test_precision': precision_score(y_test, y_test_pred, zero_division=0),
        'test_recall': recall_score(y_test, y_test_pred, zero_division=0),
        'diff_f1': diff_f1
    }


## Estrategia: Vectorizador Balanceado + Función Objetivo Mejorada

**Configuración del vectorizador:**
- max_features: 2000 (intermedio entre 1000 y 5000)
- ngram_range: (1, 1) - solo unigramas (menos complejidad)
- min_df: 2 (mantener más palabras)
- max_df: 0.90

**Función objetivo:**
- Requisito mínimo: F1-score > 0.50 (rechazar modelos inútiles)
- Bonus por overfitting < 5% pero no excesivo
- Balance entre F1-score y control de overfitting


In [4]:
# Vectorizador balanceado
tfidf_balanced = TfidfVectorizer(
    max_features=2000,      # Intermedio: no tan simple, no tan complejo
    ngram_range=(1, 1),     # Solo unigramas
    min_df=2,               # Mantener más palabras
    max_df=0.90,            # Filtrar palabras muy comunes
    stop_words='english'
)

print("Aplicando vectorización balanceada...")
X_train_tfidf = tfidf_balanced.fit_transform(X_train_text)
X_test_tfidf = tfidf_balanced.transform(X_test_text)

print(f"✅ Vectorización completada: {X_train_tfidf.shape[1]} features")


Aplicando vectorización balanceada...
✅ Vectorización completada: 1235 features


In [5]:
def objective_balanced(trial):
    """
    Función objetivo BALANCEADA:
    - Prioriza F1-score razonable (>0.50)
    - Bonus moderado por bajo overfitting
    - Rechaza modelos con F1 muy bajo
    """
    # Hiperparámetros: C un poco más alto que antes
    C = trial.suggest_float('C', 0.1, 2.0, log=True)  # Rango más razonable
    kernel = trial.suggest_categorical('kernel', ['linear', 'rbf'])
    gamma = trial.suggest_categorical('gamma', ['scale', 'auto'])
    
    # Crear y entrenar modelo
    model = SVC(C=C, kernel=kernel, gamma=gamma, random_state=42, probability=True)
    model.fit(X_train_tfidf, y_train)
    
    # Evaluar modelo
    results = evaluate_model(model, X_train_tfidf, X_test_tfidf, y_train, y_test)
    
    # CRÍTICO: Rechazar modelos con F1 muy bajo (modelos inútiles)
    if results['test_f1'] < 0.50:
        return -10.0  # Penalización muy alta
    
    # Base score: F1-score en test
    base_score = results['test_f1']
    
    # Bonus moderado por bajo overfitting (no excesivo)
    if results['diff_f1'] < 5.0:
        overfitting_bonus = (5.0 - results['diff_f1']) * 0.02  # Bonus más pequeño
    else:
        overfitting_bonus = 0
    
    # Penalización moderada por overfitting alto
    if results['diff_f1'] > 5.0:
        overfitting_penalty = (results['diff_f1'] - 5.0) * 0.01  # Penalización más suave
    else:
        overfitting_penalty = 0
    
    # Score final: F1-score + bonus moderado - penalización moderada
    score = base_score + overfitting_bonus - overfitting_penalty
    
    return score

print("✅ Función objetivo balanceada definida")


✅ Función objetivo balanceada definida


In [6]:
# Optimización
study = optuna.create_study(
    direction='maximize',
    study_name='anti_overfitting_balanced',
    sampler=optuna.samplers.TPESampler(seed=42)
)

print("="*80)
print("OPTIMIZACIÓN BALANCEADA")
print("="*80)
print("Objetivo: F1-score > 0.50 Y overfitting < 5%")
print("Trials: 50")
print("-"*80)

study.optimize(objective_balanced, n_trials=50, show_progress_bar=True)

print("\n✅ Optimización completada")


[I 2025-12-02 10:35:49,229] A new study created in memory with name: anti_overfitting_balanced


OPTIMIZACIÓN BALANCEADA
Objetivo: F1-score > 0.50 Y overfitting < 5%
Trials: 50
--------------------------------------------------------------------------------


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-12-02 10:35:49,591] Trial 0 finished with value: -10.0 and parameters: {'C': 0.30710573677773717, 'kernel': 'linear', 'gamma': 'scale'}. Best is trial 0 with value: -10.0.
[I 2025-12-02 10:35:49,901] Trial 1 finished with value: -10.0 and parameters: {'C': 0.15957084694148355, 'kernel': 'rbf', 'gamma': 'auto'}. Best is trial 0 with value: -10.0.
[I 2025-12-02 10:35:50,203] Trial 2 finished with value: -10.0 and parameters: {'C': 0.10636066512540288, 'kernel': 'linear', 'gamma': 'scale'}. Best is trial 0 with value: -10.0.
[I 2025-12-02 10:35:50,527] Trial 3 finished with value: -10.0 and parameters: {'C': 0.17322667470546263, 'kernel': 'rbf', 'gamma': 'scale'}. Best is trial 0 with value: -10.0.
[I 2025-12-02 10:35:50,845] Trial 4 finished with value: -10.0 and parameters: {'C': 0.6252287916406215, 'kernel': 'rbf', 'gamma': 'auto'}. Best is trial 0 with value: -10.0.
[I 2025-12-02 10:35:51,183] Trial 5 finished with value: -10.0 and parameters: {'C': 1.0508421338691765, 'kernel

In [7]:
# Obtener mejores hiperparámetros
best_params = study.best_params
best_value = study.best_value

print("="*80)
print("MEJORES HIPERPARÁMETROS")
print("="*80)
print(f"Mejor score: {best_value:.4f}")
for param, value in best_params.items():
    print(f"  - {param}: {value}")

# Entrenar modelo final
best_model = SVC(
    C=best_params['C'],
    kernel=best_params['kernel'],
    gamma=best_params['gamma'],
    random_state=42,
    probability=True
)

best_model.fit(X_train_tfidf, y_train)
results = evaluate_model(best_model, X_train_tfidf, X_test_tfidf, y_train, y_test)

print("\n" + "="*80)
print("RESULTADOS FINALES")
print("="*80)
print(f"F1-score (test): {results['test_f1']:.4f}")
print(f"Accuracy (test): {results['test_accuracy']:.4f}")
print(f"Precision (test): {results['test_precision']:.4f}")
print(f"Recall (test): {results['test_recall']:.4f}")
print(f"Diferencia F1 (train-test): {results['diff_f1']:.2f}%")

if results['diff_f1'] < 5.0 and results['test_f1'] > 0.50:
    print("\n✅ OBJETIVO CUMPLIDO:")
    print("   - Overfitting < 5%")
    print("   - F1-score > 0.50")
elif results['diff_f1'] < 5.0:
    print("\n⚠️  Overfitting controlado pero F1-score bajo")
else:
    print("\n⚠️  F1-score aceptable pero overfitting > 5%")

print("="*80)


MEJORES HIPERPARÁMETROS
Mejor score: -10.0000
  - C: 0.30710573677773717
  - kernel: linear
  - gamma: scale

RESULTADOS FINALES
F1-score (test): 0.0208
Accuracy (test): 0.5300
Precision (test): 0.2500
Recall (test): 0.0109
Diferencia F1 (train-test): 21.69%

⚠️  F1-score aceptable pero overfitting > 5%


In [8]:
# Guardar modelo si cumple objetivos
if results['diff_f1'] < 5.0 and results['test_f1'] > 0.50:
    with open('../models/final_model_anti_overfitting.pkl', 'wb') as f:
        pickle.dump(best_model, f)
    
    with open('../models/final_tfidf_vectorizer.pkl', 'wb') as f:
        pickle.dump(tfidf_balanced, f)
    
    model_info = {
        'hyperparameters': best_params,
        'test_f1': results['test_f1'],
        'test_accuracy': results['test_accuracy'],
        'diff_f1': results['diff_f1'],
        'vectorizer_config': {
            'max_features': 2000,
            'ngram_range': (1, 1),
            'min_df': 2,
            'max_df': 0.90
        }
    }
    
    with open('../models/final_model_info.pkl', 'wb') as f:
        pickle.dump(model_info, f)
    
    print("✅ Modelo guardado en models/final_model_anti_overfitting.pkl")
else:
    print("⚠️  Modelo no cumple objetivos, no se guarda")


⚠️  Modelo no cumple objetivos, no se guarda
