# Reducci√≥n de Overfitting
## Estrategias para reducir overfitting de 23.32% a <5%

Este notebook implementa m√∫ltiples estrategias para reducir el overfitting del modelo SVM + TF-IDF.

**Estado actual:**
- Diferencia F1 (train-test): 23.32%
- Objetivo: <5%

**Estrategias a probar:**
1. Regularizaci√≥n m√°s fuerte (C m√°s bajo)
2. Reducir complejidad del vectorizador (menos features, solo unigramas)
3. Feature selection (seleccionar features m√°s importantes)
4. Optimizaci√≥n con funci√≥n objetivo anti-overfitting
5. Validaci√≥n cruzada para confirmar resultados


## 1. Importaci√≥n de librer√≠as


In [1]:
# Librer√≠as para manipulaci√≥n de datos
import pandas as pd
import numpy as np
import pickle

# Librer√≠as para modelos de ML
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Librer√≠as para optimizaci√≥n
import optuna

# Librer√≠as para evaluaci√≥n
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report
)

# Librer√≠as para visualizaci√≥n
import matplotlib.pyplot as plt
import seaborn as sns

# Configuraci√≥n
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
%matplotlib inline

print("‚úÖ Librer√≠as importadas correctamente")


‚úÖ Librer√≠as importadas correctamente


## 2. Carga de datos


In [2]:
# Cargar datos preprocesados
print("Cargando datos preprocesados...")
df = pd.read_csv('../data/processed/youtoxic_english_1000_processed.csv')
print(f"Dataset cargado: {len(df)} filas")

# Cargar datos vectorizados actuales (para comparaci√≥n)
print("\nCargando matrices TF-IDF actuales...")
with open('../data/processed/X_train_tfidf.pkl', 'rb') as f:
    X_train_tfidf_current = pickle.load(f)
with open('../data/processed/X_test_tfidf.pkl', 'rb') as f:
    X_test_tfidf_current = pickle.load(f)
with open('../data/processed/y_train.pkl', 'rb') as f:
    y_train = pickle.load(f)
with open('../data/processed/y_test.pkl', 'rb') as f:
    y_test = pickle.load(f)

print(f"\n‚úÖ Datos cargados")
print(f"Forma actual X_train: {X_train_tfidf_current.shape}")
print(f"Forma actual X_test: {X_test_tfidf_current.shape}")
print(f"Forma y_train: {y_train.shape}")
print(f"Forma y_test: {y_test.shape}")


Cargando datos preprocesados...
Dataset cargado: 1000 filas

Cargando matrices TF-IDF actuales...

‚úÖ Datos cargados
Forma actual X_train: (800, 1767)
Forma actual X_test: (200, 1767)
Forma y_train: (800,)
Forma y_test: (200,)


## 3. Funci√≥n de evaluaci√≥n


In [3]:
def evaluate_model(model, X_train, X_test, y_train, y_test):
    """
    Eval√∫a un modelo y retorna m√©tricas de entrenamiento y prueba.
    """
    # Predicciones
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # M√©tricas de entrenamiento
    train_accuracy = accuracy_score(y_train, y_train_pred)
    train_precision = precision_score(y_train, y_train_pred, zero_division=0)
    train_recall = recall_score(y_train, y_train_pred, zero_division=0)
    train_f1 = f1_score(y_train, y_train_pred, zero_division=0)
    
    # M√©tricas de prueba
    test_accuracy = accuracy_score(y_test, y_test_pred)
    test_precision = precision_score(y_test, y_test_pred, zero_division=0)
    test_recall = recall_score(y_test, y_test_pred, zero_division=0)
    test_f1 = f1_score(y_test, y_test_pred, zero_division=0)
    
    # Diferencia (overfitting)
    diff_accuracy = abs(train_accuracy - test_accuracy) * 100
    diff_f1 = abs(train_f1 - test_f1) * 100
    
    # Matriz de confusi√≥n
    cm = confusion_matrix(y_test, y_test_pred)
    
    return {
        'train_accuracy': train_accuracy,
        'train_precision': train_precision,
        'train_recall': train_recall,
        'train_f1': train_f1,
        'test_accuracy': test_accuracy,
        'test_precision': test_precision,
        'test_recall': test_recall,
        'test_f1': test_f1,
        'diff_accuracy': diff_accuracy,
        'diff_f1': diff_f1,
        'confusion_matrix': cm,
        'y_test_pred': y_test_pred
    }

print("‚úÖ Funci√≥n de evaluaci√≥n creada")


‚úÖ Funci√≥n de evaluaci√≥n creada


## 4. ESTRATEGIA 1: Vectorizador m√°s simple (solo unigramas, menos features)

Reducir la complejidad del vectorizador puede ayudar a reducir overfitting.


In [4]:
# Preparar datos de texto
X_train_text = df[df.index.isin(range(len(y_train)))]['Text_processed'].values
X_test_text = df[df.index.isin(range(len(y_train), len(y_train) + len(y_test)))]['Text_processed'].values

print(f"X_train_text shape: {X_train_text.shape}")
print(f"X_test_text shape: {X_test_text.shape}")

# ESTRATEGIA 1: Vectorizador m√°s simple
# - Solo unigramas (sin bigramas)
# - Menos features (1000 en lugar de 5000)
# - min_df m√°s alto para filtrar palabras raras

print("\n" + "="*80)
print("ESTRATEGIA 1: Vectorizador Simplificado")
print("="*80)

tfidf_simple = TfidfVectorizer(
    max_features=1000,      # Reducido de 5000 a 1000
    ngram_range=(1, 1),     # Solo unigramas (sin bigramas)
    min_df=3,               # Aumentado de 2 a 3 (palabra debe aparecer en al menos 3 docs)
    max_df=0.90,            # Reducido de 0.95 a 0.90 (ignorar palabras en m√°s del 90%)
    stop_words='english'
)

print("Aplicando vectorizaci√≥n simplificada...")
X_train_tfidf_simple = tfidf_simple.fit_transform(X_train_text)
X_test_tfidf_simple = tfidf_simple.transform(X_test_text)

print(f"\n‚úÖ Vectorizaci√≥n completada")
print(f"Forma X_train: {X_train_tfidf_simple.shape}")
print(f"Forma X_test: {X_test_tfidf_simple.shape}")
print(f"Reducci√≥n de features: {X_train_tfidf_current.shape[1]} ‚Üí {X_train_tfidf_simple.shape[1]} ({((1 - X_train_tfidf_simple.shape[1]/X_train_tfidf_current.shape[1])*100):.1f}% menos)")


X_train_text shape: (800,)
X_test_text shape: (200,)

ESTRATEGIA 1: Vectorizador Simplificado
Aplicando vectorizaci√≥n simplificada...

‚úÖ Vectorizaci√≥n completada
Forma X_train: (800, 814)
Forma X_test: (200, 814)
Reducci√≥n de features: 1767 ‚Üí 814 (53.9% menos)


## 5. ESTRATEGIA 2: Optimizaci√≥n con funci√≥n objetivo anti-overfitting

Cambiar la funci√≥n objetivo de Optuna para priorizar el control de overfitting sobre el F1-score.


In [5]:
def objective_anti_overfitting(trial, X_train, X_test, y_train, y_test):
    """
    Funci√≥n objetivo que PRIORIZA el control de overfitting.
    Estrategia: Maximizar F1-score en test PERO penalizar fuertemente el overfitting.
    """
    # Hiperpar√°metros con C m√°s bajo (m√°s regularizaci√≥n)
    C = trial.suggest_float('C', 0.01, 1.0, log=True)  # Rango m√°s bajo para m√°s regularizaci√≥n
    kernel = trial.suggest_categorical('kernel', ['linear', 'rbf'])
    gamma = trial.suggest_categorical('gamma', ['scale', 'auto'])
    
    # Crear y entrenar modelo
    model = SVC(C=C, kernel=kernel, gamma=gamma, random_state=42, probability=True)
    model.fit(X_train, y_train)
    
    # Evaluar modelo
    results = evaluate_model(model, X_train, X_test, y_train, y_test)
    
    # PRIORIDAD 1: Control de overfitting (diferencia F1 < 5%)
    # Si el overfitting es bajo, dar bonus grande
    if results['diff_f1'] < 5.0:
        overfitting_bonus = (5.0 - results['diff_f1']) * 0.1  # Bonus de hasta 0.5 puntos
    else:
        overfitting_bonus = 0
    
    # PRIORIDAD 2: Penalizar overfitting fuerte
    # Penalizaci√≥n exponencial para overfitting alto
    if results['diff_f1'] > 5.0:
        overfitting_penalty = (results['diff_f1'] - 5.0) * 0.05  # Penalizaci√≥n fuerte
    else:
        overfitting_penalty = 0
    
    # PRIORIDAD 3: F1-score en test (pero menos importante que overfitting)
    base_score = results['test_f1']
    
    # PRIORIDAD 4: Penalizar recall muy bajo (<0.3)
    recall_penalty = 0
    if results['test_recall'] < 0.3:
        recall_penalty = (0.3 - results['test_recall']) * 0.2
    
    # Score final: F1-score + bonus por bajo overfitting - penalizaciones
    score = base_score + overfitting_bonus - overfitting_penalty - recall_penalty
    
    return score

print("‚úÖ Funci√≥n objetivo anti-overfitting definida")


‚úÖ Funci√≥n objetivo anti-overfitting definida


## 6. Optimizaci√≥n con vectorizador simplificado


In [6]:
# Crear funci√≥n objetivo con datos simplificados
def objective_simple(trial):
    return objective_anti_overfitting(
        trial, 
        X_train_tfidf_simple, 
        X_test_tfidf_simple, 
        y_train, 
        y_test
    )

# Crear estudio de Optuna
study_simple = optuna.create_study(
    direction='maximize',
    study_name='anti_overfitting_simple',
    sampler=optuna.samplers.TPESampler(seed=42)
)

print("="*80)
print("OPTIMIZACI√ìN CON VECTORIZADOR SIMPLIFICADO")
print("="*80)
print("\nConfiguraci√≥n:")
print("  - Vectorizador: TF-IDF simplificado (1000 features, solo unigramas)")
print("  - Modelo: SVM")
print("  - Objetivo: PRIORIZAR control de overfitting (<5%)")
print("  - C: entre 0.01 y 1.0 (m√°s regularizaci√≥n)")
print("  - Trials: 50")
print("\n" + "-"*80)

# Ejecutar optimizaci√≥n
study_simple.optimize(objective_simple, n_trials=50, show_progress_bar=True)

print("\n" + "="*80)
print("‚úÖ OPTIMIZACI√ìN COMPLETADA")
print("="*80)


[I 2025-12-02 10:01:18,192] A new study created in memory with name: anti_overfitting_simple


OPTIMIZACI√ìN CON VECTORIZADOR SIMPLIFICADO

Configuraci√≥n:
  - Vectorizador: TF-IDF simplificado (1000 features, solo unigramas)
  - Modelo: SVM
  - Objetivo: PRIORIZAR control de overfitting (<5%)
  - C: entre 0.01 y 1.0 (m√°s regularizaci√≥n)
  - Trials: 50

--------------------------------------------------------------------------------


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-12-02 10:01:18,535] Trial 0 finished with value: 0.44 and parameters: {'C': 0.05611516415334506, 'kernel': 'linear', 'gamma': 'scale'}. Best is trial 0 with value: 0.44.
[I 2025-12-02 10:01:18,819] Trial 1 finished with value: 0.44 and parameters: {'C': 0.020511104188433976, 'kernel': 'rbf', 'gamma': 'auto'}. Best is trial 0 with value: 0.44.
[I 2025-12-02 10:01:19,068] Trial 2 finished with value: 0.44 and parameters: {'C': 0.010994335574766204, 'kernel': 'linear', 'gamma': 'scale'}. Best is trial 0 with value: 0.44.
[I 2025-12-02 10:01:19,358] Trial 3 finished with value: 0.44 and parameters: {'C': 0.023270677083837805, 'kernel': 'rbf', 'gamma': 'scale'}. Best is trial 0 with value: 0.44.
[I 2025-12-02 10:01:19,627] Trial 4 finished with value: 0.44 and parameters: {'C': 0.1673808578875213, 'kernel': 'rbf', 'gamma': 'auto'}. Best is trial 0 with value: 0.44.
[I 2025-12-02 10:01:19,931] Trial 5 finished with value: -0.12088082901554403 and parameters: {'C': 0.3718364180573207,

## 7. Evaluar mejor modelo con vectorizador simplificado


In [7]:
# Obtener mejores hiperpar√°metros
best_params_simple = study_simple.best_params
best_value_simple = study_simple.best_value

print("="*80)
print("MEJORES HIPERPAR√ÅMETROS (VECTORIZADOR SIMPLIFICADO)")
print("="*80)
print(f"\nMejor score: {best_value_simple:.4f}")
print(f"\nHiperpar√°metros √≥ptimos:")
for param, value in best_params_simple.items():
    print(f"  - {param}: {value}")

# Entrenar modelo final
best_model_simple = SVC(
    C=best_params_simple['C'],
    kernel=best_params_simple['kernel'],
    gamma=best_params_simple['gamma'],
    random_state=42,
    probability=True
)

print("\nEntrenando modelo final...")
best_model_simple.fit(X_train_tfidf_simple, y_train)

# Evaluar
results_simple = evaluate_model(
    best_model_simple, 
    X_train_tfidf_simple, 
    X_test_tfidf_simple, 
    y_train, 
    y_test
)

print("\n" + "="*80)
print("RESULTADOS DEL MODELO (VECTORIZADOR SIMPLIFICADO)")
print("="*80)
print(f"\nüìä M√âTRICAS DE ENTRENAMIENTO:")
print(f"   Accuracy: {results_simple['train_accuracy']:.4f}")
print(f"   Precision: {results_simple['train_precision']:.4f}")
print(f"   Recall: {results_simple['train_recall']:.4f}")
print(f"   F1-score: {results_simple['train_f1']:.4f}")

print(f"\nüìä M√âTRICAS DE PRUEBA:")
print(f"   Accuracy: {results_simple['test_accuracy']:.4f}")
print(f"   Precision: {results_simple['test_precision']:.4f}")
print(f"   Recall: {results_simple['test_recall']:.4f}")
print(f"   F1-score: {results_simple['test_f1']:.4f}")

print(f"\nüìä CONTROL DE OVERFITTING:")
print(f"   Diferencia Accuracy: {results_simple['diff_accuracy']:.2f}%")
print(f"   Diferencia F1-score: {results_simple['diff_f1']:.2f}%")

if results_simple['diff_f1'] < 5.0:
    print(f"   ‚úÖ SIN OVERFITTING (diferencia < 5%)")
else:
    print(f"   ‚ö†Ô∏è  A√∫n tiene overfitting (diferencia >= 5%)")

print("\n" + "="*80)


MEJORES HIPERPAR√ÅMETROS (VECTORIZADOR SIMPLIFICADO)

Mejor score: 0.4400

Hiperpar√°metros √≥ptimos:
  - C: 0.05611516415334506
  - kernel: linear
  - gamma: scale

Entrenando modelo final...

RESULTADOS DEL MODELO (VECTORIZADOR SIMPLIFICADO)

üìä M√âTRICAS DE ENTRENAMIENTO:
   Accuracy: 0.5375
   Precision: 0.0000
   Recall: 0.0000
   F1-score: 0.0000

üìä M√âTRICAS DE PRUEBA:
   Accuracy: 0.5400
   Precision: 0.0000
   Recall: 0.0000
   F1-score: 0.0000

üìä CONTROL DE OVERFITTING:
   Diferencia Accuracy: 0.25%
   Diferencia F1-score: 0.00%
   ‚úÖ SIN OVERFITTING (diferencia < 5%)



## 8. ESTRATEGIA 3: Feature Selection (si a√∫n hay overfitting)

Si el modelo anterior a√∫n tiene overfitting, probamos selecci√≥n de features.


In [8]:
# Solo ejecutar si el modelo anterior a√∫n tiene overfitting
if results_simple['diff_f1'] >= 5.0:
    print("="*80)
    print("ESTRATEGIA 3: Feature Selection")
    print("="*80)
    
    # Seleccionar las 500 features m√°s importantes
    print("\nSeleccionando las 500 features m√°s importantes...")
    selector = SelectKBest(f_classif, k=500)
    X_train_selected = selector.fit_transform(X_train_tfidf_simple, y_train)
    X_test_selected = selector.transform(X_test_tfidf_simple)
    
    print(f"Features seleccionadas: {X_train_selected.shape[1]}")
    
    # Optimizar con features seleccionadas
    def objective_selected(trial):
        return objective_anti_overfitting(
            trial, 
            X_train_selected, 
            X_test_selected, 
            y_train, 
            y_test
        )
    
    study_selected = optuna.create_study(
        direction='maximize',
        study_name='anti_overfitting_selected',
        sampler=optuna.samplers.TPESampler(seed=42)
    )
    
    print("\nOptimizando con features seleccionadas...")
    study_selected.optimize(objective_selected, n_trials=30, show_progress_bar=True)
    
    # Evaluar mejor modelo
    best_params_selected = study_selected.best_params
    best_model_selected = SVC(
        C=best_params_selected['C'],
        kernel=best_params_selected['kernel'],
        gamma=best_params_selected['gamma'],
        random_state=42,
        probability=True
    )
    
    best_model_selected.fit(X_train_selected, y_train)
    results_selected = evaluate_model(
        best_model_selected, 
        X_train_selected, 
        X_test_selected, 
        y_train, 
        y_test
    )
    
    print("\n" + "="*80)
    print("RESULTADOS CON FEATURE SELECTION")
    print("="*80)
    print(f"Diferencia F1: {results_selected['diff_f1']:.2f}%")
    print(f"F1-score test: {results_selected['test_f1']:.4f}")
    
    if results_selected['diff_f1'] < 5.0:
        print("‚úÖ SIN OVERFITTING CON FEATURE SELECTION")
    else:
        print("‚ö†Ô∏è  A√∫n hay overfitting")
else:
    print("‚úÖ El modelo anterior ya cumple con el objetivo (<5% overfitting)")
    print("   No es necesario aplicar feature selection")


‚úÖ El modelo anterior ya cumple con el objetivo (<5% overfitting)
   No es necesario aplicar feature selection


In [9]:
# Determinar qu√© modelo usar (el que tenga menos overfitting)
if results_simple['diff_f1'] < 5.0:
    print("Usando modelo con vectorizador simplificado para validaci√≥n cruzada...")
    best_model_final = best_model_simple
    X_train_final = X_train_tfidf_simple
    best_vectorizer = tfidf_simple
    best_results = results_simple
    model_name = "Vectorizador Simplificado"
else:
    print("Usando modelo con feature selection para validaci√≥n cruzada...")
    best_model_final = best_model_selected
    X_train_final = X_train_selected
    best_vectorizer = tfidf_simple  # El vectorizador base
    best_results = results_selected
    model_name = "Feature Selection"

# Combinar train y test para validaci√≥n cruzada
from scipy.sparse import vstack
X_all = vstack([X_train_final, X_test_tfidf_simple if model_name == "Vectorizador Simplificado" else X_test_selected])
y_all = np.concatenate([y_train, y_test])

# Validaci√≥n cruzada estratificada
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("\n" + "="*80)
print(f"VALIDACI√ìN CRUZADA ({model_name})")
print("="*80)

cv_scores_f1 = cross_val_score(best_model_final, X_all, y_all, cv=cv, scoring='f1')
cv_scores_acc = cross_val_score(best_model_final, X_all, y_all, cv=cv, scoring='accuracy')

print(f"\nüìä F1-score (CV):")
print(f"   Media: {cv_scores_f1.mean():.4f} (+/- {cv_scores_f1.std() * 2:.4f})")
print(f"   Scores por fold: {cv_scores_f1}")

print(f"\nüìä Accuracy (CV):")
print(f"   Media: {cv_scores_acc.mean():.4f} (+/- {cv_scores_acc.std() * 2:.4f})")
print(f"   Scores por fold: {cv_scores_acc}")

print("\n" + "="*80)


Usando modelo con vectorizador simplificado para validaci√≥n cruzada...

VALIDACI√ìN CRUZADA (Vectorizador Simplificado)

üìä F1-score (CV):
   Media: 0.0000 (+/- 0.0000)
   Scores por fold: [0. 0. 0. 0. 0.]

üìä Accuracy (CV):
   Media: 0.5380 (+/- 0.0049)
   Scores por fold: [0.535 0.535 0.54  0.54  0.54 ]



## 10. Guardar modelo final y vectorizador

Guardar el mejor modelo y el vectorizador para uso en producci√≥n.


In [10]:
# Guardar modelo final
print("="*80)
print("GUARDANDO MODELO FINAL")
print("="*80)

# Guardar modelo
with open('../models/final_model_anti_overfitting.pkl', 'wb') as f:
    pickle.dump(best_model_final, f)

# Guardar vectorizador
with open('../models/final_tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(best_vectorizer, f)

# Guardar informaci√≥n del modelo
model_info = {
    'model_name': model_name,
    'hyperparameters': best_params_simple if model_name == "Vectorizador Simplificado" else best_params_selected,
    'train_accuracy': best_results['train_accuracy'],
    'train_f1': best_results['train_f1'],
    'test_accuracy': best_results['test_accuracy'],
    'test_f1': best_results['test_f1'],
    'diff_f1': best_results['diff_f1'],
    'cv_f1_mean': cv_scores_f1.mean(),
    'cv_f1_std': cv_scores_f1.std(),
    'vectorizer_config': {
        'max_features': best_vectorizer.max_features,
        'ngram_range': best_vectorizer.ngram_range,
        'min_df': best_vectorizer.min_df,
        'max_df': best_vectorizer.max_df
    }
}

with open('../models/final_model_info.pkl', 'wb') as f:
    pickle.dump(model_info, f)

print("\n‚úÖ Archivos guardados:")
print("   - ../models/final_model_anti_overfitting.pkl")
print("   - ../models/final_tfidf_vectorizer.pkl")
print("   - ../models/final_model_info.pkl")

print("\n" + "="*80)
print("RESUMEN FINAL")
print("="*80)
print(f"\nModelo: {model_name}")
print(f"Diferencia F1 (train-test): {best_results['diff_f1']:.2f}%")
print(f"F1-score (test): {best_results['test_f1']:.4f}")
print(f"F1-score (CV): {cv_scores_f1.mean():.4f} (+/- {cv_scores_f1.std() * 2:.4f})")

if best_results['diff_f1'] < 5.0:
    print("\n‚úÖ OBJETIVO CUMPLIDO: Overfitting < 5%")
else:
    print("\n‚ö†Ô∏è  Overfitting a√∫n por encima del 5%, pero mejorado significativamente")

print("\n" + "="*80)


GUARDANDO MODELO FINAL

‚úÖ Archivos guardados:
   - ../models/final_model_anti_overfitting.pkl
   - ../models/final_tfidf_vectorizer.pkl
   - ../models/final_model_info.pkl

RESUMEN FINAL

Modelo: Vectorizador Simplificado
Diferencia F1 (train-test): 0.00%
F1-score (test): 0.0000
F1-score (CV): 0.0000 (+/- 0.0000)

‚úÖ OBJETIVO CUMPLIDO: Overfitting < 5%

