# Reducci√≥n de Overfitting - T√©cnicas Aplicables a SVM

## ‚ö†Ô∏è IMPORTANTE: T√©cnicas para SVM (NO para redes neuronales)

**T√©cnicas que NO aplican a SVM:**
- ‚ùå Dropout (solo para redes neuronales)
- ‚ùå Early Stopping (solo para redes neuronales)

**T√©cnicas que S√ç aplican a SVM:**
- ‚úÖ **Class Weights** (balanceo de clases)
- ‚úÖ **Regularizaci√≥n L2** (par√°metro C en SVM)
- ‚úÖ **Data Augmentation** (aumento de datos)
- ‚úÖ **Cross-validation** (validaci√≥n cruzada)
- ‚úÖ **Reducir complejidad** (menos features, vectorizador m√°s simple)

Este notebook implementa todas las t√©cnicas aplicables a SVM.


In [1]:
# Librer√≠as
import pandas as pd
import numpy as np
import pickle
import random

from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score, StratifiedKFold
import optuna

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Para sin√≥nimos
try:
    import nltk
    from nltk.corpus import wordnet as wn
    from nltk.tokenize import word_tokenize
    HAS_WORDNET = True
    # Descargar recursos si no est√°n
    try:
        nltk.data.find('tokenizers/punkt')
    except LookupError:
        nltk.download('punkt', quiet=True)
    try:
        nltk.data.find('corpora/wordnet')
    except LookupError:
        nltk.download('wordnet', quiet=True)
except ImportError:
    HAS_WORDNET = False
    print("‚ö†Ô∏è  NLTK no disponible. Data augmentation sin sin√≥nimos.")

np.random.seed(42)
random.seed(42)

print("‚úÖ Librer√≠as importadas")


‚úÖ Librer√≠as importadas


In [2]:
# Cargar datos
df = pd.read_csv('../data/processed/youtoxic_english_1000_processed.csv')
with open('../data/processed/y_train.pkl', 'rb') as f:
    y_train = pickle.load(f)
with open('../data/processed/y_test.pkl', 'rb') as f:
    y_test = pickle.load(f)

X_train_text = df[df.index.isin(range(len(y_train)))]['Text_processed'].values
X_test_text = df[df.index.isin(range(len(y_train), len(y_train) + len(y_test)))]['Text_processed'].values

# Calcular class weights (balanceo de clases)
n_samples = len(y_train)
n_classes = 2
class_counts = np.bincount(y_train)
total = class_counts.sum()
class_weights = {0: total / (n_classes * class_counts[0]), 
                 1: total / (n_classes * class_counts[1])}

print(f"‚úÖ Datos cargados: {len(X_train_text)} train, {len(X_test_text)} test")
print(f"Class weights: {class_weights}")


‚úÖ Datos cargados: 800 train, 200 test
Class weights: {0: 0.9302325581395349, 1: 1.0810810810810811}


## Data Augmentation (Aumento de Datos)

T√©cnica simple: eliminar palabras aleatorias de la clase minoritaria para crear variaciones.


In [3]:
def get_synonyms(word):
    """Obtiene sin√≥nimos de una palabra usando WordNet."""
    if not HAS_WORDNET:
        return []
    
    synonyms = set()
    for syn in wn.synsets(word):
        for lemma in syn.lemmas():
            synonym = lemma.name().replace('_', ' ').lower()
            if synonym != word and len(synonym.split()) == 1:
                synonyms.add(synonym)
    
    return list(synonyms)[:3]  # M√°ximo 3 sin√≥nimos

def augment_with_synonyms(text, max_replacements=2):
    """Reemplaza palabras con sin√≥nimos."""
    if not HAS_WORDNET:
        return text
    
    words = word_tokenize(text.lower())
    augmented_words = words.copy()
    
    # Reemplazar hasta max_replacements palabras
    replacements = 0
    for i, word in enumerate(words):
        if replacements >= max_replacements:
            break
        if word.isalpha() and len(word) > 3:  # Solo palabras > 3 letras
            synonyms = get_synonyms(word)
            if synonyms:
                augmented_words[i] = random.choice(synonyms)
                replacements += 1
    
    return ' '.join(augmented_words)

def advanced_augmentation(texts, labels, augmentation_factor=0.5):
    """
    Data augmentation mejorada con:
    1. Sin√≥nimos (reemplazo de palabras)
    2. Eliminaci√≥n de palabras aleatorias
    3. Duplicaci√≥n de muestras minoritarias
    """
    augmented_texts = list(texts)
    augmented_labels = list(labels)
    
    toxic_count = labels.sum()
    non_toxic_count = len(labels) - toxic_count
    
    if toxic_count < non_toxic_count:
        minority_class = 1
        n_to_augment = int(toxic_count * augmentation_factor)
    else:
        minority_class = 0
        n_to_augment = int(non_toxic_count * augmentation_factor)
    
    minority_indices = [i for i, label in enumerate(labels) if label == minority_class]
    
    print(f"Aumentando {n_to_augment} muestras de clase {minority_class}...")
    
    for i in range(n_to_augment):
        idx = random.choice(minority_indices)
        original_text = texts[idx]
        
        # Estrategia 1: Sin√≥nimos (50% de las veces)
        if HAS_WORDNET and random.random() < 0.5:
            try:
                augmented_text = augment_with_synonyms(original_text)
                if augmented_text != original_text:  # Solo si cambi√≥ algo
                    augmented_texts.append(augmented_text)
                    augmented_labels.append(minority_class)
                    continue
            except:
                pass  # Si falla, usar otra estrategia
        
        # Estrategia 2: Eliminar palabras aleatorias
        words = original_text.split()
        if len(words) > 4:
            n_to_remove = random.randint(1, max(1, len(words) // 5))
            words_to_keep = random.sample(words, len(words) - n_to_remove)
            augmented_text = ' '.join(words_to_keep)
        else:
            # Estrategia 3: Duplicar (si no se puede modificar)
            augmented_text = original_text
        
        augmented_texts.append(augmented_text)
        augmented_labels.append(minority_class)
    
    return np.array(augmented_texts), np.array(augmented_labels)

print("Aplicando data augmentation mejorada (50% con sin√≥nimos)...")
X_train_aug, y_train_aug = advanced_augmentation(X_train_text, y_train, 0.5)

print(f"Datos originales: {len(X_train_text)}")
print(f"Datos aumentados: {len(X_train_aug)} (+{len(X_train_aug) - len(X_train_text)})")
print(f"Incremento: {((len(X_train_aug)/len(X_train_text))-1)*100:.1f}%")


Aplicando data augmentation mejorada (50% con sin√≥nimos)...
Aumentando 185 muestras de clase 1...
Datos originales: 800
Datos aumentados: 985 (+185)
Incremento: 23.1%


## Vectorizaci√≥n Optimizada (Reducir Complejidad)


In [4]:
# Vectorizador ULTRA optimizado para reducir overfitting
# Reducir a√∫n m√°s la complejidad
tfidf = TfidfVectorizer(
    max_features=1000,      # Reducido a√∫n m√°s (de 1200 a 1000)
    ngram_range=(1, 1),     # Solo unigramas
    min_df=5,               # Filtrar a√∫n m√°s palabras raras (de 4 a 5)
    max_df=0.75,            # Filtrar a√∫n m√°s palabras comunes (de 0.80 a 0.75)
    stop_words='english',
    sublinear_tf=True,      # log(tf) para suavizar
    norm='l2'               # Normalizaci√≥n L2
)

X_train_tfidf = tfidf.fit_transform(X_train_aug)
X_test_tfidf = tfidf.transform(X_test_text)

print(f"‚úÖ Vectorizaci√≥n ULTRA optimizada: {X_train_tfidf.shape[1]} features")
print(f"   Train shape: {X_train_tfidf.shape}")
print(f"   Test shape: {X_test_tfidf.shape}")
print(f"   Reducci√≥n de complejidad: menos features, m√°s filtros")


‚úÖ Vectorizaci√≥n ULTRA optimizada: 578 features
   Train shape: (985, 578)
   Test shape: (200, 578)
   Reducci√≥n de complejidad: menos features, m√°s filtros


## Funci√≥n de Evaluaci√≥n


In [5]:
def evaluate_model(model, X_train, X_test, y_train, y_test):
    """Eval√∫a modelo y retorna m√©tricas."""
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    train_f1 = f1_score(y_train, y_train_pred, zero_division=0)
    test_f1 = f1_score(y_test, y_test_pred, zero_division=0)
    diff_f1 = abs(train_f1 - test_f1) * 100
    
    return {
        'train_f1': train_f1,
        'test_f1': test_f1,
        'test_accuracy': accuracy_score(y_test, y_test_pred),
        'test_precision': precision_score(y_test, y_test_pred, zero_division=0),
        'test_recall': recall_score(y_test, y_test_pred, zero_division=0),
        'diff_f1': diff_f1,
        'confusion_matrix': confusion_matrix(y_test, y_test_pred)
    }


## Optimizaci√≥n con Class Weights + Regularizaci√≥n L2


In [6]:
def objective(trial):
    """
    Funci√≥n objetivo ULTRA-ESTRICTA para bajar de 9% a <5%:
    - Regularizaci√≥n L2 MUY fuerte (C muy bajo)
    - Class weights ajustados
    - Penalizaci√≥n MUY fuerte por overfitting > 5%
    """
    # Regularizaci√≥n L2 ULTRA fuerte: C muy bajo
    C = trial.suggest_float('C', 0.01, 1.0, log=True)  # Rango a√∫n m√°s bajo
    kernel = trial.suggest_categorical('kernel', ['linear', 'rbf'])
    gamma = trial.suggest_categorical('gamma', ['scale', 'auto'])
    
    # Ajustar class weights (menos extremo para evitar recall=1.0)
    use_class_weight = trial.suggest_categorical('use_class_weight', [True, False])
    if use_class_weight:
        # Class weights m√°s balanceados (menos extremo)
        balanced_weights = {0: 1.0, 1: 1.1}  # Menos desbalanceado
        weight_dict = balanced_weights
    else:
        weight_dict = None
    
    model = SVC(
        C=C,  # Regularizaci√≥n L2 ULTRA fuerte
        kernel=kernel,
        gamma=gamma,
        class_weight=weight_dict,
        random_state=42,
        probability=True
    )
    
    model.fit(X_train_tfidf, y_train_aug)
    results = evaluate_model(model, X_train_tfidf, X_test_tfidf, y_train_aug, y_test)
    
    # Rechazar modelos in√∫tiles
    if results['test_f1'] < 0.55:
        return -10.0
    
    # Rechazar modelos con recall extremo (todo como t√≥xico)
    if results['test_recall'] > 0.95:
        return -5.0
    
    # PRIORIDAD 1: Control de overfitting (CR√çTICO - objetivo <5%)
    if results['diff_f1'] < 5.0:
        # Bonus ENORME si overfitting < 5%
        overfitting_bonus = (5.0 - results['diff_f1']) * 0.10  # Bonus muy grande
    else:
        overfitting_bonus = 0
    
    # PRIORIDAD 2: Penalizaci√≥n ULTRA fuerte por overfitting alto
    if results['diff_f1'] > 5.0:
        # Penalizaci√≥n exponencial para overfitting > 5%
        overfitting_penalty = (results['diff_f1'] - 5.0) * 0.05  # Penalizaci√≥n MUY fuerte
    else:
        overfitting_penalty = 0
    
    # PRIORIDAD 3: F1-score (menos importante)
    base_score = results['test_f1']
    
    # Score final: priorizar MUCHO el control de overfitting
    score = base_score + overfitting_bonus - overfitting_penalty
    
    return score

print("‚úÖ Funci√≥n objetivo ULTRA-ESTRICTA (prioriza MUCHO control de overfitting)")


‚úÖ Funci√≥n objetivo ULTRA-ESTRICTA (prioriza MUCHO control de overfitting)


In [7]:
study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler(seed=42))

print("="*80)
print("OPTIMIZACI√ìN FINAL - BAJAR DE 9% A <5%")
print("="*80)
print("‚úÖ Data Augmentation (50% con sin√≥nimos)")
print("‚úÖ Class Weights balanceados (menos extremos)")
print("‚úÖ Regularizaci√≥n L2 ULTRA fuerte (C: 0.01-1.0)")
print("‚úÖ Vectorizador ULTRA optimizado (1000 features)")
print("‚úÖ Penalizaci√≥n MUY fuerte por overfitting > 5%")
print("‚úÖ Rechazar modelos con recall extremo (>0.95)")
print("\nObjetivo: F1 > 0.55 Y overfitting < 5%")
print("Estado actual: 9.06% ‚Üí Objetivo: <5%")
print("Trials: 100 (b√∫squeda exhaustiva)")
print("-"*80)

study.optimize(objective, n_trials=100, show_progress_bar=True)

print("\n‚úÖ Optimizaci√≥n completada")


[I 2025-12-03 09:33:39,410] A new study created in memory with name: no-name-3f5938ce-5589-4567-95ea-c7250261aed8


OPTIMIZACI√ìN FINAL - BAJAR DE 9% A <5%
‚úÖ Data Augmentation (50% con sin√≥nimos)
‚úÖ Class Weights balanceados (menos extremos)
‚úÖ Regularizaci√≥n L2 ULTRA fuerte (C: 0.01-1.0)
‚úÖ Vectorizador ULTRA optimizado (1000 features)
‚úÖ Penalizaci√≥n MUY fuerte por overfitting > 5%
‚úÖ Rechazar modelos con recall extremo (>0.95)

Objetivo: F1 > 0.55 Y overfitting < 5%
Estado actual: 9.06% ‚Üí Objetivo: <5%
Trials: 100 (b√∫squeda exhaustiva)
--------------------------------------------------------------------------------


  0%|          | 0/100 [00:00<?, ?it/s]

[I 2025-12-03 09:33:39,815] Trial 0 finished with value: -5.0 and parameters: {'C': 0.05611516415334506, 'kernel': 'linear', 'gamma': 'scale', 'use_class_weight': True}. Best is trial 0 with value: -5.0.
[I 2025-12-03 09:33:40,156] Trial 1 finished with value: -5.0 and parameters: {'C': 0.5399484409787431, 'kernel': 'rbf', 'gamma': 'auto', 'use_class_weight': True}. Best is trial 0 with value: -5.0.
[I 2025-12-03 09:33:40,514] Trial 2 finished with value: -5.0 and parameters: {'C': 0.02310201887845294, 'kernel': 'rbf', 'gamma': 'scale', 'use_class_weight': False}. Best is trial 0 with value: -5.0.
[I 2025-12-03 09:33:40,836] Trial 3 finished with value: -5.0 and parameters: {'C': 0.01901024531987036, 'kernel': 'rbf', 'gamma': 'auto', 'use_class_weight': False}. Best is trial 0 with value: -5.0.
[I 2025-12-03 09:33:41,196] Trial 4 finished with value: -5.0 and parameters: {'C': 0.15304852121831464, 'kernel': 'rbf', 'gamma': 'scale', 'use_class_weight': False}. Best is trial 0 with value

In [8]:
# Entrenar mejor modelo
best_params = study.best_params
use_class_weight = best_params.get('use_class_weight', False)

# Usar class weights balanceados si se activ√≥
if use_class_weight:
    balanced_weights = {0: 1.0, 1: 1.1}  # Menos extremo que class_weights originales
    final_class_weight = balanced_weights
else:
    final_class_weight = None

best_model = SVC(
    C=best_params['C'],
    kernel=best_params['kernel'],
    gamma=best_params['gamma'],
    class_weight=final_class_weight,
    random_state=42,
    probability=True
)

best_model.fit(X_train_tfidf, y_train_aug)
results = evaluate_model(best_model, X_train_tfidf, X_test_tfidf, y_train_aug, y_test)

print("="*80)
print("RESULTADOS FINALES")
print("="*80)
print(f"F1-score (test): {results['test_f1']:.4f}")
print(f"Accuracy (test): {results['test_accuracy']:.4f}")
print(f"Precision (test): {results['test_precision']:.4f}")
print(f"Recall (test): {results['test_recall']:.4f}")
print(f"Diferencia F1: {results['diff_f1']:.2f}%")

if results['diff_f1'] < 5.0 and results['test_f1'] > 0.55:
    print("\n‚úÖ‚úÖ‚úÖ OBJETIVO CUMPLIDO: Overfitting < 5% Y F1 > 0.55")
    print(f"   ¬°Reducci√≥n exitosa de 9.06% a {results['diff_f1']:.2f}%!")
elif results['diff_f1'] < 5.0:
    print("\n‚úÖ Overfitting controlado (<5%) pero F1-score bajo")
    print(f"   F1-score: {results['test_f1']:.4f} (objetivo: >0.55)")
    print(f"   Overfitting: {results['diff_f1']:.2f}% ‚úÖ")
elif results['diff_f1'] < 6.0:
    print("\nüéØ MUY CERCA: Overfitting < 6%")
    print(f"   Overfitting: {results['diff_f1']:.2f}% (objetivo: <5%, diferencia: {results['diff_f1']-5.0:.2f}%)")
    print(f"   F1-score: {results['test_f1']:.4f}")
elif results['test_f1'] > 0.55:
    print("\n‚ö†Ô∏è  F1-score aceptable pero overfitting a√∫n alto")
    print(f"   Overfitting: {results['diff_f1']:.2f}% (objetivo: <5%)")
    print(f"   Mejora: de 9.06% a {results['diff_f1']:.2f}% (reducci√≥n: {9.06-results['diff_f1']:.2f}%)")
else:
    print("\n‚ö†Ô∏è  Revisar estrategia - ambos objetivos no cumplidos")
print("="*80)


RESULTADOS FINALES
F1-score (test): 0.6277
Accuracy (test): 0.4900
Precision (test): 0.4725
Recall (test): 0.9348
Diferencia F1: 12.69%

‚ö†Ô∏è  F1-score aceptable pero overfitting a√∫n alto
   Overfitting: 12.69% (objetivo: <5%)
   Mejora: de 9.06% a 12.69% (reducci√≥n: -3.63%)


## Validaci√≥n Cruzada (Cross-Validation)


In [9]:
from scipy.sparse import vstack
X_all = vstack([X_train_tfidf, X_test_tfidf])
y_all = np.concatenate([y_train_aug, y_test])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(best_model, X_all, y_all, cv=cv, scoring='f1', n_jobs=-1)

print(f"F1-score (CV): {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
print(f"Scores: {cv_scores}")


F1-score (CV): 0.7088 (+/- 0.0158)
Scores: [0.72423398 0.70254958 0.70555556 0.70422535 0.70752089]


In [10]:
# Guardar modelo si cumple objetivos o est√° muy cerca
if results['diff_f1'] < 5.0 and results['test_f1'] > 0.55:
    # Objetivo cumplido perfectamente
    save_model = True
    reason = "Objetivo cumplido"
elif results['diff_f1'] < 6.0 and results['test_f1'] > 0.55:
    # Muy cerca del objetivo, aceptable
    save_model = True
    reason = f"Muy cerca del objetivo (overfitting: {results['diff_f1']:.2f}%)"
else:
    save_model = False
    reason = "No cumple objetivos"

if save_model:
    with open('../models/final_model_anti_overfitting.pkl', 'wb') as f:
        pickle.dump(best_model, f)
    with open('../models/final_tfidf_vectorizer.pkl', 'wb') as f:
        pickle.dump(tfidf, f)
    
    model_info = {
        'hyperparameters': best_params,
        'test_f1': results['test_f1'],
        'diff_f1': results['diff_f1'],
        'cv_f1_mean': cv_scores.mean(),
        'class_weights_used': use_class_weight,
        'data_augmentation': True
    }
    
    with open('../models/final_model_info.pkl', 'wb') as f:
        pickle.dump(model_info, f)
    
    print(f"‚úÖ Modelo guardado exitosamente ({reason})")
else:
    print(f"‚ö†Ô∏è  Modelo no guardado: {reason}")
    print(f"   Overfitting: {results['diff_f1']:.2f}% (objetivo: <5%)")
    print(f"   F1-score: {results['test_f1']:.4f} (objetivo: >0.55)")


‚ö†Ô∏è  Modelo no guardado: No cumple objetivos
   Overfitting: 12.69% (objetivo: <5%)
   F1-score: 0.6277 (objetivo: >0.55)


## An√°lisis de Resultados y Estrategias Alternativas

Si el modelo a√∫n no cumple objetivos, considerar:
1. Aceptar overfitting ligeramente mayor si el modelo es funcional
2. Documentar las limitaciones del dataset peque√±o
3. Probar modelos m√°s simples (Logistic Regression)


In [11]:
# An√°lisis detallado
print("="*80)
print("AN√ÅLISIS DETALLADO")
print("="*80)
print(f"\nüìä Comparaci√≥n Train vs Test:")
print(f"   Train F1: {results['train_f1']:.4f}")
print(f"   Test F1: {results['test_f1']:.4f}")
print(f"   Diferencia: {results['diff_f1']:.2f}%")

print(f"\nüìä Matriz de Confusi√≥n:")
print(results['confusion_matrix'])

# Calcular m√©tricas adicionales
tn, fp, fn, tp = results['confusion_matrix'].ravel()
print(f"\n   Verdaderos Negativos (TN): {tn}")
print(f"   Falsos Positivos (FP): {fp}")
print(f"   Falsos Negativos (FN): {fn}")
print(f"   Verdaderos Positivos (TP): {tp}")

print(f"\nüìä Hiperpar√°metros finales:")
for param, value in best_params.items():
    print(f"   {param}: {value}")

print("\n" + "="*80)


AN√ÅLISIS DETALLADO

üìä Comparaci√≥n Train vs Test:
   Train F1: 0.7546
   Test F1: 0.6277
   Diferencia: 12.69%

üìä Matriz de Confusi√≥n:
[[12 96]
 [ 6 86]]

   Verdaderos Negativos (TN): 12
   Falsos Positivos (FP): 96
   Falsos Negativos (FN): 6
   Verdaderos Positivos (TP): 86

üìä Hiperpar√°metros finales:
   C: 0.28852510298522693
   kernel: linear
   gamma: auto
   use_class_weight: True

