# Naive Bayes - Reducci√≥n de Overfitting

## Objetivo
Probar Naive Bayes con t√©cnicas anti-overfitting como alternativa a modelos m√°s complejos para reducir overfitting manteniendo F1-score > 0.55.

## Ventajas de Naive Bayes
- ‚úÖ‚úÖ‚úÖ Modelo simple (menos propenso a overfitting)
- ‚úÖ‚úÖ‚úÖ Regularizaci√≥n incorporada (par√°metro alpha)
- ‚úÖ‚úÖ‚úÖ R√°pido de entrenar y predecir
- ‚úÖ‚úÖ‚úÖ Funciona bien con TF-IDF
- ‚úÖ‚úÖ‚úÖ Menos par√°metros = menos riesgo de sobreajuste
- ‚úÖ Buen baseline para comparar con otros modelos


## 1. Importaci√≥n de librer√≠as


In [1]:
import pandas as pd
import numpy as np
import pickle
import random

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score, StratifiedKFold
import optuna

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix
)

np.random.seed(42)
random.seed(42)

print("‚úÖ Librer√≠as importadas")


‚úÖ Librer√≠as importadas


## 2. Carga de datos


In [2]:
# Cargar datos
df = pd.read_csv('../data/processed/youtoxic_english_1000_processed.csv')
with open('../data/processed/y_train.pkl', 'rb') as f:
    y_train = pickle.load(f)
with open('../data/processed/y_test.pkl', 'rb') as f:
    y_test = pickle.load(f)

X_train_text = df[df.index.isin(range(len(y_train)))]['Text_processed'].values
X_test_text = df[df.index.isin(range(len(y_train), len(y_train) + len(y_test)))]['Text_processed'].values

print(f"‚úÖ Datos cargados: {len(X_train_text)} train, {len(X_test_text)} test")
print(f"Distribuci√≥n train: {np.bincount(y_train)}")
print(f"Distribuci√≥n test: {np.bincount(y_test)}")


‚úÖ Datos cargados: 800 train, 200 test
Distribuci√≥n train: [430 370]
Distribuci√≥n test: [108  92]


## 3. Vectorizaci√≥n optimizada


In [3]:
# Vectorizaci√≥n optimizada para reducir overfitting
tfidf = TfidfVectorizer(
    max_features=600,        # Menos features = menos overfitting
    ngram_range=(1, 2),      # Bigramas para captar contexto
    min_df=3,                # Filtrar palabras raras
    max_df=0.85,             # Filtrar palabras muy comunes
    stop_words='english',
    sublinear_tf=True,
    norm='l2'
)

# SIN augmentaci√≥n (Naive Bayes es simple y no necesita tanto)
X_train_tfidf = tfidf.fit_transform(X_train_text)
X_test_tfidf = tfidf.transform(X_test_text)

print(f"‚úÖ Vectorizaci√≥n: {X_train_tfidf.shape[1]} features")
print(f"   Train shape: {X_train_tfidf.shape}")
print(f"   Test shape: {X_test_tfidf.shape}")


‚úÖ Vectorizaci√≥n: 600 features
   Train shape: (800, 600)
   Test shape: (200, 600)


## 4. Funci√≥n de Evaluaci√≥n


In [4]:
def evaluate_model(model, X_train, X_test, y_train, y_test):
    """Eval√∫a modelo y retorna m√©tricas."""
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    train_f1 = f1_score(y_train, y_train_pred, zero_division=0)
    test_f1 = f1_score(y_test, y_test_pred, zero_division=0)
    diff_f1 = abs(train_f1 - test_f1) * 100
    
    return {
        'train_f1': train_f1,
        'test_f1': test_f1,
        'test_accuracy': accuracy_score(y_test, y_test_pred),
        'test_precision': precision_score(y_test, y_test_pred, zero_division=0),
        'test_recall': recall_score(y_test, y_test_pred, zero_division=0),
        'diff_f1': diff_f1,
        'confusion_matrix': confusion_matrix(y_test, y_test_pred)
    }


## 5. Funci√≥n Objetivo para Optuna


In [5]:
def objective(trial):
    """
    Funci√≥n objetivo para Naive Bayes:
    - Regularizaci√≥n con alpha (smoothing parameter)
    - Priorizar overfitting <5% y F1 >0.55
    - Naive Bayes es simple, menos propenso a overfitting
    """
    # Alpha: par√°metro de regularizaci√≥n (smoothing)
    # Valores m√°s altos = m√°s regularizaci√≥n = menos overfitting
    alpha = trial.suggest_float('alpha', 0.1, 10.0, log=True)
    
    # fit_prior: si usar las probabilidades a priori de las clases
    fit_prior = trial.suggest_categorical('fit_prior', [True, False])
    
    model = MultinomialNB(
        alpha=alpha,
        fit_prior=fit_prior,
        class_prior=None  # Calcular autom√°ticamente
    )
    
    model.fit(X_train_tfidf, y_train)
    results = evaluate_model(model, X_train_tfidf, X_test_tfidf, y_train, y_test)
    
    # Rechazar modelos in√∫tiles
    if results['test_f1'] < 0.50:
        return -10.0
    
    # Rechazar overfitting extremo
    if results['diff_f1'] > 8.0:
        return -20.0
    
    # PRIORIDAD 1: Control de overfitting
    if results['diff_f1'] < 5.0:
        overfitting_bonus = (5.0 - results['diff_f1']) * 0.50  # Bonus grande
    else:
        overfitting_bonus = 0
    
    # PRIORIDAD 2: Penalizaci√≥n por overfitting
    if results['diff_f1'] > 5.0:
        overfitting_penalty = ((results['diff_f1'] - 5.0) ** 2) * 0.05
    else:
        overfitting_penalty = 0
    
    # PRIORIDAD 3: F1-score base
    base_score = results['test_f1'] * 0.4
    
    score = base_score + overfitting_bonus - overfitting_penalty
    return score

print("‚úÖ Funci√≥n objetivo definida (prioriza overfitting <5%)")


‚úÖ Funci√≥n objetivo definida (prioriza overfitting <5%)


## 6. Optimizaci√≥n con Optuna


In [6]:
study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler(seed=42))

print("="*80)
print("OPTIMIZACI√ìN NAIVE BAYES - CONTROL DE OVERFITTING")
print("="*80)
print("‚úÖ SIN augmentaci√≥n (Naive Bayes es simple)")
print("‚úÖ Regularizaci√≥n con alpha (smoothing parameter)")
print("‚úÖ Vectorizador simplificado (600 features)")
print("‚úÖ Penalizaci√≥n por overfitting >5%")
print("\nObjetivo: F1 > 0.55 Y overfitting < 5%")
print("Trials: 150")
print("-"*80)

study.optimize(objective, n_trials=150, show_progress_bar=True)

print("\n‚úÖ Optimizaci√≥n completada")


[I 2025-12-04 09:20:57,456] A new study created in memory with name: no-name-4c03dce4-aaf0-4eda-8328-80c82d886f82


OPTIMIZACI√ìN NAIVE BAYES - CONTROL DE OVERFITTING
‚úÖ SIN augmentaci√≥n (Naive Bayes es simple)
‚úÖ Regularizaci√≥n con alpha (smoothing parameter)
‚úÖ Vectorizador simplificado (600 features)
‚úÖ Penalizaci√≥n por overfitting >5%

Objetivo: F1 > 0.55 Y overfitting < 5%
Trials: 150
--------------------------------------------------------------------------------


  0%|          | 0/150 [00:00<?, ?it/s]

[I 2025-12-04 09:20:57,543] Trial 0 finished with value: -10.0 and parameters: {'alpha': 0.5611516415334505, 'fit_prior': True}. Best is trial 0 with value: -10.0.
[I 2025-12-04 09:20:57,573] Trial 1 finished with value: -10.0 and parameters: {'alpha': 1.5751320499779735, 'fit_prior': True}. Best is trial 0 with value: -10.0.
[I 2025-12-04 09:20:57,604] Trial 2 finished with value: -10.0 and parameters: {'alpha': 0.13066739238053282, 'fit_prior': True}. Best is trial 0 with value: -10.0.
[I 2025-12-04 09:20:57,636] Trial 3 finished with value: -10.0 and parameters: {'alpha': 2.607024758370768, 'fit_prior': False}. Best is trial 0 with value: -10.0.
[I 2025-12-04 09:20:57,667] Trial 4 finished with value: -10.0 and parameters: {'alpha': 4.622589001020832, 'fit_prior': True}. Best is trial 0 with value: -10.0.
[I 2025-12-04 09:20:57,701] Trial 5 finished with value: -10.0 and parameters: {'alpha': 0.2327067708383781, 'fit_prior': False}. Best is trial 0 with value: -10.0.
[I 2025-12-04 0

## 7. Evaluaci√≥n del Mejor Modelo


In [7]:
# Entrenar mejor modelo
best_params = study.best_params

best_model = MultinomialNB(
    alpha=best_params['alpha'],
    fit_prior=best_params['fit_prior'],
    class_prior=None
)

best_model.fit(X_train_tfidf, y_train)
results = evaluate_model(best_model, X_train_tfidf, X_test_tfidf, y_train, y_test)

print("="*80)
print("RESULTADOS FINALES - NAIVE BAYES")
print("="*80)
print(f"F1-score (train): {results['train_f1']:.4f}")
print(f"F1-score (test): {results['test_f1']:.4f}")
print(f"Accuracy (test): {results['test_accuracy']:.4f}")
print(f"Precision (test): {results['test_precision']:.4f}")
print(f"Recall (test): {results['test_recall']:.4f}")
print(f"Diferencia F1: {results['diff_f1']:.2f}%")
print(f"\nMejores hiperpar√°metros:")
print(f"  Alpha: {best_params['alpha']:.4f}")
print(f"  Fit prior: {best_params['fit_prior']}")
print(f"\nMatriz de confusi√≥n:")
print(results['confusion_matrix'])

if results['diff_f1'] < 5.0 and results['test_f1'] > 0.55:
    print("\n‚úÖ‚úÖ‚úÖ OBJETIVO CUMPLIDO: Overfitting < 5% Y F1 > 0.55")
elif results['diff_f1'] < 6.0:
    print("\nüéØ MUY CERCA: Overfitting < 6%")
else:
    print("\n‚ö†Ô∏è  Overfitting a√∫n alto")

print("="*80)


RESULTADOS FINALES - NAIVE BAYES
F1-score (train): 0.7450
F1-score (test): 0.3286
Accuracy (test): 0.5300
Precision (test): 0.4792
Recall (test): 0.2500
Diferencia F1: 41.65%

Mejores hiperpar√°metros:
  Alpha: 0.5612
  Fit prior: True

Matriz de confusi√≥n:
[[83 25]
 [69 23]]

‚ö†Ô∏è  Overfitting a√∫n alto


## 8. Validaci√≥n Cruzada


In [8]:
X_all = tfidf.transform(np.concatenate([X_train_text, X_test_text]))
y_all = np.concatenate([y_train, y_test])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(best_model, X_all, y_all, cv=cv, scoring='f1', n_jobs=-1)

print(f"F1-score (CV): {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
print(f"Scores: {cv_scores}")


F1-score (CV): 0.3937 (+/- 0.0961)
Scores: [0.47058824 0.42105263 0.37974684 0.36601307 0.33103448]


## 9. Guardar Modelo (si cumple objetivos)


In [9]:
if results['diff_f1'] < 6.0 and results['test_f1'] > 0.55:
    with open('../models/naive_bayes_model.pkl', 'wb') as f:
        pickle.dump(best_model, f)
    with open('../models/naive_bayes_tfidf.pkl', 'wb') as f:
        pickle.dump(tfidf, f)
    
    model_info = {
        'model_type': 'Naive Bayes (MultinomialNB)',
        'hyperparameters': best_params,
        'test_f1': results['test_f1'],
        'diff_f1': results['diff_f1'],
        'cv_f1_mean': cv_scores.mean(),
        'data_augmentation': False
    }
    
    with open('../models/naive_bayes_info.pkl', 'wb') as f:
        pickle.dump(model_info, f)
    
    print("‚úÖ Modelo Naive Bayes guardado")
else:
    print("‚ö†Ô∏è  Modelo no guardado (no cumple objetivos)")


‚ö†Ô∏è  Modelo no guardado (no cumple objetivos)
