# üîÑ Data Augmentation para Expansi√≥n del Dataset

Este notebook implementa t√©cnicas de data augmentation para expandir el dataset y mejorar el rendimiento del modelo.

## T√©cnicas implementadas:
1. **Reemplazo por sin√≥nimos** (WordNet)
2. **Traducci√≥n y back-translation** (googletrans)
3. **Combinaci√≥n de t√©cnicas**

## Objetivos:
1. Expandir dataset de 1,000 a ~2,000-3,000 ejemplos
2. Evaluar mejora en m√©tricas del modelo
3. Comparar rendimiento antes/despu√©s de augmentation


## 1. Importar librer√≠as


In [8]:
import pandas as pd
import numpy as np
import sys
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# A√±adir src al path
sys.path.append(str(Path('../src').resolve()))

# Descargar recursos de NLTK si no est√°n
try:
    import nltk
    nltk.download('wordnet', quiet=True)
    nltk.download('omw-1.4', quiet=True)
    nltk.download('punkt', quiet=True)
except:
    pass

from data.augmentation import TextAugmenter
from data.preprocessing import TextPreprocessor
from features.vectorization import TextVectorizer
from models.train import train_model
from models.evaluate import evaluate_model
from sklearn.model_selection import train_test_split

print("‚úÖ Librer√≠as importadas")


‚úÖ Librer√≠as importadas


## 2. Cargar dataset original


In [9]:
# Cargar dataset original
data_path = Path('../data/raw/youtoxic_english_1000.csv')
df_original = pd.read_csv(data_path)

print(f"‚úÖ Dataset original cargado:")
print(f"   Total: {len(df_original)} comentarios")
print(f"   T√≥xicos: {df_original['IsToxic'].sum()}")
print(f"   No t√≥xicos: {len(df_original) - df_original['IsToxic'].sum()}")
print(f"\n   Columnas: {list(df_original.columns)}")


‚úÖ Dataset original cargado:
   Total: 1000 comentarios
   T√≥xicos: 462
   No t√≥xicos: 538

   Columnas: ['CommentId', 'VideoId', 'Text', 'IsToxic', 'IsAbusive', 'IsThreat', 'IsProvocative', 'IsObscene', 'IsHatespeech', 'IsRacist', 'IsNationalist', 'IsSexist', 'IsHomophobic', 'IsReligiousHate', 'IsRadicalism']


## 3. Preprocesar datos originales


In [10]:
# Preprocesar texto original
preprocessor = TextPreprocessor(use_spacy=True)

print("üîÑ Preprocesando texto original...")
df_original['Text_processed'] = df_original['Text'].apply(
    lambda x: preprocessor.preprocess_text(str(x), remove_stopwords=True)
)

# Preparar columnas para augmentation
df_for_aug = df_original[['Text_processed', 'IsToxic']].copy()
df_for_aug.columns = ['text', 'label']

print(f"‚úÖ Preprocesamiento completado")
print(f"   Ejemplo de texto procesado:")
print(f"   Original: {df_original['Text'].iloc[0][:100]}...")
print(f"   Procesado: {df_for_aug['text'].iloc[0][:100]}...")


‚úÖ spaCy cargado: en_core_web_sm
üîÑ Preprocesando texto original...
‚úÖ Preprocesamiento completado
   Ejemplo de texto procesado:
   Original: If only people would just take a step back and not make this case about them, because it wasn't abou...
   Procesado: people step case wasn t people situation lump mess matter hand make kind protest selfish rational th...


## 4. Inicializar Augmenter y Probar T√©cnicas


In [11]:
# Inicializar augmenter
augmenter = TextAugmenter(use_translation=True, use_synonyms=True)

print("‚úÖ TextAugmenter inicializado")
print(f"   Traducci√≥n disponible: {augmenter.use_translation}")
print(f"   Sin√≥nimos disponibles: {augmenter.use_synonyms}")

# Probar con un ejemplo
test_text = df_for_aug['text'].iloc[0]
print(f"\nüìù Texto original:")
print(f"   {test_text}")

# Probar sin√≥nimos
if augmenter.use_synonyms:
    augmented_synonyms = augmenter.replace_with_synonyms(test_text, replacement_ratio=0.3)
    print(f"\nüîÑ Con sin√≥nimos:")
    print(f"   {augmented_synonyms}")


‚úÖ TextAugmenter inicializado
   Traducci√≥n disponible: False
   Sin√≥nimos disponibles: True

üìù Texto original:
   people step case wasn t people situation lump mess matter hand make kind protest selfish rational thought investigation guy video heavily emotional hype want hear get hear press reasonable discussion kudo smerconish keep level time let masri fool dare tear city protest dishonor entire incident hate way police brutality epidemic wish stop pretend like know exactly go s measurable people honestly witness incident clue way issue swing grand jury informed trust majority rule right course action let thank 99 99 police officer america actually serve protect bit jerk pull respect job know people go pout hold accountable action people hate police need officer emergency

üîÑ Con sin√≥nimos:
   people step face wasn t people site lump mess matter paw cook kind dissent selfish rational thought investigation guy tv heavily emotional hype want learn get hear press sensible disco

## 5. Aumentar Dataset Completo


In [12]:
# Aumentar dataset (duplicar tama√±o: augmentation_factor=1.0)
print("üîÑ Aumentando dataset...")
print("   Esto puede tardar varios minutos...")
print("\n" + "="*60)

# Usar solo sin√≥nimos para ser m√°s r√°pido (traducci√≥n es muy lenta)
methods = ['synonyms']
if augmenter.use_translation:
    # Opcional: a√±adir traducci√≥n (muy lento)
    # methods.append('translation')
    pass

df_augmented = augmenter.augment_dataframe(
    df_for_aug,
    text_column='text',
    label_column='label',
    augmentation_factor=1.0,  # Duplicar dataset
    methods=methods
)

print("\n" + "="*60)
print(f"‚úÖ Dataset aumentado:")
print(f"   Original: {len(df_for_aug)} ejemplos")
print(f"   Aumentado: {len(df_augmented)} ejemplos")
print(f"   Incremento: {len(df_augmented) - len(df_for_aug)} ejemplos ({((len(df_augmented) - len(df_for_aug))/len(df_for_aug)*100):.1f}%)")


üîÑ Aumentando dataset...
   Esto puede tardar varios minutos...

üîÑ Aumentando dataset: 1000 ‚Üí 2000 ejemplos
   M√©todos: ['synonyms']
‚úÖ Dataset aumentado: 1917 ejemplos totales
   Originales: 1000
   Aumentados: 917

‚úÖ Dataset aumentado:
   Original: 1000 ejemplos
   Aumentado: 1917 ejemplos
   Incremento: 917 ejemplos (91.7%)


## 6. Guardar Dataset Aumentado


In [13]:
# Guardar dataset aumentado
output_path = Path('../data/processed/youtoxic_english_1000_augmented.csv')

# Preparar para guardar (usar nombres de columnas originales)
df_to_save = df_augmented.copy()
df_to_save['Text'] = df_to_save['text']
df_to_save['IsToxic'] = df_to_save['label'].astype(int)

# Guardar solo columnas necesarias
df_to_save[['Text', 'IsToxic', '_augmented', '_augmentation_method']].to_csv(
    output_path,
    index=False
)

print(f"‚úÖ Dataset aumentado guardado en: {output_path}")
print(f"\n   Estad√≠sticas:")
print(f"   Total: {len(df_to_save)} ejemplos")
print(f"   Originales: {len(df_to_save[~df_to_save['_augmented']])}")
print(f"   Aumentados: {len(df_to_save[df_to_save['_augmented']])}")
print(f"   T√≥xicos: {df_to_save['IsToxic'].sum()}")
print(f"   No t√≥xicos: {len(df_to_save) - df_to_save['IsToxic'].sum()}")


‚úÖ Dataset aumentado guardado en: ../data/processed/youtoxic_english_1000_augmented.csv

   Estad√≠sticas:
   Total: 1917 ejemplos
   Originales: 1000
   Aumentados: 917
   T√≥xicos: 904
   No t√≥xicos: 1013


## 7. Vectorizar y Entrenar Modelo con Dataset Aumentado (Opcional)

> **Nota**: Esta secci√≥n es opcional. Puedes entrenar el modelo con el dataset aumentado y comparar m√©tricas.


In [14]:
# Vectorizar dataset aumentado
print("üîÑ Vectorizando dataset aumentado...")

# Preparar datos
X_aug = df_to_save['Text'].values
y_aug = df_to_save['IsToxic'].values

# Crear vectorizador y entrenar
vectorizer = TextVectorizer(method='tfidf', max_features=5000)
X_aug_vectorized = vectorizer.fit_transform(pd.Series(X_aug))

# Split train/test
X_train_aug, X_test_aug, y_train_aug, y_test_aug = train_test_split(
    X_aug_vectorized,
    y_aug,
    test_size=0.2,
    random_state=42,
    stratify=y_aug
)

print(f"‚úÖ Vectorizaci√≥n completada:")
print(f"   Train: {X_train_aug.shape}")
print(f"   Test: {X_test_aug.shape}")
print(f"   Features: {X_train_aug.shape[1]}")


üîÑ Vectorizando dataset aumentado...
‚úÖ Vectorizaci√≥n completada:
   Train: (1533, 5000)
   Test: (384, 5000)
   Features: 5000


## 8. Entrenar Modelo con Dataset Aumentado


In [15]:
# Entrenar SVM con dataset aumentado usando mismos par√°metros optimizados
print("üîÑ Entrenando modelo SVM con dataset aumentado...")
print("="*60)

from sklearn.svm import SVC
from sklearn.calibration import CalibratedClassifierCV

# Usar par√°metros optimizados del modelo original
svm_aug = SVC(C=0.056, kernel='linear', probability=True, class_weight='balanced', random_state=42)
svm_aug_calibrated = CalibratedClassifierCV(svm_aug, method='sigmoid', cv=3)
svm_aug_calibrated.fit(X_train_aug, y_train_aug)

print("‚úÖ Modelo entrenado con dataset aumentado")

# Evaluar modelo aumentado
results_aug = evaluate_model(
    svm_aug_calibrated,
    X_train_aug,
    X_test_aug,
    pd.Series(y_train_aug),
    pd.Series(y_test_aug),
    verbose=True
)


üîÑ Entrenando modelo SVM con dataset aumentado...
‚úÖ Modelo entrenado con dataset aumentado
RESULTADOS DE EVALUACI√ìN

üìä M√âTRICAS EN TRAIN:
   Accuracy:  0.8969
   Precision: 0.8929
   Recall:    0.8880
   F1-score:  0.8904

üìä M√âTRICAS EN TEST:
   Accuracy:  0.8490
   Precision: 0.8773
   Recall:    0.7901
   F1-score:  0.8314

‚ö†Ô∏è  OVERFITTING:
   Diferencia F1 (train-test): 5.90%
   ‚ö†Ô∏è  Overfitting moderado (5-10%)

üìã Matriz de confusi√≥n (test):
[[183  20]
 [ 38 143]]


## 9. Cargar y Evaluar Modelo Original para Comparar


In [None]:
# Cargar datos originales vectorizados para comparar
print("üîÑ Cargando datos originales para comparaci√≥n...")
from features.vectorization import load_vectorized_data
X_train_orig, X_test_orig, y_train_orig, y_test_orig = load_vectorized_data(data_dir, prefix='tfidf')

# Cargar modelo original optimizado
import pickle
model_path = Path('../models/optimized/best_optimized_model.pkl')
if model_path.exists():
    with open(model_path, 'rb') as f:
        model_original = pickle.load(f)
    print("‚úÖ Modelo original cargado")
    
    # Evaluar modelo original
    results_original = evaluate_model(
        model_original,
        X_train_orig,
        X_test_orig,
        y_train_orig,
        y_test_orig,
        verbose=True
    )
else:
    print("‚ö†Ô∏è  Modelo original no encontrado. Usando m√©tricas conocidas.")
    # M√©tricas del modelo original (SVM optimizado)
    results_original = {
        'test_f1': 0.7407,
        'test_accuracy': 0.64,
        'test_precision': 0.6452,
        'test_recall': 0.8696,
        'train_f1': 0.7119,
        'diff_f1': 2.54
    }


üîÑ Cargando datos originales para comparaci√≥n...


NameError: name 'load_vectorized_data' is not defined

## 10. Comparaci√≥n Detallada: Original vs Aumentado


In [None]:
# Crear DataFrame comparativo
print("\n" + "="*60)
print("üìä COMPARACI√ìN: Modelo Original vs Modelo con Augmentation")
print("="*60)

comparison_data = {
    'M√©trica': ['F1-Score (Test)', 'Accuracy (Test)', 'Precision (Test)', 'Recall (Test)', 'Overfitting (%)'],
    'Original': [
        results_original['test_f1'],
        results_original['test_accuracy'],
        results_original['test_precision'],
        results_original['test_recall'],
        results_original['diff_f1']
    ],
    'Con Augmentation': [
        results_aug['test_f1'],
        results_aug['test_accuracy'],
        results_aug['test_precision'],
        results_aug['test_recall'],
        results_aug['diff_f1']
    ]
}

comparison_df = pd.DataFrame(comparison_data)
comparison_df['Mejora'] = comparison_df['Con Augmentation'] - comparison_df['Original']
comparison_df['Mejora %'] = ((comparison_df['Con Augmentation'] - comparison_df['Original']) / comparison_df['Original'] * 100).round(2)

print("\n")
print(comparison_df.to_string(index=False))

# Determinar si hay mejora
f1_improvement = comparison_df[comparison_df['M√©trica'] == 'F1-Score (Test)']['Mejora'].values[0]
overfitting_change = comparison_df[comparison_df['M√©trica'] == 'Overfitting (%)']['Mejora'].values[0]

print("\n" + "="*60)
if f1_improvement > 0.01:  # Mejora significativa (>1%)
    print("‚úÖ RESULTADO: Data Augmentation MEJORA el modelo")
    print(f"   - F1-Score mejor√≥ en {f1_improvement:.4f} ({comparison_df[comparison_df['M√©trica'] == 'F1-Score (Test)']['Mejora %'].values[0]:.2f}%)")
elif f1_improvement < -0.01:  # Empeoramiento significativo
    print("‚ùå RESULTADO: Data Augmentation EMPEORA el modelo")
    print(f"   - F1-Score empeor√≥ en {abs(f1_improvement):.4f}")
else:
    print("‚ûñ RESULTADO: Data Augmentation no cambia significativamente el modelo")
    print(f"   - F1-Score cambi√≥ en {f1_improvement:.4f}")

if abs(overfitting_change) > 1:
    if overfitting_change > 0:
        print(f"   ‚ö†Ô∏è  Overfitting aument√≥ en {overfitting_change:.2f}%")
    else:
        print(f"   ‚úÖ Overfitting disminuy√≥ en {abs(overfitting_change):.2f}%")
else:
    print(f"   ‚ûñ Overfitting se mantiene similar ({overfitting_change:.2f}%)")


## 11. Visualizaci√≥n de Comparaci√≥n


In [None]:
# Visualizar comparaci√≥n
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# M√©tricas principales
metrics = ['F1-Score (Test)', 'Accuracy (Test)', 'Precision (Test)', 'Recall (Test)']
x = np.arange(len(metrics))
width = 0.35

ax1 = axes[0]
ax1.bar(x - width/2, [comparison_df[comparison_df['M√©trica'] == m]['Original'].values[0] for m in metrics], 
        width, label='Original', color='steelblue')
ax1.bar(x + width/2, [comparison_df[comparison_df['M√©trica'] == m]['Con Augmentation'].values[0] for m in metrics], 
        width, label='Con Augmentation', color='coral')
ax1.set_xlabel('M√©tricas')
ax1.set_ylabel('Score')
ax1.set_title('Comparaci√≥n de M√©tricas', fontweight='bold')
ax1.set_xticks(x)
ax1.set_xticklabels(metrics, rotation=45, ha='right')
ax1.legend()
ax1.grid(axis='y', alpha=0.3)
ax1.set_ylim(0, 1)

# Overfitting
ax2 = axes[1]
overfitting_orig = comparison_df[comparison_df['M√©trica'] == 'Overfitting (%)']['Original'].values[0]
overfitting_aug = comparison_df[comparison_df['M√©trica'] == 'Overfitting (%)']['Con Augmentation'].values[0]
ax2.bar(['Original', 'Con Augmentation'], 
        [overfitting_orig, overfitting_aug],
        color=['steelblue', 'coral'])
ax2.axhline(y=5, color='r', linestyle='--', label='L√≠mite objetivo (5%)')
ax2.set_ylabel('Overfitting (%)')
ax2.set_title('Overfitting', fontweight='bold')
ax2.legend()
ax2.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('../data/processed/augmentation_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print("‚úÖ Gr√°ficos guardados en: ../data/processed/augmentation_comparison.png")


## 12. Guardar Modelo Aumentado (si mejora)


In [None]:
# Guardar modelo aumentado si mejora significativamente
f1_improvement = comparison_df[comparison_df['M√©trica'] == 'F1-Score (Test)']['Mejora'].values[0]

if f1_improvement > 0.01:  # Mejora > 1%
    models_dir = Path('../models/augmented')
    models_dir.mkdir(parents=True, exist_ok=True)
    
    model_path = models_dir / 'svm_augmented_model.pkl'
    vectorizer_path = models_dir / 'tfidf_vectorizer_augmented.pkl'
    
    # Guardar modelo
    with open(model_path, 'wb') as f:
        pickle.dump(svm_aug_calibrated, f)
    
    # Guardar vectorizador
    with open(vectorizer_path, 'wb') as f:
        pickle.dump(vectorizer, f)
    
    # Guardar informaci√≥n del modelo
    model_info = {
        'model_name': 'SVM (Augmented)',
        'vectorizer_type': 'tfidf',
        'test_f1': results_aug['test_f1'],
        'test_accuracy': results_aug['test_accuracy'],
        'overfitting': results_aug['diff_f1'],
        'train_f1': results_aug['train_f1'],
        'dataset_size': len(df_augmented),
        'augmentation_method': 'synonyms'
    }
    
    info_path = models_dir / 'svm_augmented_model_info.pkl'
    with open(info_path, 'wb') as f:
        pickle.dump(model_info, f)
    
    print(f"‚úÖ Modelo aumentado guardado:")
    print(f"   Modelo: {model_path}")
    print(f"   Vectorizador: {vectorizer_path}")
    print(f"   Info: {info_path}")
else:
    print("‚ÑπÔ∏è  Modelo aumentado no se guarda (no hay mejora significativa)")
    print("   El modelo original sigue siendo el mejor")


## 13. Conclusiones Finales

### Resultados del Data Augmentation:

1. **Tama√±o del dataset**: ‚úÖ Expandido de 1,000 a ~1,900 ejemplos
2. **Balance de clases**: Verificar si se mantiene
3. **Mejora en m√©tricas**: Ver comparaci√≥n arriba

### Recomendaciones:

- **Si mejora**: Considerar usar modelo aumentado en producci√≥n
- **Si no mejora**: El dataset original es suficiente, augmentation no aporta valor
- **Pr√≥ximos pasos**: Probar otras t√©cnicas (traducci√≥n, parafraseo) o aumentar solo clase minoritaria


## 5. Aumentar Dataset Completo
