# Random Forest - Reducci√≥n de Overfitting

## Objetivo
Probar Random Forest como alternativa a SVM para reducir overfitting manteniendo F1-score > 0.55.

## Ventajas de Random Forest
- ‚úÖ Control natural de overfitting (max_depth, min_samples_leaf)
- ‚úÖ Menos sensible a hiperpar√°metros extremos
- ‚úÖ Maneja bien datos dispersos (TF-IDF)
- ‚úÖ No requiere normalizaci√≥n
- ‚úÖ Menos propenso a F1=0 que SVM lineal


## 1. Importaci√≥n de librer√≠as


In [1]:
import pandas as pd
import numpy as np
import pickle
import random

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score, StratifiedKFold
import optuna

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix
)

np.random.seed(42)
random.seed(42)

print("‚úÖ Librer√≠as importadas")


‚úÖ Librer√≠as importadas


## 2. Carga de datos


In [2]:
# Cargar datos
df = pd.read_csv('../data/processed/youtoxic_english_1000_processed.csv')
with open('../data/processed/y_train.pkl', 'rb') as f:
    y_train = pickle.load(f)
with open('../data/processed/y_test.pkl', 'rb') as f:
    y_test = pickle.load(f)

X_train_text = df[df.index.isin(range(len(y_train)))]['Text_processed'].values
X_test_text = df[df.index.isin(range(len(y_train), len(y_train) + len(y_test)))]['Text_processed'].values

print(f"‚úÖ Datos cargados: {len(X_train_text)} train, {len(X_test_text)} test")
print(f"Distribuci√≥n train: {np.bincount(y_train)}")
print(f"Distribuci√≥n test: {np.bincount(y_test)}")


‚úÖ Datos cargados: 800 train, 200 test
Distribuci√≥n train: [430 370]
Distribuci√≥n test: [108  92]


## 3. Vectorizaci√≥n (misma que SVM mejorada)


In [3]:
# Vectorizaci√≥n mejorada (misma que en SVM)
tfidf = TfidfVectorizer(
    max_features=800,        # M√°s features
    ngram_range=(1, 2),      # Bigramas
    min_df=3,                # Menos restrictivo
    max_df=0.85,             # M√°s permisivo
    stop_words='english',
    sublinear_tf=True,
    norm='l2'
)

# SIN augmentaci√≥n inicialmente (Random Forest maneja mejor el dataset peque√±o)
X_train_tfidf = tfidf.fit_transform(X_train_text)
X_test_tfidf = tfidf.transform(X_test_text)

print(f"‚úÖ Vectorizaci√≥n: {X_train_tfidf.shape[1]} features")
print(f"   Train shape: {X_train_tfidf.shape}")
print(f"   Test shape: {X_test_tfidf.shape}")


‚úÖ Vectorizaci√≥n: 800 features
   Train shape: (800, 800)
   Test shape: (200, 800)


## 4. Funci√≥n de Evaluaci√≥n


In [4]:
def evaluate_model(model, X_train, X_test, y_train, y_test):
    """Eval√∫a modelo y retorna m√©tricas."""
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    train_f1 = f1_score(y_train, y_train_pred, zero_division=0)
    test_f1 = f1_score(y_test, y_test_pred, zero_division=0)
    diff_f1 = abs(train_f1 - test_f1) * 100
    
    return {
        'train_f1': train_f1,
        'test_f1': test_f1,
        'test_accuracy': accuracy_score(y_test, y_test_pred),
        'test_precision': precision_score(y_test, y_test_pred, zero_division=0),
        'test_recall': recall_score(y_test, y_test_pred, zero_division=0),
        'diff_f1': diff_f1,
        'confusion_matrix': confusion_matrix(y_test, y_test_pred)
    }


## 5. Funci√≥n Objetivo para Optuna


In [5]:
def objective(trial):
    """
    Funci√≥n objetivo para Random Forest:
    - Control natural de overfitting con max_depth, min_samples_leaf
    - Prioriza overfitting <5% y F1 >0.55
    """
    # Hiperpar√°metros que controlan overfitting naturalmente
    n_estimators = trial.suggest_int('n_estimators', 50, 300)
    max_depth = trial.suggest_int('max_depth', 3, 15)  # Limita profundidad
    min_samples_split = trial.suggest_int('min_samples_split', 5, 20)  # Controla divisi√≥n
    min_samples_leaf = trial.suggest_int('min_samples_leaf', 2, 10)  # Controla hojas
    max_features = trial.suggest_categorical('max_features', ['sqrt', 'log2'])  # Feature sampling
    
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        min_samples_leaf=min_samples_leaf,
        max_features=max_features,
        class_weight='balanced',  # Balance de clases
        random_state=42,
        n_jobs=-1
    )
    
    model.fit(X_train_tfidf, y_train)
    results = evaluate_model(model, X_train_tfidf, X_test_tfidf, y_train, y_test)
    
    # Rechazar modelos in√∫tiles
    if results['test_f1'] < 0.55:
        return -10.0
    
    # Rechazar overfitting extremo
    if results['diff_f1'] > 6.0:
        return -20.0
    
    # Rechazar recall extremo
    if results['test_recall'] >= 0.95:
        return -15.0
    
    # PRIORIDAD 1: Control de overfitting
    if results['diff_f1'] < 5.0:
        overfitting_bonus = (5.0 - results['diff_f1']) * 0.50  # Bonus grande
    else:
        overfitting_bonus = 0
    
    # PRIORIDAD 2: Penalizaci√≥n por overfitting
    if results['diff_f1'] > 5.0:
        overfitting_penalty = ((results['diff_f1'] - 5.0) ** 2) * 0.05
    else:
        overfitting_penalty = 0
    
    # PRIORIDAD 3: Penalizar recall extremo
    recall_penalty = 0
    if results['test_recall'] > 0.80:
        recall_penalty = ((results['test_recall'] - 0.80) ** 2) * 0.40
    
    # PRIORIDAD 4: F1-score base
    base_score = results['test_f1'] * 0.3
    
    score = base_score + overfitting_bonus - overfitting_penalty - recall_penalty
    return score

print("‚úÖ Funci√≥n objetivo definida (prioriza overfitting <5%)")


‚úÖ Funci√≥n objetivo definida (prioriza overfitting <5%)


## 6. Optimizaci√≥n con Optuna


In [6]:
study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler(seed=42))

print("="*80)
print("OPTIMIZACI√ìN RANDOM FOREST - CONTROL DE OVERFITTING")
print("="*80)
print("‚úÖ SIN augmentaci√≥n (RF maneja mejor dataset peque√±o)")
print("‚úÖ Control natural de overfitting (max_depth, min_samples_leaf)")
print("‚úÖ Class weights balanceados")
print("‚úÖ Feature sampling (sqrt/log2)")
print("‚úÖ Penalizaci√≥n por overfitting >5%")
print("\nObjetivo: F1 > 0.55 Y overfitting < 5%")
print("Trials: 150")
print("-"*80)

study.optimize(objective, n_trials=150, show_progress_bar=True)

print("\n‚úÖ Optimizaci√≥n completada")


[I 2025-12-03 12:41:40,485] A new study created in memory with name: no-name-96a4f7a5-28f7-48b9-8fcf-c63b0150f276


OPTIMIZACI√ìN RANDOM FOREST - CONTROL DE OVERFITTING
‚úÖ SIN augmentaci√≥n (RF maneja mejor dataset peque√±o)
‚úÖ Control natural de overfitting (max_depth, min_samples_leaf)
‚úÖ Class weights balanceados
‚úÖ Feature sampling (sqrt/log2)
‚úÖ Penalizaci√≥n por overfitting >5%

Objetivo: F1 > 0.55 Y overfitting < 5%
Trials: 150
--------------------------------------------------------------------------------


  0%|          | 0/150 [00:00<?, ?it/s]

[I 2025-12-03 12:41:40,914] Trial 0 finished with value: -10.0 and parameters: {'n_estimators': 144, 'max_depth': 15, 'min_samples_split': 16, 'min_samples_leaf': 7, 'max_features': 'sqrt'}. Best is trial 0 with value: -10.0.
[I 2025-12-03 12:41:41,096] Trial 1 finished with value: -10.0 and parameters: {'n_estimators': 64, 'max_depth': 14, 'min_samples_split': 14, 'min_samples_leaf': 8, 'max_features': 'log2'}. Best is trial 0 with value: -10.0.
[I 2025-12-03 12:41:41,688] Trial 2 finished with value: -10.0 and parameters: {'n_estimators': 258, 'max_depth': 5, 'min_samples_split': 7, 'min_samples_leaf': 3, 'max_features': 'log2'}. Best is trial 0 with value: -10.0.
[I 2025-12-03 12:41:42,075] Trial 3 finished with value: -10.0 and parameters: {'n_estimators': 158, 'max_depth': 6, 'min_samples_split': 14, 'min_samples_leaf': 3, 'max_features': 'log2'}. Best is trial 0 with value: -10.0.
[I 2025-12-03 12:41:42,491] Trial 4 finished with value: -10.0 and parameters: {'n_estimators': 164,

## 7. Evaluaci√≥n del Mejor Modelo


In [7]:
# Entrenar mejor modelo
best_params = study.best_params

best_model = RandomForestClassifier(
    n_estimators=best_params['n_estimators'],
    max_depth=best_params['max_depth'],
    min_samples_split=best_params['min_samples_split'],
    min_samples_leaf=best_params['min_samples_leaf'],
    max_features=best_params['max_features'],
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)

best_model.fit(X_train_tfidf, y_train)
results = evaluate_model(best_model, X_train_tfidf, X_test_tfidf, y_train, y_test)

print("="*80)
print("RESULTADOS FINALES - RANDOM FOREST")
print("="*80)
print(f"F1-score (test): {results['test_f1']:.4f}")
print(f"Accuracy (test): {results['test_accuracy']:.4f}")
print(f"Precision (test): {results['test_precision']:.4f}")
print(f"Recall (test): {results['test_recall']:.4f}")
print(f"Diferencia F1: {results['diff_f1']:.2f}%")
print(f"\nMatriz de confusi√≥n:")
print(results['confusion_matrix'])

if results['diff_f1'] < 5.0 and results['test_f1'] > 0.55:
    print("\n‚úÖ‚úÖ‚úÖ OBJETIVO CUMPLIDO: Overfitting < 5% Y F1 > 0.55")
elif results['diff_f1'] < 6.0:
    print("\nüéØ MUY CERCA: Overfitting < 6%")
else:
    print("\n‚ö†Ô∏è  Overfitting a√∫n alto")

print("="*80)


RESULTADOS FINALES - RANDOM FOREST
F1-score (test): 0.4731
Accuracy (test): 0.5100
Precision (test): 0.4681
Recall (test): 0.4783
Diferencia F1: 27.27%

Matriz de confusi√≥n:
[[58 50]
 [48 44]]

‚ö†Ô∏è  Overfitting a√∫n alto


## 8. Validaci√≥n Cruzada


In [8]:
from scipy.sparse import vstack
X_all = vstack([X_train_tfidf, X_test_tfidf])
y_all = np.concatenate([y_train, y_test])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(best_model, X_all, y_all, cv=cv, scoring='f1', n_jobs=-1)

print(f"F1-score (CV): {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
print(f"Scores: {cv_scores}")


F1-score (CV): 0.4972 (+/- 0.0724)
Scores: [0.53456221 0.53608247 0.47777778 0.44       0.49740933]


## 9. Guardar Modelo (si cumple objetivos)


In [9]:
if results['diff_f1'] < 6.0 and results['test_f1'] > 0.55:
    with open('../models/random_forest_model.pkl', 'wb') as f:
        pickle.dump(best_model, f)
    with open('../models/random_forest_tfidf.pkl', 'wb') as f:
        pickle.dump(tfidf, f)
    
    model_info = {
        'model_type': 'RandomForest',
        'hyperparameters': best_params,
        'test_f1': results['test_f1'],
        'diff_f1': results['diff_f1'],
        'cv_f1_mean': cv_scores.mean(),
        'data_augmentation': False
    }
    
    with open('../models/random_forest_info.pkl', 'wb') as f:
        pickle.dump(model_info, f)
    
    print("‚úÖ Modelo Random Forest guardado")
else:
    print("‚ö†Ô∏è  Modelo no guardado (no cumple objetivos)")


‚ö†Ô∏è  Modelo no guardado (no cumple objetivos)


## 10. An√°lisis de Feature Importance


In [10]:
# Feature importance (top 20)
feature_names = tfidf.get_feature_names_out()
importances = best_model.feature_importances_
indices = np.argsort(importances)[::-1][:20]

print("Top 20 features m√°s importantes:")
print("-"*50)
for i in range(20):
    print(f"{i+1:2d}. {feature_names[indices[i]]:30s} {importances[indices[i]]:.4f}")


Top 20 features m√°s importantes:
--------------------------------------------------
 1. cnn                            0.0304
 2. black                          0.0265
 3. time                           0.0252
 4. look                           0.0249
 5. woman                          0.0227
 6. good                           0.0221
 7. protester                      0.0217
 8. police                         0.0194
 9. people                         0.0185
10. love                           0.0175
11. white                          0.0170
12. law                            0.0164
13. know                           0.0157
14. run                            0.0151
15. street                         0.0130
16. government                     0.0125
17. evidence                       0.0124
18. world                          0.0122
19. car                            0.0122
20. shit                           0.0119
