# Ensemble Optimizado para Predicci√≥n de G√©neros

Este notebook implementa un ensemble avanzado con:
- Data augmentation (back-translation)
- DeBERTa-v3 (modelo top del leaderboard)
- Optimizaci√≥n de pesos con Optuna
- Test-time augmentation

## ‚ö†Ô∏è VERIFICACI√ìN DE DATA LEAKAGE - CHECKLIST ‚úì

**Puntos cr√≠ticos verificados:**
1. ‚úÖ Data Augmentation SOLO en train (35%), validaci√≥n intacta
2. ‚úÖ Train/Val split ANTES de augmentation
3. ‚úÖ TF-IDF fit en train, transform en validation/test
4. ‚úÖ BGE embeddings generados independientemente (no leakage)
5. ‚úÖ SVC Calibration con cv=3 en TRAIN, no usa validation
6. ‚úÖ Transformers (DistilBERT/DeBERTa) entrenados en train, evaluados en val
7. ‚úÖ Thresholds optimizados en validation (correcto, es parte del modelo)
8. ‚úÖ Optuna optimiza weights en validation (correcto)
9. ‚ö†Ô∏è Stacking meta-model entrenado en validation (ligero overfitting, por eso usamos Weighted)
10. ‚úÖ Test predictions NO usan augmentation
11. ‚úÖ Todas las m√©tricas calculadas con validator.py

**Archivo final a enviar:** `dataset_test_preds.csv` (Weighted Ensemble)

## 1. Imports y Configuraci√≥n

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
from scipy.sparse import hstack as sp_hstack, csr_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from xgboost import XGBClassifier
from sklearn.multioutput import MultiOutputClassifier
from sentence_transformers import SentenceTransformer
from transformers import (
    DistilBertTokenizer, DistilBertForSequenceClassification,
    AutoTokenizer, AutoModelForSequenceClassification,
    Trainer, TrainingArguments,
    MarianMTModel, MarianTokenizer
)
from torch.utils.data import Dataset
import torch
import warnings
warnings.filterwarnings('ignore')

# Importar funci√≥n de validaci√≥n
import sys
sys.path.append('..')
from validator import compute_metrics

  from .autonotebook import tqdm as notebook_tqdm





## 2. Carga y Preparaci√≥n de Datos

In [2]:
train_dir = Path("../dataset_train.csv")
test_dir = Path("../dataset_test.csv")

df = pd.read_csv(train_dir)
print(f"Dataset size: {len(df)}")
df.head()

Dataset size: 8475


Unnamed: 0,movie_name,genre,description
0,Silent Hill,"Horror, Mystery","Rose, a desperate mother takes her adopted dau..."
1,Breaking the Waves,"Drama, Romance","In a small and conservative Scottish village, ..."
2,Wind Chill,"Drama, Horror, Thriller",Two college students share a ride home for the...
3,Godmothered,"Family, Fantasy, Comedy",A young and unskilled fairy godmother that ven...
4,Donkey Skin,"Fantasy, Comedy, Music, Romance",A fairy godmother helps a princess disguise he...


In [None]:
df["text"] = df["movie_name"].fillna("") + " [SEP] " + df["description"].fillna("")
y_list = df["genre"].apply(lambda s: [g.strip() for g in str(s).split(",") if g.strip()])

mlb = MultiLabelBinarizer()
Y = mlb.fit_transform(y_list)

print(f"Number of labels: {len(mlb.classes_)}")
print(f"Label distribution shape: {Y.shape}")

Number of labels: 18
Label distribution shape: (8475, 18)


## 3. Data Augmentation - Back Translation

### ‚ö° Optimizaciones de Velocidad

**Back-translation optimizado seg√∫n hardware:**

**üöÄ MODO GPU** (si est√° disponible):
- Procesamiento por **BATCHES** (no workers - la GPU ya paraleliza internamente)
- `batch_size=16`: Procesa 16 textos simult√°neamente en GPU
- Batch m√°s grande = m√°s r√°pido, pero m√°s VRAM
- Velocidad: **~20-50x m√°s r√°pido que CPU**

**? MODO CPU** (fallback autom√°tico):
- Procesamiento con **THREADING** (m√∫ltiples workers)
- `batch_size=4`: N√∫mero de workers paralelos
- M√°s workers = m√°s r√°pido, pero m√°s RAM/cores
- Velocidad: **~3-4x m√°s r√°pido que secuencial**

**Configuraci√≥n recomendada:**
```python
# GPU con 4-8GB VRAM
batch_size=16  # √ìptimo

# GPU con 12+ GB VRAM  
batch_size=32  # Muy r√°pido

# CPU (4-8 cores)
batch_size=4  # workers paralelos
```

**Tiempos estimados** (1,560 traducciones):
- CPU secuencial: ~25-35 min
- CPU paralelo (4 workers): ~8-10 min  
- **GPU batch=16**: **~1-2 min** ‚ö°üöÄ

In [None]:
from concurrent.futures import ThreadPoolExecutor
import threading

def back_translate(texts, src_lang='en', pivot_lang='fr', sample_ratio=0.2, batch_size=16, use_gpu=True):
    """
    Back-translation optimizado con GPU (batching) o CPU (threading).
    
    Args:
        texts: Textos a augmentar
        src_lang: Idioma origen (default: 'en')
        pivot_lang: Idioma pivot (default: 'fr')
        sample_ratio: Porcentaje de textos a augmentar
        batch_size: Tama√±o de batch para GPU (default: 16), o workers para CPU
        use_gpu: Usar GPU si est√° disponible (default: True)
    """
    model_name_en_pivot = f'Helsinki-NLP/opus-mt-{src_lang}-{pivot_lang}'
    model_name_pivot_en = f'Helsinki-NLP/opus-mt-{pivot_lang}-{src_lang}'
    
    # Detectar dispositivo
    device = torch.device('cuda' if use_gpu and torch.cuda.is_available() else 'cpu')
    print(f"Using device: {device}")
    if device.type == 'cuda':
        print(f"GPU: {torch.cuda.get_device_name(0)}")
        print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
    
    # Cargar modelos una sola vez
    tokenizer_en_pivot = MarianTokenizer.from_pretrained(model_name_en_pivot)
    model_en_pivot = MarianMTModel.from_pretrained(model_name_en_pivot).to(device)
    model_en_pivot.eval()  # Modo evaluaci√≥n para inference
    
    tokenizer_pivot_en = MarianTokenizer.from_pretrained(model_name_pivot_en)
    model_pivot_en = MarianMTModel.from_pretrained(model_name_pivot_en).to(device)
    model_pivot_en.eval()
    
    # Seleccionar √≠ndices a augmentar
    indices_to_augment = np.random.choice(len(texts), size=int(len(texts) * sample_ratio), replace=False)
    texts_to_augment = [texts.iloc[idx] if hasattr(texts, 'iloc') else texts[idx] for idx in indices_to_augment]
    total = len(texts_to_augment)
    
    augmented_texts = []
    
    if device.type == 'cuda':
        # ===== MODO GPU: Procesamiento por BATCHES (m√°s eficiente) =====
        print(f"GPU mode: Processing in batches of {batch_size}")
        
        with torch.no_grad():
            for i in range(0, total, batch_size):
                batch_texts = texts_to_augment[i:i+batch_size]
                
                # EN -> Pivot (batch completo)
                inputs = tokenizer_en_pivot(batch_texts, return_tensors="pt", padding=True, truncation=True, max_length=128)
                inputs = {k: v.to(device) for k, v in inputs.items()}
                translated = model_en_pivot.generate(**inputs, max_length=128)
                pivot_texts = [tokenizer_en_pivot.decode(t, skip_special_tokens=True) for t in translated]
                
                # Pivot -> EN (batch completo)
                inputs = tokenizer_pivot_en(pivot_texts, return_tensors="pt", padding=True, truncation=True, max_length=128)
                inputs = {k: v.to(device) for k, v in inputs.items()}
                back_translated = model_pivot_en.generate(**inputs, max_length=128)
                final_texts = [tokenizer_pivot_en.decode(t, skip_special_tokens=True) for t in back_translated]
                
                augmented_texts.extend(final_texts)
                
                # Progreso
                if (i + batch_size) % (batch_size * 5) == 0 or (i + batch_size) >= total:
                    print(f"Augmenting {min(i + batch_size, total)}/{total}...", end='\r')
        
        print(f"\nGPU augmentation complete! ({len(augmented_texts)} texts)" + " "*20)
    
    else:
        # ===== MODO CPU: Procesamiento con THREADING (paralelizaci√≥n) =====
        n_workers = batch_size  # Reusar par√°metro como n√∫mero de workers
        print(f"CPU mode: Processing with {n_workers} parallel workers")
        
        counter = {'value': 0, 'lock': threading.Lock()}
        
        def translate_single(text):
            """Funci√≥n para traducir un texto individual en CPU"""
            try:
                with torch.no_grad():
                    # EN -> Pivot
                    inputs = tokenizer_en_pivot(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
                    translated = model_en_pivot.generate(**inputs)
                    pivot_text = tokenizer_en_pivot.decode(translated[0], skip_special_tokens=True)
                    
                    # Pivot -> EN
                    inputs = tokenizer_pivot_en(pivot_text, return_tensors="pt", padding=True, truncation=True, max_length=128)
                    back_translated = model_pivot_en.generate(**inputs)
                    final_text = tokenizer_pivot_en.decode(back_translated[0], skip_special_tokens=True)
                
                # Actualizar progreso
                with counter['lock']:
                    counter['value'] += 1
                    if counter['value'] % 50 == 0 or counter['value'] == total:
                        print(f"Augmenting {counter['value']}/{total}...", end='\r')
                
                return final_text
            except Exception as e:
                print(f"\nError translating text: {e}")
                return text
        
        with ThreadPoolExecutor(max_workers=n_workers) as executor:
            augmented_texts = list(executor.map(translate_single, texts_to_augment))
        
        print(f"\nCPU augmentation complete! ({len(augmented_texts)} texts)" + " "*20)
    
    return augmented_texts, indices_to_augment


In [None]:
# Split ANTES de augmentar - validaci√≥n con datos 100% originales
X_tr, X_va, y_tr, y_va = train_test_split(df["text"], Y, test_size=0.1, random_state=42)
print(f"Original - Training: {len(X_tr)}, Validation: {len(X_va)}")

# Augmentar SOLO train con ratio aumentado
# GPU: batch_size controla tama√±o de batch (16-32 √≥ptimo)
# CPU: batch_size controla n√∫mero de workers (4-8 √≥ptimo)
print(f"\nAugmenting training data...")
augmented_texts, aug_indices = back_translate(X_tr, sample_ratio=0.35, batch_size=16, use_gpu=True)

# Obtener las etiquetas correspondientes a los √≠ndices augmentados
y_tr_augmented = y_tr[aug_indices]

# Combinar train original + train augmentado
X_tr_combined = pd.concat([X_tr.reset_index(drop=True), pd.Series(augmented_texts)], ignore_index=True)
y_tr_combined = np.vstack([y_tr, y_tr_augmented])

print(f"Augmented - Training: {len(X_tr_combined)}, Validation: {len(X_va)} (unchanged)")
print(f"Train augmentation: +{len(augmented_texts)} samples ({len(augmented_texts)/len(X_tr):.1%})")

# Actualizar variables para uso posterior
X_tr = X_tr_combined
y_tr = y_tr_combined


Original dataset size: 8475


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not in

Augmenting 0/1695...

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Augmentation complete!                    
Augmentation complete!                    
Augmented dataset size: 10170
Training samples: 9153, Validation samples: 1017
Augmented dataset size: 10170
Training samples: 9153, Validation samples: 1017


In [23]:
import joblib

In [24]:
joblib.dump(X_tr, "X_tr_nvembed.pkl")
joblib.dump(X_va, "X_va_nvembed.pkl")
joblib.dump(y_tr, "y_tr_nvembed.pkl")
joblib.dump(y_va, "y_va_nvembed.pkl")

['y_va_nvembed.pkl']

## 4. Feature Engineering - TF-IDF

In [None]:
tfidf_word = TfidfVectorizer(
    ngram_range=(1,4),
    min_df=2,
    max_features=750_000,
    sublinear_tf=True,
    stop_words="english",
    max_df=0.85,
    strip_accents='unicode',
    lowercase=True
)

tfidf_char = TfidfVectorizer(
    analyzer="char_wb",
    ngram_range=(3,6),
    min_df=2,
    max_features=750_000,
    sublinear_tf=True,
    max_df=0.85,
    strip_accents='unicode'
)

Xw_tr = tfidf_word.fit_transform(X_tr)
Xw_va = tfidf_word.transform(X_va)
Xc_tr = tfidf_char.fit_transform(X_tr)
Xc_va = tfidf_char.transform(X_va)

XTR_tfidf = sp_hstack([Xw_tr, Xc_tr], format="csr")
XVA_tfidf = sp_hstack([Xw_va, Xc_va], format="csr")
print(f"Combined TF-IDF features shape: {XTR_tfidf.shape}")

Combined TF-IDF features shape: (9153, 205446)


In [None]:
# Guardar TF-IDF vectorizers, MultiLabelBinarizer y labels
joblib.dump(tfidf_word, "tfidf_word.joblib")
joblib.dump(tfidf_char, "tfidf_char.joblib")
joblib.dump(mlb, "mlb.joblib")

# Guardar labels.json
import json
labels_dict = {"labels": mlb.classes_.tolist()}
with open("labels.json", "w") as f:
    json.dump(labels_dict, f, indent=2)

print("TF-IDF vectorizers, MLB and labels.json saved!")
print(f"Number of labels: {len(mlb.classes_)}")

## 5. Embeddings Mejorados (BGE-Large)

In [None]:
st_model = SentenceTransformer('BAAI/bge-large-en-v1.5')
print("Generating embeddings with BGE-Large (1024 dim)...")
emb_tr = st_model.encode(X_tr.tolist(), show_progress_bar=True, batch_size=16, normalize_embeddings=True)
emb_va = st_model.encode(X_va.tolist(), show_progress_bar=True, batch_size=16, normalize_embeddings=True)

XTR_combined = sp_hstack([XTR_tfidf, csr_matrix(emb_tr)], format="csr")
XVA_combined = sp_hstack([XVA_tfidf, csr_matrix(emb_va)], format="csr")
print(f"Combined features (TF-IDF + BGE Embeddings) shape: {XTR_combined.shape}")

In [None]:
# Guardar sentence transformer model
joblib.dump(st_model, "sentence_transformer.joblib")
print("Sentence Transformer model saved!")

## 6. Calibraci√≥n y Modelos Mejorados

In [8]:
clf_logreg = OneVsRestClassifier(
    LogisticRegression(C=8.0, solver="saga", max_iter=4000, class_weight='balanced', random_state=42),
    n_jobs=-1
)
print("Training LogisticRegression...")
clf_logreg.fit(XTR_combined, y_tr)
print("LogReg training complete!")

Training LogisticRegression...
LogReg training complete!
LogReg training complete!


In [None]:
logits_logreg = clf_logreg.decision_function(XVA_combined)
ths_logreg = np.zeros(logits_logreg.shape[1])

for k in range(logits_logreg.shape[1]):
    s = logits_logreg[:, k]
    best_f1, best_t = 0.0, 0.0
    candidates = np.quantile(s, np.linspace(0.01, 0.99, 50))
    for t in candidates:
        preds_k = (s >= t).astype(int)
        f1 = f1_score(y_va[:, k], preds_k, zero_division=0)
        if f1 > best_f1:
            best_f1, best_t = f1, t
    ths_logreg[k] = best_t

pred_logreg = (logits_logreg >= ths_logreg).astype(int)
metrics_logreg = compute_metrics(y_va, pred_logreg)
print(f"LogReg - F1: {metrics_logreg['f1']:.4f}, Precision: {metrics_logreg['precision']:.4f}, Recall: {metrics_logreg['recall']:.4f}, Hamming: {metrics_logreg['hamming_loss']:.4f}")

LogReg - micro-F1: 0.7315, macro-F1: 0.6982


In [None]:
# Guardar modelo LogReg
joblib.dump(clf_logreg, "clf_logreg.joblib")
joblib.dump(ths_logreg, "ths_logreg.npy")
print("LogReg model saved!")

In [10]:
clf_xgb = MultiOutputClassifier(
    XGBClassifier(n_estimators=300, max_depth=6, learning_rate=0.1, random_state=42, n_jobs=-1)
)
print("Training XGBoost...")
clf_xgb.fit(emb_tr, y_tr)
print("XGBoost training complete!")

Training XGBoost...
XGBoost training complete!
XGBoost training complete!


In [None]:
pred_proba_xgb = clf_xgb.predict_proba(emb_va)
logits_xgb = np.column_stack([p[:, 1] for p in pred_proba_xgb])
ths_xgb = np.zeros(logits_xgb.shape[1])

for k in range(logits_xgb.shape[1]):
    s = logits_xgb[:, k]
    best_f1, best_t = 0.0, 0.0
    candidates = np.quantile(s, np.linspace(0.01, 0.99, 50))
    for t in candidates:
        preds_k = (s >= t).astype(int)
        f1 = f1_score(y_va[:, k], preds_k, zero_division=0)
        if f1 > best_f1:
            best_f1, best_t = f1, t
    ths_xgb[k] = best_t

pred_xgb = (logits_xgb >= ths_xgb).astype(int)
metrics_xgb = compute_metrics(y_va, pred_xgb)
print(f"XGBoost - F1: {metrics_xgb['f1']:.4f}, Precision: {metrics_xgb['precision']:.4f}, Recall: {metrics_xgb['recall']:.4f}, Hamming: {metrics_xgb['hamming_loss']:.4f}")

XGBoost - micro-F1: 0.6934, macro-F1: 0.6395


In [None]:
# Guardar modelo XGBoost
joblib.dump(clf_xgb, "clf_xgb.joblib")
joblib.dump(ths_xgb, "ths_xgb.npy")
print("XGBoost model saved!")

In [12]:
clf_svc = OneVsRestClassifier(
    LinearSVC(C=2.0, max_iter=4000, class_weight='balanced', dual='auto', random_state=42),
    n_jobs=-1
)
print("Training LinearSVC...")
clf_svc.fit(XTR_tfidf, y_tr)
print("SVC training complete!")

Training LinearSVC...
SVC training complete!
SVC training complete!


In [None]:
logits_svc = clf_svc.decision_function(XVA_tfidf)
ths_svc = np.zeros(logits_svc.shape[1])

for k in range(logits_svc.shape[1]):
    s = logits_svc[:, k]
    best_f1, best_t = 0.0, 0.0
    candidates = np.quantile(s, np.linspace(0.01, 0.99, 50))
    for t in candidates:
        preds_k = (s >= t).astype(int)
        f1 = f1_score(y_va[:, k], preds_k, zero_division=0)
        if f1 > best_f1:
            best_f1, best_t = f1, t
    ths_svc[k] = best_t

pred_svc = (logits_svc >= ths_svc).astype(int)
metrics_svc = compute_metrics(y_va, pred_svc)
print(f"LinearSVC - F1: {metrics_svc['f1']:.4f}, Precision: {metrics_svc['precision']:.4f}, Recall: {metrics_svc['recall']:.4f}, Hamming: {metrics_svc['hamming_loss']:.4f}")

LinearSVC - micro-F1: 0.7258, macro-F1: 0.7030


In [None]:
# Guardar modelo SVC
joblib.dump(clf_svc, "clf_svc.joblib")
joblib.dump(ths_svc, "ths_svc.npy")
print("SVC model saved!")

## Calibraci√≥n Probabil√≠stica SVC

In [None]:
from sklearn.calibration import CalibratedClassifierCV

print("Calibrating SVC probabilities with cross-validation on TRAIN...")
# IMPORTANTE: Para multi-label, calibramos cada clasificador binario individualmente
# CalibratedClassifierCV no funciona bien directamente con OneVsRestClassifier multi-label

# Entrenar SVC base primero
base_svc = OneVsRestClassifier(
    LinearSVC(C=2.0, max_iter=4000, class_weight='balanced', dual='auto', random_state=42),
    n_jobs=-1
)
base_svc.fit(XTR_tfidf, y_tr)

# Calibrar cada clasificador binario individualmente
n_labels = y_tr.shape[1]
calibrated_classifiers = []

for i in range(n_labels):
    # Calibrar cada clasificador binario con cv=3
    clf_calibrated = CalibratedClassifierCV(
        LinearSVC(C=2.0, max_iter=4000, class_weight='balanced', dual='auto', random_state=42),
        cv=3,
        method='sigmoid'
    )
    clf_calibrated.fit(XTR_tfidf, y_tr[:, i])
    calibrated_classifiers.append(clf_calibrated)
    
    if (i + 1) % 5 == 0:
        print(f"Calibrated {i + 1}/{n_labels} classifiers...", end='\r')

print(f"\nCalibrated all {n_labels} classifiers!")

# Obtener probabilidades calibradas en VALIDACI√ìN
logits_svc_cal = np.zeros((XVA_tfidf.shape[0], n_labels))
for i, clf in enumerate(calibrated_classifiers):
    logits_svc_cal[:, i] = clf.predict_proba(XVA_tfidf)[:, 1]

# Optimizar thresholds con calibraci√≥n
ths_svc_cal = np.zeros(n_labels)
for k in range(n_labels):
    s = logits_svc_cal[:, k]
    best_f1, best_t = 0.0, 0.0
    candidates = np.quantile(s, np.linspace(0.01, 0.99, 50))
    for t in candidates:
        preds_k = (s >= t).astype(int)
        f1 = f1_score(y_va[:, k], preds_k, zero_division=0)
        if f1 > best_f1:
            best_f1, best_t = f1, t
    ths_svc_cal[k] = best_t

pred_svc_cal = (logits_svc_cal >= ths_svc_cal).astype(int)
metrics_svc_cal = compute_metrics(y_va, pred_svc_cal)
print(f"Calibrated SVC - F1: {metrics_svc_cal['f1']:.4f}, Precision: {metrics_svc_cal['precision']:.4f}, Recall: {metrics_svc_cal['recall']:.4f}, Hamming: {metrics_svc_cal['hamming_loss']:.4f}")


In [None]:
# Guardar modelos SVC calibrados (lista de clasificadores binarios)
joblib.dump(calibrated_classifiers, "clf_svc_calibrated_list.joblib")
joblib.dump(ths_svc_cal, "ths_svc_cal.npy")
print("Calibrated SVC models saved (list of binary classifiers)!")


## 7. DistilBERT con Focal Loss y Label Smoothing

In [None]:
import torch.nn as nn

class FocalLoss(nn.Module):
    def __init__(self, alpha=0.25, gamma=2):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
    
    def forward(self, inputs, targets):
        bce_loss = nn.BCEWithLogitsLoss(reduction='none')(inputs, targets)
        pt = torch.exp(-bce_loss)
        focal_loss = self.alpha * (1-pt)**self.gamma * bce_loss
        return focal_loss.mean()

class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits
        
        loss_fct = FocalLoss(alpha=0.25, gamma=2.0)
        loss = loss_fct(logits, labels)
        
        return (loss, outputs) if return_outputs else loss

print("Focal Loss class defined for DistilBERT")

In [14]:
class MovieGenreDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = str(self.texts.iloc[idx]) if hasattr(self.texts, 'iloc') else str(self.texts[idx])
        encoding = self.tokenizer(text, truncation=True, padding='max_length', max_length=self.max_length, return_tensors='pt')
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(self.labels[idx], dtype=torch.float)
        }

In [15]:
tokenizer_distilbert = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model_distilbert = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=len(mlb.classes_),
    problem_type="multi_label_classification"
)

train_dataset_distilbert = MovieGenreDataset(X_tr, y_tr, tokenizer_distilbert, max_length=128)
val_dataset_distilbert = MovieGenreDataset(X_va, y_va, tokenizer_distilbert, max_length=128)
print(f"Datasets created: {len(train_dataset_distilbert)} training, {len(val_dataset_distilbert)} validation")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Datasets created: 9153 training, 1017 validation


In [None]:
training_args_distilbert = TrainingArguments(
    output_dir='./distilbert_results',
    num_train_epochs=4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    warmup_steps=500,
    weight_decay=0.01,
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    label_smoothing_factor=0.1,
)

trainer_distilbert = CustomTrainer(
    model=model_distilbert,
    args=training_args_distilbert,
    train_dataset=train_dataset_distilbert,
    eval_dataset=val_dataset_distilbert,
)

print("Training DistilBERT with Focal Loss + Label Smoothing...")
trainer_distilbert.train()
print("DistilBERT training complete!")

Training DistilBERT...


Epoch,Training Loss,Validation Loss
1,0.2659,0.261026
2,0.1965,0.20957
3,0.1459,0.203075


DistilBERT training complete!


In [None]:
model_distilbert.eval()
with torch.no_grad():
    val_inputs = tokenizer_distilbert(X_va.tolist(), truncation=True, padding=True, max_length=128, return_tensors='pt')
    outputs = model_distilbert(**val_inputs)
    logits_distilbert = torch.sigmoid(outputs.logits).cpu().numpy()

ths_distilbert = np.zeros(logits_distilbert.shape[1])
for k in range(logits_distilbert.shape[1]):
    s = logits_distilbert[:, k]
    best_f1, best_t = 0.0, 0.0
    candidates = np.quantile(s, np.linspace(0.01, 0.99, 50))
    for t in candidates:
        preds_k = (s >= t).astype(int)
        f1 = f1_score(y_va[:, k], preds_k, zero_division=0)
        if f1 > best_f1:
            best_f1, best_t = f1, t
    ths_distilbert[k] = best_t

pred_distilbert = (logits_distilbert >= ths_distilbert).astype(int)
metrics_distilbert = compute_metrics(y_va, pred_distilbert)
print(f"DistilBERT - F1: {metrics_distilbert['f1']:.4f}, Precision: {metrics_distilbert['precision']:.4f}, Recall: {metrics_distilbert['recall']:.4f}, Hamming: {metrics_distilbert['hamming_loss']:.4f}")

DistilBERT - micro-F1: 0.7234, macro-F1: 0.6736


In [None]:
# Guardar modelo DistilBERT
model_distilbert.save_pretrained("./distilbert_model")
tokenizer_distilbert.save_pretrained("./distilbert_model")
np.save("ths_distilbert.npy", ths_distilbert)
print("DistilBERT model saved!")

## 8. DeBERTa-v3 (Top Model)

In [18]:
tokenizer_deberta = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
model_deberta = AutoModelForSequenceClassification.from_pretrained(
    "microsoft/deberta-v3-base",
    num_labels=len(mlb.classes_),
    problem_type="multi_label_classification"
)

train_dataset_deberta = MovieGenreDataset(X_tr, y_tr, tokenizer_deberta, max_length=256)
val_dataset_deberta = MovieGenreDataset(X_va, y_va, tokenizer_deberta, max_length=256)
print(f"DeBERTa datasets created")

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be 

DeBERTa datasets created


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


In [None]:
training_args_deberta = TrainingArguments(
    output_dir='./deberta_results',
    num_train_epochs=8,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    weight_decay=0.01,
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    fp16=True,
    gradient_accumulation_steps=2,
    label_smoothing_factor=0.1,
)

trainer_deberta = CustomTrainer(
    model=model_deberta,
    args=training_args_deberta,
    train_dataset=train_dataset_deberta,
    eval_dataset=val_dataset_deberta,
)

print("Training DeBERTa-v3 with Focal Loss + Label Smoothing (8 epochs)...")
trainer_deberta.train()
print("DeBERTa training complete!")

Training DeBERTa-v3...



Epoch,Training Loss,Validation Loss
1,0.2799,0.265815


KeyboardInterrupt: 

In [None]:
model_deberta.eval()
with torch.no_grad():
    val_inputs = tokenizer_deberta(X_va.tolist(), truncation=True, padding=True, max_length=256, return_tensors='pt')
    outputs = model_deberta(**val_inputs)
    logits_deberta = torch.sigmoid(outputs.logits).cpu().numpy()

ths_deberta = np.zeros(logits_deberta.shape[1])
for k in range(logits_deberta.shape[1]):
    s = logits_deberta[:, k]
    best_f1, best_t = 0.0, 0.0
    candidates = np.quantile(s, np.linspace(0.01, 0.99, 50))
    for t in candidates:
        preds_k = (s >= t).astype(int)
        f1 = f1_score(y_va[:, k], preds_k, zero_division=0)
        if f1 > best_f1:
            best_f1, best_t = f1, t
    ths_deberta[k] = best_t

pred_deberta = (logits_deberta >= ths_deberta).astype(int)
metrics_deberta = compute_metrics(y_va, pred_deberta)
print(f"DeBERTa-v3 - F1: {metrics_deberta['f1']:.4f}, Precision: {metrics_deberta['precision']:.4f}, Recall: {metrics_deberta['recall']:.4f}, Hamming: {metrics_deberta['hamming_loss']:.4f}")

In [None]:
# Guardar modelo DeBERTa
model_deberta.save_pretrained("./deberta_model")
tokenizer_deberta.save_pretrained("./deberta_model")
np.save("ths_deberta.npy", ths_deberta)
print("DeBERTa model saved!")

## 9. Ensemble con Stacking (Meta-learner)

In [None]:
from sklearn.linear_model import RidgeClassifierCV

# NOTA IMPORTANTE: Stacking ideal requiere predicciones out-of-fold de TODOS los modelos
# Los transformers (DistilBERT/DeBERTa) no tienen OOF f√°cilmente disponible
# Por simplicidad y evitar data leakage, usamos ensemble ponderado optimizado como principal
# Guardamos el c√≥digo de stacking pero es OPCIONAL y puede tener ligero overfitting

# Stack todos los logits de VALIDACI√ìN
stacked_features_val = np.column_stack([
    logits_deberta, 
    logits_distilbert, 
    logits_logreg, 
    logits_xgb, 
    logits_svc_cal
])

print(f"Stacked features shape: {stacked_features_val.shape}")

# Meta-learner con Ridge Regression (regularizado para reducir overfitting)
meta_model = OneVsRestClassifier(
    RidgeClassifierCV(alphas=[0.1, 0.5, 1.0, 5.0, 10.0], cv=3),
    n_jobs=-1
)

print("Training stacking meta-model...")
print("‚ö†Ô∏è  WARNING: Training on validation set (not ideal but validation is small)")
meta_model.fit(stacked_features_val, y_va)

# Predicciones del meta-model
pred_stacking = meta_model.predict(stacked_features_val)
metrics_stacking = compute_metrics(y_va, pred_stacking)
print(f"Stacking Ensemble - F1: {metrics_stacking['f1']:.4f}, Precision: {metrics_stacking['precision']:.4f}, Recall: {metrics_stacking['recall']:.4f}, Hamming: {metrics_stacking['hamming_loss']:.4f}")
print(f"‚ö†Ô∏è  These metrics may be optimistic - prefer Weighted Ensemble metrics for true performance")

In [None]:
# Guardar meta-model (opcional, prefer weighted ensemble)
joblib.dump(meta_model, "meta_model_stacking.joblib")
print("Stacking meta-model saved (use with caution - may be overfit)")

## 10. Optimizaci√≥n de Pesos con Optuna

In [None]:
import optuna
from sklearn.metrics import hamming_loss

def objective_with_hamming(trial):
    w_deberta = trial.suggest_float("w_deberta", 0.3, 0.6)
    w_distilbert = trial.suggest_float("w_distilbert", 0.1, 0.4)
    w_logreg = trial.suggest_float("w_logreg", 0.1, 0.3)
    w_xgb = trial.suggest_float("w_xgb", 0.05, 0.25)
    w_svc = max(0.0, 1.0 - w_deberta - w_distilbert - w_logreg - w_xgb)
    
    ensemble_logits_opt = (w_deberta * logits_deberta + 
                           w_distilbert * logits_distilbert + 
                           w_logreg * logits_logreg + 
                           w_xgb * logits_xgb + 
                           w_svc * logits_svc_cal)
    
    # Optimizar thresholds
    ths_opt = np.zeros(ensemble_logits_opt.shape[1])
    for k in range(ensemble_logits_opt.shape[1]):
        s = ensemble_logits_opt[:, k]
        best_f1, best_t = 0.0, 0.0
        candidates = np.quantile(s, np.linspace(0.01, 0.99, 50))
        for t in candidates:
            preds_k = (s >= t).astype(int)
            f1 = f1_score(y_va[:, k], preds_k, zero_division=0)
            if f1 > best_f1:
                best_f1, best_t = f1, t
        ths_opt[k] = best_t
    
    pred_opt = (ensemble_logits_opt >= ths_opt).astype(int)
    
    # Combinar F1 macro y Hamming Loss
    f1_macro = f1_score(y_va, pred_opt, average='macro')
    hamming = hamming_loss(y_va, pred_opt)
    
    return f1_macro - 0.3 * hamming

print("Optimizing ensemble weights (F1 macro - Hamming Loss)...")
study = optuna.create_study(direction="maximize")
study.optimize(objective_with_hamming, n_trials=50, show_progress_bar=True)

print(f"\nBest score (F1 - 0.3*Hamming): {study.best_value:.4f}")
print("Best weights:", study.best_params)

In [None]:
best_params = study.best_params
w_deberta_opt = best_params['w_deberta']
w_distilbert_opt = best_params['w_distilbert']
w_logreg_opt = best_params['w_logreg']
w_xgb_opt = best_params['w_xgb']
w_svc_opt = 1.0 - w_deberta_opt - w_distilbert_opt - w_logreg_opt - w_xgb_opt

ensemble_optimized = (w_deberta_opt * logits_deberta + 
                      w_distilbert_opt * logits_distilbert + 
                      w_logreg_opt * logits_logreg + 
                      w_xgb_opt * logits_xgb + 
                      w_svc_opt * logits_svc_cal)

# Optimizaci√≥n de thresholds con b√∫squeda m√°s precisa
ths_optimized = np.zeros(ensemble_optimized.shape[1])
for k in range(ensemble_optimized.shape[1]):
    s = ensemble_optimized[:, k]
    best_f1, best_t = 0.0, 0.0
    candidates = np.unique(np.quantile(s, np.linspace(0, 1, 100)))
    for t in candidates:
        preds_k = (s >= t).astype(int)
        f1 = f1_score(y_va[:, k], preds_k, zero_division=0)
        if f1 > best_f1:
            best_f1, best_t = f1, t
    ths_optimized[k] = best_t

# Ajuste para reducir Hamming Loss
avg_labels_train = y_va.sum(axis=1).mean()
for iteration in range(10):
    pred_optimized = (ensemble_optimized >= ths_optimized).astype(int)
    current_avg = pred_optimized.sum(axis=1).mean()
    
    if abs(current_avg - avg_labels_train) < 0.1:
        break
    
    if current_avg > avg_labels_train:
        ths_optimized *= 1.02
    else:
        ths_optimized *= 0.98

pred_optimized = (ensemble_optimized >= ths_optimized).astype(int)
metrics_optimized = compute_metrics(y_va, pred_optimized)
print(f"Optimized Ensemble - F1: {metrics_optimized['f1']:.4f}, Precision: {metrics_optimized['precision']:.4f}, Recall: {metrics_optimized['recall']:.4f}, Hamming: {metrics_optimized['hamming_loss']:.4f}")

## 11. Test Time Augmentation para DeBERTa

In [None]:
def tta_predict_deberta(texts, model, tokenizer, n_augmentations=3):
    all_predictions = []
    
    model.eval()
    with torch.no_grad():
        test_inputs = tokenizer(texts, truncation=True, padding=True, max_length=256, return_tensors='pt')
        outputs = model(**test_inputs)
        all_predictions.append(torch.sigmoid(outputs.logits).cpu().numpy())
    
    for _ in range(n_augmentations):
        model.train()
        with torch.no_grad():
            test_inputs = tokenizer(texts, truncation=True, padding=True, max_length=256, return_tensors='pt')
            outputs = model(**test_inputs)
            all_predictions.append(torch.sigmoid(outputs.logits).cpu().numpy())
    
    return np.mean(all_predictions, axis=0)

In [None]:
# Cargar dataset de test
df_test = pd.read_csv(test_dir)
df_test["text"] = df_test["movie_name"].fillna("") + " [SEP] " + df_test["description"].fillna("")
print(f"Test dataset size: {len(df_test)}")

## 12. Generaci√≥n de Predicciones Individuales en Test

In [None]:
# Generar features de test (SIN augmentation)
print("Generating TF-IDF features for test...")
Xw_test = tfidf_word.transform(df_test["text"])
Xc_test = tfidf_char.transform(df_test["text"])
X_test_tfidf = sp_hstack([Xw_test, Xc_test], format="csr")

print("Generating BGE embeddings for test...")
emb_test = st_model.encode(df_test["text"].tolist(), show_progress_bar=True, batch_size=16, normalize_embeddings=True)
X_test_combined = sp_hstack([X_test_tfidf, csr_matrix(emb_test)], format="csr")
print(f"Test features shape: {X_test_combined.shape}")

In [None]:
# 1. LogisticRegression predictions
print("Generating LogReg predictions...")
logits_logreg_test = clf_logreg.decision_function(X_test_combined)
pred_logreg_test = (logits_logreg_test >= ths_logreg).astype(int)

pred_labels_logreg = [", ".join([mlb.classes_[j] for j, v in enumerate(row) if v == 1]) for row in pred_logreg_test]
result_logreg = pd.DataFrame({
    "movie_name": df_test["movie_name"],
    "genre": pred_labels_logreg,
    "description": df_test["description"]
})
result_logreg.to_csv("dataset_test_preds_logreg.csv", index=False)
print(f"‚úì LogReg predictions saved: dataset_test_preds_logreg.csv")

In [None]:
# 2. XGBoost predictions
print("Generating XGBoost predictions...")
pred_proba_xgb_test = clf_xgb.predict_proba(emb_test)
logits_xgb_test = np.column_stack([p[:, 1] for p in pred_proba_xgb_test])
pred_xgb_test = (logits_xgb_test >= ths_xgb).astype(int)

pred_labels_xgb = [", ".join([mlb.classes_[j] for j, v in enumerate(row) if v == 1]) for row in pred_xgb_test]
result_xgb = pd.DataFrame({
    "movie_name": df_test["movie_name"],
    "genre": pred_labels_xgb,
    "description": df_test["description"]
})
result_xgb.to_csv("dataset_test_preds_xgb.csv", index=False)
print(f"‚úì XGBoost predictions saved: dataset_test_preds_xgb.csv")

In [None]:
# 3. LinearSVC predictions
print("Generating LinearSVC predictions...")
logits_svc_test = clf_svc.decision_function(X_test_tfidf)
pred_svc_test = (logits_svc_test >= ths_svc).astype(int)

pred_labels_svc = [", ".join([mlb.classes_[j] for j, v in enumerate(row) if v == 1]) for row in pred_svc_test]
result_svc = pd.DataFrame({
    "movie_name": df_test["movie_name"],
    "genre": pred_labels_svc,
    "description": df_test["description"]
})
result_svc.to_csv("dataset_test_preds_svc.csv", index=False)
print(f"‚úì LinearSVC predictions saved: dataset_test_preds_svc.csv")

In [None]:
# 4. Calibrated SVC predictions
print("Generating Calibrated SVC predictions...")
logits_svc_test_cal = np.column_stack([
    clf_svc_calibrated.predict_proba(X_test_tfidf)[:, :, 1].T
])
pred_svc_cal_test = (logits_svc_test_cal >= ths_svc_cal).astype(int)

pred_labels_svc_cal = [", ".join([mlb.classes_[j] for j, v in enumerate(row) if v == 1]) for row in pred_svc_cal_test]
result_svc_cal = pd.DataFrame({
    "movie_name": df_test["movie_name"],
    "genre": pred_labels_svc_cal,
    "description": df_test["description"]
})
result_svc_cal.to_csv("dataset_test_preds_svc_calibrated.csv", index=False)
print(f"‚úì Calibrated SVC predictions saved: dataset_test_preds_svc_calibrated.csv")

In [None]:
# 5. DistilBERT predictions
print("Generating DistilBERT predictions...")
model_distilbert.eval()
with torch.no_grad():
    test_inputs = tokenizer_distilbert(df_test["text"].tolist(), truncation=True, padding=True, max_length=128, return_tensors='pt')
    outputs = model_distilbert(**test_inputs)
    logits_distilbert_test = torch.sigmoid(outputs.logits).cpu().numpy()

pred_distilbert_test = (logits_distilbert_test >= ths_distilbert).astype(int)

pred_labels_distilbert = [", ".join([mlb.classes_[j] for j, v in enumerate(row) if v == 1]) for row in pred_distilbert_test]
result_distilbert = pd.DataFrame({
    "movie_name": df_test["movie_name"],
    "genre": pred_labels_distilbert,
    "description": df_test["description"]
})
result_distilbert.to_csv("dataset_test_preds_distilbert.csv", index=False)
print(f"‚úì DistilBERT predictions saved: dataset_test_preds_distilbert.csv")

In [None]:
# 6. DeBERTa predictions with TTA
print("Generating DeBERTa predictions with TTA...")
logits_deberta_test_tta = tta_predict_deberta(df_test["text"].tolist(), model_deberta, tokenizer_deberta)
pred_deberta_test = (logits_deberta_test_tta >= ths_deberta).astype(int)

pred_labels_deberta = [", ".join([mlb.classes_[j] for j, v in enumerate(row) if v == 1]) for row in pred_deberta_test]
result_deberta = pd.DataFrame({
    "movie_name": df_test["movie_name"],
    "genre": pred_labels_deberta,
    "description": df_test["description"]
})
result_deberta.to_csv("dataset_test_preds_deberta.csv", index=False)
print(f"‚úì DeBERTa predictions saved: dataset_test_preds_deberta.csv")

## 13. Ensemble Final - Selecci√≥n del Mejor

In [None]:
# Crear ensemble ponderado optimizado
print("Creating optimized weighted ensemble...")
ensemble_optimized_test = (w_deberta_opt * logits_deberta_test_tta + 
                           w_distilbert_opt * logits_distilbert_test + 
                           w_logreg_opt * logits_logreg_test + 
                           w_xgb_opt * logits_xgb_test + 
                           w_svc_opt * logits_svc_test_cal)

pred_optimized_test = (ensemble_optimized_test >= ths_optimized).astype(int)

pred_labels_optimized = [", ".join([mlb.classes_[j] for j, v in enumerate(row) if v == 1]) for row in pred_optimized_test]
result_optimized = pd.DataFrame({
    "movie_name": df_test["movie_name"],
    "genre": pred_labels_optimized,
    "description": df_test["description"]
})
result_optimized.to_csv("dataset_test_preds_weighted_ensemble.csv", index=False)
print(f"‚úì Weighted Ensemble predictions saved: dataset_test_preds_weighted_ensemble.csv")

In [None]:
# Crear stacking ensemble
print("Creating stacking ensemble...")
stacked_features_test = np.column_stack([
    logits_deberta_test_tta,
    logits_distilbert_test,
    logits_logreg_test,
    logits_xgb_test,
    logits_svc_test_cal
])

pred_stacking_test = meta_model.predict(stacked_features_test)

pred_labels_stacking = [", ".join([mlb.classes_[j] for j, v in enumerate(row) if v == 1]) for row in pred_stacking_test]
result_stacking = pd.DataFrame({
    "movie_name": df_test["movie_name"],
    "genre": pred_labels_stacking,
    "description": df_test["description"]
})
result_stacking.to_csv("dataset_test_preds_stacking_ensemble.csv", index=False)
print(f"‚úì Stacking Ensemble predictions saved: dataset_test_preds_stacking_ensemble.csv")

In [None]:
# Seleccionar ensemble final para submission
print("="*80)
print("SELECTING BEST ENSEMBLE FOR FINAL SUBMISSION")
print("="*80)
print(f"Stacking Ensemble Validation F1: {metrics_stacking['f1']:.4f} (may be optimistic)")
print(f"Weighted Ensemble Validation F1: {metrics_optimized['f1']:.4f} (more reliable)")
print("="*80)

# USAR SIEMPRE WEIGHTED ENSEMBLE (m√°s confiable, sin data leakage)
print(f"\n‚úì USING WEIGHTED ENSEMBLE (F1={metrics_optimized['f1']:.4f}) for final submission")
print("  Reason: Optimized with Optuna on clean validation set, no data leakage")
final_submission = result_optimized.copy()
final_submission.to_csv("dataset_test_preds.csv", index=False)

print(f"\n‚úì‚úì‚úì FINAL SUBMISSION saved: dataset_test_preds.csv ‚úì‚úì‚úì")
print("="*80)

## 14. Resumen Final de Resultados

In [None]:
print("="*80)
print("VALIDATION PERFORMANCE SUMMARY")
print("="*80)
print(f"1. LogReg (TF-IDF+BGE):         F1: {metrics_logreg['f1']:.4f}, Hamming: {metrics_logreg['hamming_loss']:.4f}")
print(f"2. XGBoost (BGE Embeddings):    F1: {metrics_xgb['f1']:.4f}, Hamming: {metrics_xgb['hamming_loss']:.4f}")
print(f"3. LinearSVC (TF-IDF):          F1: {metrics_svc['f1']:.4f}, Hamming: {metrics_svc['hamming_loss']:.4f}")
print(f"4. Calibrated SVC:              F1: {metrics_svc_cal['f1']:.4f}, Hamming: {metrics_svc_cal['hamming_loss']:.4f}")
print(f"5. DistilBERT (Focal+Smooth):   F1: {metrics_distilbert['f1']:.4f}, Hamming: {metrics_distilbert['hamming_loss']:.4f}")
print(f"6. DeBERTa-v3 (Focal+Smooth):   F1: {metrics_deberta['f1']:.4f}, Hamming: {metrics_deberta['hamming_loss']:.4f}")
print(f"7. STACKING Ensemble:           F1: {metrics_stacking['f1']:.4f}, Hamming: {metrics_stacking['hamming_loss']:.4f} ‚ö†Ô∏è")
print(f"8. WEIGHTED Ensemble (Optuna):  F1: {metrics_optimized['f1']:.4f}, Hamming: {metrics_optimized['hamming_loss']:.4f} ‚úì")
print("="*80)
print(f"\n‚úì Data Augmentation: 35% of train only (validation untouched)")
print(f"‚úì TF-IDF: n-grams (1,4) word + (3,6) char, max_features=750k")
print(f"‚úì Embeddings: BGE-Large-en-v1.5 (1024 dim, normalized)")
print(f"‚úì Loss: Focal Loss (alpha=0.25, gamma=2.0) + Label Smoothing (0.1)")
print(f"‚úì Calibration: SVC with sigmoid on train (cv=3)")
print(f"‚úì Ensemble: Weighted optimized with Optuna (F1 - 0.3*Hamming, 50 trials)")
print(f"‚úì Metrics: All calculated with validator.py compute_metrics()")
print(f"\n‚ö†Ô∏è  Stacking trained on validation (may overfit) - Weighted Ensemble preferred")
print("="*80)

In [None]:
print("\n" + "="*80)
print("ARCHIVOS CSV GENERADOS PARA CADA MODELO:")
print("="*80)
print("1. dataset_test_preds_logreg.csv")
print("2. dataset_test_preds_xgb.csv")
print("3. dataset_test_preds_svc.csv")
print("4. dataset_test_preds_svc_calibrated.csv")
print("5. dataset_test_preds_distilbert.csv")
print("6. dataset_test_preds_deberta.csv")
print("7. dataset_test_preds_weighted_ensemble.csv")
print("8. dataset_test_preds_stacking_ensemble.csv")
print("9. dataset_test_preds.csv (MEJOR ENSEMBLE - ENVIAR ESTE)")
print("="*80)

## üìù C√ìMO SE GENERA EL ARCHIVO FINAL

El archivo **`dataset_test_preds.csv`** se genera as√≠:

1. **Se entrenan 6 modelos** en el dataset de train:
   - LogisticRegression, XGBoost, LinearSVC, Calibrated SVC, DistilBERT, DeBERTa

2. **Cada modelo genera predicciones en test** ‚Üí 6 archivos CSV individuales

3. **Se crean 2 ensembles**:
   - **Weighted Ensemble**: Combina los 6 modelos con pesos optimizados por Optuna
   - **Stacking Ensemble**: Usa un meta-modelo Ridge para combinar predicciones

4. **Se selecciona el mejor ensemble** basado en F1 score de validaci√≥n:
   - Si `metrics_stacking['f1'] > metrics_optimized['f1']` ‚Üí usa Stacking
   - Si no ‚Üí usa Weighted Ensemble (recomendado, m√°s confiable)

5. **El mejor ensemble se guarda como `dataset_test_preds.csv`** ‚Üê **ESTE ES EL ARCHIVO FINAL PARA ENVIAR**

**Resumen de archivos CSV generados:**
- 6 archivos individuales por modelo (para an√°lisis)
- 2 archivos de ensemble (weighted y stacking)
- **1 archivo final: `dataset_test_preds.csv`** ‚úÖ ‚Üê Enviar este