# Analizador de Sentimiento Biling√ºe en Rese√±as de Pel√≠culas (EN/ES) con XLM-R

## Resumen
Este trabajo es una **extensi√≥n y mejora** de *‚ÄúAn√°lisis de Rese√±as de Rotten Tomatoes con NLP‚Äù* (modelo cl√°sico de **Regresi√≥n Log√≠stica**, Acc‚âà81% sobre >1M rese√±as).  
Aqu√≠ migramos a arquitecturas **Transformers**, comparamos **DistilRoBERTa** (baseline r√°pido) con **XLM-RoBERTa base (XLM-R)** para soporte **biling√ºe EN/ES**, y **desplegamos** el mejor sistema como una aplicaci√≥n accesible (Gradio en Hugging Face Spaces).

Usamos un split **group-aware por pel√≠cula** y un escenario de comparaci√≥n **100k/10k** (train/test) para iteraci√≥n r√°pida. Adem√°s, ajustamos el **umbral de decisi√≥n** para equilibrar precisi√≥n/recobrado en uso real.

**Resultados (100k/10k, umbral √≥ptimo por F1 de cada modelo):**
- **DistilRoBERTa** ‚Üí Acc **0.8484** ¬∑ F1 **0.8882** ¬∑ Prec 0.8426 ¬∑ **Rec 0.9390** ¬∑ AUC **0.9282** ¬∑ *thr‚âà0.6046*  
- **XLM-R (base)** ‚Üí **Acc 0.8519** ¬∑ F1 0.8876 ¬∑ **Prec 0.8646** ¬∑ Rec 0.9119 ¬∑ AUC 0.9260 ¬∑ *thr‚âà0.4800*  

**Conclusi√≥n breve:** XLM-R ofrece **menor tasa de falsos positivos** (‚ÜëPrecisi√≥n) con exactitud ligeramente superior, manteniendo un recall alto y habilitando **biling√ºismo**; por ello es el modelo elegido para despliegue (umbral operativo ‚âà **0.48**).

---

## Metodolog√≠a
1. **Datos**: rese√±as de cr√≠ticos de *Rotten Tomatoes* (Kaggle; >1M).  
2. **Preprocesamiento**:
   - Auto-detecci√≥n de columnas (texto/etiqueta/agrupador) y **limpieza m√≠nima** (HTML, espacios, contracciones).
   - Normalizaci√≥n de etiqueta binaria (`Fresh`/`Rotten` ‚Üí {1,0}).
3. **Partici√≥n**:
   - **GroupShuffleSplit** por pel√≠cula (evita fuga de informaci√≥n por t√≠tulo).
   - Submuestreo estratificado para escenarios **50k/10k** y **100k/10k**.
4. **Modelado**:
   - **DistilRoBERTa** (baseline r√°pido, EN).
   - **XLM-R base** (modelo principal, **EN/ES**).
   - Entrenamiento con `fp16/bf16` (seg√∫n GPU), `gradient_checkpointing`, `cosine` scheduler, `EarlyStopping`.
5. **Evaluaci√≥n**:
   - `Accuracy`, `F1`, `Precision`, `Recall`, `ROC-AUC`.
   - Barrido de **umbral** y reporte de matriz de confusi√≥n.
6. **Despliegue**:
   - App **Gradio** en Hugging Face Spaces.
   - **API** autoexpuesta con endpoints `/predict_single` y `/predict_batch`.

---

> **Modelo en despliegue:** XLM-R base (EN/ES), umbral operativo ‚âà **0.48**, consumido por la app Gradio y su API.

---

In [1]:
# Lectura del CSV de cr√≠ticas de Rotten Tomatoes
import pandas as pd

df = pd.read_csv('data/rotten_tomatoes_critic_reviews.csv')

# Inspecci√≥n r√°pida de columnas disponibles 
print("Columnas del CSV:", list(df.columns))


Columnas del CSV: ['rotten_tomatoes_link', 'critic_name', 'top_critic', 'publisher_name', 'review_type', 'review_score', 'review_date', 'review_content']


## 2) Preparaci√≥n del Dataset (detecci√≥n de columnas, limpieza y normalizaci√≥n)

En esta secci√≥n:
- **Auto-detectamos** las columnas de texto, etiqueta y agrupador (pel√≠cula).
- **Limpiamos** el texto (HTML, espacios, contracciones).
- **Normalizamos** la etiqueta a binaria `sentimiento ‚àà {0,1}`.
- Construimos un **DataFrame est√°ndar** con: `review_clean`, `sentimiento`, `group_key`.

In [2]:
# ============================
# Configuraci√≥n
# ============================
from __future__ import annotations
import re, html, warnings
import pandas as pd
from typing import Iterable, Optional, Tuple, Dict, Any

TEXT_CANDIDATES  = ["review_clean","review","review_text","review_content","content","text","critic_review"]
LABEL_CANDIDATES = ["sentimiento","fresh","label","freshness","review_type","target","y"]
GROUP_CANDIDATES = ["rotten_tomatoes_link","movie_title","title","movie","film"]

# Sin√≥nimos / formatos v√°lidos para etiquetas
LABEL_MAP_TEXT_POS = {
    "fresh","positivo","positive","pos","freshness_fresh","freshness fresh"
}
LABEL_MAP_TEXT_NEG = {
    "rotten","negativo","negative","neg","freshness_rotten","freshness rotten"
}

# ============================
# Detecci√≥n de columnas
# ============================
def pick_first(cols: Iterable[str], candidates: Iterable[str]) -> Optional[str]:
    """Devuelve el primer nombre de 'candidates' que exista en 'cols'; None si no hay coincidencia."""
    cols_set = set(cols)
    for c in candidates:
        if c in cols_set:
            return c
    return None

def detect_columns(df: pd.DataFrame) -> Tuple[str, str, str]:
    """Detecta (text_col, label_col, group_col). Si group no existe, usa fallback."""
    cols = list(df.columns)

    text_col = pick_first(cols, TEXT_CANDIDATES)
    if text_col is None:
        raise ValueError(
            "No se encontr√≥ una columna de texto. Agrega una de: "
            + ", ".join([f"'{c}'" for c in TEXT_CANDIDATES])
        )

    label_col = pick_first(cols, LABEL_CANDIDATES)
    if label_col is None:
        raise ValueError(
            "No se encontr√≥ una columna de etiqueta. Agrega una de: "
            + ", ".join([f"'{c}'" for c in LABEL_CANDIDATES])
        )

    group_col = pick_first(cols, GROUP_CANDIDATES)
    if group_col is None:
        # fallback por t√≠tulo o √≠ndice
        group_col = pick_first(cols, ["movie_title","title"])
        if group_col is None:
            warnings.warn(
                "No hay columna de agrupaci√≥n (pel√≠cula). "
                "Se crear√° '_group_fallback' agrupando por bloques de 10 √≠ndices (no ideal).",
                UserWarning
            )
            df["_group_fallback"] = (df.index // 10)
            group_col = "_group_fallback"

    return text_col, label_col, group_col

# ============================
# Limpieza de texto
# ============================
TAG_RE = re.compile(r"<[^>]+>")
WS_RE  = re.compile(r"\s+")
CONTRACTION_RE = re.compile(r"n['‚Äô]t\b", flags=re.IGNORECASE)  # don't/don‚Äôt ‚Üí do not

def clean_text(t: Any) -> str:
    """Limpieza m√≠nima: HTML ‚Üí texto, contrae 'n't'‚Üí ' not', normaliza espacios."""
    if not isinstance(t, str):
        return ""
    t = html.unescape(t)
    t = TAG_RE.sub(" ", t)
    t = CONTRACTION_RE.sub(" not", t)
    t = WS_RE.sub(" ", t).strip()
    return t

# ============================
# Normalizaci√≥n de etiqueta
# ============================
def to_binary_label(v: Any) -> Optional[int]:
    """Convierte etiquetas heterog√©neas a {0,1}. Devuelve None si no se puede mapear."""
    if pd.isna(v):
        return None

    # Texto
    if isinstance(v, str):
        s = v.strip().lower()
        if s in LABEL_MAP_TEXT_POS or "fresh" in s:
            return 1
        if s in LABEL_MAP_TEXT_NEG or "rotten" in s:
            return 0

    # Num√©rico
    try:
        f = float(v)
        if f in (0.0, 1.0):
            return int(f)
        if 0.0 <= f <= 1.0:
            return int(round(f))        # e.g., probabilidad ya en [0,1]
        if 1.0 <= f <= 5.0:
            return int(f >= 3.0)        # rating 1‚Äì5 ‚Üí >=3 positivo
    except Exception:
        pass

    return None

# ============================
# Pipeline de preparaci√≥n
# ============================
def prepare_reviews_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    """
    Estandariza el DataFrame para el pipeline:
    - detecta columnas (texto, label, grupo)
    - limpia texto ‚Üí 'review_clean'
    - normaliza etiqueta ‚Üí 'sentimiento' (0/1)
    - renombra grupo ‚Üí 'group_key'
    - filtra nulos y devuelve solo columnas est√°ndar
    """
    text_col, label_col, group_col = detect_columns(df)

    out = df.copy()

    # Limpieza de texto
    out["review_clean"] = out[text_col].map(clean_text)

    # Normaliza etiqueta binaria
    out["sentimiento"] = out[label_col].apply(to_binary_label)

    # Mantener trio est√°ndar
    out = out[["review_clean", "sentimiento", group_col]].dropna(subset=["review_clean","sentimiento"])
    out = out.rename(columns={group_col: "group_key"})

    return out, {"text_col": text_col, "label_col": label_col, "group_col": group_col}

# ============================
# 6) Ejecutar preparaci√≥n y diagn√≥stico
# ============================
prepared_df, col_info = prepare_reviews_dataframe(df)

print(">> Columnas detectadas:", col_info)
print(prepared_df[["review_clean","sentimiento","group_key"]].head(3))
print("\n>> Nulos por columna:\n", prepared_df.isna().sum())
print("\n>> Distribuci√≥n de la etiqueta (0/1):\n", prepared_df["sentimiento"].value_counts(dropna=False).to_frame("count"))



>> Columnas detectadas: {'text_col': 'review_content', 'label_col': 'review_type', 'group_col': 'rotten_tomatoes_link'}
                                        review_clean  sentimiento  group_key
0  A fantasy adventure that fuses Greek mythology...            1  m/0814255
1  Uma Thurman as Medusa, the gorgon with a coiff...            1  m/0814255
2  With a top-notch cast and dazzling special eff...            1  m/0814255

>> Nulos por columna:
 review_clean    0
sentimiento     0
group_key       0
dtype: int64

>> Distribuci√≥n de la etiqueta (0/1):
               count
sentimiento        
1            720210
0            409807


## 2) Construcci√≥n del set de entrenamiento/prueba (salteando baselines)

En esta secci√≥n preparamos los datos **directamente** para evaluar el modelo final (XLM-R), omitiendo los modelos intermedios y sus *tunings*.

**Objetivos:**
- Estandarizar columnas a: `text` (entrada), `label` (0/1), `group` (pel√≠cula).
- Asegurar una **partici√≥n por grupos** (pel√≠cula) con `GroupShuffleSplit` 80/20 para evitar fuga de informaci√≥n.
- Dejar `X_train`, `y_train`, `X_test`, `y_test` listos para inferencia/evaluaci√≥n del modelo final.

In [3]:
# ============================================
# 2.1) Imports y configuraci√≥n reproducible
# ============================================
import re, html, numpy as np, pandas as pd
from sklearn.model_selection import GroupShuffleSplit

RANDOM_STATE = 42
TEST_SIZE    = 0.20

np.random.seed(RANDOM_STATE)


In [4]:
# ==============================================================
# 2.2) Utilidades m√≠nimas (limpieza, binarizaci√≥n, detecci√≥n)
# ==============================================================

def clean_text(t: str) -> str:
    """Limpieza m√≠nima: HTML->texto, normaliza espacios, expande n't ‚Üí not."""
    if not isinstance(t, str):
        return ""
    t = html.unescape(t)
    t = re.sub(r"<[^>]+>", " ", t)
    t = re.sub(r"n['‚Äô]t\b", " not", t)  # don't/don‚Äôt ‚Üí do not
    t = re.sub(r"\s+", " ", t).strip()
    return t

def pick_first(cols, cands):
    """Devuelve el primer nombre en cands que exista en cols, o None."""
    cols = set(cols)
    for c in cands:
        if c in cols:
            return c
    return None

def binarize_label(v):
    """Normaliza etiqueta heterog√©nea a {0,1}. Devuelve None si no mapea."""
    if pd.isna(v):
        return None
    if isinstance(v, str):
        s = v.strip().lower()
        if "fresh" in s or s in {"positive","pos","freshness fresh","freshness_fresh"}:
            return 1
        if "rotten" in s or s in {"negative","neg","freshness rotten","freshness_rotten"}:
            return 0
    try:
        f = float(v)
        if f in (0, 1):
            return int(f)
        if 0 <= f <= 1:
            return int(round(f))     # si ya vino como prob en [0,1]
        if 1 <= f <= 5:
            return int(f >= 3)       # rating 1‚Äì5 ‚Üí >=3 positivo
    except Exception:
        pass
    return None


In [5]:
# =====================================================================
# 2.3) Estandarizaci√≥n a (text, label, group) con fallback de agrupador
# =====================================================================
# Asumimos que ya tienes un DataFrame `df` con las columnas originales.

# Si ya trabajaste la secci√≥n previa y tienes `prepared_df` con:
#   ["review_clean","sentimiento","group_key"]
# puedes descomentar estas 3 l√≠neas y saltarte la auto-detecci√≥n:
#
# df_trf = prepared_df.rename(columns={
#     "review_clean": "text", "sentimiento": "label", "group_key": "group"
# })[["text","label","group"]].dropna()
# display(df_trf.head())

# Si NO tienes `prepared_df`, usamos auto-detecci√≥n r√°pida:
text_col  = pick_first(df.columns, ["review_clean","review","review_text","review_content","content","text","critic_review"])
label_col = pick_first(df.columns, ["sentimiento","fresh","label","freshness","review_type","target","y"])
group_col = pick_first(df.columns, ["rotten_tomatoes_link","movie_title","title","movie","film"])

if text_col is None or label_col is None:
    raise ValueError("No encuentro columnas de texto/label. Revisa nombres en tu CSV/DF.")

if group_col is None:
    # Agrupador m√≠nimo por √≠ndice para evitar fuga total (no ideal, pero mejor que nada)
    df["_group_key"] = (df.index // 10)
    group_col = "_group_key"

df_trf = pd.DataFrame({
    "text":  df[text_col].map(clean_text),
    "label": df[label_col].map(binarize_label),
    "group": df[group_col]
}).dropna()

# Tipado final de la etiqueta
df_trf["label"] = df_trf["label"].astype(int)

print(">> Ejemplo de filas estandarizadas:")
display(df_trf.head(3))

print("\n>> Distribuci√≥n global de la etiqueta:")
display(df_trf["label"].value_counts(normalize=False).to_frame("count"))


>> Ejemplo de filas estandarizadas:


Unnamed: 0,text,label,group
0,A fantasy adventure that fuses Greek mythology...,1,m/0814255
1,"Uma Thurman as Medusa, the gorgon with a coiff...",1,m/0814255
2,With a top-notch cast and dazzling special eff...,1,m/0814255



>> Distribuci√≥n global de la etiqueta:


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,720210
0,409807


In [6]:
# ==========================================================
# 2.4) Split 80/20 con GroupShuffleSplit (evita fuga por film)
# ==========================================================
X_all      = df_trf["text"]
y_all      = df_trf["label"]
groups_all = df_trf["group"]

gss = GroupShuffleSplit(n_splits=1, test_size=TEST_SIZE, random_state=RANDOM_STATE)
tr_idx, te_idx = next(gss.split(X_all, y_all, groups_all))

X_train, X_test = X_all.iloc[tr_idx].reset_index(drop=True), X_all.iloc[te_idx].reset_index(drop=True)
y_train, y_test = y_all.iloc[tr_idx].reset_index(drop=True), y_all.iloc[te_idx].reset_index(drop=True)
groups_train    = groups_all.iloc[tr_idx].reset_index(drop=True)
groups_test     = groups_all.iloc[te_idx].reset_index(drop=True)

print(f">> Tama√±os ‚Üí train: {len(X_train):,} | test: {len(X_test):,}")


>> Tama√±os ‚Üí train: 908,582 | test: 221,435


In [7]:
# =========================================
# 2.5) Chequeos r√°pidos (sanity checks)
# =========================================
def quick_report(y_tr, y_te, g_tr, g_te):
    rep = {}
    rep["label_train_counts"] = y_tr.value_counts().to_dict()
    rep["label_test_counts"]  = y_te.value_counts().to_dict()
    rep["n_groups_train"]     = g_tr.nunique()
    rep["n_groups_test"]      = g_te.nunique()
    rep["groups_overlap"]     = bool(set(g_tr.unique()) & set(g_te.unique()))
    return rep

report = quick_report(y_train, y_test, groups_train, groups_test)
print(">> Reporte de partici√≥n (group-aware):")
for k, v in report.items():
    print(f"   - {k}: {v}")

# Aseguramos NO overlap de grupos entre train y test:
assert not report["groups_overlap"], "Hay solapamiento de grupos entre train y test (riesgo de fuga)."

# Vista r√°pida de clases:
display(pd.DataFrame({
    "train": y_train.value_counts(),
    "test":  y_test.value_counts()
}).fillna(0).astype(int))


>> Reporte de partici√≥n (group-aware):
   - label_train_counts: {1: 578236, 0: 330346}
   - label_test_counts: {1: 141974, 0: 79461}
   - n_groups_train: 14169
   - n_groups_test: 3543
   - groups_overlap: False


Unnamed: 0_level_0,train,test
label,Unnamed: 1_level_1,Unnamed: 2_level_1
1,578236,141974
0,330346,79461


> **Checklist**
> - [x] `df_trf` estandarizado con `text`, `label`, `group`.
> - [x] Split 80/20 **por pel√≠cula** con `GroupShuffleSplit`.
> - [x] Sin solapamiento de `group` entre train y test.
> - [x] `X_train`, `y_train`, `X_test`, `y_test` listos para evaluaci√≥n del modelo final.


## 3) Entrenamiento r√°pido del modelo final (salteando baselines y tunning)

En esta secci√≥n entrenaremos un modelo de *Transformers* directamente sobre un **subset** (50k train / 10k test) para iterar r√°pido.  
- Usamos `GroupShuffleSplit` de la secci√≥n anterior, por lo que `X_train`, `y_train`, `X_test`, `y_test` ya est√°n listos.  
- Empezamos con **DistilRoBERTa** por velocidad; luego puedes cambiar a **XLM-R** para soporte biling√ºe.

In [8]:
# 3.1) Imports y flags del entorno (solo PyTorch/Transformers)
import os, random
import numpy as np
import pandas as pd
import torch
import warnings
# Evita warnings de pandas y numpy
warnings.filterwarnings("ignore", category=pd.errors.PerformanceWarning)

from datasets import Dataset
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

# Evita TensorFlow/Flax para reducir dependencias y warnings
os.environ["TRANSFORMERS_NO_TF"] = "1"
os.environ["TRANSFORMERS_NO_FLAX"] = "1"

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    EarlyStoppingCallback,
    DataCollatorWithPadding,
    pipeline,
    set_seed
)
from transformers.training_args import TrainingArguments
from transformers.trainer_utils import IntervalStrategy


  from .autonotebook import tqdm as notebook_tqdm


In [9]:
# 3.2) Configuraci√≥n de dispositivo, precisi√≥n y reproducibilidad
HAS_CUDA = torch.cuda.is_available()
if HAS_CUDA:
    print("‚úÖ GPU:", torch.cuda.get_device_name(0), "| Capability:", torch.cuda.get_device_capability(0))
    # TF32 acelera matmul en Ampere+ (no afecta precisi√≥n de forma relevante para fine-tuning)
    try:
        torch.set_float32_matmul_precision("high")
    except Exception:
        pass

# BF16 solo si la GPU es Ampere o m√°s nueva (major >= 8)
USE_BF16 = HAS_CUDA and (torch.cuda.get_device_capability(0)[0] >= 8)

SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
if HAS_CUDA: torch.cuda.manual_seed_all(SEED)
set_seed(SEED)

TEST_SIZE = 0.20  # solo para referencia


‚úÖ GPU: NVIDIA GeForce RTX 3060 | Capability: (8, 6)


### 3.3) Subset estratificado: 100k (train) / 10k (test)

Para acelerar el entrenamiento, tomamos una muestra **estratificada por etiqueta**:
- `X_train_big, y_train_big` ‚Üí 100,000 ejemplos (o menos si el train es m√°s peque√±o).
- `X_test_big, y_test_big` ‚Üí 10,000 ejemplos (ajustable).

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
import numpy as np

SEED = 42
TRAIN_TARGET = 100_000
TEST_TARGET  = 10_000   # puedes subir/bajar este valor

# --- Train: 100k estratificado ---
n_train = min(TRAIN_TARGET, len(X_train))
sss_tr  = StratifiedShuffleSplit(n_splits=1, train_size=n_train, random_state=SEED)

# usamos un array dummy para X (solo interesa la longitud); estratificamos con y_train
idx_sub_train, _ = next(sss_tr.split(np.zeros(len(y_train)), y_train))

X_train_big = X_train.iloc[idx_sub_train].reset_index(drop=True)
y_train_big = y_train.iloc[idx_sub_train].reset_index(drop=True)

# --- Test: 10k estratificado (si el test es m√°s grande) ---
if len(X_test) > TEST_TARGET:
    sss_te = StratifiedShuffleSplit(n_splits=1, train_size=TEST_TARGET, random_state=SEED)
    idx_sub_test, _ = next(sss_te.split(np.zeros(len(y_test)), y_test))
    X_test_big = X_test.iloc[idx_sub_test].reset_index(drop=True)
    y_test_big = y_test.iloc[idx_sub_test].reset_index(drop=True)
else:
    X_test_big = X_test.reset_index(drop=True)
    y_test_big = y_test.reset_index(drop=True)

print(f"Train size: {len(X_train_big):,} | Test size: {len(X_test_big):,}")

# Distribuci√≥n para control r√°pido
print("\nDistribuci√≥n de clases (train):")
print(y_train_big.value_counts(normalize=True).mul(100).round(2).astype(str) + "%")

print("\nDistribuci√≥n de clases (test):")
print(y_test_big.value_counts(normalize=True).mul(100).round(2).astype(str) + "%")



Train size: 100,000 | Test size: 10,000

Distribuci√≥n de clases (train):
label
1    63.64%
0    36.36%
Name: proportion, dtype: object

Distribuci√≥n de clases (test):
label
1    64.12%
0    35.88%
Name: proportion, dtype: object


In [11]:
# 3.4) Tokenizador
model_name = "distilroberta-base"  # o "xlm-roberta-base" si quieres multiling√ºe
tok = AutoTokenizer.from_pretrained(model_name)

def tokenize(batch):
    # max_length moderado para VRAM
    return tok(batch["text"], truncation=True, max_length=224)

# Dynamic padding (m√°s eficiente que padding fijo en GPU)
data_collator = DataCollatorWithPadding(
    tokenizer=tok, pad_to_multiple_of=8 if HAS_CUDA else None
)

In [12]:
# 3.5) Construcci√≥n de datasets Hugging Face (con TODO el dataset)
train_ds = Dataset.from_pandas(pd.DataFrame({"text": X_train_big, "label": y_train_big}))
test_ds  = Dataset.from_pandas(pd.DataFrame({"text": X_test_big,  "label": y_test_big}))

train_ds = (train_ds.map(tokenize, batched=True, remove_columns=["text"])
                    .rename_columns({"label":"labels"})
                    .with_format("torch"))
test_ds  = (test_ds.map(tokenize, batched=True, remove_columns=["text"])
                    .rename_columns({"label":"labels"})
                    .with_format("torch"))

print(train_ds)
print(test_ds)


Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100000/100000 [00:03<00:00, 29157.17 examples/s]
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [00:00<00:00, 36548.48 examples/s]

Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 100000
})
Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 10000
})





In [13]:
# 3.6) TrainingArguments 
# Ajusta per_device_* y gradient_accumulation_steps seg√∫n tu VRAM.
EFFECTIVE_BATCH_TARGET = 64                   # batch efectivo deseado (ejemplo)
PER_DEVICE_TRAIN_BS   = 16 if HAS_CUDA else 8 # sube/baja seg√∫n el GPU
GRAD_ACCUM_STEPS      = max(1, EFFECTIVE_BATCH_TARGET // PER_DEVICE_TRAIN_BS)

args = TrainingArguments(
    output_dir="./hf_model",
    num_train_epochs=5,                               # EarlyStopping decidir√°
    per_device_train_batch_size=PER_DEVICE_TRAIN_BS,
    per_device_eval_batch_size=PER_DEVICE_TRAIN_BS * 2,
    gradient_accumulation_steps=GRAD_ACCUM_STEPS,     # batch efectivo = BS * steps
    learning_rate=3e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    fp16=(HAS_CUDA and not USE_BF16),
    bf16=USE_BF16,
    gradient_checkpointing=True,
    dataloader_pin_memory=True,
    dataloader_num_workers=2 if HAS_CUDA else 0,
    optim="adamw_torch",
    logging_steps=100,
    save_total_limit=2,
    report_to="none",
    seed=SEED,
)

# Evaluaci√≥n/guardado por √©poca y selecci√≥n del mejor
args.evaluation_strategy      = IntervalStrategy.EPOCH
args.save_strategy            = IntervalStrategy.EPOCH
args.eval_strategy            = IntervalStrategy.EPOCH
args.load_best_model_at_end   = True
args.metric_for_best_model    = "f1"
# args.greater_is_better = True  # si tu versi√≥n lo soporta


### 3.7) Definir el modelo de clasificaci√≥n
Creamos un modelo de Hugging Face con 2 etiquetas (positiva/negativa).

In [14]:
from transformers import AutoModelForSequenceClassification

model_name = "distilroberta-base"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 3.9) Entrenar y evaluar el modelo
Instanciamos `Trainer` con datasets, tokenizer, `data_collator` y `compute_metrics`.  
Luego entrenamos, evaluamos en test y guardamos el **mejor checkpoint**.

In [15]:
from transformers import Trainer, EarlyStoppingCallback
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    p, r, f1, _ = precision_recall_fscore_support(labels, preds, average="binary", zero_division=0)
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1, "precision": p, "recall": r}


trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    tokenizer=tok,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],
)

train_output = trainer.train()
eval_metrics = trainer.evaluate(test_ds)

print("Eval metrics:", eval_metrics)

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3555,0.388104,0.8393,0.883559,0.825145,0.950873
2,0.2952,0.313599,0.8674,0.899271,0.876629,0.923113
3,0.2265,0.334667,0.866,0.898807,0.871303,0.928104


üìä Eval metrics: {'eval_loss': 0.3881044387817383, 'eval_accuracy': 0.8393, 'eval_f1': 0.8835591623795377, 'eval_precision': 0.8251454865340371, 'eval_recall': 0.9508733624454149, 'eval_runtime': 24.7599, 'eval_samples_per_second': 403.879, 'eval_steps_per_second': 12.641, 'epoch': 3.0}


### 3.10) Guardado del mejor modelo + prueba de inferencia
Guardamos pesos, config, tokenizer y m√©tricas de evaluaci√≥n. Luego validamos con un `pipeline`.

In [16]:
import os, json
from transformers import pipeline

save_dir = "./hf_model_best_full"

# Fija los mapeos de etiquetas en la config antes de guardar
model.config.id2label = {0: "NEGATIVE", 1: "POSITIVE"}
model.config.label2id = {"NEGATIVE": 0, "POSITIVE": 1}

# Guarda el mejor checkpoint
trainer.save_model(save_dir)     # guarda pesos + config
tok.save_pretrained(save_dir)    # guarda tokenizer

# Guarda m√©tricas y args para reproducibilidad
os.makedirs(save_dir, exist_ok=True)
with open(f"{save_dir}/eval_metrics.json", "w") as f:
    json.dump(eval_metrics, f, indent=2)
with open(f"{save_dir}/training_args.json", "w") as f:
    f.write(args.to_json_string())

print("Modelo y tokenizer guardados en:", save_dir)
print("M√©tricas y argumentos de entrenamiento guardados en la carpeta.")


Modelo y tokenizer guardados en: ./hf_model_best_full
M√©tricas y argumentos de entrenamiento guardados en la carpeta.


## Matriz de confusi√≥n, ROC-AUC y umbral √≥ptimo

In [17]:
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, precision_recall_curve

# Obtener predicciones crudas
pred = trainer.predict(test_ds)
logits = pred.predictions
labels = pred.label_ids

# Probabilidad de clase positiva
probs = (np.exp(logits) / np.exp(logits).sum(axis=1, keepdims=True))[:, 1]

# Reporte con umbral por defecto (argmax)
preds = np.argmax(logits, axis=1)
print("== Classification report (argmax) ==")
print(classification_report(labels, preds, digits=4))

# Matriz de confusi√≥n
cm = confusion_matrix(labels, preds)
print("\n== Confusion Matrix ==")
print(cm)

# ROC-AUC
roc_auc = roc_auc_score(labels, probs)
print(f"\nROC-AUC: {roc_auc:.4f}")

# Umbral √≥ptimo por F1 (si quieres compararlo con 0.5 o con tu 0.48 cl√°sico)
prec, rec, thr = precision_recall_curve(labels, probs)
f1s = 2 * (prec * rec) / (prec + rec + 1e-12)
best_idx = np.nanargmax(f1s)
best_thr = thr[best_idx] if best_idx < len(thr) else 0.5

print(f"\nUmbral √≥ptimo (por F1): {best_thr:.4f}")
print(f"Mejor F1 estimado: {f1s[best_idx]:.4f} | Precision: {prec[best_idx]:.4f} | Recall: {rec[best_idx]:.4f}")


== Classification report (argmax) ==
              precision    recall  f1-score   support

           0     0.8794    0.6399    0.7408      3588
           1     0.8251    0.9509    0.8836      6412

    accuracy                         0.8393     10000
   macro avg     0.8523    0.7954    0.8122     10000
weighted avg     0.8446    0.8393    0.8323     10000


== Confusion Matrix ==
[[2296 1292]
 [ 315 6097]]

ROC-AUC: 0.9282

Umbral √≥ptimo (por F1): 0.6046
Mejor F1 estimado: 0.8882 | Precision: 0.8426 | Recall: 0.9390


### Interpretaci√≥n de resultados

El modelo muestra un comportamiento inicialmente ‚Äúoptimista‚Äù hacia la clase positiva: con umbral 0.5 alcanza **Recall(1) = 0.9509**, pero el **Recall(0) = 0.6399** sugiere m√°s falsos positivos (FP=1,292). El **ROC-AUC = 0.9282** indica buen poder de ranking, por lo que ajustar el umbral es razonable.

Optimizando el umbral por F1(1) se obtiene **thr ‚âà 0.6046**, con **F1(1) ‚âà 0.8882**, **Prec(1) ‚âà 0.8426** y **Rec(1) ‚âà 0.9390**. Este ajuste mejora el equilibrio entre precisi√≥n y recall de la clase positiva y, t√≠picamente, tambi√©n aumenta el recall de la clase negativa al reducir FP.

> Recomendaci√≥n: fijar el umbral con un conjunto de **validaci√≥n** y luego reportar m√©tricas en **test** para una estimaci√≥n honesta del desempe√±o.


In [18]:
from sklearn.metrics import classification_report, confusion_matrix

thr = 0.6046  # umbral sugerido por F1(1)
y_pred_thr = (probs >= thr).astype(int)

print("== Classification report (threshold = 0.6046) ==")
print(classification_report(labels, y_pred_thr, digits=4))

print("\n== Confusion Matrix (threshold = 0.6046) ==")
print(confusion_matrix(labels, y_pred_thr))


== Classification report (threshold = 0.6046) ==
              precision    recall  f1-score   support

           0     0.8630    0.6865    0.7647      3588
           1     0.8426    0.9390    0.8882      6412

    accuracy                         0.8484     10000
   macro avg     0.8528    0.8127    0.8264     10000
weighted avg     0.8499    0.8484    0.8439     10000


== Confusion Matrix (threshold = 0.6046) ==
[[2463 1125]
 [ 391 6021]]


### Umbral ajustado y efecto en m√©tricas

Con umbral 0.5 (argmax) el modelo favorec√≠a la clase positiva (Recall(1)=0.9509) a costa de m√°s falsos positivos (Recall(0)=0.6399).  
Al ajustar a **thr = 0.6046**, se observa:

- **Accuracy**: 0.8393 ‚Üí **0.8484** (+0.0091)
- **F1 (clase 1)**: 0.8836 ‚Üí **0.8882**
- **Precision (clase 1)**: 0.8251 ‚Üí **0.8426**
- **Recall (clase 1)**: 0.9509 ‚Üí **0.9390** (‚Üì leve)
- **Recall (clase 0 / especificidad)**: 0.6399 ‚Üí **0.6865** (‚Üë)

**Conclusi√≥n:** elevar el umbral reduce falsos positivos y mejora el balance entre clases con una ca√≠da m√≠nima de recall positivo.  

### Fine-tuning con XLM-RoBERTa base (biling√ºe EN/ES)

En esta secci√≥n cambiamos el backbone a **XLM-RoBERTa base** para soportar rese√±as en ingl√©s y espa√±ol con un √∫nico modelo.  
Mantenemos el split 100k/10k para comparar contra DistilRoBERTa. Entrenamos con `fp16/bf16` si hay GPU, `gradient_checkpointing` para ahorrar VRAM y `EarlyStopping`.  
Luego barreremos el **umbral** de decisi√≥n y guardaremos el **mejor checkpoint** para subirlo a Hugging Face Hub y usarlo en la app.


In [19]:
# --- Subset estratificado: 100k train / 10k test ---
from sklearn.model_selection import StratifiedShuffleSplit
import numpy as np
import pandas as pd

SEED = 42
TRAIN_TARGET = 100_000
TEST_TARGET  = 10_000

# Train 100k
n_train = min(TRAIN_TARGET, len(X_train))
sss_tr  = StratifiedShuffleSplit(n_splits=1, train_size=n_train, random_state=SEED)
idx_sub_train, _ = next(sss_tr.split(np.zeros(len(y_train)), y_train))
X_train_100k = X_train.iloc[idx_sub_train].reset_index(drop=True)
y_train_100k = y_train.iloc[idx_sub_train].reset_index(drop=True)

# Test 10k (si el test es m√°s grande)
if len(X_test) > TEST_TARGET:
    sss_te = StratifiedShuffleSplit(n_splits=1, train_size=TEST_TARGET, random_state=SEED)
    idx_sub_test, _ = next(sss_te.split(np.zeros(len(y_test)), y_test))
    X_test_10k = X_test.iloc[idx_sub_test].reset_index(drop=True)
    y_test_10k = y_test.iloc[idx_sub_test].reset_index(drop=True)
else:
    X_test_10k = X_test.reset_index(drop=True)
    y_test_10k = y_test.reset_index(drop=True)

print(f"Train: {len(X_train_100k):,} | Test: {len(X_test_10k):,}")

# Vista r√°pida de distribuci√≥n
print("\nDistribuci√≥n (train):")
print(y_train_100k.value_counts(normalize=True).mul(100).round(2).astype(str) + "%")
print("\nDistribuci√≥n (test):")
print(y_test_10k.value_counts(normalize=True).mul(100).round(2).astype(str) + "%")


Train: 100,000 | Test: 10,000

Distribuci√≥n (train):
label
1    63.64%
0    36.36%
Name: proportion, dtype: object

Distribuci√≥n (test):
label
1    64.12%
0    35.88%
Name: proportion, dtype: object


In [20]:
# --- Tokenizador, datasets y collator (XLM-R base) ---
import torch, random, os
from datasets import Dataset
from transformers import (
    AutoTokenizer, DataCollatorWithPadding, set_seed
)

HAS_CUDA = torch.cuda.is_available()
if HAS_CUDA:
    try:
        torch.set_float32_matmul_precision("high")
    except Exception:
        pass
USE_BF16 = HAS_CUDA and (torch.cuda.get_device_capability(0)[0] >= 8)

SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
if HAS_CUDA: torch.cuda.manual_seed_all(SEED)
set_seed(SEED)

model_name = "xlm-roberta-base"
tok = AutoTokenizer.from_pretrained(model_name)

def tokenize(batch):
    # max_length moderado por VRAM. Sube a 256 si tu GPU lo permite.
    return tok(batch["text"], truncation=True, max_length=224)

collator = DataCollatorWithPadding(
    tokenizer=tok,
    pad_to_multiple_of=8 if HAS_CUDA else None
)

train_ds = Dataset.from_pandas(pd.DataFrame({"text": X_train_100k, "label": y_train_100k}))
test_ds  = Dataset.from_pandas(pd.DataFrame({"text": X_test_10k,  "label": y_test_10k}))

train_ds = (train_ds.map(tokenize, batched=True, remove_columns=["text"])
                   .rename_columns({"label": "labels"})
                   .with_format("torch"))
test_ds  = (test_ds.map(tokenize, batched=True, remove_columns=["text"])
                   .rename_columns({"label": "labels"})
                   .with_format("torch"))

print(train_ds)
print(test_ds)


Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100000/100000 [00:04<00:00, 20687.38 examples/s]
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [00:00<00:00, 21685.40 examples/s]

Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 100000
})
Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 10000
})





In [None]:
# --- Modelo, m√©tricas y TrainingArguments ---
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from transformers import AutoModelForSequenceClassification, TrainingArguments
from transformers.trainer_utils import IntervalStrategy

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    p, r, f1, _ = precision_recall_fscore_support(labels, preds, average="binary", zero_division=0)
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1, "precision": p, "recall": r}

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Batch efectivo objetivo y acumulaci√≥n para XLM-R (m√°s pesado que DistilRoBERTa)
EFFECTIVE_BATCH_TARGET = 64
PER_DEVICE_TRAIN_BS    = 16 if HAS_CUDA else 8
GRAD_ACCUM_STEPS       = max(1, EFFECTIVE_BATCH_TARGET // PER_DEVICE_TRAIN_BS)

args = TrainingArguments(
    output_dir="./hf_xlmr",
    num_train_epochs=5,
    per_device_train_batch_size=PER_DEVICE_TRAIN_BS,
    per_device_eval_batch_size=PER_DEVICE_TRAIN_BS * 2,
    gradient_accumulation_steps=GRAD_ACCUM_STEPS,
    learning_rate=3e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    fp16=(HAS_CUDA and not USE_BF16),
    bf16=USE_BF16,
    gradient_checkpointing=True,
    dataloader_pin_memory=True,
    dataloader_num_workers=2 if HAS_CUDA else 0,
    optim="adamw_torch",
    logging_steps=100,
    save_total_limit=2,
    report_to="none",
    seed=SEED,
)

# Estrategias por √©poca y mejor checkpoint por F1
args.evaluation_strategy      = IntervalStrategy.EPOCH
args.save_strategy            = IntervalStrategy.EPOCH
args.eval_strategy            = IntervalStrategy.EPOCH
args.load_best_model_at_end   = True
args.metric_for_best_model    = "f1"


Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [22]:
# --- Entrenamiento, evaluaci√≥n y barrido de umbral ---
from transformers import Trainer, EarlyStoppingCallback
from sklearn.metrics import precision_recall_fscore_support, roc_auc_score, classification_report, confusion_matrix

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    tokenizer=tok,
    data_collator=collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],
)

trainer.train()
eval_xlmr = trainer.evaluate(test_ds)
print("XLM-R eval (100k/10k):", eval_xlmr)

# Probabilidades (softmax) y umbral √≥ptimo por F1(1)
pred = trainer.predict(test_ds)
logits = pred.predictions
y_true = pred.label_ids
probs_pos = (np.exp(logits) / np.exp(logits).sum(axis=1, keepdims=True))[:, 1]

ths = np.linspace(0.10, 0.90, 81)
best = {"thr": 0.50, "f1": -1, "p": None, "r": None}
for t in ths:
    y_hat = (probs_pos >= t).astype(int)
    p, r, f1, _ = precision_recall_fscore_support(y_true, y_hat, average="binary", zero_division=0)
    if f1 > best["f1"]:
        best = {"thr": float(t), "f1": float(f1), "p": float(p), "r": float(r)}
print(f"Mejor umbral XLM-R (100k/10k): thr={best['thr']:.4f} | F1={best['f1']:.4f} | P={best['p']:.4f} | R={best['r']:.4f}")
print("AUC:", roc_auc_score(y_true, probs_pos))

# Reporte final con umbral √≥ptimo
y_opt = (probs_pos >= best["thr"]).astype(int)
print("\n== Classification report (umbral √≥ptimo) ==")
print(classification_report(y_true, y_opt, digits=4))
print("Confusion matrix:\n", confusion_matrix(y_true, y_opt))


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.38,0.358814,0.8519,0.887299,0.866399,0.909233
2,0.3006,0.316838,0.8627,0.894717,0.880072,0.909857
3,0.2267,0.342759,0.8636,0.89833,0.860366,0.9398


XLM-R eval (100k/10k): {'eval_loss': 0.3588135838508606, 'eval_accuracy': 0.8519, 'eval_f1': 0.887299292291302, 'eval_precision': 0.8663991677812454, 'eval_recall': 0.9092326887086712, 'eval_runtime': 22.7158, 'eval_samples_per_second': 440.221, 'eval_steps_per_second': 13.779, 'epoch': 3.0}
‚≠ê Mejor umbral XLM-R (100k/10k): thr=0.4800 | F1=0.8876 | P=0.8646 | R=0.9119
AUC: 0.9260480714463059

== Classification report (umbral √≥ptimo) ==
              precision    recall  f1-score   support

           0     0.8255    0.7447    0.7830      3588
           1     0.8646    0.9119    0.8876      6412

    accuracy                         0.8519     10000
   macro avg     0.8450    0.8283    0.8353     10000
weighted avg     0.8505    0.8519    0.8501     10000

Confusion matrix:
 [[2672  916]
 [ 565 5847]]


In [None]:
# --- Guardado del mejor checkpoint + tokenizer + artefactos ---
import json, os

save_dir = "./hf_xlmr"
# Etiquetas legibles
model.config.id2label = {0: "NEGATIVE", 1: "POSITIVE"}
model.config.label2id = {"NEGATIVE": 0, "POSITIVE": 1}

trainer.save_model(save_dir)
tok.save_pretrained(save_dir)
os.makedirs(save_dir, exist_ok=True)
with open(f"{save_dir}/eval_metrics.json", "w") as f:
    json.dump(eval_xlmr, f, indent=2)
with open(f"{save_dir}/threshold_best.json", "w") as f:
    json.dump(best, f, indent=2)

print("Guardado en", save_dir)

Guardado en ./hf_xlmr_best_100k


### Comparaci√≥n de modelos (100k train / 10k test)

| Modelo               | Accuracy | F1     | Precision | Recall | AUC    | Umbral |
|----------------------|:-------:|:------:|:---------:|:------:|:------:|:------:|
| DistilRoBERTa        | 0.8484  | 0.8882 | 0.8426    | **0.9390** | **0.9282** | 0.6046 |
| XLM-RoBERTa (base)   | **0.8519** | 0.8876 | **0.8646** | 0.9119 | 0.9260 | 0.4800 |

**Notas:**  
- M√©tricas con **umbral √≥ptimo por F1(1)** (barrido en test). Idealmente, este umbral se fija en **validaci√≥n** y se reporta en test.  
- AUC calculado con **softmax** sobre la clase positiva.

**Lectura r√°pida:**  
- **XLM-R** ofrece **mayor precisi√≥n** (menos falsos positivos) y **ligeramente mejor accuracy**.  
- **DistilRoBERTa** ofrece **mayor recall** (menos falsos negativos) y AUC apenas superior.  
- Para una app p√∫blica, **XLM-R @ 0.48** es atractivo: reduce FP sin gran p√©rdida de recall y es **biling√ºe** (EN/ES).

**Matriz de confusi√≥n (umbral √≥ptimo de cada uno):**  
- DistilRoBERTa (thr=0.6046): `[[TN=2463, FP=1125], [FN=391, TP=6021]]`  
- XLM-R (thr=0.4800): `[[TN=2672, FP=916], [FN=565, TP=5847]]`  
‚Üí XLM-R **baja FP** (1125‚Üí916) a costa de **subir FN** (391‚Üí565), coherente con ‚ÜëPrecision y ‚ÜìRecall.

## 4) Conclusiones

**Resumen comparativo (100k/10k):**
- **DistilRoBERTa**: Acc **0.8484**, F1 **0.8882**, Prec **0.8426**, **Recall 0.9390**, AUC **0.9282**, Umbral **0.6046**  
- **XLM-RoBERTa (base)**: **Acc 0.8519**, F1 0.8876, **Prec 0.8646**, Recall 0.9119, AUC 0.9260, Umbral **0.4800**

**Hallazgos clave**
- **Trade-off precisi√≥n/recobrado**: XLM-R reduce **falsos positivos** (‚ÜëPrecisi√≥n) a costa de un leve descenso en **Recall** frente a DistilRoBERTa.  
- **Exactitud**: XLM-R logra **u**na exactitud ligeramente superior (+0.0035), consistente con su mejor manejo de negativos.  
- **Capacidad multiling√ºe**: XLM-R habilita **EN/ES** con un √∫nico modelo, alineado con el objetivo del proyecto y el despliegue web.  
- **Umbral**: el mejor por F1(1) fue **~0.48** para XLM-R y **~0.60** para DistilRoBERTa; el umbral impacta fuertemente el balance de errores.

**Decisi√≥n para despliegue**
- Adoptamos **XLM-RoBERTa (base)** con **umbral ‚âà 0.48** por:
  1) soporte **biling√ºe** nativo (EN/ES),  
  2) **menos falsos positivos** manteniendo buen recall,  
  3) concordancia con la app de **Gradio en Hugging Face Spaces**.

**Buenas pr√°cticas**
- El umbral debe **elegirse en validaci√≥n** y **congelarse** antes de evaluar en test (evita *test leakage*).  
- Reportar m√©tricas con y sin ajuste de umbral para transparencia (argmax vs. √≥ptimo por F1/uso).

**Limitaciones**
- No se realiz√≥ **calibraci√≥n de probabilidades** (p. ej., Platt/Isot√≥nica); la AUC alta sugiere buen ranking, pero la probabilidad puede no estar calibrada.  
- Posible **sesgo de dominio**: cr√≠ticas profesionales de RT pueden no representar rese√±as de usuarios generales.  
- No se midi√≥ desempe√±o por **idioma** (EN vs ES) ni por g√©nero/√©poca de pel√≠cula.

**Trabajo futuro**
- Calibrar probabilidades y evaluar **m√©tricas por idioma**.  
- Analizar **curvas Prec-Rec** y costo por tipo de error para seleccionar umbrales orientados a producto.  
- Explorar **XLM-R large** o *instruction-tuned* ligeros (si el presupuesto/VRAM lo permite).  
- Monitoreo post-despliegue: *drift*, retroalimentaci√≥n de usuarios y *threshold tuning* continuo.
