# Assignment 2 — Neural Sequence Models: From Recurrence to Self-Attention

---

## 1. Introduction

### 1.1 Recap: Assignment 1 Baseline Performance

Assignment 1 established classical machine learning baselines for **EXIST 2025 — Task 1, Subtask 1.1 (Sexism Identification in Tweets)**, a binary text classification task on a bilingual (English/Spanish) corpus of ~10,000 tweets. Using TF-IDF feature representations paired with linear classifiers (Logistic Regression and LinearSVC), and FastText mean-pooled embeddings as an alternative dense representation, the following results were obtained on the held-out validation set (1,038 samples):

| Model | Feature Representation | CV F1-Macro | Val F1-Macro | Val Accuracy |
|---|---|---|---|---|
| **Logistic Regression** *(best overall)* | TF-IDF (unigrams, sublinear_tf) | 0.6536 | **0.75** | 0.75 |
| LinearSVC | TF-IDF (unigrams) | 0.6524 | — | — |
| SVM (LinearSVC) *(best dense)* | FastText embeddings (scratch) | 0.5322 | 0.62 | 0.62 |
| Logistic Regression | FastText embeddings (scratch) | 0.5234 | — | — |

Key takeaways from Assignment 1:

- **Simple lowercasing** was the optimal preprocessing strategy (F1: 0.6506), outperforming stemming, lemmatization, and stopword removal.
- **Unigrams only** outperformed all higher-order n-gram configurations, due to feature space explosion relative to the small training corpus (~6,920 samples).
- **Sparse TF-IDF representations decisively outperformed dense FastText embeddings** trained from scratch (~12.1 F1-point gap), highlighting the limitations of mean-pooled embeddings on a small, domain-specific corpus.
- **Error analysis** revealed two systematic failure modes of bag-of-words models: (1) false negatives on tweets that *report or condemn* sexism (anti-sexist tone masks SEXIST label), and (2) false positives driven by gender-related keywords used in neutral or non-sexist contexts (e.g., references to *patriarchy* or *mgtow* without sexist intent). These errors share a common root cause: TF-IDF cannot model *how* words are used, only *which* words appear.

---

### 1.2 Research Questions

Assignment 1 demonstrated the ceiling of bag-of-words approaches for this task. This assignment transitions to **neural sequence models** — architectures that explicitly model sequential structure, word order, and long-range dependencies. Concretely, we implement and evaluate a **Bidirectional LSTM** and a **Transformer-based model** (fine-tuned from a pre-trained checkpoint, given the small dataset size), comparing them against the Assignment 1 baselines. 

This motivates the following research questions:

> **RQ1 — Performance:** Do neural sequence models (BiLSTM, Transformer/BERT) yield meaningful F1-Macro improvements over the TF-IDF + Logistic Regression baseline on sexism identification in tweets?

> **RQ2 — Error correction:** Which specific error types from Assignment 1 — irony/sarcasm, anti-sexist reporting, keyword mismatch — are resolved by architectures that capture sequential context and pragmatic meaning?

> **RQ3 — Data efficiency:** Given the limited training size (~6,920 samples), do neural models trained from scratch suffer from data starvation, and does transfer learning via a pre-trained Transformer (e.g., multilingual BERT / XLM-R) mitigate this?

> **RQ4 — Computational cost:** Is the added complexity of neural models — in training time, GPU memory, and inference latency — justified by the performance gains over the classical baseline?

> **RQ5 — Architecture trade-offs:** How do BiLSTM and Transformer architectures compare in terms of their inductive biases for this specific task (short bilingual tweets, subtle/ironic content, near-balanced classes)?

In [1]:
import json
import pandas as pd
import numpy as np
import re
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
import ast
from time import time, sleep
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, Subset

# Sklearn imports
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix, f1_score

# NLTK downloads
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)
nltk.download('punkt', quiet=True)

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Configurar semilla para reproducibilidad
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

In [2]:
def load_and_parse_data(filepath):
    """
    Parses nested JSON and applies Majority Voting for labels.
    """
    with open(filepath, 'r', encoding='utf-8') as f:
        data = json.load(f)
    
    df = pd.DataFrame.from_dict(data, orient='index')
    df = df.reset_index(drop=True).rename(columns={'index': 'id_EXIST'})
    
    # Label Processing (Majority Voting)
    if 'labels_task1_1' in df.columns:
        def get_majority_vote(labels_list):
            if not isinstance(labels_list, list): return np.nan
            counts = pd.Series(labels_list).value_counts()
            # Tie-breaking: Prioritize 'YES' (Sexism) if tie
            if len(counts) > 1 and counts.iloc[0] == counts.iloc[1]:
                if 'YES' in counts.index[:2]: return 'YES'
            return counts.idxmax()
        
        df['final_label_str'] = df['labels_task1_1'].apply(get_majority_vote)
        df['label'] = df['final_label_str'].map({'YES': 1, 'NO': 0})
        df = df.dropna(subset=['label'])
        df['label'] = df['label'].astype(int)
        
    return df

print("Loading Data...")
df_train = load_and_parse_data('../data/training/EXIST2025_training.json')
df_val = load_and_parse_data('../data/dev/EXIST2025_dev.json')
df_test = load_and_parse_data('../data/test/EXIST2025_test_clean.json')

print(f"\nTotal Samples - Training: {len(df_train)}")
print(df_train['final_label_str'].value_counts())
print(f"\nTotal Samples - Validation (Will be used as TEST): {len(df_val)}")
print(df_val['final_label_str'].value_counts())

Loading Data...

Total Samples - Training: 6920
final_label_str
YES    3553
NO     3367
Name: count, dtype: int64

Total Samples - Validation (Will be used as TEST): 1038
final_label_str
YES    559
NO     479
Name: count, dtype: int64


In [3]:
stop_words = set(stopwords.words('english')) | set(stopwords.words('spanish'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def preprocess_text(text, strategy='raw'):
    text_processed = str(text)
    if strategy == 'raw':
        return text_processed
    if strategy == 'lowercase':
        return text_processed.lower()
    if strategy == 'no_punct':
        text_processed = re.sub(r'[^\w\s]', '', text_processed)
        return text_processed.lower()
    if strategy == 'no_stopwords':
        text_processed = text_processed.lower()
        words = text_processed.split()
        return " ".join([w for w in words if w not in stop_words])
    if strategy == 'stemmed':
        text_processed = text_processed.lower()
        words = text_processed.split()
        return " ".join([stemmer.stem(w) for w in words])
    if strategy == 'lemmatized':
        text_processed = text_processed.lower()
        words = text_processed.split() 
        return " ".join([lemmatizer.lemmatize(w) for w in words])
    return text_processed

# Process texts
df_train['text_clean'] = df_train['tweet'].apply(lambda x: preprocess_text(x, 'lowercase'))
df_val['text_clean'] = df_val['tweet'].apply(lambda x: preprocess_text(x, 'lowercase'))
df_test['text_clean'] = df_test['tweet'].apply(lambda x: preprocess_text(x, 'lowercase'))

In [4]:
class Vocabulary:
    def __init__(self, min_freq=2):
        self.itos = {0: "<PAD>", 1: "<UNK>"}
        self.stoi = {"<PAD>": 0, "<UNK>": 1}
        self.min_freq = min_freq
        
    def build_vocabulary(self, sentence_list):
        frequencies = {}
        idx = 2
        for sentence in sentence_list:
            for word in self.tokenize(sentence):
                if word not in frequencies:
                    frequencies[word] = 1
                else:
                    frequencies[word] += 1
                if frequencies[word] == self.min_freq:
                    self.stoi[word] = idx
                    self.itos[idx] = word
                    idx += 1
                    
    def tokenize(self, text):
        return re.findall(r'\w+', text)
        
    def numericalize(self, text):
        tokenized_text = self.tokenize(text)
        return [self.stoi.get(token, self.stoi["<UNK>"]) for token in tokenized_text]

vocab = Vocabulary(min_freq=2)
vocab.build_vocabulary(df_train['text_clean'].tolist())

In [5]:
class EXISTDataset(Dataset):
    def __init__(self, df, vocab, max_len=64):
        self.df = df
        self.vocab = vocab
        self.max_len = max_len
        
    def __len__(self):
        return len(self.df)
        
    def __getitem__(self, index):
        text = self.df.iloc[index]['text_clean']
        label = self.df.iloc[index]['label'] if 'label' in self.df.columns else -1
        
        tokens = self.vocab.numericalize(text)
        length = len(tokens)
        
        # Evitar secuencias vacías (rompen el pack_padded_sequence)
        if length == 0:
            tokens = [self.vocab.stoi["<UNK>"]]
            length = 1
            
        # Truncar si es más largo que max_len
        if length > self.max_len:
            tokens = tokens[:self.max_len]
            length = self.max_len
            
        # Rellenar (Padding)
        padded_tokens = tokens + [self.vocab.stoi["<PAD>"]] * (self.max_len - length)
            
        return torch.tensor(padded_tokens), torch.tensor(label, dtype=torch.long), torch.tensor(length, dtype=torch.long)

# Dataset global de Train (se dividirá en Folds)
train_dataset = EXISTDataset(df_train, vocab)

# El dataset "dev" original será nuestro "TEST" real
test_dataset_real = EXISTDataset(df_val, vocab)
test_loader_real = DataLoader(test_dataset_real, batch_size=32, shuffle=False)

In [6]:
import optuna
import mlflow
import mlflow.pytorch
from optuna.integration.mlflow import MLflowCallback

class BiLSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, dropout=0.5):
        super(BiLSTMClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers,
                            bidirectional=True, batch_first=True,
                            dropout=dropout if n_layers > 1 else 0)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, text, lengths):
        embedded = self.dropout(self.embedding(text))
        packed = nn.utils.rnn.pack_padded_sequence(
            embedded, lengths.cpu(), batch_first=True, enforce_sorted=False
        )
        packed_output, (hidden, cell) = self.lstm(packed)
        forward_hidden = hidden[-2]
        backward_hidden = hidden[-1]
        final_hidden = torch.cat([forward_hidden, backward_hidden], dim=1)
        output = self.dropout(final_hidden)
        logits = self.fc(output)
        return logits


class EarlyStopping:
    def __init__(self, patience=5, min_delta=0.001):
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.best_score = None
        self.early_stop = False
        self.best_model_state = None

    def __call__(self, val_score, model):
        if self.best_score is None:
            self.best_score = val_score
            self.best_model_state = model.state_dict().copy()
        elif val_score < self.best_score + self.min_delta:
            self.counter += 1
            if self.counter >= self.patience:
                self.early_stop = True
        else:
            self.best_score = val_score
            self.best_model_state = model.state_dict().copy()
            self.counter = 0
        return self.early_stop


# Fixed constants
VOCAB_SIZE  = len(vocab.stoi)
OUTPUT_DIM  = 2
EPOCHS      = 20
N_SPLITS    = 5
SEED        = 42

In [7]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

# ── MLflow setup ────────────────────────────────────────────────────────────
MLFLOW_EXPERIMENT = "BiLSTM_Optuna_Tuning"
mlflow.set_experiment(MLFLOW_EXPERIMENT)

# ── Search space ────────────────────────────────────────────────────────────
# All hyperparameters Optuna will tune are defined here for easy editing.
SEARCH_SPACE = {
    "embedding_dim" : ("categorical", [100, 200, 300]),
    "hidden_dim"    : ("categorical", [128, 256, 512]),
    "n_layers"      : ("int",         [1, 3]),          # suggest_int(low, high)
    "dropout"       : ("float",       [0.2, 0.5]),       # suggest_float(low, high)
    "lr"            : ("loguniform",  [1e-4, 1e-2]),     # suggest_float(..., log=True)
    "batch_size"    : ("categorical", [16, 32, 64]),
}


def suggest_hyperparams(trial):
    """Translate SEARCH_SPACE into Optuna suggestions."""
    params = {}
    for name, (kind, bounds) in SEARCH_SPACE.items():
        if kind == "categorical":
            params[name] = trial.suggest_categorical(name, bounds)
        elif kind == "int":
            params[name] = trial.suggest_int(name, *bounds)
        elif kind == "float":
            params[name] = trial.suggest_float(name, *bounds)
        elif kind == "loguniform":
            params[name] = trial.suggest_float(name, *bounds, log=True)
    return params


def run_cv(params: dict) -> tuple[float, dict]:
    """
    5-fold stratified CV for a given hyperparameter configuration.
    Returns (mean_val_f1, per-fold metrics dict).
    """
    skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=SEED)
    y_array = df_train['label'].values
    fold_f1s = []

    for fold, (train_idx, val_idx) in enumerate(
        skf.split(np.zeros(len(y_array)), y_array)
    ):
        train_loader = DataLoader(
            Subset(train_dataset, train_idx),
            batch_size=params["batch_size"], shuffle=True
        )
        val_loader = DataLoader(
            Subset(train_dataset, val_idx),
            batch_size=params["batch_size"], shuffle=False
        )

        model = BiLSTMClassifier(
            vocab_size    = VOCAB_SIZE,
            embedding_dim = params["embedding_dim"],
            hidden_dim    = params["hidden_dim"],
            output_dim    = OUTPUT_DIM,
            n_layers      = params["n_layers"],
            dropout       = params["dropout"],
        ).to(device)

        criterion     = nn.CrossEntropyLoss()
        optimizer     = torch.optim.Adam(model.parameters(), lr=params["lr"])
        early_stopping = EarlyStopping(patience=5)

        for epoch in range(EPOCHS):
            # ── Train ──
            model.train()
            for texts, labels, lengths in train_loader:
                texts, labels = texts.to(device), labels.to(device)
                optimizer.zero_grad()
                loss = criterion(model(texts, lengths), labels)
                loss.backward()
                torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
                optimizer.step()

            # ── Validate ──
            model.eval()
            preds, targets = [], []
            with torch.no_grad():
                for texts, labels, lengths in val_loader:
                    texts = texts.to(device)
                    out   = model(texts, lengths)
                    preds.extend(out.argmax(1).cpu().numpy())
                    targets.extend(labels.numpy())

            val_f1 = f1_score(targets, preds, average='macro')

            if early_stopping(val_f1, model):
                break

        fold_f1s.append(early_stopping.best_score)

    mean_f1      = float(np.mean(fold_f1s))
    per_fold_log = {f"fold_{i+1}_f1": v for i, v in enumerate(fold_f1s)}
    return mean_f1, per_fold_log


def objective(trial: optuna.Trial) -> float:
    """Optuna objective: suggest params → CV → log to MLflow → return metric."""
    params = suggest_hyperparams(trial)

    with mlflow.start_run(
        run_name=f"trial_{trial.number}", nested=True
    ):
        # Log all hyperparameters
        mlflow.log_params(params)
        mlflow.log_param("trial_number", trial.number)

        mean_f1, per_fold_log = run_cv(params)

        # Log per-fold and aggregate metrics
        mlflow.log_metrics(per_fold_log)
        mlflow.log_metric("mean_cv_f1_macro", mean_f1)

        # Optuna pruning hook (optional but useful with a Pruner)
        trial.report(mean_f1, step=0)
        if trial.should_prune():
            raise optuna.exceptions.TrialPruned()

    return mean_f1

Device: cuda


2026/02/24 20:17:03 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2026/02/24 20:17:03 INFO mlflow.store.db.utils: Updating database tables
2026/02/24 20:17:04 INFO mlflow.tracking.fluent: Experiment with name 'BiLSTM_Optuna_Tuning' does not exist. Creating a new experiment.


In [8]:
# Parent MLflow run wraps the entire study
with mlflow.start_run(run_name="optuna_study"):

    sampler = optuna.samplers.TPESampler(seed=SEED)
    pruner  = optuna.pruners.MedianPruner(n_startup_trials=5, n_warmup_steps=0)

    study = optuna.create_study(
        direction   = "maximize",
        sampler     = sampler,
        pruner      = pruner,
        study_name  = "bilstm_sexism",
    )

    study.optimize(
        objective,
        n_trials         = 30,      # ← adjust to your compute budget
        timeout          = None,
        show_progress_bar= True,
    )

    # ── Log best result to the parent run ──────────────────────────────────
    best = study.best_trial
    mlflow.log_params({f"best_{k}": v for k, v in best.params.items()})
    mlflow.log_metric("best_mean_cv_f1_macro", best.value)

    print("\n" + "=" * 50)
    print(f"Best trial:  #{best.number}")
    print(f"Best CV F1-Macro: {best.value:.4f}")
    print("Best hyperparameters:")
    for k, v in best.params.items():
        print(f"  {k}: {v}")

[32m[I 2026-02-24 20:17:20,096][0m A new study created in memory with name: bilstm_sexism[0m


  0%|          | 0/30 [00:00<?, ?it/s]

[32m[I 2026-02-24 20:19:27,678][0m Trial 0 finished with value: 0.7184964405687821 and parameters: {'embedding_dim': 200, 'hidden_dim': 128, 'n_layers': 1, 'dropout': 0.45985284373248053, 'lr': 0.0015930522616241021, 'batch_size': 64}. Best is trial 0 with value: 0.7184964405687821.[0m
[32m[I 2026-02-24 20:21:58,257][0m Trial 1 finished with value: 0.6961201858060088 and parameters: {'embedding_dim': 100, 'hidden_dim': 512, 'n_layers': 2, 'dropout': 0.2873687420594126, 'lr': 0.0016738085788752138, 'batch_size': 64}. Best is trial 0 with value: 0.7184964405687821.[0m
[32m[I 2026-02-24 20:25:35,490][0m Trial 2 finished with value: 0.6993984921712976 and parameters: {'embedding_dim': 200, 'hidden_dim': 256, 'n_layers': 2, 'dropout': 0.2511572371061875, 'lr': 0.00013492834268013249, 'batch_size': 32}. Best is trial 0 with value: 0.7184964405687821.[0m
[32m[I 2026-02-24 20:29:16,560][0m Trial 3 finished with value: 0.6987525959654033 and parameters: {'embedding_dim': 300, 'hidden

In [None]:
best_params = study.best_trial.params

with mlflow.start_run(run_name="final_model_best_params"):
    mlflow.log_params(best_params)

    final_model = BiLSTMClassifier(
        vocab_size    = VOCAB_SIZE,
        embedding_dim = best_params["embedding_dim"],
        hidden_dim    = best_params["hidden_dim"],
        output_dim    = OUTPUT_DIM,
        n_layers      = best_params["n_layers"],
        dropout       = best_params["dropout"],
    ).to(device)

    full_loader = DataLoader(
        train_dataset,
        batch_size = best_params["batch_size"],
        shuffle    = True
    )

    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(final_model.parameters(), lr=best_params["lr"])

    # No early stopping: we train for a fixed number of epochs on the full
    # training set. A reasonable default is the mean epoch at which early
    # stopping triggered across the CV folds; here we reuse EPOCHS directly.
    for epoch in range(EPOCHS):
        final_model.train()
        train_loss = 0
        all_preds, all_labels = [], []

        for texts, labels, lengths in full_loader:
            texts, labels = texts.to(device), labels.to(device)
            optimizer.zero_grad()
            loss = criterion(final_model(texts, lengths), labels)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(final_model.parameters(), max_norm=1.0)
            optimizer.step()

            train_loss += loss.item()
            all_preds.extend(final_model(texts, lengths).detach().argmax(1).cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

        train_f1 = f1_score(all_labels, all_preds, average='macro')
        mlflow.log_metric("train_f1_macro", train_f1, step=epoch)
        mlflow.log_metric("train_loss", train_loss / len(full_loader), step=epoch)

        print(f"Epoch {epoch+1:02d}/{EPOCHS} | "
              f"Train Loss: {train_loss/len(full_loader):.4f} | "
              f"Train F1: {train_f1:.4f}")

    # Save the final model artifact
    mlflow.pytorch.log_model(final_model, artifact_path="bilstm_final_model")
    print("\nFinal model trained and logged to MLflow.")

In [None]:
print("Evaluating the final model on the TEST set (original 'dev' set)...")

with mlflow.start_run(run_name="final_model_test_evaluation"):
    mlflow.log_params(best_params)

    final_model.eval()
    test_preds, test_labels_list = [], []

    start_time = time()
    with torch.no_grad():
        for texts, labels, lengths in test_loader_real:
            texts, labels = texts.to(device), labels.to(device)
            outputs = final_model(texts, lengths)
            test_preds.extend(outputs.argmax(1).cpu().numpy())
            test_labels_list.extend(labels.cpu().numpy())
    inference_time = time() - start_time

    test_f1 = f1_score(test_labels_list, test_preds, average='macro')

    # Log metrics to MLflow
    mlflow.log_metric("test_f1_macro", test_f1)
    mlflow.log_metric("inference_time_seconds", inference_time)

    # Console output
    print(f"\nF1-Macro Final (Test Set):       {test_f1:.4f}")
    print(f"Tiempo de Inferencia total:      {inference_time:.3f} segundos")
    print("\nConfusion Matrix:\n",
          confusion_matrix(test_labels_list, test_preds))
    print("\nClassification Report:\n",
          classification_report(test_labels_list, test_preds))

# Transformers

In [9]:
from transformers import AutoModel, AutoTokenizer, get_linear_schedule_with_warmup
import torch.nn as nn
import torch

# Usaremos un modelo base ligero, ideal para empezar.
MODEL_NAME = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [13]:
class BERTDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.texts = texts.reset_index(drop=True)
        self.labels = labels.reset_index(drop=True)
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __len__(self):
        return len(self.texts)
        
    def __getitem__(self, index):
        text = str(self.texts[index])
        label = self.labels[index]
        
        # Llamada directa al tokenizador (la forma moderna y estándar)
        encoding = self.tokenizer(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(label, dtype=torch.long)
        }

# Preparamos los textos crudos
X_train_bert = df_train['tweet']
y_train_bert = df_train['label']

X_test_bert = df_val['tweet'] 
y_test_bert = df_val['label']

bert_test_dataset = BERTDataset(X_test_bert, y_test_bert, tokenizer)
bert_test_loader = DataLoader(bert_test_dataset, batch_size=16, shuffle=False)

In [14]:
class BERTClassifier(nn.Module):
    def __init__(self, model_name, num_classes, freeze_bert=False):
        super().__init__()
        self.bert = AutoModel.from_pretrained(model_name)
        self.dropout = nn.Dropout(0.3)
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_classes)
        
        if freeze_bert:
            for param in self.bert.parameters():
                param.requires_grad = False
                
    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        # Usamos el token [CLS] (primera posición)
        pooled = outputs.last_hidden_state[:, 0]
        pooled = self.dropout(pooled)
        logits = self.classifier(pooled)
        return logits

def unfreeze_last_n_layers(model, n):
    """Utilidad para tu Estudio de Ablación"""
    # Congelar todo primero
    for param in model.bert.parameters():
        param.requires_grad = False
    
    # Descongelar las últimas n capas del encoder
    if n > 0:
        for layer in model.bert.encoder.layer[-n:]:
            for param in layer.parameters():
                param.requires_grad = True
                
    # La cabeza de clasificación siempre debe poder entrenarse
    for param in model.classifier.parameters():
        param.requires_grad = True

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score
from torch.utils.data import Subset

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Entrenando BERT en: {device}\n")

# Parámetros específicos para Fine-Tuning según el PDF
EPOCHS_BERT = 4 
BATCH_SIZE = 16 
N_SPLITS = 5

skf_bert = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=SEED)
y_train_array_bert = y_train_bert.values
bert_fold_metrics = []
best_bert_global_f1 = 0
best_bert_model_state = None

bert_full_dataset = BERTDataset(X_train_bert, y_train_bert, tokenizer)

for fold, (train_idx, val_idx) in enumerate(skf_bert.split(np.zeros(len(y_train_array_bert)), y_train_array_bert)):
    print(f"================ BERT FOLD {fold + 1}/{N_SPLITS} ================")
    
    train_sub = Subset(bert_full_dataset, train_idx)
    val_sub = Subset(bert_full_dataset, val_idx)
    
    train_loader = DataLoader(train_sub, batch_size=BATCH_SIZE, shuffle=True)
    val_loader = DataLoader(val_sub, batch_size=BATCH_SIZE, shuffle=False)
    
    # Instanciar el modelo (Full fine-tuning inicial)
    model = BERTClassifier(MODEL_NAME, num_classes=2, freeze_bert=False).to(device)
    
    # Diferentes learning rates: bajo para BERT, más alto para la nueva capa (Exigencia del PDF)
    optimizer = torch.optim.AdamW([
        {'params': model.bert.parameters(), 'lr': 2e-5},
        {'params': model.classifier.parameters(), 'lr': 1e-4}
    ])
    
    # Scheduler con Warmup (10% de los pasos totales)
    total_steps = len(train_loader) * EPOCHS_BERT
    warmup_steps = int(total_steps * 0.1)
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_steps)
    
    criterion = nn.CrossEntropyLoss()
    best_fold_f1 = 0
    
    for epoch in range(EPOCHS_BERT):
        model.train()
        train_loss = 0
        all_train_preds, all_train_labels = [], []
        
        for batch in train_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)
            
            optimizer.zero_grad()
            logits = model(input_ids, attention_mask)
            loss = criterion(logits, labels)
            loss.backward()
            
            # Gradient clipping (max_norm = 1.0)
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            
            optimizer.step()
            scheduler.step() # Actualizar learning rate
            
            train_loss += loss.item()
            _, predicted = torch.max(logits, 1)
            all_train_preds.extend(predicted.cpu().numpy())
            all_train_labels.extend(labels.cpu().numpy())
            
        # Validación
        model.eval()
        val_loss = 0
        all_val_preds, all_val_labels = [], []
        
        with torch.no_grad():
            for batch in val_loader:
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['label'].to(device)
                
                logits = model(input_ids, attention_mask)
                loss = criterion(logits, labels)
                
                val_loss += loss.item()
                _, predicted = torch.max(logits, 1)
                all_val_preds.extend(predicted.cpu().numpy())
                all_val_labels.extend(labels.cpu().numpy())
                
        val_f1 = f1_score(all_val_labels, all_val_preds, average='macro')
        print(f'Epoch {epoch+1}/{EPOCHS_BERT} | Train Loss: {train_loss/len(train_loader):.4f} | Val Loss: {val_loss/len(val_loader):.4f} | Val F1: {val_f1:.4f}')
        
        # Guardar el mejor modelo del Fold
        if val_f1 > best_fold_f1:
            best_fold_f1 = val_f1
            best_model_state_for_this_fold = model.state_dict().copy()
            
    bert_fold_metrics.append(best_fold_f1)
    print(f"\nMejor F1-Macro en Fold {fold+1}: {best_fold_f1:.4f}\n")
    
    if best_fold_f1 > best_bert_global_f1:
        best_bert_global_f1 = best_fold_f1
        best_bert_model_state = best_model_state_for_this_fold

print("=" * 40)
print(f"F1-Macro Promedio BERT (5 Folds): {np.mean(bert_fold_metrics):.4f}")
print("=" * 40)

Entrenando BERT en: cuda



Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

[1mBertModel LOAD REPORT[0m from: bert-base-uncased
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
cls.predictions.transform.LayerNorm.weight | UNEXPECTED |  | 
cls.predictions.transform.dense.weight     | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED |  | 
cls.seq_relationship.weight                | UNEXPECTED |  | 
cls.predictions.transform.dense.bias       | UNEXPECTED |  | 
cls.seq_relationship.bias                  | UNEXPECTED |  | 
cls.predictions.bias                       | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


Epoch 1/4 | Train Loss: 0.5995 | Val Loss: 0.5007 | Val F1: 0.7452
Epoch 2/4 | Train Loss: 0.4460 | Val Loss: 0.5136 | Val F1: 0.7764
Epoch 3/4 | Train Loss: 0.3107 | Val Loss: 0.5435 | Val F1: 0.7879
Epoch 4/4 | Train Loss: 0.2146 | Val Loss: 0.6583 | Val F1: 0.7876

Mejor F1-Macro en Fold 1: 0.7879



Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

[1mBertModel LOAD REPORT[0m from: bert-base-uncased
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
cls.predictions.transform.LayerNorm.weight | UNEXPECTED |  | 
cls.predictions.transform.dense.weight     | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED |  | 
cls.seq_relationship.weight                | UNEXPECTED |  | 
cls.predictions.transform.dense.bias       | UNEXPECTED |  | 
cls.seq_relationship.bias                  | UNEXPECTED |  | 
cls.predictions.bias                       | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


In [None]:
print("Evaluando el mejor modelo BERT en el conjunto de TEST...")

final_bert = BERTClassifier(MODEL_NAME, num_classes=2).to(device)
final_bert.load_state_dict(best_bert_model_state)
final_bert.eval()

test_preds = []
test_labels = []

from time import time
start_time = time()

with torch.no_grad():
    for batch in bert_test_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)
        
        logits = final_bert(input_ids, attention_mask)
        _, predicted = torch.max(logits, 1)
        
        test_preds.extend(predicted.cpu().numpy())
        test_labels.extend(labels.cpu().numpy())
        
inference_time = time() - start_time

test_f1 = f1_score(test_labels, test_preds, average='macro')

print(f"\nF1-Macro Final BERT (Test Set): {test_f1:.4f}")
print(f"Tiempo de Inferencia total: {inference_time:.3f} segundos")
print("\nReporte de Clasificación:\n", classification_report(test_labels, test_preds))