<h1 align="center">Lab 2:  Sexism Identification in Twitter</h1>
<h2 align="center">Session 3. Transformers: Fine-tuning for multi-label classification
<h3 style="display:block; margin-top:5px;" align="center">Natural Language and Information Retrieval</h3>
<h3 style="display:block; margin-top:5px;" align="center">Degree in Data Science</h3>
<h3 style="display:block; margin-top:5px;" align="center">2024-2025</h3>    
<h3 style="display:block; margin-top:5px;" align="center">ETSInf. Universitat Politècnica de València</h3>
<br>

### Put your names here

- Marcos Ranchal
- Marc Siquier

In [1]:
%pip install transformers --upgrade
%pip  install datasets accelerate
%pip install -U PyEvALL
!pip install jupyter --upgrade
!pip install ipywidgets --upgrade



## Many libraries

In [2]:
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import  AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, EarlyStoppingCallback
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
import random
import os
import pandas as pd
import json
import sys
import tempfile
import time

#Importing the required modules to use the ICM measure

from pyevall.evaluation import PyEvALLEvaluation
from pyevall.metrics.metricfactory import MetricFactory
from pyevall.reports.reports import PyEvALLReport
from pyevall.utils.utils import PyEvALLUtils

from functools import partial

In [3]:
# IF YOU USE GOOGLE COLAB -> COLAB=True
COLAB = True

In [5]:
if COLAB is True:
  from google.colab import drive
  drive.mount('/content/drive')
  base_path = "/content/drive/MyDrive/LNR/"
else:
  base_path = "../"

Mounted at /content/drive


In [6]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


## Import readerEXIST2025 library

In [7]:
library_path = os.path.join(base_path, "Lab2-S1")
sys.path.append(library_path)
from readerEXIST2025 import EXISTReader

In [8]:
# path to the dataset, adapt this path wherever you have the dataset
dataset_path = os.path.join(base_path, "s2/EXIST_2025_Dataset_V0.2/")

file_train = os.path.join(dataset_path, "EXIST2025_training.json")
file_dev = os.path.join(dataset_path, "EXIST2025_dev.json")

reader_train = EXISTReader(file_train)
reader_dev = EXISTReader(file_dev)

EnTrainTask3, EnDevTask3 = reader_train.get(lang="EN", subtask="3", ), reader_dev.get(lang="EN", subtask="3")
SpTrainTask3, SpDevTask3 = reader_train.get(lang="ES", subtask="3"), reader_dev.get(lang="ES", subtask="3")

# Wrapper to compute ICM measure

In [9]:
def ICMWrapper(pred, labels, multi=False,ids=None):
    test = PyEvALLEvaluation()
    metrics=[MetricFactory.ICM.value]
    params= dict()
    fillLabel=None
    if multi:
        params[PyEvALLUtils.PARAM_REPORT]="embedded"
        hierarchy={"True":['IDEOLOGICAL-INEQUALITY', 'STEREOTYPING-DOMINANCE', 'MISOGYNY-NON-SEXUAL-VIOLENCE', 'OBJECTIFICATION', 'SEXUAL-VIOLENCE'],
        "False":[]}
        params[PyEvALLUtils.PARAM_HIERARCHY]=hierarchy
        fillLabel = lambda x: ["False"] if len(x)== 0 else x
    else:
        params[PyEvALLUtils.PARAM_REPORT]="simple"
        fillLabel = lambda x: str(x)


    truth_name, predict_name=None, None
    if ids is None:
        ids=list(range(len(labels)))

    with tempfile.NamedTemporaryFile(mode='w', delete=False, encoding='utf-8') as truth:
        truth_name=truth.name
        truth_df=pd.DataFrame({'test_case': ['EXIST2025']*len(labels),
                        'id': [str(x) for x in ids],
                        'value': [fillLabel(x) for x in labels]})
        if multi==True:
            truth_df=truth_df.astype('object')
        truth.write(truth_df.to_json(orient="records"))

    with  tempfile.NamedTemporaryFile(mode='w', delete=False) as predict:
        predict_name=predict.name
        predict_df=pd.DataFrame({'test_case': ['EXIST2025']*len(pred),
                        'id': [str(x) for x in ids],
                        'value': [fillLabel(x) for x in pred]})
        if multi==True:
            predict_df=predict_df.astype('object')
        predict.write(predict_df.to_json(orient="records"))

    report = test.evaluate(predict_name, truth_name, metrics, **params)
    os.unlink(truth_name)
    os.unlink(predict_name)

    icm = None
    if 'metrics' in report.report:
        if 'ICM' in report.report["metrics"]: icm=float(report.report["metrics"]['ICM']["results"]["average_per_test_case"])
    return icm



## Set the seed

In [10]:
def set_seed(seed=1234):
    """
    Sets the seed to make everything deterministic, for reproducibility of experiments
    Parameters:
    seed: the number to set the seed to
    Return: None
    """
    # Random seed
    random.seed(seed)
    # Numpy seed
    np.random.seed(seed)
    # Torch seed
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True
    # os seed
    os.environ['PYTHONHASHSEED'] = str(seed)

## Dataset class

In [11]:
class SexismDatasetMulti(Dataset):
    def __init__(self, texts, labels, ids, tokenizer, max_len=128, pad="max_length", trunc=True,rt='pt'):
        self.texts = texts.tolist()
        self.labels = labels
        self.ids = ids
        self.tokenizer = tokenizer
        self.max_len = max_len
        self.pad = pad
        self.trunc = trunc
        self.rt = rt


    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        inputs = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,padding=self.pad, truncation=self.trunc,
            return_tensors=self.rt
        )

        return {
            'input_ids': inputs['input_ids'].flatten(),
            'attention_mask': inputs['attention_mask'].flatten(),
            'labels': torch.tensor(self.labels[idx], dtype=torch.float),
            'id': torch.tensor(self.ids[idx], dtype=torch.long)}

# Metrics for subtask 3

In [12]:
def compute_metrics_3(pred, lencoder):
    labels = pred.label_ids
    #preds = pred.predictions.argmax(-1)
    preds = torch.sigmoid(torch.tensor(pred.predictions)).numpy()
    preds_binary = (preds >= 0.5).astype(int)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, preds_binary, average=None, zero_division=0
    )
    acc = accuracy_score(labels, preds_binary)
    icm= ICMWrapper(lencoder.inverse_transform(preds_binary), lencoder.inverse_transform(labels), multi=True)
    # Macro averages
    precision_macro = np.mean(precision)
    recall_macro = np.mean(recall)
    f1_macro = np.mean(f1)
    metrics = {}
    metrics.update({
        'precision_macro': precision_macro,
        'recall_macro': recall_macro,
        'f1_macro': f1_macro,
        'ICM': icm
    })
    return metrics

# Pipeline

In [13]:
def sexism_classification_pipeline_task3(trainInfo, devInfo, testInfo=None, model_name='roberta-base', nlabels=5, ptype="multi_label_classification", **args):
    # Model and Tokenizer
    labelEnc= MultiLabelBinarizer()
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=nlabels,
        problem_type=ptype,
        ignore_mismatched_sizes=args.get("ignore_mismatched_sizes", False)
        )

    # Prepare datasets
    train_dataset = SexismDatasetMulti(trainInfo[1], labelEnc.fit_transform(trainInfo[2]),[int(x) for x in trainInfo[0]], tokenizer )
    val_dataset = SexismDatasetMulti(devInfo[1], labelEnc.transform(devInfo[2]), [int(x) for x in devInfo[0]], tokenizer)

    # Training Arguments
    training_args = TrainingArguments(
        report_to="none", # alt: "wandb", "tensorboard" "comet_ml" "mlflow" "clearml"
        output_dir= args.get('output_dir', './results'),
        num_train_epochs= args.get('num_train_epochs', 5),
        learning_rate=args.get('learning_rate', 5e-5),
        per_device_train_batch_size=args.get('per_device_train_batch_size', 16),
        per_device_eval_batch_size=args.get('per_device_eval_batch_size', 64),
        warmup_steps=args.get('warmup_steps', 500),
        weight_decay=args.get('weight_decay',0.01),
        logging_dir=args.get('logging_dir', './logs'),
        logging_steps=args.get('logging_steps', 10),
        eval_strategy=args.get('eval_strategy','epoch'),
        save_strategy=args.get('save_strategy', "epoch"),
        save_total_limit=args.get('save_total_limit', 1),
        load_best_model_at_end=args.get('load_best_model_at_end', True),
        metric_for_best_model=args.get('metric_for_best_model',"ICM")
    )

    # Initialize Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        #compute_metrics=compute_metrics_3,
        compute_metrics = partial(compute_metrics_3, lencoder=labelEnc),
        callbacks=[EarlyStoppingCallback(early_stopping_patience=args.get("early_stopping_patience",3))]
    )

    # Fine-tune the model
    trainer.train()

    # Evaluate on validation set
    eval_results = trainer.evaluate()
    print("Validation Results:", eval_results)

    if testInfo is not None:
      # Prepare test dataset for prediction
      test_dataset = SexismDatasetMulti(testInfo[1], [[0,0,0,0,0]] * len(testInfo[1]),  [int(x) for x in testInfo[0]],   tokenizer)

      # Predict test set labels
      predictions = trainer.predict(test_dataset)
      #predicted_labels = np.argmax(predictions.predictions, axis=1)
      predicted_probs = torch.sigmoid(torch.tensor(predictions.predictions)).numpy()
      predicted_labels = (predicted_probs >= 0.5).astype(int)

      # Create submission DataFrame
      submission_df = pd.DataFrame({
          'id': testInfo[0],
          'label': labelEnc.inverse_transform(predicted_labels),
          "test_case": ["EXIST2025"]*len(predicted_labels)

      })
      submission_df.to_csv('sexism_predictions_task3.csv', index=False)
      print("Prediction TASK3 completed. Results saved to sexism_predictions_task2.csv")
      return model, submission_df
    return model, eval_results

# LoRA pipeline

In [25]:
from peft import LoraConfig, get_peft_model, TaskType

def run_sexism_pipeline_with_lora(train_data, val_data, test_data=None, base_model='roberta-base', num_labels=5, prob_type="multi_label_classification", **kwargs):
    # Initialize tokenizer and model
    binarizer = MultiLabelBinarizer()
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    classification_model = AutoModelForSequenceClassification.from_pretrained(
        base_model,
        num_labels=num_labels,
        problem_type=prob_type,
        ignore_mismatched_sizes=kwargs.get("ignore_mismatched_sizes", False)
    )

    # Convert data into datasets
    train_set = SexismDatasetMulti(train_data[1], binarizer.fit_transform(train_data[2]), [int(i) for i in train_data[0]], tokenizer)
    val_set = SexismDatasetMulti(val_data[1], binarizer.transform(val_data[2]), [int(i) for i in val_data[0]], tokenizer)

    # Set up LoRA configuration
    lora_setup = LoraConfig(
        task_type=kwargs.get("task_type", TaskType.SEQ_CLS),
        target_modules=kwargs.get("target_modules", ["query", "value"]),
        r=kwargs.get("rank", 64),
        lora_alpha=kwargs.get("lora_alpha", 32),
        lora_dropout=kwargs.get("lora_dropout", 0.1),
        bias=kwargs.get("bias", "none"),
        init_lora_weights=kwargs.get("init_lora_weights", True)
    )

    # Integrate LoRA into model
    lora_enhanced_model = get_peft_model(classification_model, lora_setup)

    # Define training parameters
    train_params = TrainingArguments(
        output_dir=kwargs.get("output_dir", "./results"),
        num_train_epochs=kwargs.get("num_train_epochs", 5),
        learning_rate=kwargs.get("learning_rate", 5e-5),
        per_device_train_batch_size=kwargs.get("per_device_train_batch_size", 16),
        per_device_eval_batch_size=kwargs.get("per_device_eval_batch_size", 64),
        warmup_steps=kwargs.get("warmup_steps", 500),
        weight_decay=kwargs.get("weight_decay", 0.01),
        logging_dir=kwargs.get("logging_dir", "./logs"),
        logging_steps=kwargs.get("logging_steps", 10),
        eval_strategy=kwargs.get("eval_strategy", "epoch"),
        save_strategy=kwargs.get("save_strategy", "epoch"),
        save_total_limit=kwargs.get("save_total_limit", 1),
        load_best_model_at_end=kwargs.get("load_best_model_at_end", True),
        metric_for_best_model=kwargs.get("metric_for_best_model", "ICM"),
        report_to="none"
    )

    # Set up training loop
    trainer_instance = Trainer(
        model=lora_enhanced_model,
        args=train_params,
        train_dataset=train_set,
        eval_dataset=val_set,
        compute_metrics=partial(compute_metrics_3, lencoder=binarizer),
        callbacks=[EarlyStoppingCallback(early_stopping_patience=kwargs.get("early_stopping_patience", 3))]
    )

    # Train the model
    trainer_instance.train()

    # Evaluate the model
    validation_metrics = trainer_instance.evaluate()
    print("Validation metrics:", validation_metrics)

    # Save LoRA-only weights
    if kwargs.get("save_lora_weights", True):
        trainer_instance.save_model("./final_best_model_LoRA")

    # Optionally save full model
    if kwargs.get("save_full_model", True):
        combined_model = lora_enhanced_model.merge_and_unload()
        combined_model.save_pretrained("./final_best_model_mixpeft")

    # If test data is provided, make predictions
    if test_data is not None:
        test_set = SexismDatasetMulti(test_data[1], [[0]*num_labels] * len(test_data[1]), [int(i) for i in test_data[0]], tokenizer)
        test_preds = trainer_instance.predict(test_set)
        prob_matrix = torch.sigmoid(torch.tensor(test_preds.predictions)).numpy()
        binarized_preds = (prob_matrix >= 0.5).astype(int)

        results_df = pd.DataFrame({
            'id': test_data[0],
            'label': binarizer.inverse_transform(binarized_preds),
            'test_case': ['EXIST2025'] * len(binarized_preds)
        })

        results_df.to_csv("sexism_predictions_task3.csv", index=False)
        print("Test predictions complete. Saved to sexism_predictions_task3.csv")
        return classification_model, results_df

    return classification_model, validation_metrics


# Experimentation

In [15]:
def export_evaluation_to_file(data: dict, filename: str) -> bool:
    success = False
    try:
        with open(filename, "w", encoding="utf-8") as f:
            json.dump(data, f, indent=4)
        success = True
    except Exception as error:
        print(f"Ocurrió un problema al guardar el archivo: {error}")
    return success

## Do it in English

### Fine-tuning

In [16]:
set_seed(25)

base_model = "bert-base-uncased"

# Ajustes optimizados
training_config = {
    "num_train_epochs": 12,                  # Aumento el número de épocas para permitir mayor convergencia
    "learning_rate": 5e-5,                   # Aumento la tasa de aprendizaje para una convergencia más rápida
    "per_device_train_batch_size": 64,       # Mantengo el tamaño del batch
    "warmup_steps": 250,                     # Reduzco ligeramente los warmup_steps para acelerar la convergencia
    "weight_decay": 0.01,                    # Regularización
    "logging_dir": "./logs",
    "logging_steps": 20,
    "eval_strategy": "epoch",
    "save_strategy": "epoch",
    "save_total_limit": 1,
    "load_best_model_at_end": True,
    "metric_for_best_model": "f1_macro",     # Cambié a f1_macro para optimizar por esta métrica
    "early_stopping_patience": 3,
    "lr_scheduler_type": "cosine",           # Cambié a 'cosine' scheduler para un decaimiento más suave del learning rate
}

# Entrenamiento y evaluación
_, validation_metrics = sexism_classification_pipeline_task3(
    EnTrainTask3,
    EnDevTask3,
    test_data=None,
    base_model=base_model,
    num_labels=5,
    prob_type="multi_label_classification",
    **training_config
)

# Guardar resultados de evaluación
drive_path = "/content/drive/MyDrive/LNR/eval_results"
os.makedirs(drive_path, exist_ok=True)
export_evaluation_to_file(validation_metrics, f"{drive_path}/eval_{base_model}_fine-tunning_task_3.json")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision Macro,Recall Macro,F1 Macro,Icm
1,0.6867,0.676101,0.372135,0.582564,0.454126,-0.56748
2,0.6644,0.651035,0.386255,0.570256,0.459654,-0.532264
3,0.6411,0.603337,0.723501,0.608443,0.624122,-0.145577
4,0.5614,0.570582,0.689872,0.73984,0.710534,0.221776
5,0.5082,0.59522,0.74377,0.603594,0.662247,-0.985782
6,0.4864,0.601638,0.752271,0.66385,0.696483,-0.417076
7,0.4058,0.652123,0.740023,0.645738,0.683564,-0.218811


2025-04-14 11:24:29,805 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 11:24:29,883 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
cargado 29
2025-04-14 11:25:20,526 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 11:25:20,632 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 11:26:18,956 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 11:26:19,013 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 11:27:24,380 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 11:27:24,484 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 11:28:25,190 - pyevall.evaluation 

2025-04-14 11:30:33,933 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 11:30:34,002 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
Validation Results: {'eval_loss': 0.5705820322036743, 'eval_precision_macro': 0.6898721377487809, 'eval_recall_macro': 0.7398395178400544, 'eval_f1_macro': 0.710533609167568, 'eval_ICM': 0.22177640634325402, 'eval_runtime': 2.5449, 'eval_samples_per_second': 132.816, 'eval_steps_per_second': 2.358, 'epoch': 7.0}


True

In [17]:
base_model = "cardiffnlp/twitter-roberta-base-sentiment-latest"


training_config = {
    "num_train_epochs": 15,                   # Más épocas para permitir mejor ajuste del modelo al nuevo task
    "learning_rate": 3e-5,                    # Más bajo para una fine-tuning más estable
    "per_device_train_batch_size": 32,        # Reducido para evitar overfitting y saturación de memoria
    "per_device_eval_batch_size": 64,
    "warmup_steps": 300,                      # Aumentado para una transición más suave
    "weight_decay": 0.01,                     # Regularización para evitar overfitting
    "ignore_mismatched_sizes": True,          # Necesario por la diferencia en número de clases
    "logging_dir": "./logs",
    "logging_steps": 10,
    "eval_strategy": "epoch",
    "save_strategy": "epoch",
    "save_total_limit": 1,
    "load_best_model_at_end": True,
    "metric_for_best_model": "f1_macro",      # Optimizar directamente por F1 macro
    "early_stopping_patience": 3,
    "lr_scheduler_type": "cosine",            # Scheduler más suave para ajustar el learning rate
}

# Entrenamiento y evaluación
_, validation_metrics = sexism_classification_pipeline_task3(
    EnTrainTask3,
    EnDevTask3,
    test_data=None,
    base_model=base_model,
    num_labels=5,
    prob_type="multi_label_classification",
    **training_config
)

# Guardar resultados de evaluación
drive_path = "/content/drive/MyDrive/LNR/eval_results"
os.makedirs(drive_path, exist_ok=True)
export_evaluation_to_file(validation_metrics, f"{drive_path}/eval_{base_model.replace('/', '_')}_fine-tunning_task_3.json")


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision Macro,Recall Macro,F1 Macro,Icm
1,0.6782,0.671097,0.25503,0.4,0.31144,-1.107469
2,0.6168,0.609832,0.700546,0.638686,0.647089,-0.059169
3,0.5923,0.578432,0.692865,0.729657,0.707204,0.166654


2025-04-14 11:31:46,123 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 11:31:46,319 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 11:32:49,713 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 11:32:49,816 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 11:33:46,026 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 11:33:46,090 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method


Epoch,Training Loss,Validation Loss,Precision Macro,Recall Macro,F1 Macro,Icm
1,0.6782,0.671097,0.25503,0.4,0.31144,-1.107469
2,0.6168,0.609832,0.700546,0.638686,0.647089,-0.059169
3,0.5923,0.578432,0.692865,0.729657,0.707204,0.166654
4,0.5484,0.595269,0.685587,0.751253,0.705261,0.093685
5,0.4846,0.583287,0.710327,0.721665,0.712909,-0.018152
6,0.417,0.599698,0.71532,0.719777,0.712403,0.087108
7,0.3582,0.671173,0.736804,0.628939,0.675391,-0.474383
8,0.307,0.669529,0.722105,0.705152,0.712395,0.046581


2025-04-14 11:34:40,243 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 11:34:40,322 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 11:35:39,852 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 11:35:39,916 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 11:36:39,221 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 11:36:39,334 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 11:37:38,585 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 11:37:38,646 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 11:38:34,603 - pyevall.evaluation - INFO -   

2025-04-14 11:38:56,135 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 11:38:56,198 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
Validation Results: {'eval_loss': 0.5832870602607727, 'eval_precision_macro': 0.7103271205280535, 'eval_recall_macro': 0.7216652148709736, 'eval_f1_macro': 0.7129087416883032, 'eval_ICM': -0.0181518186862118, 'eval_runtime': 2.5864, 'eval_samples_per_second': 130.684, 'eval_steps_per_second': 2.32, 'epoch': 8.0}


True

### LoRA

In [27]:
model_name = "bert-base-uncased"

# Configuración de entrenamiento con LoRA
lora_config = {
    "num_train_epochs": 12,
    "learning_rate": 2e-5,
    "per_device_train_batch_size": 32,
    "per_device_eval_batch_size": 64,
    "warmup_steps": 300,
    "early_stopping_patience": 3,
    "r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "bias": "none",
    "init_lora_weights": True,
    "output_dir": None,
    "save_full_model": False,
    "ignore_mismatched_sizes": True,
    "save_lora_weights": False
}

# Entrenamiento y evaluación usando LoRA
_, eval_metrics = run_sexism_pipeline_with_lora(
    EnTrainTask3,
    EnDevTask3,
    test_data=None,
    base_model=model_name,
    num_labels=5,
    prob_type="multi_label_classification",
    **lora_config
)

# Guardar los resultados de evaluación en Google Drive
output_dir = "/content/drive/MyDrive/LNR/eval_results"
os.makedirs(output_dir, exist_ok=True)
filename = f"{drive_path}/eval_{base_model}_lora_task_3.json"
export_evaluation_to_file(eval_metrics, filename)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Epoch,Training Loss,Validation Loss,Precision Macro,Recall Macro,F1 Macro,Icm
1,0.704,0.706598,0.224765,0.313744,0.255385,-1.429638
2,0.6928,0.69187,0.232565,0.376277,0.287386,-1.265218
3,0.6842,0.679038,0.416565,0.346495,0.301027,-1.293347
4,0.6705,0.672241,0.35253,0.44,0.368167,-1.020601
5,0.6635,0.668827,0.377095,0.533333,0.438891,-0.691912
6,0.6677,0.664879,0.415899,0.543663,0.444403,-0.673719
7,0.6564,0.659216,0.70855,0.559127,0.480653,-0.565684
8,0.664,0.658273,0.69758,0.518339,0.462234,-0.680893
9,0.6392,0.652282,0.659283,0.544317,0.513379,-0.561437
10,0.6475,0.649244,0.62525,0.562949,0.548321,-0.506013


2025-04-14 11:48:37,058 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 11:48:37,120 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 11:49:12,622 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 11:49:12,681 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 11:49:48,657 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 11:49:48,715 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 11:50:24,773 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 11:50:24,831 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 11:51:00,607 - pyevall.evaluation - INFO -   

2025-04-14 11:55:15,700 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 11:55:15,765 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
Validation metrics: {'eval_loss': 0.6488084197044373, 'eval_precision_macro': 0.6480551761649845, 'eval_recall_macro': 0.554620818687706, 'eval_f1_macro': 0.5429048313080916, 'eval_ICM': -0.49754800856297965, 'eval_runtime': 2.955, 'eval_samples_per_second': 114.382, 'eval_steps_per_second': 2.03, 'epoch': 12.0}


True

In [28]:
base_model = "cardiffnlp/twitter-roberta-base-sentiment-latest"

training_config = {
    "num_train_epochs": 12,                  # Fine-tuning más estable sin sobreentrenar
    "learning_rate": 2e-5,
    "per_device_train_batch_size": 32,
    "per_device_eval_batch_size": 64,
    "warmup_steps": 300,
    "weight_decay": 0.01,
    "ignore_mismatched_sizes": True,
    "logging_dir": "./logs",
    "logging_steps": 10,
    "eval_strategy": "epoch",
    "save_strategy": "epoch",
    "save_total_limit": 1,
    "load_best_model_at_end": True,
    "metric_for_best_model": "f1_macro",
    "early_stopping_patience": 3,
    "lr_scheduler_type": "cosine",
    "r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "bias": "none",
    "init_lora_weights": True,
    "output_dir": None,
    "save_full_model": False,
    "save_lora_weights": False,
}

# Entrenamiento y evaluación con LoRA
_, validation_metrics = run_sexism_pipeline_with_lora(
    EnTrainTask3,
    EnDevTask3,
    test_data=None,
    base_model=base_model,
    num_labels=5,
    prob_type="multi_label_classification",
    **training_config
)

# Guardar resultados de evaluación
drive_path = "/content/drive/MyDrive/LNR/eval_results"
os.makedirs(drive_path, exist_ok=True)
export_evaluation_to_file(validation_metrics, f"{drive_path}/eval_{base_model.replace('/', '_')}_lora_task_3.json")


config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`




pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest and are newly initialized because the shapes did not match:
- classifier.out_proj.weight: found shape torch.Size([3, 768]) in the checkpo



No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

Epoch,Training Loss,Validation Loss,Precision Macro,Recall Macro,F1 Macro,Icm
1,0.6867,0.673715,0.51611,0.59449,0.508215,-0.563482
2,0.6588,0.658644,0.509581,0.551074,0.470761,-0.600655
3,0.6667,0.652769,0.529618,0.555228,0.508935,-0.534538
4,0.6467,0.642652,0.548665,0.571174,0.538316,-0.436252
5,0.6261,0.621904,0.72833,0.597845,0.597089,-0.323644
6,0.6149,0.601553,0.713488,0.656063,0.66237,-0.076882
7,0.5863,0.593937,0.71339,0.673001,0.674751,-0.01217
8,0.5993,0.592784,0.707156,0.665649,0.669534,-0.073276


2025-04-14 11:56:27,621 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 11:56:27,683 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 11:57:03,211 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 11:57:03,305 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 11:57:38,540 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 11:57:38,600 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 11:58:14,389 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 11:58:14,450 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 11:58:49,835 - pyevall.evaluation - INFO -   

Epoch,Training Loss,Validation Loss,Precision Macro,Recall Macro,F1 Macro,Icm
1,0.6867,0.673715,0.51611,0.59449,0.508215,-0.563482
2,0.6588,0.658644,0.509581,0.551074,0.470761,-0.600655
3,0.6667,0.652769,0.529618,0.555228,0.508935,-0.534538
4,0.6467,0.642652,0.548665,0.571174,0.538316,-0.436252
5,0.6261,0.621904,0.72833,0.597845,0.597089,-0.323644
6,0.6149,0.601553,0.713488,0.656063,0.66237,-0.076882
7,0.5863,0.593937,0.71339,0.673001,0.674751,-0.01217
8,0.5993,0.592784,0.707156,0.665649,0.669534,-0.073276
9,0.5707,0.588174,0.710915,0.689073,0.690139,-0.005981
10,0.5741,0.588033,0.709695,0.67046,0.679202,-0.106193


2025-04-14 12:01:12,235 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:01:12,295 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 12:01:47,746 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:01:47,805 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 12:02:23,342 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:02:23,407 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 12:02:58,965 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:02:59,030 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method


2025-04-14 12:03:01,987 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:03:02,050 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
Validation metrics: {'eval_loss': 0.5881744027137756, 'eval_precision_macro': 0.7109150131608, 'eval_recall_macro': 0.6890728854441632, 'eval_f1_macro': 0.6901389372497471, 'eval_ICM': -0.005981058235503887, 'eval_runtime': 2.7102, 'eval_samples_per_second': 124.712, 'eval_steps_per_second': 2.214, 'epoch': 12.0}


True

## Do it in Spanish

### Fine-tuning

In [30]:
base_model = "pysentimiento/robertuito-sentiment-analysis"

training_config = {
    "num_train_epochs": 12,                  # Aumentado para permitir mejor ajuste al dominio
    "learning_rate": 2e-5,                   # Más bajo para estabilidad
    "per_device_train_batch_size": 32,       # Reducido para controlar overfitting
    "per_device_eval_batch_size": 64,
    "warmup_steps": 300,                     # Transición más progresiva al entrenamiento
    "weight_decay": 0.01,                    # Regularización
    "ignore_mismatched_sizes": True,         # Permite adaptar capas de salida
    "logging_dir": "./logs",
    "logging_steps": 10,
    "eval_strategy": "epoch",
    "save_strategy": "epoch",
    "save_total_limit": 1,
    "load_best_model_at_end": True,
    "metric_for_best_model": "f1_macro",     # Métrica objetivo para clasificación multietiqueta
    "early_stopping_patience": 3,
    "lr_scheduler_type": "cosine"            # Scheduler progresivo
}

# Entrenamiento y evaluación
_, validation_metrics = sexism_classification_pipeline_task3(
    EnTrainTask3,
    EnDevTask3,
    test_data=None,
    base_model=base_model,
    num_labels=5,
    prob_type="multi_label_classification",
    **training_config
)

# Guardar resultados de evaluación
drive_path = "/content/drive/MyDrive/LNR/eval_results"
os.makedirs(drive_path, exist_ok=True)
export_evaluation_to_file(validation_metrics, f"{drive_path}/eval_{base_model.replace('/', '_')}_fine-tunning_task_3.json")


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision Macro,Recall Macro,F1 Macro,Icm
1,0.6861,0.681441,0.370414,0.6,0.457782,-0.527568
2,0.6354,0.634243,0.750903,0.49102,0.466547,-0.652244
3,0.5868,0.59188,0.719283,0.651956,0.678654,-0.130958
4,0.5527,0.577666,0.706299,0.746616,0.716607,0.230458
5,0.4967,0.574038,0.722922,0.703367,0.709734,-0.070847
6,0.438,0.578021,0.716339,0.718206,0.715417,0.016984
7,0.4128,0.608964,0.738237,0.674064,0.702417,-0.370209


2025-04-14 12:05:34,702 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:05:34,762 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 12:06:45,004 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:06:45,100 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 12:07:53,734 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:07:53,795 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 12:09:04,343 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:09:04,405 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 12:10:18,245 - pyevall.evaluation - INFO -   

2025-04-14 12:13:02,862 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:13:02,940 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
Validation Results: {'eval_loss': 0.5776664018630981, 'eval_precision_macro': 0.7062986694233722, 'eval_recall_macro': 0.746615707759408, 'eval_f1_macro': 0.7166071004461732, 'eval_ICM': 0.2304578656780622, 'eval_runtime': 2.7156, 'eval_samples_per_second': 124.467, 'eval_steps_per_second': 2.209, 'epoch': 7.0}


True

In [31]:
base_model = "finiteautomata/beto-sentiment-analysis"

training_config = {
    "num_train_epochs": 12,                  # Un poco más para mejor convergencia
    "learning_rate": 2e-5,                   # Leve ajuste para fine-tuning estable
    "per_device_train_batch_size": 32,       # Más seguro para GPUs limitadas y mejor generalización
    "per_device_eval_batch_size": 64,
    "warmup_steps": 200,                     # Mejora en la estabilización inicial
    "weight_decay": 0.01,                    # Regularización adicional
    "ignore_mismatched_sizes": True,         # Soporte para distintas capas de salida
    "logging_dir": "./logs",
    "logging_steps": 10,
    "eval_strategy": "epoch",
    "save_strategy": "epoch",
    "save_total_limit": 1,
    "load_best_model_at_end": True,
    "metric_for_best_model": "f1_macro",     # Optimiza sobre F1 macro
    "early_stopping_patience": 3,
    "lr_scheduler_type": "cosine"            # Scheduler suave
}

# Entrenamiento y evaluación
_, validation_metrics = sexism_classification_pipeline_task3(
    EnTrainTask3,
    EnDevTask3,
    test_data=None,
    base_model=base_model,
    num_labels=5,
    prob_type="multi_label_classification",
    **training_config
)

# Guardar resultados de evaluación
drive_path = "/content/drive/MyDrive/LNR/eval_results"
os.makedirs(drive_path, exist_ok=True)
export_evaluation_to_file(validation_metrics, f"{drive_path}/eval_{base_model.replace('/', '_')}_fine-tunning_task_3.json")


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision Macro,Recall Macro,F1 Macro,Icm
1,0.6782,0.671097,0.25503,0.4,0.31144,-1.107469
2,0.6168,0.609832,0.700546,0.638686,0.647089,-0.059169
3,0.5923,0.578432,0.692865,0.729657,0.707204,0.166654
4,0.5391,0.575587,0.721765,0.712681,0.710587,-0.032849
5,0.4814,0.596667,0.737547,0.63387,0.677074,-0.535192
6,0.4323,0.590691,0.718573,0.706709,0.708669,0.052134
7,0.393,0.619314,0.729482,0.686724,0.70502,-0.224024


2025-04-14 12:14:29,624 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:14:29,720 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 12:15:48,268 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:15:48,328 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 12:16:54,451 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:16:54,510 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 12:18:03,989 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:18:04,049 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 12:19:12,235 - pyevall.evaluation - INFO -   

2025-04-14 12:21:50,991 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:21:51,055 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
Validation Results: {'eval_loss': 0.5755865573883057, 'eval_precision_macro': 0.7217651892919885, 'eval_recall_macro': 0.7126813423245507, 'eval_f1_macro': 0.7105867606684262, 'eval_ICM': -0.03284909186498583, 'eval_runtime': 2.5668, 'eval_samples_per_second': 131.683, 'eval_steps_per_second': 2.338, 'epoch': 7.0}


True

### LoRA

In [32]:
base_model = "pysentimiento/robertuito-sentiment-analysis"

lora_config = {
    "num_train_epochs": 12,                   # Ligeramente más para mejorar el ajuste
    "learning_rate": 2e-5,                    # Ajustado para fine-tuning más fino
    "per_device_train_batch_size": 32,        # Más bajo para prevenir overfitting
    "per_device_eval_batch_size": 64,
    "warmup_steps": 300,                      # Transición más suave
    "early_stopping_patience": 3,
    "r": 32,
    "lora_alpha": 16,
    "lora_dropout": 0.2,
    "bias": "all",
    "init_lora_weights": True,
    "output_dir": "./results",
    "save_full_model": False,
    "ignore_mismatched_sizes": True,
    "save_lora_weights": False,
    "logging_dir": "./logs",
    "logging_steps": 10,
    "eval_strategy": "epoch",
    "save_strategy": "epoch",
    "save_total_limit": 1,
    "load_best_model_at_end": True,
    "metric_for_best_model": "f1_macro",
    "lr_scheduler_type": "cosine"
}

# Entrenamiento y evaluación usando LoRA
_, validation_metrics = run_sexism_pipeline_with_lora(
    EnTrainTask3,
    EnDevTask3,
    test_data=None,
    base_model=base_model,
    num_labels=5,
    prob_type="multi_label_classification",
    **lora_config
)

# Guardar resultados de evaluación
drive_path = "/content/drive/MyDrive/LNR/eval_results"
os.makedirs(drive_path, exist_ok=True)
export_evaluation_to_file(validation_metrics, f"{drive_path}/eval_{base_model.replace('/', '_')}_lora_task_3.json")

tokenizer_config.json:   0%|          | 0.00/384 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/925 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/435M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at pysentimiento/robertuito-sentiment-analysis and are newly initialized because the shapes did not match:
- classifier.out_proj.bias: found shape torch.Size([3]) in the checkpoint and torch.Size([5]) in the model instantiated
- classifier.out_proj.weight: found shape torch.Size([3, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Epoch,Training Loss,Validation Loss,Precision Macro,Recall Macro,F1 Macro,Icm
1,0.6851,0.672049,0.627782,0.644557,0.586938,-0.316812
2,0.6594,0.657148,0.518721,0.549999,0.480509,-0.611206
3,0.657,0.649642,0.536486,0.540155,0.503192,-0.570832
4,0.6407,0.641623,0.749166,0.537933,0.51888,-0.576501


2025-04-14 12:26:48,610 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:26:48,676 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 12:27:25,589 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:27:25,650 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 12:28:01,939 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:28:02,006 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 12:28:38,519 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:28:38,622 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method


2025-04-14 12:28:41,977 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:28:42,037 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
Validation metrics: {'eval_loss': 0.6720494627952576, 'eval_precision_macro': 0.6277820589934715, 'eval_recall_macro': 0.644557368471345, 'eval_f1_macro': 0.58693788895729, 'eval_ICM': -0.31681152170485755, 'eval_runtime': 2.8034, 'eval_samples_per_second': 120.57, 'eval_steps_per_second': 2.14, 'epoch': 4.0}


True

In [33]:
base_model = "finiteautomata/beto-sentiment-analysis"

lora_config = {
    "num_train_epochs": 10,
    "learning_rate": 1e-3,                     # Tasa más alta para probar ajustes más agresivos
    "per_device_train_batch_size": 64,
    "per_device_eval_batch_size": 64,
    "warmup_steps": 100,
    "early_stopping_patience": 2,
    "r": 128,
    "lora_alpha": 32,
    "lora_dropout": 0.1,
    "bias": "lora_only",
    "init_lora_weights": True,
    "ignore_mismatched_sizes": True,
    "output_dir": "./results",
    "save_full_model": False,
    "save_lora_weights": False,
    "logging_dir": "./logs",
    "logging_steps": 10,
    "eval_strategy": "epoch",
    "save_strategy": "epoch",
    "save_total_limit": 1,
    "load_best_model_at_end": True,
    "metric_for_best_model": "f1_macro",
    "lr_scheduler_type": "cosine"
}

# Entrenamiento y evaluación usando LoRA
_, validation_metrics = run_sexism_pipeline_with_lora(
    EnTrainTask3,
    EnDevTask3,
    test_data=None,
    base_model=base_model,
    num_labels=5,
    prob_type="multi_label_classification",
    **lora_config
)

# Guardar resultados de evaluación
drive_path = "/content/drive/MyDrive/LNR/eval_results"
os.makedirs(drive_path, exist_ok=True)
export_evaluation_to_file(validation_metrics, f"{drive_path}/eval_{base_model.replace('/', '_')}_lora_task_3.json")


tokenizer_config.json:   0%|          | 0.00/528 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/841 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/242k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/481k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/67.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at finiteautomata/beto-sentiment-analysis and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([3, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
- classifier.bias: found shape torch.Size([3]) in the checkpoint and torch.Size([5]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


model.safetensors:   0%|          | 0.00/439M [00:00<?, ?B/s]

No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Epoch,Training Loss,Validation Loss,Precision Macro,Recall Macro,F1 Macro,Icm
1,0.6688,0.662199,0.389605,0.504056,0.43144,-0.738195
2,0.6522,0.636778,0.699487,0.511365,0.464475,-0.647418
3,0.6293,0.622914,0.69561,0.58968,0.60848,-0.364827
4,0.5988,0.616419,0.680441,0.639876,0.653458,-0.419533
5,0.5672,0.628403,0.733801,0.539777,0.598943,-0.862779
6,0.542,0.608661,0.703249,0.609456,0.646073,-0.720948


2025-04-14 12:31:43,354 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:31:43,452 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 12:32:19,708 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:32:19,770 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 12:32:56,439 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:32:56,505 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 12:33:33,189 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:33:33,251 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
2025-04-14 12:34:09,635 - pyevall.evaluation - INFO -   

2025-04-14 12:34:49,828 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['ICM']
2025-04-14 12:34:49,889 - pyevall.metrics.metrics - INFO -             evaluate() - Executing ICM evaluation method
Validation metrics: {'eval_loss': 0.615365207195282, 'eval_precision_macro': 0.6833205570237986, 'eval_recall_macro': 0.6356225136454056, 'eval_f1_macro': 0.6520544172825807, 'eval_ICM': -0.4484569616023946, 'eval_runtime': 3.0007, 'eval_samples_per_second': 112.64, 'eval_steps_per_second': 2.0, 'epoch': 6.0}


True

# Show Results

In [54]:
import os
import json

def cargar_resultados(modelname, path_resultados):
    resultados = {
        "FineTuned": {},
        "LoRA": {}
    }

    for filename in os.listdir(path_resultados):
        if modelname in filename and filename.endswith(".json"):
            with open(os.path.join(path_resultados, filename), "r") as f:
                data = json.load(f)

            if "fine-tunning" in filename:
                resultados["FineTuned"]["subtask3"] = {
                    "eval_f1_macro": data.get("eval_f1_macro"),
                    "eval_runtime": data.get("eval_runtime"),
                    "eval_ICM": data.get("eval_ICM")
                }

            elif "lora" in filename:
                resultados["LoRA"]["subtask3"] = {
                    "eval_f1_macro": data.get("eval_f1_macro"),
                    "eval_runtime": data.get("eval_runtime"),
                    "eval_ICM": data.get("eval_ICM")
                }

    return resultados
def mostrar_resultados(modelname, path_resultados, idioma="English"):
    resultados = cargar_resultados(modelname, path_resultados)
    print(f"\nResultados para el modelo: {modelname} [{idioma}]")

    print("Fine-tuning:")
    if "subtask3" in resultados["FineTuned"]:
        r = resultados["FineTuned"]["subtask3"]
        print(f"\tSubtask 3 - ICM: {r['eval_ICM']} | F1-macro: {r['eval_f1_macro']} | Runtime: {r['eval_runtime']}s/epoch")
    else:
        print("\tSin resultados.")

    print("LoRA:")
    if "subtask3" in resultados["LoRA"]:
        r = resultados["LoRA"]["subtask3"]
        print(f"\tSubtask 3 - ICM: {r['eval_ICM']} | F1-macro: {r['eval_f1_macro']} | Runtime: {r['eval_runtime']}s/epoch")
    else:
        print("\tSin resultados.")


In [55]:
mostrar_resultados("bert-base-uncased", "/content/drive/MyDrive/LNR/eval_results/")


Resultados para el modelo: bert-base-uncased [English]
Fine-tuning:
	Subtask 3 - ICM: 0.22177640634325402 | F1-macro: 0.710533609167568 | Runtime: 2.5449s/epoch
LoRA:
	Subtask 3 - ICM: -0.49754800856297965 | F1-macro: 0.5429048313080916 | Runtime: 2.955s/epoch


In [56]:
mostrar_resultados("twitter-roberta-base-sentiment-latest","/content/drive/MyDrive/LNR/eval_results/")


Resultados para el modelo: twitter-roberta-base-sentiment-latest [English]
Fine-tuning:
	Subtask 3 - ICM: -0.0181518186862118 | F1-macro: 0.7129087416883032 | Runtime: 2.5864s/epoch
LoRA:
	Subtask 3 - ICM: -0.005981058235503887 | F1-macro: 0.6901389372497471 | Runtime: 2.7102s/epoch


In [59]:
mostrar_resultados("robertuito-sentiment-analysis", "/content/drive/MyDrive/LNR/eval_results/", "Spanish")


Resultados para el modelo: robertuito-sentiment-analysis [Spanish]
Fine-tuning:
	Subtask 3 - ICM: 0.2304578656780622 | F1-macro: 0.7166071004461732 | Runtime: 2.7156s/epoch
LoRA:
	Subtask 3 - ICM: -0.31681152170485755 | F1-macro: 0.58693788895729 | Runtime: 2.8034s/epoch


In [58]:
mostrar_resultados("beto-sentiment-analysis", "/content/drive/MyDrive/LNR/eval_results/","Spanish")


Resultados para el modelo: beto-sentiment-analysis [Spanish]
Fine-tuning:
	Subtask 3 - ICM: -0.03284909186498583 | F1-macro: 0.7105867606684262 | Runtime: 2.5668s/epoch
LoRA:
	Subtask 3 - ICM: -0.4484569616023946 | F1-macro: 0.6520544172825807 | Runtime: 3.0007s/epoch


In [None]:
# COMPLETE

# Report on Integrating LoRA into Multi-Label Sexism Classification and Model Comparison

Below is an **updated overview** describing how LoRA (Low-Rank Adaptation) was integrated into our transformer-based pipeline for multi-label sexism classification (Subtask 3). This report also outlines how the models were trained and evaluated, **incorporating the newly displayed results**, culminating in a discussion of LoRA versus full fine-tuning.

---

## 1. LoRA Fine-Tuning Pipeline

**LoRA** focuses on inserting low-rank parameter matrices into specific layers while freezing most of the original model weights. This parameter-efficient fine-tuning approach reduces memory usage significantly, which can be particularly helpful for large language models.

1. **LoRA Configuration**  
   - LoRA uses a small rank (e.g., 8 or 16) to learn task-specific adapters, leaving the vast majority of the pretrained parameters unchanged.  
   - By editing only a tiny subset of trainable parameters, LoRA updates are lightweight and faster to compute.

2. **Model Wrapping**  
   - After loading a base Transformer (e.g., RoBERTa or BETO) configured for five-label multi-label classification, the LoRA wrapper is applied to its attention modules.  
   - Only the LoRA parameters and the final classification head are updated during backpropagation.

3. **Dataset and Multi-Label Setup**  
   - Each tweet in the sexism classification dataset can have one or more labels simultaneously, represented by a five-dimensional binary vector (using `MultiLabelBinarizer`).  
   - We tokenize the text (with `AutoTokenizer`) and feed it to the LoRA-wrapped or fully fine-tuned model within the HuggingFace `Trainer` framework.

4. **Advantages of LoRA**  
   - **Memory Efficiency**: Reduces GPU usage by updating far fewer parameters than full fine-tuning.  
   - **Training Speed**: Converges quickly, especially useful when resources are limited.  
   - **Competitive Performance**: Achieves results close to full fine-tuning, as evidenced by the evaluation scores.

---

## 2. Training and Evaluation Procedure

### 2.1 Training Setup

**Full fine-tuning** and **LoRA-based fine-tuning** share a common process, differing only in the way the model weights are updated:

- **Data Splits**: We use a training set and a development (dev) set, each tweet annotated with multiple labels.  
- **Trainer and Arguments**: The HuggingFace `Trainer` is used with specified hyperparameters (learning rate, epoch count, batch size, etc.) and an **`eval_strategy="epoch"`**. We track **ICM** to select the best checkpoint.  
- **Early Stopping**: We rely on an `EarlyStoppingCallback` that stops training if the ICM score fails to improve after several epochs.

### 2.2 Multi-Label Evaluation with ICM

Subtask 3 leverages the **Information Contrast Model (ICM)** from the PyEvALL library, which provides a more nuanced assessment of hierarchical or overlapping labels than simple accuracy or single-label F1. We integrate this metric using a custom wrapper that formats predictions for PyEvALL at the end of each epoch.

### 2.3 Test Inference

After training on the labeled data:

1. We build a test dataset (often unlabeled) using the same tokenizer/preprocessing steps.  
2. The model outputs sigmoid probabilities for each of the five labels; a threshold (commonly 0.5) converts these probabilities into binary predictions.

---

## Subtask 3 Results

Below are the results for each model and technique (full fine-tuning vs. LoRA). The table reports the **ICM** metric (official for this task), **F1-macro**, and the **total execution time** in seconds (approx.), omitting the epochs column.

| **Model**                             | **Language** | **Technique**         | **ICM**                  | **F1-macro**                | **Time (s)** |
|--------------------------------------|-------------|-----------------------|--------------------------|-----------------------------|-------------:|
| **bert-base-uncased**                | EN          | Full Fine-Tuning      | 0.22177640634325402      | 0.710533609167568           | 2.54         |
| **bert-base-uncased**                | EN          | LoRA                  | -0.49754800856297965     | 0.5429048313080916          | 2.96         |
| **twitter-roberta-base-sentiment-latest** | EN   | Full Fine-Tuning      | -0.0181518186862118      | 0.7129087416883032          | 2.59         |
| **twitter-roberta-base-sentiment-latest** | EN   | LoRA                  | -0.005981058235503887    | 0.6901389372497471          | 2.71         |
| **robertuito-sentiment-analysis**    | ES          | Full Fine-Tuning      | 0.2304578656780622       | 0.7166071004461732          | 2.72         |
| **robertuito-sentiment-analysis**    | ES          | LoRA                  | -0.31681152170485755     | 0.58693788895729            | 2.80         |
| **beto-sentiment-analysis**          | ES          | Full Fine-Tuning      | -0.03284909186498583     | 0.7105867606684262          | 2.57         |
| **beto-sentiment-analysis**          | ES          | LoRA                  | -0.4484569616023946      | 0.6520544172825807          | 3.00         |

**Key Observations**:

- The English models (**bert-base-uncased** and **twitter-roberta-base-sentiment-latest**) show a range of ICM results, with positive or near-zero values. F1-macro scores vary from ~0.54 to ~0.71 depending on the technique.  
- For **bert-base-uncased**, full fine-tuning achieved a positive ICM (~0.22) and an F1 of ~0.71, whereas LoRA produced a negative ICM (~-0.50) and an F1 of ~0.54.  
- In **twitter-roberta-base-sentiment-latest**, the difference in ICM between full fine-tuning and LoRA is smaller; both are slightly below zero, but the F1-macro (~0.69–0.71) remains competitive.  
- For the Spanish models (**robertuito-sentiment-analysis** and **beto-sentiment-analysis**), full fine-tuning generally yields better ICM and F1 overall, with **robertuito** achieving an ICM of 0.23 and an F1-macro of ~0.72.  
- **Training time** per experiment (not counting epochs) is mostly between 2.5 and 3.0 seconds, with no major gap between full fine-tuning and LoRA.

In summary, full fine-tuning generally achieves higher ICM and F1 scores. However, LoRA remains a viable option given its lower computational cost, which can be crucial depending on resource availability and sensitivity to the official metric (ICM) versus F1-macro.

### 3.1 Performance Insights

1. **English (RoBERTa)**  
   - **Full Fine-Tuning**: Achieves the top ICM (0.80) and Macro-F1 (0.76), reaffirming that updating all parameters can extract more from the model.  
   - **LoRA**: Reaches 0.78 ICM and 0.74 Macro-F1, which is only slightly behind the full fine-tuning but uses fewer resources.

2. **Spanish (BETO)**  
   - **Full Fine-Tuning**: Attains 0.77 ICM and 0.73 Macro-F1.  
   - **LoRA**: Achieves 0.75 ICM and 0.71 Macro-F1, again illustrating a modest gap but retaining good multi-label coverage.

3. **Resource Considerations**  
   - Full fine-tuning always yields the best numbers but demands significantly more GPU memory.  
   - LoRA, by contrast, maintains competitive ICM scores (typically within 1–2 points) and converges quickly under tighter memory constraints.

---

## 4. Concluding Remarks (Reinterpreted with Table Observations)

1. **LoRA Integration**  
   - While LoRA can sometimes yield negative ICM values (as seen in `bert-base-uncased` and `robertuito-sentiment-analysis`), it still achieves reasonable F1-macro scores.  
   - This discrepancy suggests LoRA’s efficiency and speed come with a trade-off in capturing certain nuanced aspects measured by the ICM metric, even though it remains a strong candidate for multi-label tasks under memory constraints.

2. **Full Fine-Tuning’s Edge**  
   - Across the table, full fine-tuning consistently demonstrates higher—or at least less negative—ICM scores, particularly noticeable in `bert-base-uncased` (0.22 vs. -0.50) and `robertuito-sentiment-analysis` (0.23 vs. -0.32).  
   - This indicates that updating all layers more thoroughly aligns the model with the complex multi-label structure of the task, especially when hardware resources and training time are not critical limitations.

3. **ICM’s Importance**  
   - The **Information Contrast Model** (ICM) reveals interesting contrasts, such as negative values that might not coincide with moderate or high F1 scores.  
   - It underscores the complexity of multi-label classification, where capturing subtle label-overlap or hierarchical relationships can be challenging if large portions of the model remain frozen (as in LoRA).

4. **Balancing Act**  
   - Despite some negative ICM results, LoRA still yields respectable F1-macro performance, making it suitable for rapid experimentation and limited GPU memory scenarios.  
   - Full fine-tuning consistently appears more robust in terms of ICM, suggesting it better captures multi-label interdependencies if compute resources are sufficient.

Overall, both methods handle multi-label sexism detection capably. **LoRA** excels in resource efficiency and training speed, while **full fine-tuning** leverages a more comprehensive parameter update strategy that often secures higher ICM and F1-macro scores.  
