# Transformers and Explainability

**Experimental Setup**

Pretrained multilingual transformer model was used with Ferret’s Benchmark to generate and evaluate explanations for Spanish text classification

**Methodology**

Explanations for given sentences and target classes were generated, deduplicated to remove redundancy, and quantitatively evaluated for faithfulness. Results were displayed below, as interactive tables for easy inspection.

## Some libraries

In [None]:
!pip install \
    transformers==4.43.1 \
    sentence-transformers==2.2.2 \
    ferret-xai==0.4.2 \
    scikit-learn==1.4.2 \
    pandas==2.2.2 \
    tqdm==4.66.4

In [None]:
!pip install numpy==1.25.2

In [None]:
import os
import sys
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix, classification_report, accuracy_score
from sklearn.preprocessing import LabelEncoder
import pandas as pd
from typing import List, Tuple
from tqdm import tqdm

In [None]:
!pip install numpy==1.26.4
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

## Import readerEXIST2025 library, and read the dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')
base_path = "/content/drive/MyDrive"

In [None]:
sys.path.append(base_path)
from readerEXIST2025 import EXISTReader

In [None]:
file_train = os.path.join(base_path, "EXIST2025_training.json")
file_dev = os.path.join(base_path, "EXIST2025_dev.json")

reader_train = EXISTReader(file_train)
reader_dev = EXISTReader(file_dev)

EnTrainTask1, EnDevTask1 = reader_train.get(lang="EN", subtask="1"), reader_dev.get(lang="EN", subtask="1")
SpTrainTask1, SpDevTask1 = reader_train.get(lang="ES", subtask="1"), reader_dev.get(lang="ES", subtask="1")

## Dataset class

In [None]:
class SexismDataset(Dataset):
    def __init__(self, texts, labels, ids, tokenizer, max_len=128, pad="max_length", trunc=True,rt='pt'):
        self.texts = texts.tolist()
        self.labels = labels
        self.ids = ids
        self.tokenizer = tokenizer
        self.max_len = max_len
        self.pad = pad
        self.trunc = trunc
        self.rt = rt

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        inputs = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,padding=self.pad, truncation=self.trunc,
            return_tensors=self.rt
        )

        return {
            'input_ids': inputs['input_ids'].flatten(),
            'attention_mask': inputs['attention_mask'].flatten(),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long),
            'id': torch.tensor(self.ids[idx], dtype=torch.long)
        }

## Auxiliary functions

In [None]:
from collections import Counter
from itertools import tee, count

def uniquify(seq, suffs = count(1)):

    not_unique = [k for k,v in Counter(seq).items() if v > 1]
    suff_gens = dict(zip(not_unique, tee(suffs, len(not_unique))))
    for idx,s in enumerate(seq):
        try:
            suffix = str(next(suff_gens[s]))
        except KeyError:
            continue
        else:
            seq[idx] += suffix

def deduplicate(explanations):
    for i in range(len(explanations)):
        tokens = explanations[i].tokens
        uniquify(tokens, (f'_{x!s}' for x in range(1, 100)))
        explanations[i].tokens=tokens
    return explanations

# Two options to predict

### LORA pipeline

In [None]:
from peft import LoraConfig, get_peft_model, TaskType

def sexism_classification_pipeline_task1_LoRA(trainInfo, devInfo, testInfo=None, model_name='roberta-base', nlabels=2, ptype="single_label_classification", **args):
    # Model and Tokenizer
    labelEnc = LabelEncoder()
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=nlabels,
        problem_type=ptype
    )

    # Configure LoRA
    lora_config = LoraConfig(
    task_type= args.get("task_type", TaskType.SEQ_CLS),
    target_modules= args.get("target_modules", ["query", "value"]),
    r= args.get("rank", 64),  # Rank of LoRA adaptation
    lora_alpha=args.get("lora_alpha", 32),  # Scaling factor
    lora_dropout=args.get("lora_dropout", 0.1),
    bias=args.get("bias", "none")
)

    # Prepare LoRA model
    peft_model = get_peft_model(model, lora_config)

    # Prepare datasets
    train_dataset = SexismDataset(trainInfo[1], labelEnc.fit_transform(trainInfo[2]),[int(x) for x in trainInfo[0]], tokenizer )
    val_dataset = SexismDataset(devInfo[1], labelEnc.transform(devInfo[2]), [int(x) for x in devInfo[0]], tokenizer)

    # Training Arguments
    training_args = TrainingArguments(
        report_to="none", # alt: "wandb", "tensorboard" "comet_ml" "mlflow" "clearml"
        output_dir= args.get('output_dir', './results_task1_LoRA0'),
        num_train_epochs= args.get('num_train_epochs', 5),
        learning_rate=args.get('learning_rate', 5e-5),
        per_device_train_batch_size=args.get('per_device_train_batch_size', 16),
        per_device_eval_batch_size=args.get('per_device_eval_batch_size', 64),
        warmup_steps=args.get('warmup_steps', 500),
        weight_decay=args.get('weight_decay',0.01),
        logging_dir=args.get('logging_dir', './logs'),
        logging_steps=args.get('logging_steps', 10),
        eval_strategy=args.get('eval_strategy','epoch'),
        save_strategy=args.get('save_strategy', "epoch"),
        save_total_limit=args.get('save_total_limit', 1),
        load_best_model_at_end=args.get('load_best_model_at_end', True),
        metric_for_best_model=args.get('metric_for_best_model',"f1")
    )

    # Initialize Trainer
    trainer = Trainer(
        # Prepare LoRA model
        model=peft_model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics_1,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=args.get("early_stopping_patience",3))]
    )

    # Fine-tune the model
    trainer.train()

    # Evaluate on validation set
    eval_results = trainer.evaluate()
    print("Validation Results:", eval_results)

    #Saving the new weigths for the LoRA model
    # trainer.save_model(dir)
    # Notice that, in this case only the LoRA matrices are saved.
    # The weigths for the classification head are not saved.

    #Mixing the LoRA matrices with the weigths of the base model used
    mixModel=peft_model.merge_and_unload()
    # mixModel.save_pretrained(dir)
    # IN this case the full model is saved.

    if testInfo is not None:
        # Prepare test dataset for prediction
        test_dataset = SexismDataset(testInfo[1], [0] * len(testInfo[1]),  [int(x) for x in testInfo[0]],   tokenizer)

        # Predict test set labels
        predictions = trainer.predict(test_dataset)
        predicted_labels = np.argmax(predictions.predictions, axis=1)

        # Create submission DataFrame
        submission_df = pd.DataFrame({
            # 'id': testInfo[0],
            'label': labelEnc.inverse_transform(predicted_labels),
            "test_case": ["EXIST2025"]*len(predicted_labels)
        })
        submission_df.to_csv('sexism_predictions_task1.csv', index=False)
        print("Prediction for TASK 1 completed. Results saved to sexism_predictions_task1.csv")
        return model, submission_df
    return model, eval_results

### Metrics

In [None]:
def compute_metrics_1(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, preds, average='binary', zero_division=0
    )
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

### The simplest

In [None]:
def predict_op1(model, dataset, args = {}):

    training_args = TrainingArguments(
        output_dir="./output",
        per_device_eval_batch_size = args.get("per_device_eval_batch_size", 16),
        do_train=False,
        do_eval=False,
    )
    trainer = Trainer(model=model, args=training_args)

    predictions = trainer.predict(dataset)

    logits = predictions.predictions

    probs = torch.nn.functional.softmax(torch.tensor(logits), dim=-1).numpy()

    pred_classes = np.argmax(logits, axis=-1)

    return pred_classes, probs

### Predictions from the best Spanish model

In [None]:
from transformers import  AutoTokenizer, AutoModelForSequenceClassification,  Trainer, TrainingArguments,  EarlyStoppingCallback

params = {
    "learning_rate": 5e-5,
    "num_train_epochs": 4,
    "per_device_train_batch_size": 16,
    "warmup_steps": 100,
    "weight_decay": 0.01,
    "early_stopping_patience": 2,
    "use_lora": True,
}
modelname = "PlanTL-GOB-ES/roberta-base-bne"
output_dir = "./drive/MyDrive/best-finetuned-model-spanish"
mixmodel, res = sexism_classification_pipeline_task1_LoRA(SpTrainTask1, SpDevTask1, None, modelname, 2, "single_label_classification", **params)

In [None]:
label_encoder = LabelEncoder()
val_labels_encoded = label_encoder.fit_transform(SpDevTask1[2]) # Assuming SpDevTask1[2] are the string labels

tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/roberta-base-bne")

val_dataset = SexismDataset(SpDevTask1[1], LabelEncoder().fit_transform(SpDevTask1[2]), [int(x) for x in SpDevTask1[0]], tokenizer)

preds, probs = predict_op1(mixmodel, val_dataset)


### Evaluation of the results

In [None]:
def compute_metrics(y_true, y_pred):
    precision, recall, f1, _ = precision_recall_fscore_support(
        y_true, y_pred, average='binary'
    )
    acc = accuracy_score(y_true, y_pred)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall,
    }

metrics = compute_metrics(val_dataset.labels, preds)
print("Metrics:", metrics)

conf_matrix = confusion_matrix(val_dataset.labels, preds)
print("\nConfusion matrix:")
print(conf_matrix)

class_report = classification_report(val_dataset.labels, preds, target_names=["NO", "YES"], digits=4)
print("\nReport:")
print(class_report)

### Plot confusion matrix


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", cbar=False,
            xticklabels=["No", "Yes"],
            yticklabels=["No", "Yes"])
plt.title("Confusion Matrix")
plt.xlabel("Prediction")
plt.ylabel("Actual")
plt.show()

### Text and probability of False positive and False negative

In [None]:
def false_positive_false_negative(
    y_true: List[int], y_pred: List[int], pred_probs: List[float], texts: List[str]
) -> Tuple[List[Tuple[float, str]], List[Tuple[float, str]]]:
    """
    returns incorrect predictions from a binary classification model.

    Parameters:
    - y_true (list[int]): True labels for each sample (0 or 1).
    - y_pred (list[int]): Labels predicted by the model (0 or 1).
    - pred_probs (list[float]): Probabilities assigned by the model to the predicted class.
    - texts (list[str]): Text of each sample.

    Returns:
    - false_positives (list[tuple[float, str]]): List of tuples for false positives.
    - false_negatives (list[tuple[float, str]]): List of tuples for false negatives.
    Each tuple of both lists includes the probability and the text of the sample.
    """

    false_positive = []
    false_negative = []

    for i in range(len(y_pred)):
        if y_pred[i] == 1 and y_true[i] == 0:
            false_positive.append((pred_probs[i], texts[i]))
        elif y_pred[i] == 0 and y_true[i] == 1:
            false_negative.append((pred_probs[i], texts[i]))

    return false_positive, false_negative

In [None]:
fpositive, fnegative = false_positive_false_negative(y_true = val_dataset.labels, y_pred = preds, pred_probs = probs, texts = val_dataset.texts)

print("False positive:", len(fpositive))
# for i,s in enumerate(fpositive):
#     print(i,'S', s[0], s[1])
print(19, fpositive[19][0])  # highest prob
print(fpositive[19][1])
print(22, fpositive[22][0])
print(fpositive[22][1]) # lowest prob, almost 1/2
print(33, fpositive[33][0])
print(fpositive[33][1]) # almost highest prob

In [None]:
print("False negative:", len(fnegative))
# for i,s in enumerate(fnegative):
#     print(i,'S', s[0], s[1])

print(22, fnegative[22][0])  # highest prob
print(fnegative[22][1])
print(0, fnegative[0][0])
print(fnegative[0][1]) # lowest prob, almost 1/2
print(9, fnegative[9][0])
print(fnegative[9][1]) # prob somewhere in the middle - 75%

### Select some samples to analyze

In [None]:
fpositive_samples = [fpositive[19][1], fpositive[22][1], fpositive[33][1]]
fnegative_samples = [fnegative[22][1], fnegative[0][1], fnegative[9][1]]

In [None]:
from collections import Counter
from itertools import tee, count

def uniquify(seq, suffs = count(1)):
  """Make all the items unique by adding a suffix (1, 2, etc).
  `seq` is mutable sequence of strings.
  `suffs` is an optional alternative suffix iterable.
  """
  not_unique = [k for k,v in Counter(seq).items() if v>1]
  suff_gens = dict(zip(not_unique, tee(suffs, len(not_unique))))
  for idx,s in enumerate(seq):
      try:
          suffix = str(next(suff_gens[s]))
      except KeyError:
          continue
      else:
          seq[idx] += suffix

def deduplicate(explanations):
    for i in range(len(explanations)):
        tokens = explanations[i].tokens
        uniquify(tokens, (f'_{x!s}' for x in range(1, 100)))
        explanations[i].tokens=tokens
    return explanations

In [None]:

from ferret import Benchmark
from IPython.display import display_html

model_es = mixmodel
bench = Benchmark(model_es, tokenizer)

def explain_this(benchmark, sentence, target):
    explanations = benchmark.explain(sentence, target=target)
    explanations_de = deduplicate(explanations)
    explanation_evaluations = benchmark.evaluate_explanations(explanations_de, target=target)
    print("Sentence:", sentence)
    print("Class:", target)
    tble = benchmark.show_table(explanations_de)
    tble2 = benchmark.show_evaluation_table(explanation_evaluations)
    display_html(tble.to_html(), raw=True)
    display_html(tble2.to_html(), raw=True)

### Show explanations

In [None]:
for sample in fpositive_samples:
    print("False Positive:")
    explain_this(bench, sample, 1)
    print("\n\n")

**Report**

**Spanish - False positives**

Based on the SHAP and LIME analysis, specific keywords like "mujer," "mujeres," "feminismo," and "feminista" appear to strongly influence the model's prediction. The model gives high importance to these terms, leading it to classify tweets as sexist mainly based on their presence. This strong keyword association, without taking into account the surrounding context or intent, likely causes these tweets to be misclassified as false positives.

In [None]:
for sample in fnegative_samples:
    print("False Negative:")
    explain_this(bench, sample, 0)
    print("\n\n")

**Spanish - false negatives**

Looking at the SHAP and LIME values, it seems the model doesn't put enough importance on the parts of these tweets that signal they should be sexist. It looks like, none of the tokens have significant values, which means, that the model "thought" that none of the words is out of average tone of the sentence and didn't point it out.

This means the potentially relevant content gets overlooked or canceled out, leading to them being classified incorrectly as non sexist.


# DO IT IN ENGLISH

### Model

In [None]:
params = {
    "learning_rate": 5e-5,
    "num_train_epochs": 6,
    "per_device_train_batch_size": 16,
    "warmup_steps": 100,
    "weight_decay": 0.01,
    "early_stopping_patience": 2,
    "use_lora": True,
}
modelname = "roberta-base"
output_dir = "./drive/MyDrive/best-finetuned-model-english"

model2, res = sexism_classification_pipeline_task1_LoRA( EnTrainTask1, EnDevTask1, None, modelname, 2, "single_label_classification", **params)

### Predictions from best English model

In [None]:
label_encoder = LabelEncoder()
val_labels_encoded = label_encoder.fit_transform(EnDevTask1[2])

tokenizer_en = AutoTokenizer.from_pretrained("roberta-base")
val_dataset_en = SexismDataset(EnDevTask1[1], LabelEncoder().fit_transform(EnDevTask1[2]), [int(x) for x in EnDevTask1[0]], tokenizer_en)

preds_en, probs_en = predict_op1(model2, val_dataset_en)


### Evaluation of results

In [None]:
metrics_en = compute_metrics(val_dataset_en.labels, preds_en)
print("Metrics:", metrics_en)

conf_matrix_en = confusion_matrix(val_dataset_en.labels, preds_en)
print("\nConfusion matrix:")
print(conf_matrix_en)

class_report_en = classification_report(val_dataset_en.labels, preds_en, target_names=["NO", "YES"], digits=4)
print("\nReport:")
print(class_report_en)

### Plot confusion matrix

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix_en, annot=True, fmt="d", cmap="Blues", cbar=False,
            xticklabels=["No", "Yes"],
            yticklabels=["No", "Yes"])
plt.title("Confusion Matrix")
plt.xlabel("Prediction")
plt.ylabel("Actual")
plt.show()

### False negatives and positives

In [None]:
fpositive_en, fnegative_en = false_positive_false_negative(y_true = val_dataset_en.labels, y_pred = preds_en, pred_probs = probs_en, texts = val_dataset_en.texts)

print("False positive:", len(fpositive_en))
# for i,s in enumerate(fpositive_en):
#     print(i,'S', s[0], s[1])

print(40, fpositive_en[40][0]) # - highest
print(fpositive_en[40][1])
print(42, fpositive_en[42][0]) # - lowest
print(fpositive_en[42][1])
print(26, fpositive_en[26][0]) # - middle
print(fpositive_en[26][1])

In [None]:
print("False negative:", len(fnegative_en))
# for i,s in enumerate(fnegative_en):
#     print(i,'S', s[0], s[1])

print(22, fnegative_en[22][0]) # highest
print(fnegative_en[22][1])
print(10, fnegative_en[10][0]) # lowest
print(fnegative_en[10][1])
print(16, fnegative_en[16][0]) # middle
print(fnegative_en[16][1])

### Select own samples

In [None]:
fpositive_samples_en = [fpositive_en[40][1], fpositive_en[42][1], fpositive_en[26][1]]
fnegative_samples_en = [fnegative_en[22][1], fnegative_en[10][1], fnegative_en[16][1]]

In [None]:
bench_en = Benchmark(model2, tokenizer_en)

### Show explanations

In [None]:
for sample in fpositive_samples_en:
    print("False Positive:")
    explain_this(bench_en, sample, 1)
    print("\n\n")

**English - false positives**

Based on the SHAP and LIME values, the model's misclassification stems from its strong positive reaction to key words. Confirming a previous pattern, terms like "women" significantly drive the prediction towards sexism classification. Additionally, words describing negative social concepts such as "envy," "gossip," and even explicit terms like "penis," especially when linked to "men," also heavily contribute to this false positive classification. This shows the model associates sexism with specific gendered nouns.

In [None]:
for sample in fnegative_samples_en:
    print("False Negative:")
    explain_this(bench_en, sample, 0)
    print("\n\n")

**English - false negative**

Based on the SHAP and LIME values, it appears the model didn't strongly recognize the phrases indicating sexism. The words you'd expect to have high positive influence actually received low or mixed scores. Instead, a number of other words were given negative importance, effectively pulling the prediction down to non sexism and causing these tweets to be missed.

**Overall**

**False positives** at both languages are mainly because of the model, focusing on keywords and overlooking the omportant context, in which the words are

**False negatives** on the other hand don't show significantly high SHAP and LIME values, which shows, that there are no few words, pushing the prediction to non sexism. This usually happens, when the tweets are more complex or the sexism is only reported, which might suggest, that the mode has not seen enough of reported sexism tweets.