# Inferencia con LLM (Ministral-3-8B-Instruct)

Este cuaderno realiza inferencia de clasificación binaria (YES/NO) usando un LLM mediante prompting

## Importar librerías

In [1]:
import os
import torch
import pandas as pd
import numpy as np

from transformers import MistralCommonBackend, Mistral3ForConditionalGeneration, FineGrainedFP8Config
from sklearn.metrics import (
    accuracy_score,
    precision_recall_fscore_support
)

from pyevall.evaluation import PyEvALLEvaluation
from pyevall.metrics.metricfactory import MetricFactory

ImportError: cannot import name 'MistralCommonBackend' from 'transformers' (/home/alumno.upv.es/scheng1/.conda/envs/RFA2526pt/lib/python3.12/site-packages/transformers/__init__.py)

## Configuración y parámetros

In [None]:
os.environ["HF_TOKEN"] = "tu_token"

MODEL_NAME = "mistralai/Ministral-3-8B-Instruct-2512"
MAIN_PATH = ".."
GROUP_ID = "BeingChillingWeWillWin"

TEXT_COLUMN = "tweet"
LABEL_COLUMN = "task1"

DATA_TRAIN_PATH = os.path.join(MAIN_PATH, "preprocessed_data", "train_preprocessed_v2.json")
DATA_VAL_PATH   = os.path.join(MAIN_PATH, "preprocessed_data", "val_preprocessed_v2.json")
DATA_TEST_PATH  = os.path.join(MAIN_PATH, "preprocessed_data", "test_preprocessed_v2.json")

PREDICTIONS_DIR = os.path.join(MAIN_PATH, "results_v2", "Ministral3B", "predictions")
os.makedirs(PREDICTIONS_DIR, exist_ok=True)

## Carga y preprocesamiento de datos

In [3]:
train_df = pd.read_json(DATA_TRAIN_PATH)
val_df   = pd.read_json(DATA_VAL_PATH)
test_df  = pd.read_json(DATA_TEST_PATH)

label_map = {"NO": 0, "YES": 1}
label_map_inverse = {0: "NO", 1: "YES"}

train_df["label"] = train_df[LABEL_COLUMN].map(label_map)
val_df["label"]   = val_df[LABEL_COLUMN].map(label_map)

print(f"Text column used: {TEXT_COLUMN}")
print(f"Train size: {len(train_df)} | Val size: {len(val_df)} | Test size: {len(test_df)}")
print(f"\nDistribución de etiquetas en TRAIN:")
print(train_df[LABEL_COLUMN].value_counts())
print(f"\nDistribución de etiquetas en VAL:")
print(val_df[LABEL_COLUMN].value_counts())

Text column used: tweet
Train size: 5154 | Val size: 910 | Test size: 934

Distribución de etiquetas en TRAIN:
task1
NO     2862
YES    2292
Name: count, dtype: int64

Distribución de etiquetas en VAL:
task1
NO     505
YES    405
Name: count, dtype: int64


## Carga del modelo LLM

In [4]:
tokenizer = MistralCommonBackend.from_pretrained(MODEL_NAME)

model = Mistral3ForConditionalGeneration.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    quantization_config=FineGrainedFP8Config(dequantize=True)
)
model.eval()
print("Modelo cargado correctamente.")

tekken.json:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/531 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/131 [00:00<?, ?B/s]

Modelo cargado correctamente.


## Definición del prompt y función de inferencia

Construimos un prompt de clasificación binaria. El modelo debe responder únicamente `YES` o `NO`.

In [5]:
SYSTEM_PROMPT = (
    "You are a text classification assistant. "
    "Your task is to determine whether the following text contains sexism. "
    "Answer with exactly one word: YES or NO."
)

def build_messages(text: str) -> list:
    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"Text: {text}\n\nDoes this text contain sexism?"},
    ]

def predict_single(text: str) -> str:
    """Devuelve 'YES' o 'NO' para un texto dado."""
    messages = build_messages(text)
    tokenized = tokenizer.apply_chat_template(
        messages,
        return_tensors="pt",
        return_dict=True
    )
    input_ids = tokenized["input_ids"].to("cuda")

    with torch.no_grad():
        output_ids = model.generate(
            input_ids=input_ids,
            max_new_tokens=5,
            do_sample=False,
        )[0]

    new_tokens = output_ids[input_ids.shape[-1]:]
    decoded = tokenizer.decode(new_tokens, skip_special_tokens=True).strip().upper()

    if "YES" in decoded:
        return "YES"
    elif "NO" in decoded:
        return "NO"
    else:
        print(f"[WARN] Respuesta inesperada: '{decoded}' → se asigna NO")
        return "NO"

def predict_batch(texts: list, verbose: bool = True) -> list:
    """Infiere sobre una lista de textos y devuelve una lista de 'YES'/'NO'."""
    preds = []
    for i, text in enumerate(texts):
        pred = predict_single(text)
        preds.append(pred)
        if verbose and (i + 1) % 50 == 0:
            print(f"  Procesados {i+1}/{len(texts)}...")
    return preds

## Inferencia en DEV (validación)

In [6]:
print("Realizando inferencia en DEV...")
dev_preds_str = predict_batch(val_df[TEXT_COLUMN].tolist())
dev_preds = np.array([label_map[p] for p in dev_preds_str])
y_true_dev = val_df["label"].values

precision, recall, f1, _ = precision_recall_fscore_support(
    y_true_dev, dev_preds, average='binary', zero_division=0
)
acc = accuracy_score(y_true_dev, dev_preds)

print(f"\nMétricas en DEV:")
print(f"Accuracy:  {acc:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1-Score:  {f1:.4f}")

Realizando inferencia en DEV...


  Procesados 50/910...


  Procesados 100/910...


  Procesados 150/910...


  Procesados 200/910...


  Procesados 250/910...


  Procesados 300/910...


  Procesados 350/910...


  Procesados 400/910...


  Procesados 450/910...


  Procesados 500/910...


  Procesados 550/910...


  Procesados 600/910...


  Procesados 650/910...


  Procesados 700/910...


  Procesados 750/910...


  Procesados 800/910...


  Procesados 850/910...


  Procesados 900/910...



Métricas en DEV:
Accuracy:  0.8264
Precision: 0.7892
Recall:    0.8321
F1-Score:  0.8101


## Evaluación en DEV con PyEvALL

In [7]:
dev_preds_for_pyevall = [
    {'test_case': 'EXIST2025', 'id': str(id_exist), 'value': pred}
    for id_exist, pred in zip(val_df['id_EXIST'].values, dev_preds_str)
]

dev_preds_df = pd.DataFrame(dev_preds_for_pyevall)
dev_preds_path = os.path.join(PREDICTIONS_DIR, 'dev_predictions_temp.json')
with open(dev_preds_path, 'w', encoding='utf-8') as f:
    f.write(dev_preds_df.to_json(orient='records'))

dev_gold = [
    {'test_case': 'EXIST2025', 'id': str(id_exist), 'value': label}
    for id_exist, label in zip(val_df['id_EXIST'].values, val_df[LABEL_COLUMN].values)
]

dev_gold_df = pd.DataFrame(dev_gold)
dev_gold_path = os.path.join(PREDICTIONS_DIR, 'dev_gold_temp.json')
with open(dev_gold_path, 'w', encoding='utf-8') as f:
    f.write(dev_gold_df.to_json(orient='records'))

evaluator = PyEvALLEvaluation()
metrics = [
    MetricFactory.Accuracy.value,
    MetricFactory.FMeasure.value,
]
report = evaluator.evaluate(dev_preds_path, dev_gold_path, metrics)
print("\n=== Evaluación en DEV con PyEvALL ===")
report.print_report()

2026-02-25 17:33:34,749 - pyevall.evaluation - INFO -             evaluate() - Evaluating the following metrics ['Accuracy', 'FMeasure']


2026-02-25 17:33:34,801 - pyevall.metrics.metrics - INFO -             evaluate() - Executing accuracy evaluation method


2026-02-25 17:33:34,912 - pyevall.metrics.metrics - INFO -             evaluate() - Executing fmeasure evaluation method



=== Evaluación en DEV con PyEvALL ===
{
  "metrics": {
    "Accuracy": {
      "name": "Accuracy",
      "acronym": "Acc",
      "description": "Coming soon!",
      "status": "OK",
      "results": {
        "test_cases": [{
          "name": "EXIST2025",
          "average": 0.8263736263736263
        }],
        "average_per_test_case": 0.8263736263736263
      }
    },
    "FMeasure": {
      "name": "F-Measure",
      "acronym": "F1",
      "description": "Coming soon!",
      "status": "OK",
      "results": {
        "test_cases": [{
          "name": "EXIST2025",
          "classes": {
            "NO": 0.8400809716599191,
            "YES": 0.8100961538461539
          },
          "average": 0.8250885627530364
        }],
        "average_per_test_case": 0.8250885627530364
      }
    }
  },
  "files": {
    "dev_predictions_temp.json": {
      "name": "dev_predictions_temp.json",
      "status": "OK",
      "gold": false,
      "description": "Use parameter: report=\"embedd

## Inferencia en TEST y generación de predicciones finales

In [8]:
print("Realizando inferencia en TEST...")
test_preds_str = predict_batch(test_df[TEXT_COLUMN].tolist())
test_preds = np.array([label_map[p] for p in test_preds_str])

print(f"\nPredicciones en TEST:")
print(f"Total muestras:    {len(test_preds)}")
print(f"Predicciones YES:  {np.sum(test_preds == 1)} ({100*np.mean(test_preds == 1):.2f}%)")
print(f"Predicciones NO:   {np.sum(test_preds == 0)} ({100*np.mean(test_preds == 0):.2f}%)")

Realizando inferencia en TEST...


  Procesados 50/934...


  Procesados 100/934...


  Procesados 150/934...


  Procesados 200/934...


  Procesados 250/934...


  Procesados 300/934...


  Procesados 350/934...


  Procesados 400/934...


  Procesados 450/934...


  Procesados 500/934...


  Procesados 550/934...


  Procesados 600/934...


  Procesados 650/934...


  Procesados 700/934...


  Procesados 750/934...


  Procesados 800/934...


  Procesados 850/934...


  Procesados 900/934...



Predicciones en TEST:
Total muestras:    934
Predicciones YES:  437 (46.79%)
Predicciones NO:   497 (53.21%)


## Guardar predicciones en formato PyEvALL para TEST

In [9]:
MODEL_ID = "Ministral3B"

test_preds_for_submission = [
    {'test_case': 'EXIST2025', 'id': str(id_exist), 'value': pred}
    for id_exist, pred in zip(test_df['id_EXIST'].values, test_preds_str)
]

test_preds_df = pd.DataFrame(test_preds_for_submission)

output_filename = f"{GROUP_ID}_{MODEL_ID}.json"
output_path = os.path.join(PREDICTIONS_DIR, output_filename)

with open(output_path, 'w', encoding='utf-8') as output_file:
    output_file.write(test_preds_df.to_json(orient='records'))

print(f"\nPredicciones guardadas en: {output_path}")


Predicciones guardadas en: ../results/Ministral3B/predictions/BeingChillingWeWillWin_Ministral3B.json
