# Tarea 2: Question Answering Fine-tuning

In [1]:
pip install -U datasets huggingface_hub fsspec


Collecting fsspec
  Using cached fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)


In [2]:
pip install evaluate



In [3]:
# Librerías

import logging
logging.getLogger("transformers").setLevel(logging.ERROR)

import torch
print("Is CUDA available:", torch.cuda.is_available())
print("CUDA version:", torch.version.cuda)
print("Number of GPUs available:", torch.cuda.device_count())

from time import time
from datasets import *
from transformers import *
import pandas as pd
import numpy as np
import re

pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', None)
pd.set_option('display.colheader_justify', 'center')

Is CUDA available: True
CUDA version: 12.4
Number of GPUs available: 1


GroupViT models are not usable since `tensorflow_probability` can't be loaded. It seems you have `tensorflow_probability` installed with the wrong tensorflow version.Please try to reinstall it following the instructions here: https://github.com/tensorflow/probability.
TAPAS models are not usable since `tensorflow_probability` can't be loaded. It seems you have `tensorflow_probability` installed with the wrong tensorflow version. Please try to reinstall it following the instructions here: https://github.com/tensorflow/probability.


## Dataset

El dataset de SQuAD (Stanford Question Answering Dataset) es un conjunto de datos utilizado principalmente para entrenar y evaluar modelos de comprensión lectora. Consiste en ternas de preguntas, respuestas y contexto.

Aquí la ficha del dataset para que podáis explorarla: https://huggingface.co/datasets/rajpurkar/squad

In [4]:
# No modificar esta celda
# Esta celda, celda tiene que estar ejecutada en la entrega

dataset = load_dataset("squad")
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

Con el único motivo de no demorar los tiempos de entrenamiento. Filtraremos el dataset y nos quedaremos solo con los registros que tenga longitud del campo _context_ inferior a 300.

El resto de la práctica se pide trabajarla sobre la variable `ds_tarea`.

In [5]:
# No modificar esta celda
# Esta celda, celda tiene que estar ejecutada en la entrega

def filtra_por_longitud(ejemplo):
    return len(ejemplo["context"]) < 300

ds_tarea = dataset.filter(filtra_por_longitud)

assert len(ds_tarea['train']) == 3466
assert len(ds_tarea['validation']) == 345

ds_tarea

Filter:   0%|          | 0/87599 [00:00<?, ? examples/s]

Filter:   0%|          | 0/10570 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 3466
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 345
    })
})

In [6]:
# # You can adjust the number of examples as needed
# train_subset_size = 100
# validation_subset_size = 20

# ds_tarea = DatasetDict({
#     'train': ds_tarea['train'].select(range(min(train_subset_size, len(ds_tarea['train'])))),
#     'validation': ds_tarea['validation'].select(range(min(validation_subset_size, len(ds_tarea['validation']))))
# })

# print("Shortened dataset:")
# print(ds_tarea)

## Modeling

En este apartado es donde tendréis que realizar todo el trabajo de la práctica. El formato, el análisis, el modelo escogido y cualquier proceso intermedio que consideréis es totalmente libre. Sin embargo, hay algunas pautas que tendréis que cumplir:

- La variable `model_checkpoint` debe almacenar el nombre del modelo y el tokenizador de 🤗 que vais a utilizar.
- La variable `model` y la variable `tokenizer` almacenarán, respectivamente, el modelo y el tokenizador de 🤗 que vais a utilizar.
- La variable `trainer` almacenará el _Trainer_ de 🤗 que, en la siguiente sección utilizaréis para entrenar el modelo.

In [7]:
from transformers import TrainingArguments, Trainer, AutoTokenizer, AutoModelForQuestionAnswering
from datasets import DatasetDict # Import DatasetDict

model_checkpoint = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

def preprocess_function(examples):
    questions = examples["question"]
    contexts = examples["context"]
    ids = examples["id"] # Get the original IDs

    inputs = tokenizer(
        questions,
        contexts,
        max_length=160,
        truncation="only_second", # Truncate only the context
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]

    start_positions = []
    end_positions = []
    sequence_ids_list = [] # To store sequence_ids for each feature
    example_ids = [] # To store example_ids for each feature
    feature_answers = [] # To store answers for each feature


    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        context = contexts[sample_idx]

        # Get the list of answer texts and their start positions for the current sample
        answer_texts = answer["text"]
        answer_starts = answer["answer_start"]

        sequence_ids = inputs.sequence_ids(i)
        sequence_ids_list.append(sequence_ids) # Store sequence_ids

        # Associate the original example ID and answers with the current feature
        example_ids.append(ids[sample_idx])
        feature_answers.append(answer)

        # If no answers are given, set the cls_index as answer.
        if len(answer_texts) == 0:
            start_positions.append(tokenizer.cls_token_id)
            end_positions.append(tokenizer.cls_token_id)
        else:
            # Find start and end character index of the first answer in the context.
            # Assuming there is at least one answer and we are using the first one.
            start_char = answer_starts[0]
            end_char = start_char + len(answer_texts[0])

            # Start token index of the context in the two texts.
            idx = 0
            while sequence_ids[idx] != 1:
                idx += 1
            context_start = idx

            # End token index of the context.
            idx = len(sequence_ids) - 1
            while sequence_ids[idx] != 1:
                idx -= 1
            context_end = idx

            # If the answer is not fully contained in the context, label it with the cls index.
            if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
                start_positions.append(tokenizer.cls_token_id)
                end_positions.append(tokenizer.cls_token_id)
            else:
                # Otherwise it's the start and end token indices.
                idx = context_start
                while idx <= context_end and offset[idx][0] <= start_char:
                    idx += 1
                start_positions.append(idx - 1)

                idx = context_end
                while idx >= context_start and offset[idx][1] >= end_char:
                    idx -= 1
                end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    inputs["sequence_ids"] = sequence_ids_list # Add sequence_ids to inputs
    inputs["example_id"] = example_ids # Add example_ids to inputs
    inputs["offset_mapping"] = offset_mapping # Add offset_mapping to inputs
    inputs["answers"] = feature_answers # Add answers to inputs
    return inputs

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--roberta-base/snapshots/e2da8e2f811d1448a5b465c236feacd80ffbac7b/config.json
Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.53.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading file vocab.json from cache at /root/.cache/huggingface/hub/models--roberta-base/snapshots/e2da8e2f811d1448a5b465c236feacd80ffbac7b/vocab.json
loading file merges.txt from cache at /root/.ca

In [8]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    # Change metric_for_best_model to 'eval_f1' which is provided by squad_v2 metric
    metric_for_best_model="eval_f1",
    greater_is_better=True,
    warmup_steps=500,
    report_to=["tensorboard"],
    seed=42,
)

PyTorch: setting up devices


In [9]:
import evaluate
from collections import defaultdict
from tqdm.auto import tqdm
import numpy as np

metric = evaluate.load("squad_v2")

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions

    # Build a map from features to examples.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    predictions = []
    null_scores = {}

    print(f"Post-processing {len(examples)} example predictions of team")

    for example_index, example in enumerate(tqdm(examples)):
        feature_indices = features_per_example[example_index]

        min_null_score = None
        valid_answers = []
        context = example["context"]

        for feature_index in feature_indices:
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            score = start_logits[0] + end_logits[0]
            if min_null_score is None or min_null_score < score:
                min_null_score = score

            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue
                    if features[feature_index]["sequence_ids"] is not None and (
                        features[feature_index]["sequence_ids"][start_index] != 1 or
                        features[feature_index]["sequence_ids"][end_index] != 1):
                        continue
                    if (features[feature_index]["offset_mapping"] is not None and
                        start_index < len(features[feature_index]["offset_mapping"]) and
                        end_index < len(features[feature_index]["offset_mapping"])):
                        offsets = features[feature_index]["offset_mapping"][start_index]
                        start_char = offsets[0]
                        offsets = features[feature_index]["offset_mapping"][end_index]
                        end_char = offsets[1]
                        valid_answers.append({
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        })

        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            best_answer = {"text": "", "score": 0.0}

        null_scores[example["id"]] = min_null_score

        if min_null_score is None or best_answer["score"] > min_null_score:
            predictions.append({"id": example["id"], "prediction_text": best_answer["text"]})
        else:
            predictions.append({"id": example["id"], "prediction_text": ""})

    return predictions, null_scores


def compute_metrics(eval_pred):
    raw_predictions = eval_pred.predictions
    eval_examples = ds_tarea['validation']
    eval_features = tokenized_ds['validation']

    predictions, null_scores = postprocess_qa_predictions(eval_examples, eval_features, raw_predictions)

    references = [{"id": ex["id"], "answers": ex["answers"]} for ex in eval_examples]

    # SOLO pasamos predictions, references, y null_scores
    return metric.compute(predictions=predictions, references=references, null_scores=null_scores)

In [10]:
tokenized_ds = ds_tarea.map(preprocess_function, batched=True)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Map:   0%|          | 0/3466 [00:00<?, ? examples/s]

Map:   0%|          | 0/345 [00:00<?, ? examples/s]

  trainer = Trainer(


## Training

In [11]:
# No modificar esta celda
# Esta celda, celda tiene que estar ejecutada en la entrega

start = time()

trainer.train()

end = time()
print(f">>>>>>>>>>>>> elapsed time: {(end-start)/60:.0f}m")

The following columns in the Training set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: sequence_ids, example_id, answers, question, id, context, title, offset_mapping. If sequence_ids, example_id, answers, question, id, context, title, offset_mapping are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3,466
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1,302
  Number of trainable parameters = 124,056,578


Epoch,Training Loss,Validation Loss


The following columns in the Evaluation set don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: sequence_ids, example_id, answers, question, id, context, title, offset_mapping. If sequence_ids, example_id, answers, question, id, context, title, offset_mapping are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 345
  Batch size = 8


Post-processing 345 example predictions of team


  0%|          | 0/345 [00:00<?, ?it/s]

KeyError: 'no_answer_probability'

## Evaluation

In [None]:
# No modificar esta celda
# Esta celda, celda tiene que estar ejecutada en la entrega

print(f"**** EVALUACIÓN ****")
print(f"********\nTokenizer config:\n{tokenizer}")
print(f"\n\n********\nModel config:\n{model.config}")
print(f"\n\n********\nTrainer arguments:\n{trainer.args}")

In [None]:
# No modificar esta celda
# Esta celda, celda tiene que estar ejecutada en la entrega

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

question_answerer = pipeline("question-answering", model=model, tokenizer=tokenizer, device=device)

In [None]:
# No modificar esta celda
# Esta celda, celda tiene que estar ejecutada en la entrega

assert len(ds_tarea['train']) == 3466
assert len(ds_tarea['validation']) == 345

def calculate_sentence_similarity(sentence1, sentence2):
    sentence1 = re.sub(r'[^a-zA-Z0-9\s]', '', sentence1).lower()
    sentence2 = re.sub(r'[^a-zA-Z0-9\s]', '', sentence2).lower()
    words1 = set(sentence1.lower().split())
    words2 = set(sentence2.lower().split())
    matches = len(words1.intersection(words2))
    total_words = len(words1.union(words2))
    if total_words == 0:
        return 0.0
    return (matches / total_words) * 100

samples = [324,342,249,176,70,168,120,58,90,192,278,289,197,146,323,248,260,273,112,211]
evaluation_list = []

for ii in samples:
    context = ds_tarea['validation'][ii]['context']
    question = ds_tarea['validation'][ii]['question']
    answer = ds_tarea['validation'][ii]['answers']
    answers = [f"{tt}" for ii, tt in enumerate(answer['text'])]
    prediction = question_answerer(context=context, question=question)['answer']
    match = max([calculate_sentence_similarity(w, prediction) for w in answers])
    evaluation_list.append((ii,context,question,answers,prediction,match))

print(f"*** evaluation_df ***")
evaluation_df = pd.DataFrame(evaluation_list, columns=['sample', 'context', 'question', 'real_answers', 'predicted_answer', 'match'])
evaluation_df[['sample','real_answers','predicted_answer', 'match']]

### Criterio de evaluación

La **nota final de la tarea2** estará relacionada con el resultado de las predicciones de vuestro modelo.

El criterio de evaluación será el siguiente:

- La tarea2 se aprobará si el notebook se entrega sin fallos y con un modelo entrenado (independientemente de sus predicciones).
- Se ponderará en función de la columna _match_, que otorga 100% de acierto si todas las palabras coinciden y bajará gradualmente el porcentaje de acierto en función del número de palabras que no coincidan.
    
Nota: La nota que se calcula a continuación es orientativa y podría verse reducida en función del código de la entrega.

In [None]:
print(f"Tu nota de la tarea2 es: {max(np.ceil(evaluation_df['match'].sum() / len(evaluation_df) / 10), 5.0)}")