# Question Answering Fine-tuning

La siguiente tarea consiste en entrenar un modelo de HuggingFace (HF) para realizar la _task_ de _question_answering_. El dataset para entrenar dicho modelo está predefinido. Sin embargo, el modelo, el tokenizador y el trainer pueden ser totalmente personalizados. Es decir, que tendréis que realizar un trabajo de investigación, de prueba y error, para poder ir aprendiendo y ganando destreza con HF.

Recomendaciones:
- Durante este proceso, tendréis muchas dudas y encontraréis muchos errores. Tratad de resolverlas primero por vuestra cuenta, enteniendo la causa del error. Después con recursos online. Y, finalmente, siempre está el foro, que puede ser utilizado de forma participativa.
- No dejeis la tarea para el último día. Los modelos tardan en entrenar. Los problemas no se resuelven en la primera iteración.

Finalmente, se pide:
- Limpieza rigurosa en la presentación del notebook.
- El notebook se entrega con todas las celdas ejecutadas.
- Los comentarios (opcionales), mejor sobre el código con '#'.

Ánimo!

## Dataset

A continuación, descargarás un dataset llamado _squad_. La columna _question_ contiene la pregunta, la columna _answer_ contiene la respuesta, y la columna _context_ contiene el contexto sobre el que tendrá que responder la pregunta.

In [None]:
# !pip install datasets
# !pip install transformers[torch]
# !pip install accelerate -U
# !pip install transformers



In [None]:
!pip install time

[31mERROR: Could not find a version that satisfies the requirement time (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for time[0m[31m
[0m

In [None]:
from datasets import load_dataset
ds = load_dataset("squad")
ds

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

Lo primero que tendrás que hacer es construir un DatasetDict nuevo, llamado **ds_tarea**, que filtre el anterior DatasetDict para:
- quedarse con los registros que tengan el contenido de la columna _context_ con menos (estrictamente) de 300 caracteres.

In [None]:
def length_x(x):
    return len(x['context'])<300

ds_tarea = ds.filter(length_x)

In [None]:
assert len(ds_tarea['train']) == 3466
assert len(ds_tarea['validation']) == 345

## EDA

Si tenéis que realizar alguna exploración del datos, utilizad esta sección.

In [None]:
# Celdas de libre uso
print(ds_tarea['train']['question'][0])
print(ds_tarea['validation'][0])

Beyonce released the song "Formation" on which online music service?
{'id': '56be53b8acb8001400a50314', 'title': 'Super_Bowl_50', 'context': 'In early 2012, NFL Commissioner Roger Goodell stated that the league planned to make the 50th Super Bowl "spectacular" and that it would be "an important game for us as a league".', 'question': 'Who was the NFL Commissioner in early 2012?', 'answers': {'text': ['Roger Goodell', 'Roger Goodell', 'Goodell'], 'answer_start': [32, 32, 38]}}


In [None]:
df_train = ds_tarea['train'].to_pandas()
df_val = ds_tarea['validation'].to_pandas()

In [None]:
print(df_train.head())
print(df_val.head())
print(df_train.info())
print(df_val.info())
print(df_train.describe())
print(df_val.describe())
print(df_train.isnull().sum())
print(df_val.isnull().sum())
print(df_train['question'].value_counts())
print(df_val['question'].value_counts())

                         id    title  \
0  56bea27b3aeaaa14008c9199  Beyoncé   
1  56bea27b3aeaaa14008c919a  Beyoncé   
2  56bfa8bba10cfb140055120b  Beyoncé   
3  56bfa8bba10cfb140055120c  Beyoncé   
4  56bfa8bba10cfb140055120d  Beyoncé   

                                             context  \
0  On February 6, 2016, one day before her perfor...   
1  On February 6, 2016, one day before her perfor...   
2  On February 6, 2016, one day before her perfor...   
3  On February 6, 2016, one day before her perfor...   
4  On February 6, 2016, one day before her perfor...   

                                            question  \
0  Beyonce released the song "Formation" on which...   
1  Beyonce's new single released before the super...   
2  What day did Beyonce release her single, Forma...   
3                       How was the single released?   
4        What was the name of the streaming service?   

                                             answers  
0         {'text': ['Tidal'], 

## Model and Tokenizer

Se pide guardar el modelo y el tokenizador en las variables _model_ y _tokenizer_.
Aunque no se utilicen hasta más adelante, declaradlos en esta sección.

In [None]:
tokenizer = None
model = None

### BEGIN SOLUTION
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
tokenizer = AutoTokenizer.from_pretrained("deepset/minilm-uncased-squad2")
model = AutoModelForQuestionAnswering.from_pretrained("deepset/minilm-uncased-squad2")

### END SOLUTION

Some weights of the model checkpoint at deepset/minilm-uncased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Feature Engineering

En esta parte, tenéis que preparar el dataset de entrada al modelo. Principalmente, tendréis que tokenizar las frases. De esta forma, crearéis columnas como _input_ids_, _attention_mask_... No obstante, depende del modelo elegido. Así pues, tendréis que investigar un poco.

_Nota:_ Es habitual, en arquitecturas de modelos ForQuestionAnswering, crear las columnas _start_positions_ y _end_positions_.

Al finalizar la sección, bien si modificais el DatasectDict, bien si no lo modificáis, lo guardaréis en __ds_tarea_featured__.

In [None]:
print(ds_tarea['train'].column_names)

['id', 'title', 'context', 'question', 'answers']


In [None]:

def prepare_features(examples):
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation=True,
        padding="max_length",
        max_length=384,
        return_offsets_mapping=True,
    )

    start_positions = []
    end_positions = []

    for i, offsets in enumerate(tokenized_examples["offset_mapping"]):
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)
        start_char = examples["answers"][i]["answer_start"][0]
        end_char = start_char + len(examples["answers"][i]["text"][0])
        start_idx = 0
        end_idx = 0
        if start_char < len(offsets) and end_char < len(offsets):
            for idx, (start, end) in enumerate(offsets):
                if start <= start_char < end:
                    start_idx = idx
                if start < end_char <= end:
                    end_idx = idx
                    break

        start_positions.append(start_idx)
        end_positions.append(end_idx)


    tokenized_examples["start_positions"] = start_positions
    tokenized_examples["end_positions"] = end_positions
    return tokenized_examples

ds_tarea_featured = ds_tarea.map(prepare_features, batched=True, remove_columns=ds_tarea["train"].column_names)

Map:   0%|          | 0/3466 [00:00<?, ? examples/s]

Map:   0%|          | 0/345 [00:00<?, ? examples/s]

In [None]:
# Celda de control

assert len(ds_tarea_featured['train']) == 3466
assert len(ds_tarea_featured['validation']) == 345

## Fine-tuning

A continuación, de forma libre se pide entrenar un modelo de HuggingFace deseado. Se pide usar un Trainer de HuggingFace que tenga los siguientes argumentos como mínimo (puede haber más argumentos en todas las variables):

In [None]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    output_dir='./finetuned2',
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=ds_tarea_featured["train"],
    eval_dataset=ds_tarea_featured["validation"],
    tokenizer=tokenizer,
)

### BEGIN SOLUTION
args = TrainingArguments(
    output_dir='./finetuned2',
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    greater_is_better=False,
    metric_for_best_model="eval_loss",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=ds_tarea_featured["train"],
    eval_dataset=ds_tarea_featured["validation"],
    tokenizer=tokenizer,
)

### END SOLUTION



A continuación se entrena el modelo

In [None]:
# Esta celda, celda tiene que estar ejecutada en la entrega

from time import time

start = time()

trainer.train()

end = time()
print(f">>>>>>>>>>>>> elapsed time: {(end-start)/60:.0f}m")

Epoch,Training Loss,Validation Loss
1,No log,1.702209
2,No log,1.761398


>>>>>>>>>>>>> elapsed time: 3m


In [None]:
# trainer.save_model("xxx") Este código por si queréis salvarlo

In [None]:
# Esta celda, celda tiene que estar ejecutada en la entrega
# Se espera un validation_loss inferior a 2.00
# A menor validation_loss no hay mayor nota, con reducir el valor del umbral de 2.00 es suficiente

results = trainer.evaluate()
final_val_loss = results.get("eval_loss")

print(f"Final validation Loss: {final_val_loss:.2f}")

Final validation Loss: 1.70
