## T5 fine-tuning 
Here's the implementation for fine-tuning the desidered T5 model. To make the code easier to read, it has been adapted for fine-tuning T5-small for question answering on SQuAD. However with small fixes this code is totally usable for any of the tasks performed in the paper.

In [None]:
!nvidia-smi

In [None]:
!pip install --quiet transformers
!pip install --quiet nlp
!pip install --quiet tokenizers
!pip install --quiet datasets

In [None]:
import torch
import transformers
import nlp
from datasets import load_dataset
from transformers import T5TokenizerFast as T5Tokenizer

### Pre-processing
First it is loaded the dataset. In order to fine-tune on a task belonging to GLUE or SuperGLUE benchmarks, it also is needed to specify the name of the task, so that the related dataset is loaded correctly.

In [None]:
train_dataset  = load_dataset('squad', split="train")
valid_dataset = load_dataset('squad', split="validation")

For this example the prefix is directly specified into the *add_eos_to_example* function. However it is possible to specify the prefix here and later properly add it.

In [None]:
prefix = ""
max_input_length = 512
max_target_length = 16

The example inputs are pre-processed by adding the prefix, elaborating the target text format and inserting the eos (**e**nd **o**f **s**entence) token.

In [None]:
def add_eos_to_examples(example):
    example['input_text'] = 'question: %s  context: %s </s>' % (example['question'], example['context'])
    example['target_text'] = '%s </s>' % example['answers']['text'][0]
    return example

Load the tokenizer.

In [None]:
tokenizer = T5Tokenizer.from_pretrained('t5-small')

Then the examples are encoded through the tokenizer. This function builds the encodings.

In [None]:
def convert_to_features(example_batch):
    input_encodings = tokenizer.batch_encode_plus(example_batch['input_text'], pad_to_max_length=True, max_length=512)
    target_encodings = tokenizer.batch_encode_plus(example_batch['target_text'], pad_to_max_length=True, max_length=16)

    encodings = {
        'input_ids': input_encodings['input_ids'], 
        'attention_mask': input_encodings['attention_mask'],
        'labels': target_encodings['input_ids'],
        'decoder_attention_mask': target_encodings['attention_mask']
    }

    return encodings

Finally the dataset is mapped accordingly leveraging the previous functions.

In [None]:
train_dataset = train_dataset.map(add_eos_to_examples)
train_dataset = train_dataset.map(convert_to_features, batched=True)
valid_dataset = valid_dataset.map(add_eos_to_examples, load_from_cache_file=False)
valid_dataset = valid_dataset.map(convert_to_features, batched=True, load_from_cache_file=False)

Remove unused columns.

In [None]:
columns = ['input_ids', 'labels', 'attention_mask', 'decoder_attention_mask']
train_dataset.set_format(type='torch', columns=columns)
valid_dataset.set_format(type='torch', columns=columns)

Save the datasets: it will be possible to load them directly during training.

In [None]:
torch.save(train_dataset, 'train_data.pt')
torch.save(validation_dataset, 'valid_data.pt')

### Training

In [None]:
import dataclasses
import logging
import os
import sys
from dataclasses import dataclass, field
from typing import Dict, List, Optional

import numpy as np
import torch
import torch.optim
import tensorflow as tf
import datetime

from transformers import T5ForConditionalGeneration, T5TokenizerFast as T5Tokenizer, EvalPrediction
from transformers import (
    HfArgumentParser,
    DataCollator,
    Trainer,
    TrainingArguments,
    set_seed,
)
from transformers import integrations

In [None]:
logger = logging.getLogger(__name__)

Build a DataCollator. It takes a list of sample from a Dataset and collates them into a batch. It returns a dictionary of tensors with the keys that the forward method is expecting to receive. This is necessary because the Trainer passes directly this dictionary to the model as argument.

In [None]:
@dataclass
class T2TDataCollator:

    def __call__(self, batch: List) -> Dict[str, torch.Tensor]:

        input_ids = torch.stack([example['input_ids'] for example in batch])
        labels = torch.stack([example['labels'] for example in batch])
        labels[labels[:, :] == 0] = -100
        attention_mask = torch.stack([example['attention_mask'] for example in batch])
        decoder_attention_mask = torch.stack([example['decoder_attention_mask'] for example in batch])
        
        return {
            'input_ids': input_ids, 
            'attention_mask': attention_mask,
            'labels': labels, 
            'decoder_attention_mask': decoder_attention_mask
        }

Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.

In [None]:
@dataclass
class ModelArguments:

    model_name_or_path: str = field(
        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
    )
    tokenizer_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
    )
    cache_dir: Optional[str] = field(
        default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
    )

Arguments pertaining to what data we are going to input our model for training and eval.

In [None]:
@dataclass
class DataTrainingArguments:

    train_file_path: Optional[str] = field(
        default='train_data.pt',
        metadata={"help": "Path for cached train dataset"},
    )
    valid_file_path: Optional[str] = field(
        default='valid_data.pt',
        metadata={"help": "Path for cached valid dataset"},
    )
    max_len: Optional[int] = field(
        default=max_input_length,
        metadata={"help": "Max input length for the source text"},
    )
    target_max_len: Optional[int] = field(
        default=max_target_length,
        metadata={"help": "Max input length for the target text"},
    )

Main function which contains the code to effectively fine-tune the pre-trained model.

In [None]:
def main():

    parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))

    model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath('args.json'))

    if (
        os.path.exists(training_args.output_dir)
        and os.listdir(training_args.output_dir)
        and training_args.do_train
        and not training_args.overwrite_output_dir
    ):
        raise ValueError(
            f"Output directory ({training_args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome."
        )

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if training_args.local_rank in [-1, 0] else logging.WARN,
    )
    logger.info("Training/evaluation parameters %s", training_args)

    # Set seed
    set_seed(training_args.seed)

    # Load pretrained model
    model = T5ForConditionalGeneration.from_pretrained(
        model_args.model_name_or_path,
        cache_dir=model_args.cache_dir,
    )

    # Get datasets
    print('Loading data...')
    train_dataset  = torch.load(data_args.train_file_path)
    valid_dataset = torch.load(data_args.valid_file_path)
    print('Loading done!')

    # Initialize the Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=valid_dataset,
        data_collator=T2TDataCollator(),
    )

    # Training
    if training_args.do_train:
        trainer.train(
            resume_from_checkpoint=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
        )
        trainer.save_model()

    # Evaluation
    results = {}
    if training_args.do_eval and training_args.local_rank in [-1, 0]:
        logger.info("*** Evaluate ***")

        eval_output = trainer.evaluate()

        logger.info("***** Eval results *****")
          for key in sorted(eval_output.keys()):
              logger.info("  %s = %s", key, str(eval_output[key]))
              results.update(eval_output)
    
    return results

Load TensorBoard to monitor the training performance. For fine-tuning sake this step is not necessary, but always useful.

In [None]:
%load_ext tensorboard

In [None]:
tensorboard --logdir ./models/gpu/runs

In [None]:
import json

These are the TrainingArguments used by the model at training-time.

In [None]:
args_dict = {
  "model_name_or_path": 't5-small',
  "max_len": max_input_length ,
  "target_max_len": max_target_length,
  "output_dir": './models/gpu',
  "overwrite_output_dir": True,
  "per_device_train_batch_size": 32,
  "per_device_eval_batch_size": 32,
  "learning_rate": 1e-4,
  "num_train_epochs": 5,
  "optim":"adafactor",
  "do_train": True,
  "logging_steps": 500,
  "logging_first_step": True,
  "save_steps": 500,
  "do_eval": True,
  "evaluation_strategy": "steps",
  "eval_steps": 500,
}

In [None]:
with open('args.json', 'w') as f:
  json.dump(args_dict, f)

In [None]:
main()

### Evaluation
After having fine-tuned the model, it can be used to generate the predictions. As stated in the T5 paper, since the SQuAD test dataset coincides with the validation dataset, here is reported the evaluation process. However, the GLUE and SuperGLUE tasks have to be evaluated on the related servers.

In [None]:
import nlp
import pickle
from transformers import T5ForConditionalGeneration, T5Tokenizer

from tqdm.auto import tqdm
import argparse
import glob
import os
import json
import time
import logging
import random
import re
from itertools import chain
from string import punctuation
from sklearn import metrics

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader

Load the fine-tuned model.

In [None]:
model = T5ForConditionalGeneration.from_pretrained('/content/models/gpu/')

This is the dataset for the evaluation.

In [None]:
test_dataset = torch.load('valid_data.pt')
dataloader = DataLoader(test_dataset, batch_size=32)

Generate the predictions.

In [None]:
answers = []
for batch in tqdm(dataloader):
  outs = model.generate(input_ids=batch['input_ids'], 
                        attention_mask=batch['attention_mask'],
                        max_length=16)
  outs = [tokenizer.decode(ids) for ids in outs]
  answers.extend(outs)

Retrieve the predictions (with some clean-up):

In [None]:
predictions = []
for preds in answers:
  pred = preds.replace("<pad>", "").replace("</s>", "")
  predictions.append(pred[1:])

And these are the references, i.e. the ground truth.

In [None]:
references = []
for refs in valid_dataset["answers"]:
  references.append(refs["text"])

The last step for evaluating our fine-tuned model is to validate the *predictions* and the *references* using the metric referring to the task performed.