<a href="https://colab.research.google.com/github/EdmilsonSantana/tcc-2022-2/blob/main/notebooks/PTT5_Fine_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Instalação de pacotes

In [1]:
%%bash
pip install nltk
pip install datasets
pip install transformers[torch]
pip install tokenizers
pip install evaluate
pip install rouge_score
pip install sentencepiece
pip install huggingface_hub

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 542.0/542.0 kB 4.0 MB/s eta 0:00:00
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.3/116.3 kB 14.6 MB/s eta 0:00:00
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.1/194.1 kB 19.9 MB/s eta 0:00:00
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 17.4 MB/s eta 0:00:00
Collecting huggingface-hub>=0.21.2 (from datasets)
  Downloading huggingface_hub-0.23.0-py3-none-any.whl (401 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 401.2/401.2 kB 24.5 MB/s eta 0:00:00
Installing collected packages: xxhash, dill, multiprocess, huggingf

In [2]:
import nltk
import evaluate
import json
import numpy as np
from datasets import load_dataset, Dataset
from transformers import T5Tokenizer, DataCollatorForSeq2Seq
from transformers import T5ForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer
from transformers import EvalPrediction
import torch
import gc

In [3]:
nltk.download("punkt", quiet=True)

True

## Preparação do dataset

In [4]:
DATA_DIR = '/content/drive/MyDrive/tcc'

In [5]:
with open(f"{DATA_DIR}/vehicle_repair_and_maintenance_qa.json", 'r', encoding='utf-8') as fp:
    data = json.load(fp)
data[0]

{'metadata': {'document_id': '14b8b663-36f9-4072-b540-8ab74e1949f9',
  'section': 'SUSPENSÃO'},
 'data': {'document': 'O sistema pode ser resumido nos pneumáticos, nos amortecedores, nas molas e barras estabilizadoras. E têm por finalidade tornar o veículo confortável, estável, ter boa dirigibilidade e garantir seu desempenho dentro dos padrões de segurança recomendados. Pequenas irregularidades das vias de rodagem são absorvidas pelos pneumáticos. Quando essas irregularidades se tornam maiores, são absorvidas pelo sistema de molas que tem importância fundamental na suspensão. Os amortecedores entram em ação para reduzir o número e a amplitude das oscilações das molas. Nas suspensões são empregados diversos tipos de molas e amortecedores. As molas podem ser helicoidais, de ar, semi-elípticas ou barras de torção e os amortecedores podem ser comuns, de dupla ação, pressurizados a gás, podem ter controle eletrônico, etc. A barra estabilizadora é uma barra de seção circular confeccionada c

In [6]:
questions = [entry['data']['question'] for entry in data]
answers = [entry['data']['answer'] for entry in data]

## Definição de métrica

In [125]:
def calculate_rogue(predictions: list, labels: list) -> dict:
  metric = evaluate.load("rouge")
  decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in predictions]
  decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in labels]
  return metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

## Treinamento do modelo

In [117]:
class QuestionAnswerT5Model:
    _model_name = 'unicamp-dl/ptt5-base-portuguese-vocab'

    def __init__(self,
                 questions: list[str],
                 answers: list[str],
                 last_checkpoint: str = None,
                 test_size: int = 0.1) -> None:
        model_name = last_checkpoint if last_checkpoint is not None else self._model_name
        device = torch.device("cuda")
        self._model = T5ForConditionalGeneration.from_pretrained(model_name).to(device)
        self._tokenizer = T5Tokenizer.from_pretrained(model_name)
        self._data_collator = DataCollatorForSeq2Seq(tokenizer=self._tokenizer, model=self._model)
        self._load_dataset(questions, answers, test_size)
        self._metric = evaluate.load("rouge")

    def __del__(self):
      self._model.cpu()
      del self._model
      gc.collect()
      torch.cuda.empty_cache()

    def _load_dataset(self, questions: list[str], answers: list[str], test_size: int):
        max_length = 512
        model_inputs = self._tokenizer(questions,
                                       max_length=max_length,
                                       truncation=True)
        labels = self._tokenizer(text_target=answers,
                                max_length=max_length,
                                truncation=True)

        model_inputs["labels"] = labels["input_ids"]

        tokenized_dataset = Dataset.from_dict(model_inputs)

        split_dataset = tokenized_dataset.train_test_split(
            test_size=test_size, shuffle=True)

        self._train_dataset = split_dataset["train"]
        self._test_dataset = split_dataset["test"]

        print(self._train_dataset)
        print(self._train_dataset[0])

    def _compute_metrics(self, eval_predictions: EvalPrediction):
      preds, labels = eval_predictions

      labels = np.where(labels != -100, labels, self._tokenizer.pad_token_id)
      decoded_preds = self._tokenizer.batch_decode(preds, skip_special_tokens=True)
      decoded_labels = self._tokenizer.batch_decode(labels, skip_special_tokens=True)

      decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
      decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

      return self._metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    def inference(self, questions: list[str], max_new_tokens: int = 100) -> str:
        self._model.eval()

        inputs = self._tokenizer(questions, return_tensors="pt", padding=True).to(self._model.device)
        outputs = self._model.generate(**inputs, max_new_tokens=max_new_tokens)
        return self._tokenizer.batch_decode(outputs, skip_special_tokens=True)

    def train(self,
              output_dir: str,
              num_epochs: int = 10,
              train_batch_size = 8,
              eval_batch_size = 8,
              learning_rate = 3e-4,
              weight_decay = 0.001) -> None:
        self._model.train()

        training_args = Seq2SeqTrainingArguments(
          output_dir=output_dir,
          evaluation_strategy="epoch",
          save_strategy="epoch",
          learning_rate=learning_rate,
          per_device_train_batch_size=train_batch_size,
          per_device_eval_batch_size=eval_batch_size,
          save_total_limit=3,
          load_best_model_at_end=True,
          num_train_epochs=num_epochs,
          predict_with_generate=True,
          push_to_hub=False
        )

        trainer = Seq2SeqTrainer(
          model=self._model,
          args=training_args,
          train_dataset=self._train_dataset,
          eval_dataset=self._test_dataset,
          tokenizer=self._tokenizer,
          data_collator=self._data_collator,
          compute_metrics=self._compute_metrics
        )

        trainer.train()

        save_dir = f'{output_dir}/final_model'
        trainer.save_model(save_dir)
        print(f"Saved model to: {save_dir}")

In [64]:
model = QuestionAnswerT5Model(
    questions=questions,
    answers=answers
)

model.safetensors:   2%|2         | 21.0M/892M [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/756k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 6337
})
{'input_ids': [4264, 133, 4415, 10, 2011, 3715, 164, 3037, 5826, 6, 108, 9, 9449, 252, 18120, 1854, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [142, 1116, 8571, 3, 8807, 717, 5212, 3, 31, 1021, 2459, 757, 3, 13645, 4, 7196, 8, 4165, 4, 18337, 9208, 6, 3715, 164, 3037, 5826, 6, 108, 9, 9449, 252, 18120, 9602, 52, 547, 5, 1]}


In [14]:
model.train(output_dir=DATA_DIR, num_epochs=20)

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,2.4851,1.368958,0.375319,0.216086,0.32673,0.33122
2,1.2627,0.950681,0.42102,0.282275,0.379837,0.384009
3,0.972,0.789567,0.439743,0.31242,0.401119,0.404749
4,0.7072,0.706995,0.462175,0.335448,0.422534,0.42614
5,0.621,0.668399,0.482333,0.36546,0.444661,0.448684
6,0.4718,0.636674,0.486397,0.375199,0.453469,0.457051
7,0.4013,0.631426,0.491615,0.373037,0.453863,0.457875
8,0.357,0.631686,0.49951,0.389472,0.46582,0.469187
9,0.2957,0.646623,0.506422,0.39552,0.471162,0.475269
10,0.2745,0.645944,0.511321,0.402549,0.478547,0.482136


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'].


Saved model to: /content/drive/MyDrive/tcc/final_model


In [9]:
model = QuestionAnswerT5Model(
    questions=questions,
    answers=answers,
    last_checkpoint=f'{DATA_DIR}/checkpoint-15860',
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 6337
})
{'input_ids': [28, 13, 1117, 13, 16, 2011, 4, 106, 1827, 393, 106, 2769, 10, 31, 1021, 2459, 124, 1854, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [292, 2011, 4, 106, 1827, 393, 106, 2769, 10, 31, 1021, 2459, 124, 3, 52, 474, 3, 17, 418, 11, 4165, 4, 18337, 9208, 6, 5, 1]}


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [19]:
model.train(output_dir=DATA_DIR, num_epochs=10)

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,0.2337,0.129949,0.610079,0.553009,0.594398,0.59589
2,0.213,0.141442,0.602778,0.541865,0.586715,0.588684
3,0.1888,0.156366,0.600204,0.536008,0.584482,0.586374
4,0.1592,0.162141,0.601692,0.538396,0.584652,0.586444
5,0.143,0.17113,0.602085,0.536766,0.58215,0.584689
6,0.1125,0.174021,0.59907,0.533693,0.580934,0.583671
7,0.0978,0.177698,0.596122,0.529106,0.57684,0.579747
8,0.0839,0.183856,0.59572,0.529556,0.577501,0.579402
9,0.0754,0.188831,0.600847,0.53458,0.582671,0.584972
10,0.0714,0.189217,0.598951,0.53295,0.581114,0.583676


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'].


Saved model to: /content/drive/MyDrive/tcc/final_model


## Avaliação do modelo

In [127]:
!pip install more-itertools
import more_itertools as mit
import pandas as pd



In [179]:
del model
gc.collect()
torch.cuda.empty_cache()

In [181]:
model = QuestionAnswerT5Model(
    questions=questions,
    answers=answers,
    last_checkpoint=f'{DATA_DIR}/checkpoint-793',
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 6337
})
{'input_ids': [40, 13, 2063, 9, 18228, 4, 9395, 1854, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [28, 18228, 4, 9395, 21, 9, 15127, 6246, 51, 9395, 19139, 6, 8, 10576, 6, 12, 527, 143, 1540, 4, 870, 8, 9586, 5, 1]}


In [182]:
preds = []
batch_size = 500
for chunk in mit.chunked(questions, batch_size):
  preds.extend(model.inference(chunk))

In [183]:
assert(len(preds) == len(questions))

In [184]:
calculate_rogue(preds, answers)

{'rouge1': 0.8542537636111834,
 'rouge2': 0.8063267942265744,
 'rougeL': 0.8385173314522122,
 'rougeLsum': 0.8413228327233703}

In [185]:
df = pd.DataFrame({'inputs': questions, 'preds': preds, 'labels': answers})

In [186]:
df.head()

Unnamed: 0,inputs,preds,labels
0,Quais são os componentes principais do sistema...,"O sistema de suspensão inclui pneumáticos, mol...","O sistema de suspensão inclui pneumáticos, mol..."
1,O que é a função do sistema de suspensão em ve...,O sistema de suspensão é responsável por torna...,O sistema de suspensão é responsável por torna...
2,Qual é o objetivo do sistema de suspensão em v...,O sistema de suspensão tem como finalidade tor...,O sistema de suspensão tem por finalidade torn...
3,Como é que as pequenas irregularidades nas via...,As pequenas irregularidades das vias de rodage...,As pequenas irregularidades das vias de rodage...
4,Como se absorvem pequenas irregularidades nas ...,Pequenas irregularidades das vias de rodagem s...,Pequenas irregularidades das vias de rodagem s...


In [187]:
df.to_csv(f'{DATA_DIR}/preds.csv')

In [188]:
print("Exact matches:", df[df['preds'] == df['labels']].shape[0])

Exact matches: 3618


## Publicação do modelo

In [161]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [168]:
model._model.save_pretrained(f"{DATA_DIR}/emgs/ptt5-qa")

In [169]:
model._model.push_to_hub("emgs/ptt5-qa")

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/emgs/ptt5-qa/commit/509de372fe415a365e5dacd5972bf00f9b8e7304', commit_message='Upload T5ForConditionalGeneration', commit_description='', oid='509de372fe415a365e5dacd5972bf00f9b8e7304', pr_url=None, pr_revision=None, pr_num=None)

In [170]:
model._tokenizer.save_pretrained(f"{DATA_DIR}/emgs/ptt5-qa")

('/content/drive/MyDrive/tcc/emgs/ptt5-qa/tokenizer_config.json',
 '/content/drive/MyDrive/tcc/emgs/ptt5-qa/special_tokens_map.json',
 '/content/drive/MyDrive/tcc/emgs/ptt5-qa/spiece.model',
 '/content/drive/MyDrive/tcc/emgs/ptt5-qa/added_tokens.json')

In [171]:
model._tokenizer.push_to_hub("emgs/ptt5-qa")

spiece.model:   0%|          | 0.00/756k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/emgs/ptt5-qa/commit/156eb5a93a0cbb33419be1ac8b98aecfa018cf02', commit_message='Upload tokenizer', commit_description='', oid='156eb5a93a0cbb33419be1ac8b98aecfa018cf02', pr_url=None, pr_revision=None, pr_num=None)