<a href="https://colab.research.google.com/github/EdmilsonSantana/tcc-2022-2/blob/main/notebooks/PTT5_Fine_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Instalação de pacotes

In [None]:
%%bash
pip install nltk
pip install datasets
pip install transformers[torch]
pip install tokenizers
pip install evaluate
pip install rouge_score
pip install sentencepiece
pip install huggingface_hub

In [None]:
import evaluate
import nltk
import json
import numpy as np
import pandas as pd
from datasets import Dataset, DatasetDict, load_from_disk
from transformers import T5Tokenizer
from transformers import T5ForConditionalGeneration
from transformers import Seq2SeqTrainingArguments
from transformers import Seq2SeqTrainer
from transformers import EvalPrediction
from transformers import DataCollatorForSeq2Seq
from sklearn.model_selection import train_test_split
import gc
import torch

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Preparação dos dados

In [None]:
DATA_DIR = '/content/drive/MyDrive/tcc'

In [None]:
with open(f"{DATA_DIR}/vehicle_repair_and_maintenance_qa.json", 'r', encoding='utf-8') as fp:
    data = json.load(fp)

In [None]:
questions = [entry['data']['question'] for entry in data]
answers = [entry['data']['answer'] for entry in data]
sections = [entry['metadata']['section'] for entry in data]

In [None]:
df_qa = pd.DataFrame({'question': questions, 'answer': answers, 'section': sections})

In [None]:
df_qa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13014 entries, 0 to 13013
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   question  13014 non-null  object
 1   answer    13014 non-null  object
 2   section   13014 non-null  object
dtypes: object(3)
memory usage: 305.1+ KB


In [None]:
df_qa.drop_duplicates(subset=['question'], inplace=True)

In [None]:
df_qa.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10047 entries, 0 to 13013
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   question  10047 non-null  object
 1   answer    10047 non-null  object
 2   section   10047 non-null  object
dtypes: object(3)
memory usage: 314.0+ KB


In [None]:
df_qa.head()

Unnamed: 0,question,answer,section
0,Quem construiu o primeiro triciclo a vapor?,O primeiro triciclo a vapor foi construído na ...,MOTOR DE COMBUSTÃO
1,O que é a máquina a vapor?,"A máquina a vapor, conhecida também como motor...",MOTOR DE COMBUSTÃO
2,O que é o motor de combustão interna do ciclo ...,O motor de combustão interna do ciclo Otto é u...,MOTOR DE COMBUSTÃO
3,O que é o primeiro automóvel?,"O primeiro automóvel foi construído em 1885, n...",MOTOR DE COMBUSTÃO
4,O que é o motor de combustão interna?,O motor de combustão interna é constituído de ...,MOTOR DE COMBUSTÃO


In [None]:
counts_by_section = df_qa.groupby('section').count()

In [None]:
counts_by_section[counts_by_section['question'] == 1]

Unnamed: 0_level_0,question,answer
section,Unnamed: 1_level_1,Unnamed: 2_level_1
folga radial e folga axial.,1,1


In [None]:
df_qa[df_qa['section'] == 'folga radial e folga axial.']

Unnamed: 0,question,answer,section
1980,O que é a compreensão de expressões e suas cor...,As figuras abaixo mostram o que cada expressão...,folga radial e folga axial.


In [None]:
df_qa[df_qa['question'].str.contains('figura')]

Unnamed: 0,question,answer,section
2055,O que é representado na figura a seguir?,A figura representa um triciclo a vapor.,MOTOR DE COMBUSTÃO EXTERNA
2600,Onde o cilindro mestre é encontrado na configu...,O cilindro mestre é comumente encontrado junto...,Cilindro mestre


In [None]:
df_qa[df_qa['answer'].str.contains('figura')].info()

<class 'pandas.core.frame.DataFrame'>
Index: 56 entries, 292 to 12404
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   question  56 non-null     object
 1   answer    56 non-null     object
 2   section   56 non-null     object
dtypes: object(3)
memory usage: 1.8+ KB


In [None]:
df_qa[df_qa['answer'].str.contains('figura')].sample(1)['answer'].values

array(['Na figura mostrada, a catraca é um componente do micrômetro que serve para assegurar uma pressão de medição constante.'],
      dtype=object)

In [None]:
df_qa.drop(index=df_qa[df_qa['question'].str.contains('figura')].index, inplace=True)
df_qa.drop(index=df_qa[df_qa['answer'].str.contains('figura')].index, inplace=True)

In [None]:
assert(df_qa[df_qa['answer'].str.contains('figura')].shape[0] == 0)

In [None]:
df_qa.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9990 entries, 0 to 13013
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   question  9990 non-null   object
 1   answer    9990 non-null   object
 2   section   9990 non-null   object
dtypes: object(3)
memory usage: 312.2+ KB


In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_qa['question'], df_qa['answer'], test_size=0.1, stratify=df_qa['section'], random_state=42)

In [None]:
train_dataset = Dataset.from_dict({'question': X_train, 'answer': y_train})
test_dataset = Dataset.from_dict({'question': X_test, 'answer': y_test})

In [None]:
dataset = DatasetDict({'train': train_dataset, 'test': test_dataset})
dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 8991
    })
    test: Dataset({
        features: ['question', 'answer'],
        num_rows: 999
    })
})

In [None]:
dataset_dir = f"{DATA_DIR}/vehicle_repair_and_maintenance_qa"

In [None]:
dataset.save_to_disk(dataset_dir)

Saving the dataset (0/1 shards):   0%|          | 0/8991 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/999 [00:00<?, ? examples/s]

In [None]:
train_dataset = load_from_disk(f'{dataset_dir}/train')
test_dataset = load_from_disk(f'{dataset_dir}/test')
dataset = DatasetDict({'train': train_dataset, 'test': test_dataset})
dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 8991
    })
    test: Dataset({
        features: ['question', 'answer'],
        num_rows: 999
    })
})

## Definição de métrica

In [None]:
metric = evaluate.load("rouge")
def calculate_rogue(predictions: list, labels: list) -> dict:
  decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip(), language='portuguese')) for pred in predictions]
  decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip(), language='portuguese')) for label in labels]
  return metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [None]:
def compute_metrics(eval_preds: EvalPrediction):
   preds, labels = eval_preds

   labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
   preds = np.where(preds != -100, preds, tokenizer.pad_token_id)

   tokenized_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
   tokenized_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

   return calculate_rogue(tokenized_preds, tokenized_labels)

## Treinamento do modelo

In [None]:
model_name = 'unicamp-dl/ptt5-base-portuguese-vocab'
max_length = 512
learning_rate = 5e-4
weight_decay = 0.01
n_epochs = 20
train_batch_size = 8
test_batch_size = 4

In [None]:
def tokenize_data(examples):
    model_inputs = tokenizer(examples['question'], max_length=max_length, truncation=True)
    labels = tokenizer(text_target=examples['answer'], max_length=max_length, truncation=True)
    model_inputs['labels'] = labels['input_ids']
    return model_inputs

In [None]:
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name).to('cuda')
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
train_dataset_tokenized = dataset['train'].map(tokenize_data, batched=True)
test_dataset_tokenized = dataset['test'].map(tokenize_data, batched=True)

training_args = Seq2SeqTrainingArguments(
    output_dir=DATA_DIR,
    num_train_epochs=n_epochs,
    per_device_train_batch_size=train_batch_size,
    per_device_eval_batch_size=test_batch_size,
    weight_decay=weight_decay,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    learning_rate=learning_rate,
    predict_with_generate=True,
    generation_max_length=100
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset_tokenized,
    eval_dataset=test_dataset_tokenized,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,1.8838,0.911703,0.385126,0.238665,0.340636,0.347974
2,0.8687,0.628922,0.487583,0.358881,0.443715,0.451123
3,0.5813,0.55487,0.538234,0.42245,0.494779,0.503855
4,0.4377,0.528685,0.544968,0.429653,0.500942,0.510388
5,0.3453,0.506659,0.570713,0.468838,0.532037,0.539167
6,0.2785,0.523544,0.587311,0.484838,0.545712,0.554389
7,0.2308,0.535356,0.594236,0.493332,0.555832,0.563165
8,0.1913,0.556149,0.607407,0.509752,0.566818,0.576456
9,0.1586,0.575376,0.610233,0.513667,0.572005,0.581155
10,0.136,0.593316,0.609515,0.512217,0.570272,0.578592


In [None]:
trainer.train(resume_from_checkpoint=f'{DATA_DIR}/checkpoint-15736')

There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'].


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
15,0.0635,0.721499,0.621776,0.526392,0.583014,0.591777
16,0.0549,0.734052,0.625036,0.530529,0.587235,0.596462
17,0.0486,0.749139,0.625945,0.531596,0.588745,0.597689
18,0.0415,0.780283,0.625569,0.530107,0.587309,0.596811
19,0.0375,0.789886,0.628372,0.534027,0.59058,0.600067
20,0.0336,0.797307,0.627335,0.531797,0.588519,0.599136


TrainOutput(global_step=22480, training_loss=0.013979988505407584, metrics={'train_runtime': 4787.8889, 'train_samples_per_second': 37.557, 'train_steps_per_second': 4.695, 'total_flos': 4902916223093760.0, 'train_loss': 0.013979988505407584, 'epoch': 20.0})

In [None]:
final_model_dir = f'{DATA_DIR}/final_model'

In [None]:
trainer.save_model(final_model_dir)

## Avaliação do modelo

In [None]:
del model, tokenizer

In [None]:
torch.cuda.empty_cache()
gc.collect()

0

In [None]:
tokenizer = T5Tokenizer.from_pretrained(final_model_dir)
model = T5ForConditionalGeneration.from_pretrained(final_model_dir).to('cuda')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
!pip install more-itertools
import more_itertools as mit
import pandas as pd



In [None]:
def inference(model, tokenizer, questions):
  model.eval()

  inputs = tokenizer(questions, return_tensors="pt", padding=True).to('cuda')
  outputs = model.generate(**inputs, max_new_tokens=512)
  return tokenizer.batch_decode(outputs, skip_special_tokens=True)

def get_preds(questions, answers):
  preds = []
  batch_size = 100
  for chunk in mit.chunked(questions, batch_size):
    preds.extend(inference(model, tokenizer, chunk))
  return preds

def save_preds(inputs, preds, labels, filename):
  df = pd.DataFrame({'inputs': inputs, 'preds': preds, 'labels': labels})
  df.to_csv(f'{DATA_DIR}/{filename}.csv')
  df = pd.read_csv(f'{DATA_DIR}/{filename}.csv')
  return df

def evaluate_preds(df):
  not_exact_matches = df[df['preds'] != df['labels']]
  labels = not_exact_matches['labels'].astype(str).values
  preds = not_exact_matches['preds'].astype(str).values
  metrics = calculate_rogue(preds, labels)
  metrics['exactMatches'] = df[df['preds'] == df['labels']].shape[0]
  return metrics

In [None]:
test_questions = test_dataset['question']
test_answers = test_dataset['answer']

In [None]:
test_preds = get_preds(test_questions, test_answers)
assert(len(test_preds) == len(test_questions))

In [None]:
df_test = save_preds(test_questions, test_preds, test_answers, 'test_result')

In [None]:
df_test.head()

Unnamed: 0.1,Unnamed: 0,inputs,preds,labels
0,0,Qual é o procedimento recomendado antes de usa...,Antes de iniciar a medição de uma peça devemos...,Antes de iniciar a medição de uma peça devemos...
1,1,O que acontece se o tamanho do pinhão for muit...,"Se o tamanho do pinhão for muito pequeno, have...","Se o tamanho do pinhão for muito pequeno, ele ..."
2,2,O que significa torque em um motor?,O torque depende não só da força (F) que é apl...,O torque significa torção. O torque depende nã...
3,3,Qual é a fórmula para calcular a precisão em p...,Precisão = 1mm ÷ divisões do nônio,A precisão é a menor medida que o instrumento ...
4,4,Qual é o significado de cada divisão em uma es...,Cada divisão em uma escala métrica equivale a 5’.,Cada centímetro gravado na Escala encontra-se ...


In [None]:
evaluate_preds(df_test)

{'rouge1': 0.5385830105555676,
 'rouge2': 0.4226439063512055,
 'rougeL': 0.49095520775615153,
 'rougeLsum': 0.5031979105529967,
 'exactMatches': 194}

In [None]:
train_questions = train_dataset['question']
train_answers = train_dataset['answer']

In [None]:
train_preds = get_preds(train_questions, train_answers)
assert(len(train_preds) == len(train_questions))

In [None]:
df_train = save_preds(train_questions, train_preds, train_answers, 'train_result')

In [None]:
df_train.head()

Unnamed: 0.1,Unnamed: 0,inputs,preds,labels
0,0,Qual é o papel da bomba de óleo no sistema de ...,A bomba de óleo tem como finalidade manter o ó...,A bomba de óleo tem como finalidade manter o ó...
1,1,O que é o controle periódico do óleo para tran...,O controle periódico do óleo para transmissões...,O controle periódico do óleo para transmissões...
2,2,Como se mide a KPI em direção?,"Na geometria de direção, a KPI é medida em com...","Na geometria de direção, a KPI é medida em com..."
3,3,Qual é o propósito da alavança de mudanças?,A alavança de mudanças é usada para selecionar...,A alavança de mudanças é usada para selecionar...
4,4,Como funciona a varetas na distribuição mecânica?,As varetas recebem o movimento dos tuchos e tr...,As varetas recebem o movimento dos tuchos e tr...


In [None]:
evaluate_preds(df_train)

{'rouge1': 0.786980085807462,
 'rouge2': 0.7375047215242283,
 'rougeL': 0.7736338696942494,
 'rougeLsum': 0.7771416217970222,
 'exactMatches': 8086}

## Publicação do modelo

In [None]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
model.save_pretrained(f"{DATA_DIR}/emgs/ptt5-qa")

In [None]:
model.push_to_hub("emgs/ptt5-qa")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/emgs/ptt5-qa/commit/8c6013844bc25715ec6ec940ec4119ac66c9d1f9', commit_message='Upload T5ForConditionalGeneration', commit_description='', oid='8c6013844bc25715ec6ec940ec4119ac66c9d1f9', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
tokenizer.save_pretrained(f"{DATA_DIR}/emgs/ptt5-qa")

('/content/drive/MyDrive/tcc/emgs/ptt5-qa/tokenizer_config.json',
 '/content/drive/MyDrive/tcc/emgs/ptt5-qa/special_tokens_map.json',
 '/content/drive/MyDrive/tcc/emgs/ptt5-qa/spiece.model',
 '/content/drive/MyDrive/tcc/emgs/ptt5-qa/added_tokens.json')

In [None]:
tokenizer.push_to_hub("emgs/ptt5-qa")

CommitInfo(commit_url='https://huggingface.co/emgs/ptt5-qa/commit/7719be22354c941ea113bf7595a9eea36de077c8', commit_message='Upload tokenizer', commit_description='', oid='7719be22354c941ea113bf7595a9eea36de077c8', pr_url=None, pr_revision=None, pr_num=None)

## Explorando modelo

In [None]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

model = T5ForConditionalGeneration.from_pretrained('emgs/ptt5-qa').to("cuda")
tokenizer = T5Tokenizer.from_pretrained('emgs/ptt5-qa')

config.json:   0%|          | 0.00/792 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.6k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/756k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.59k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
test_questions[0]

'Qual é o procedimento recomendado antes de usar um micrômetro de 0 a 25 mm ou 0 a 1"?'

In [None]:
test_answers[0]

'Antes de iniciar a medição de uma peça devemos calibrar o instrumento de acordo com a sua capacidade. Para os micrômetros cuja capacidade é de 0 a 25 mm ou de 0 a 1", precisamos tomar os seguintes cuidados: limpe cuidadosamente as partes móveis eliminando poeiras e sujeiras, com pano macio e limpo; antes do uso, limpe as faces de medição; use somente uma folha de papel macio; encoste suavemente as faces de medição usando apenas a catraca; em seguida, verifique a coincidência das linhas de referência da bainha com o zero do tambor; se estas não coincidirem, faça o ajuste movimentando a bainha com a chave de micrômetro que normalmente acompanha o instrumento.'

In [None]:
inference(model, tokenizer, [test_questions[0]])

['Antes de iniciar a medição de uma peça devemos calibrar o instrumento de acordo com a sua capacidade. Para os micrômetros cuja capacidade é de 0 a 25 mm ou de 0 a 1", precisamos tomar os seguintes cuidados: limpe cuidadosamente as partes móveis eliminando poeiras e sujeiras, com pano macio e limpo; antes do uso, limpe as faces de medição; use somente uma folha de papel macio; encoste suavemente as faces de medição usando apenas a catraca; em seguida, verifique a coincidência das linhas de referência da bainha com o zero do tambor; se estas não coincidirem, faça o ajuste movimentando a bainha com a chave de micrômetro que normalmente acompanha o instrumento.']