<a href="https://colab.research.google.com/github/EdmilsonSantana/llm-vehicle-repair/blob/main/Assistente_do_Mecanico_TCC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [33]:
%%capture
%pip install selenium beautifulsoup4 datasets transformers[torch] deep-translator

## Scraping data from AutoZone

We are going to extract the content from the articles found in AutoZone sitemap and finetune the Flan-T5 model.

In [1]:
from articles import extract_articles
import pandas as pd

In [2]:
data_dir = './data'

In [3]:
articles = extract_articles(data_dir)

In [4]:
print(articles[0])

{'title': 'Car AC Blowing Hot Air', 'content': "Understanding the Causes\n\nA quick understanding of how air conditioning works can help with understanding what the causes could be. When AC is turned on, refrigerant that flows through the system absorbs heat from your vehicle's cabin where it's removed and, through a series of parts and processes, the heat is released into the atmosphere before circulating back and repeating the process. There are several points where something can be wrong, causing warm air rather than cool:\n\nThere isn't sufficient airflow in the cabin. This could be a problem with a bad blower motor, but more commonly a plugged cabin air filter is the culprit.\nThere isn't enough refrigerant. The gas that circulates through the system can leak out, preventing it from working efficiently.\nThe compressor may not be cycling. A clutch issue or a compressor failure can prevent the AC system from being able to disperse the heat the refrigerant has absorbed.\nThe expansi

In [5]:
faq_questions = []
for article in articles:
    faq_questions.extend(article['faq_questions'])

In [6]:
df_faq = pd.DataFrame(faq_questions)

In [7]:
has_autozone_text = df_faq['question'].str.contains('AutoZone')
df_faq.drop(index=df_faq[has_autozone_text].index, inplace=True)

In [8]:
df_faq.head()

Unnamed: 0,question,answer
0,Why is my car AC blowing hot air?,There could be a multitude of root causes with...
1,Can I fix a hot AC issue myself?,There are some issues that can be done on your...
2,What are the signs of a failing AC compressor?,"Clunking noises when the compressor cycles, in..."
3,How often should I service my car's AC system?,"Annually, check that your AC is working proper..."
4,When should I consider professional help for m...,If DIY solutions haven't fixed the problem or ...


In [9]:
df_faq.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1769 entries, 0 to 1782
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   question  1769 non-null   object
 1   answer    1769 non-null   object
dtypes: object(2)
memory usage: 41.5+ KB


## Fine-Tuning Flan T5 Model

In [10]:
%%bash
pip install nltk
pip install datasets
pip install transformers[torch]
pip install tokenizers
pip install evaluate
pip install rouge_score
pip install sentencepiece
pip install huggingface_hub



In [11]:
import nltk
import evaluate
import numpy as np
from datasets import load_dataset, Dataset
from transformers import T5Tokenizer, DataCollatorForSeq2Seq
from transformers import T5ForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer
from articles import extract_articles
import pandas as pd

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [12]:
# Load the tokenizer, model, and data collator
MODEL_NAME = "google/flan-t5-base"

tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [13]:
dataset = Dataset.from_pandas(df_faq)
train_test_ds = dataset.train_test_split(test_size=0.2)
# We prefix our tasks with "answer the question"
prefix = "Please answer this question: "

In [14]:
train_test_ds

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', '__index_level_0__'],
        num_rows: 1415
    })
    test: Dataset({
        features: ['question', 'answer', '__index_level_0__'],
        num_rows: 354
    })
})

In [15]:
# Define the preprocessing function

def preprocess_function(examples):
   """Add prefix to the sentences, tokenize the text, and set the labels"""
   # The "inputs" are the tokenized answer:
   inputs = [prefix + doc for doc in examples["question"]]
   model_inputs = tokenizer(inputs, max_length=128, truncation=True)
  
   # The "labels" are the tokenized outputs:
   labels = tokenizer(text_target=examples["answer"], 
                      max_length=512,         
                      truncation=True)

   model_inputs["labels"] = labels["input_ids"]
   return model_inputs

# Map the preprocessing function across our dataset
tokenized_dataset = train_test_ds.map(preprocess_function, batched=True)


Map:   0%|          | 0/1415 [00:00<?, ? examples/s]

Map:   0%|          | 0/354 [00:00<?, ? examples/s]

In [16]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1415
    })
    test: Dataset({
        features: ['question', 'answer', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 354
    })
})

In [17]:
nltk.download("punkt")
metric = evaluate.load("rouge")

[nltk_data] Downloading package punkt to
[nltk_data]     /teamspace/studios/this_studio/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [18]:
def compute_metrics(eval_preds):
   preds, labels = eval_preds

   # decode preds and labels
   labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
   decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
   decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

   # rougeLSum expects newline after each sentence
   decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
   decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

   result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
  
   return result

In [19]:
# Global Parameters
L_RATE = 3e-4
BATCH_SIZE = 8
PER_DEVICE_EVAL_BATCH = 4
WEIGHT_DECAY = 0.01
SAVE_TOTAL_LIM = 3
NUM_EPOCHS = 10

# Set up training arguments
training_args = Seq2SeqTrainingArguments(
   output_dir="./results",
   evaluation_strategy="epoch",
   learning_rate=L_RATE,
   per_device_train_batch_size=BATCH_SIZE,
   per_device_eval_batch_size=PER_DEVICE_EVAL_BATCH,
   weight_decay=WEIGHT_DECAY,
   save_total_limit=SAVE_TOTAL_LIM,
   num_train_epochs=NUM_EPOCHS,
   predict_with_generate=True,
   push_to_hub=False
)

In [20]:
trainer = Seq2SeqTrainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_dataset["train"],
   eval_dataset=tokenized_dataset["test"],
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [21]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,No log,2.149851,0.325831,0.135755,0.264907,0.270366
2,No log,2.093176,0.325019,0.133975,0.26356,0.27024
3,2.070600,2.108422,0.320579,0.137358,0.262536,0.268787
4,2.070600,2.157729,0.329891,0.145051,0.272676,0.278419
5,2.070600,2.221337,0.324979,0.140882,0.266297,0.271246
6,1.278700,2.276956,0.324359,0.14091,0.266741,0.271738
7,1.278700,2.402885,0.326588,0.143634,0.269227,0.275727
8,1.278700,2.450764,0.327633,0.144797,0.267569,0.274056
9,0.897400,2.572177,0.322558,0.139045,0.264247,0.270081
10,0.897400,2.615739,0.328173,0.144781,0.268941,0.274802




TrainOutput(global_step=1770, training_loss=1.314621032025181, metrics={'train_runtime': 592.9578, 'train_samples_per_second': 23.863, 'train_steps_per_second': 2.985, 'total_flos': 443392422460416.0, 'train_loss': 1.314621032025181, 'epoch': 10.0})

In [30]:
last_checkpoint = "./results/checkpoint-1500"

finetuned_model = T5ForConditionalGeneration.from_pretrained(last_checkpoint)
tokenizer = T5Tokenizer.from_pretrained(last_checkpoint)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [32]:
my_question = "What are the disadvantages of synthetic blend oil?"
inputs = "Please answer to this question: " + my_question

inputs = tokenizer(inputs, return_tensors="pt")
outputs = finetuned_model.generate(**inputs)
answer = tokenizer.decode(outputs[0])
from textwrap import fill

print(fill(answer, width=80))



<pad>Synthetic blend is more expensive than conventional oil, and it's less durable than conventional oil


tensor([[   0,   94, 5619,   30,    8, 1043,  686,   11,    8, 1689,   25, 1262,
            5, 6067, 3115,  523,   12,   36, 2130,  334]])