<a href="https://colab.research.google.com/github/EdmilsonSantana/llm-vehicle-repair/blob/main/Assistente_do_Mecanico_TCC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%bash
pip install nltk
pip install datasets
pip install transformers[torch]
pip install tokenizers
pip install evaluate
pip install rouge_score
pip install sentencepiece
pip install huggingface_hub



In [2]:
import nltk
import evaluate
import numpy as np
from datasets import load_dataset
from transformers import T5Tokenizer, DataCollatorForSeq2Seq
from transformers import T5ForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer
from articles import extract_articles
import pandas as pd

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [4]:
# Load the tokenizer, model, and data collator
MODEL_NAME = "google/flan-t5-base"

tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [5]:
DATA_NAME = "yahoo_answers_qa"
yahoo_answers_qa = load_dataset(DATA_NAME)

Downloading data: 100%|██████████| 89.3M/89.3M [00:04<00:00, 19.4MB/s]


Generating train split:   0%|          | 0/87362 [00:00<?, ? examples/s]

In [11]:
yahoo_answers_qa['train'][9]

{'id': '1274254',
 'question': 'How to boil lobster?',
 'answer': 'Fill a large pot with 1/4 full of water or just enough to cover your lobster and add a generous handful of salt. When it comes to a boil, put the lobster in the pot head first. Then boil for 18 for the first pound and 10 minutes more for each additional pound.  For lobsters over 7 pounds, 8 minutes per additional pound is enough.',
 'nbestanswers': ['Fill a large pot with 1/4 full of water or just enough to cover your lobster and add a generous handful of salt. When it comes to a boil, put the lobster in the pot head first. Then boil for 18 for the first pound and 10 minutes more for each additional pound.  For lobsters over 7 pounds, 8 minutes per additional pound is enough.',
  "Here's how:. . You'll need the following:. . large deep pot. long tongs. live lobsters. boiling salted water. melted butter. . Bring salted water to a rolling boil. Using long tongs, quickly but carefully lower live lobsters into the boiling w

In [12]:
# We prefix our tasks with "answer the question"
prefix = "Please answer this question: "

# Define the preprocessing function

def preprocess_function(examples):
   """Add prefix to the sentences, tokenize the text, and set the labels"""
   # The "inputs" are the tokenized answer:
   inputs = [prefix + doc for doc in examples["question"]]
   model_inputs = tokenizer(inputs, max_length=128, truncation=True)
  
   # The "labels" are the tokenized outputs:
   labels = tokenizer(text_target=examples["answer"], 
                      max_length=512,         
                      truncation=True)

   model_inputs["labels"] = labels["input_ids"]
   return model_inputs

In [13]:
# Map the preprocessing function across our dataset
tokenized_dataset = yahoo_answers_qa.map(preprocess_function, batched=True)

Map:   0%|          | 0/87362 [00:00<?, ? examples/s]



In [24]:
nltk.download("punkt", quiet=True)
metric = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [25]:
def compute_metrics(eval_preds):
   preds, labels = eval_preds

   # decode preds and labels
   labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
   decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
   decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

   # rougeLSum expects newline after each sentence
   decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
   decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

   result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
  
   return result

In [26]:
# Global Parameters
L_RATE = 3e-4
BATCH_SIZE = 8
PER_DEVICE_EVAL_BATCH = 4
WEIGHT_DECAY = 0.01
SAVE_TOTAL_LIM = 3
NUM_EPOCHS = 3

# Set up training arguments
training_args = Seq2SeqTrainingArguments(
   output_dir="./results",
   evaluation_strategy="epoch",
   learning_rate=L_RATE,
   per_device_train_batch_size=BATCH_SIZE,
   per_device_eval_batch_size=PER_DEVICE_EVAL_BATCH,
   weight_decay=WEIGHT_DECAY,
   save_total_limit=SAVE_TOTAL_LIM,
   num_train_epochs=NUM_EPOCHS,
   predict_with_generate=True,
   push_to_hub=False
)

In [None]:
trainer = Seq2SeqTrainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_dataset["train"],
   eval_dataset=tokenized_dataset["test"],
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics
)

In [27]:
data_dir = './data'

In [28]:
articles = extract_articles(data_dir)

In [29]:
print(articles[0])

{'title': 'Car AC Blowing Hot Air', 'content': 'Understanding the Causes\n\nA quick understanding of how air conditioning works can help with understanding what the causes could be. When AC is turned on, refrigerant that flows through the system absorbs heat from your vehicle’s cabin where it’s removed and, through a series of parts and processes, the heat is released into the atmosphere before circulating back and repeating the process. There are several points where something can be wrong, causing warm air rather than cool:\n\nThere isn’t sufficient airflow in the cabin. This could be a problem with a bad blower motor, but more commonly a plugged cabin air filter is the culprit.\nThere isn’t enough refrigerant. The gas that circulates through the system can leak out, preventing it from working efficiently.\nThe compressor may not be cycling. A clutch issue or a compressor failure can prevent the AC system from being able to disperse the heat the refrigerant has absorbed.\nThe expansi

In [30]:
df_articles = pd.DataFrame(articles)

In [31]:
df_articles.head()

Unnamed: 0,title,content,category
0,Car AC Blowing Hot Air,Understanding the Causes\n\nA quick understand...,AC & Climate Control
1,Does the Car AC Use Gas?,Introduction to Car AC and Heating Systems\n\n...,AC & Climate Control
2,How to Get Rid of a Musty Smell in Your AC,Understanding the Musty Smell\n\nOne of the fi...,AC & Climate Control
3,How to Use Car Defrosters,Types of Car Defrosters: Rear and Front Defros...,AC & Climate Control
4,Does AC Affect Gas Mileage?,How Air Conditioning Affects Fuel Consumption\...,AC & Climate Control


In [32]:
df_articles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 980 entries, 0 to 979
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     980 non-null    object
 1   content   980 non-null    object
 2   category  980 non-null    object
dtypes: object(3)
memory usage: 23.1+ KB


In [44]:
df_articles.iloc[3]['content']

'Types of Car Defrosters: Rear and Front Defrosters\n\nCar defrosters are found in two main places. The front is meant to defrost or defog your front windshield and the side windows, and the rear defrosters is intended to clear the back glass. Both serve the same purpose but are designed for different parts of the vehicle.\n\nRear defrosters are primarily responsible for clearing the rear windshield of frost, ice, and condensation. They consist of heating elements embedded in the glass, which radiate heat to melt away any obstructions. It’s triggered by hitting the defrost switch or button on the HVAC controls.\nFront defrosters are essential for maintaining visibility through the front windshield. They work by blowing air, usually warm air, onto the glass surface to remove fog and ice buildup. It’s activated when the heater control air direction is switched to defrost setting, whether it’s only defrost or a blend with floor or dash vents too.\n\nHow to Use Car Defrosters: Step-by-Step

In [49]:

finetuned_model = T5ForConditionalGeneration.from_pretrained('google/flan-t5-base')
tokenizer = T5Tokenizer.from_pretrained('google/flan-t5-base')

inputs = tokenizer("Translate from portuguese to english: Ola, tudo bom ?", return_tensors="pt")
outputs = model.generate(**inputs)
answer = tokenizer.decode(outputs[0])
from textwrap import fill

print(fill(answer, width=80))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


<pad> Ola, tudo bom?</s>


['Use a swivel swivel swivel ']
