# 🤖 Fine-Tuning T5 for Product Review Generation

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd
from datasets import load_dataset, Dataset
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
from transformers import DataCollatorWithPadding
import torch
import os

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
# os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:64"

torch.cuda.empty_cache()  # Frees unreferenced memory
torch.cuda.ipc_collect()  # Collects inter-process memory

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

### Loading the dataset

Our journey begins with preparing our dataset. We'll use a subset of Amazon product reviews for our analysis and training.

Loading and Merging Datasets
We replace the unavailable "amazon_us_reviews" with a similar dataset and merge metadata with review data.

In [2]:
# "Electronics" you can also choose electronics like in the lesson, but the dataset is bigger and loading will take longer
dataset_category = "Software"

meta_ds = load_dataset("McAuley-Lab/Amazon-Reviews-2023", f"raw_meta_{dataset_category}", split='full').to_pandas()[['parent_asin', 'title']]
review_ds = load_dataset("McAuley-Lab/Amazon-Reviews-2023", f"raw_review_{dataset_category}", split='full').to_pandas()[['parent_asin', 'rating', 'text', 'verified_purchase']]

ds = meta_ds.merge(review_ds, on='parent_asin', how='inner').drop(columns="parent_asin")
ds = ds.rename(columns={"rating":"star_rating", "title":"product_title", "text":"review_body"})

ds = ds[ds['verified_purchase'] & (ds['review_body'].map(len) > 100)].sample(100_000)
ds

Unnamed: 0,product_title,star_rating,review_body,verified_purchase
4844282,ESET Smart Security 2014 Edition - 3 Users,3.0,I've used eset for a long time and have been h...,True
83854,Flight Tracker Plus,1.0,All I wanted was a way to track a flight in re...,True
1663526,Cogs,1.0,The game needs permission to &#34;run at start...,True
4243990,Subway Surfers,4.0,It's really fun and addictive but whenever I b...,True
8257,UniWar,5.0,this game is well worth it. don't miss this on...,True
...,...,...,...,...
1921480,Ultimate Jewel,5.0,This is the best gem matching game I've played...,True
859834,SHOWTIME,1.0,There isn't anything that I am interested in v...,True
589032,YouTube,5.0,What I liked about it was you could search lik...,True
2086682,"NBC News: Breaking News, US News & Live Video",3.0,It's a really good and useful app when it work...,True


Encoding and Splitting Next, we encode our star_rating column and split our dataset into training and testing sets.

In [3]:
# Loading the dataset
dataset = Dataset.from_pandas(ds)

# encoding the 'star_rating' column
dataset = dataset.class_encode_column("star_rating")

# Splitting the dataset into training and testing sets
dataset = dataset.train_test_split(test_size=0.1, seed=42, stratify_by_column="star_rating")

train_dataset = dataset['train']
test_dataset = dataset['test']
print(train_dataset[0])

Stringifying the column:   0%|          | 0/100000 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/100000 [00:00<?, ? examples/s]

{'product_title': 'Paperama', 'star_rating': 4, 'review_body': '...and my 6 year old is better at it then I am! Fun game.  Who knew virtual paper-folding could be so enthralling!', 'verified_purchase': True, '__index_level_0__': 79214}


### Model Preparation 🛠️
Now, let's prepare our T5 model for training.

#### Tokenizer Initialization

In [4]:
MODEL_NAME = "t5-base"
tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


### Data Preprocessing Function
We define a function to preprocess our data, preparing it for the model.

In [5]:
# Method for Preprocessing the inputs data
def preprocess_data(examples):
    examples['prompt'] = [f"review: {product_title}, {star_rating} Stars!" for product_title, star_rating in zip(examples["product_title"], examples["star_rating"])]
    examples['response'] = [f"{review_body}" for review_body in examples["review_body"]]

    inputs = tokenizer(examples['prompt'], padding='max_length', truncation=True, max_length=128)
    targets = tokenizer(examples['response'], padding='max_length', truncation=True, max_length=128)

    # Set -100 at the padding positions of target tokens
    target_input_ids = []
    for ids in targets['input_ids']:
        target_input_ids.append([id if id!=tokenizer.pad_token_id else -100 for id in ids])

    inputs.update({'labels': target_input_ids})
    return inputs

### Preprocessing the Datasets

In [6]:
train_dataset = train_dataset.map(preprocess_data, batched=True)
test_dataset = test_dataset.map(preprocess_data, batched=True)

Map:   0%|          | 0/90000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

We need a Datacollator like GPT for training T5 just like GPT

In [7]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

### Fine-Tuning the Model 🎯
With our data ready, we proceed to fine-tune the T5 model on our dataset.

In [8]:
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
model.gradient_checkpointing_enable()

We will prepare our class for conditional generation (i.e.) starting from an input our T5 model is expected to give a seperate output

In our case starting from the product title and star rating it should give full review, we will train it to do 

In [9]:
TRAINING_OUTPUT = "./t5_fine_tuned_reviews"

training_args = TrainingArguments(
    output_dir=TRAINING_OUTPUT,
    num_train_epochs=3, 
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    save_strategy='epoch',
    gradient_checkpointing=True,
    fp16=True,
    logging_steps=10,
    eval_steps=50,
    save_total_limit=2
)

In [10]:
# trainer = Trainer(
#     args=training_args,
#     model=model,
#     train_dataset=train_dataset,
#     data_collator=data_collator
# )
# # Since it consumes more GPU and takes more time to train we will import the exact trained model in this dataset
# trainer.train()
# Saving and Loading the Model 💾
# After training, we save our model for later use and demonstrate how to load it.

# trainer.save_model(TRAINING_OUTPUT)

In [11]:
# Loading the pretrained model from hugging face for this problem
model = T5ForConditionalGeneration.from_pretrained("TheFuzzyScientist/T5-base_Amazon-product-reviews").to(device)

### Generating Reviews ✍️
Finally, we use our fine-tuned model to generate reviews for new products.

In [37]:
def generate_review(text):
    inputs = tokenizer("review "+ text, return_tensors='pt', max_length=512, padding='max_length', truncation=True).to(device)
    # no_repeat_ngram_size will restrict the model a single word to appear 3 or more times in generated output
    # num_beam will allow the model to think longer time before the generating the output, it allow the model to explore more possibilities and chosing the output
    # early_stopping will give shorter response if the model thinks its good ennough
    outputs = model.generate(inputs['input_ids'], max_length=1024, no_repeat_ngram_size=3, num_beams=6, early_stopping=True)
    # Decode the output and skip the special tokens
    review = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return review

In [38]:
test_dataset[0]

{'product_title': 'Goat Evolution',
 'star_rating': 4,
 'review_body': "This game is so cool! I love it it's awesome I have another game like a called Cal volution another called platypus of evolution and they're really fun to . Playing this game can show you the things of God about goats also known as the weird stuff by the way if you don't play this game exactly yeah and you're looking their ratings to see what's in the see anything helpful that could help you when you're playing this game if you want to download it just telling you the fifth thing for either the fifth or the six I forgot one of the go looks like a dog or not a dog they look like dingoes so if you want to play this game it's really fun see you should find it helpful right that put the check on it helpful we don't think that will don't check it helpful .",
 'verified_purchase': True,
 '__index_level_0__': 2559599,
 'prompt': 'review: Goat Evolution, 4 Stars!',
 'response': "This game is so cool! I love it it's awesome

In [39]:
# Testing the model
random_test_products = test_dataset.shuffle(seed=42).select(range(10))["product_title"]

In [40]:
random_test_products

['Picus Wav Player Trial Unlocker',
 'Nyan Cat!',
 'SeekDroid',
 'Hoyle Card Puzzle & Board Games',
 'Parallels Desktop 7 for Mac [Old Version]',
 'Temple Run',
 'Mirrors of Albion',
 'Toilet Paper - Speed Challenge',
 'Funimation',
 'Amazon Alexa']

In [41]:
print(generate_review(random_test_products[0]+ ", 3 Stars!"))

Worked great for a couple of months before it stopped working. Now it's time to get back to work.


In [42]:
print(generate_review(random_test_products[5]+ ", 5 Stars!"))

br />Also, if you're looking for a "smartphone", you'll want to look no further.


In [43]:
print(generate_review(random_test_products[2]+ ", 4 Stars!"))

Great product, great service, great customer service. I would recommend this product to anyone looking for a great product.


In [44]:
print(generate_review(random_test_products[9]+ ", 5 Stars!"))

So far, so good. I haven't had any problems with this product.


In [45]:
print(generate_review(random_test_products[5]+ ", 1 Stars!"))

Worked great for a few months before it stopped working. Now it's time to get back to work.
