# Fine-Tuning T5 for Product Review Generation

In this interactive lab, we'll explore the exciting task of generating product reviews using the T5 (Text-to-Text Transfer Transformer) model. We'll dive into data preparation, model training, and ultimately, review generation.

###Setup and Installation

First things first, we need to install the required libraries to ensure our environment is ready for the tasks ahead.

In [None]:
!pip install numpy==1.25.1
!pip install transformers[torch]
!pip install datasets===2.13.1

Collecting accelerate>=0.21.0 (from transformers[torch])
  Using cached accelerate-0.30.1-py3-none-any.whl (302 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch->transformers[torch])
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch->transformers[torch])
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch->transformers[t

###Importing Libraries

Let's import all the necessary modules that will help us load datasets, process data, and utilize the T5 model.


In [None]:
import numpy as np
import pandas as pd
from datasets import load_dataset, Dataset
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
from transformers import DataCollatorWithPadding

###Data Preparation

Our journey begins with preparing our dataset.

Loading and Merging Datasets
We replace the unavailable "amazon_us_reviews" with a similar dataset and merge metadata with review data.

### Summary of Data Merging Process

- **Load Datasets**: Load the metadata and review datasets for the "Software" category using the `load_dataset` function from the `datasets` library and convert them to pandas DataFrames.
- **Select Relevant Columns**: Retain only the necessary columns (`parent_asin` and `title` from the metadata; `parent_asin`, `rating`, `text`, and `verified_purchase` from the reviews).
- **Merge Datasets**: Perform an `inner join` on the `parent_asin` column to combine the metadata and review datasets. This ensures that only rows with matching `parent_asin` values in both datasets are included.
- **Drop Redundant Column**: Drop the `parent_asin` column from the resulting DataFrame as it is no longer needed.
- **Rename Columns**: Rename columns for clarity: `rating` to `star_rating`, `title` to `product_title`, and `text` to `review_body`.
- **Filter Data**: Filter the DataFrame to include only reviews from verified purchases and those with a review body longer than 100 characters.
- **Sample Data**: Randomly sample 100,000 reviews from the filtered DataFrame to create a manageable subset of the data.


In [None]:
dataset_category = "Software"
# "Electronics" you can also choose electronics like in the lesson, but the dataset is bigger and loading will take longer

meta_ds = load_dataset("McAuley-Lab/Amazon-Reviews-2023", f"raw_meta_{dataset_category}", split='full').to_pandas()[['parent_asin', 'title']]
review_ds = load_dataset("McAuley-Lab/Amazon-Reviews-2023", f"raw_review_{dataset_category}", split='full').to_pandas()[['parent_asin', 'rating', 'text', 'verified_purchase']]

ds = meta_ds.merge(review_ds, on='parent_asin', how='inner').drop(columns="parent_asin")
ds = ds.rename(columns={"rating":"star_rating", "title":"product_title", "text":"review_body"})

ds = ds[ds['verified_purchase'] & (ds['review_body'].map(len) > 100)].sample(100_000)
ds


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/39.6k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/19.7k [00:00<?, ?B/s]

Downloading and preparing dataset amazon-reviews-2023/raw_meta_Software to /root/.cache/huggingface/datasets/McAuley-Lab___amazon-reviews-2023/raw_meta_Software/0.0.0/16b76e0823d73bb8cff1e9c5e3e37dbc46ae3daee380417ae141f5e67d3ea8e8...


Downloading data:   0%|          | 0.00/256M [00:00<?, ?B/s]

Generating full split: 0 examples [00:00, ? examples/s]

Dataset amazon-reviews-2023 downloaded and prepared to /root/.cache/huggingface/datasets/McAuley-Lab___amazon-reviews-2023/raw_meta_Software/0.0.0/16b76e0823d73bb8cff1e9c5e3e37dbc46ae3daee380417ae141f5e67d3ea8e8. Subsequent calls will reuse this data.
Downloading and preparing dataset amazon-reviews-2023/raw_review_Software to /root/.cache/huggingface/datasets/McAuley-Lab___amazon-reviews-2023/raw_review_Software/0.0.0/16b76e0823d73bb8cff1e9c5e3e37dbc46ae3daee380417ae141f5e67d3ea8e8...


Downloading data:   0%|          | 0.00/1.87G [00:00<?, ?B/s]

Generating full split: 0 examples [00:00, ? examples/s]

Dataset amazon-reviews-2023 downloaded and prepared to /root/.cache/huggingface/datasets/McAuley-Lab___amazon-reviews-2023/raw_review_Software/0.0.0/16b76e0823d73bb8cff1e9c5e3e37dbc46ae3daee380417ae141f5e67d3ea8e8. Subsequent calls will reuse this data.


  table = cls._concat_blocks(blocks, axis=0)


Unnamed: 0,product_title,star_rating,review_body,verified_purchase
3967541,Alarm Clock & Timer & Stopwatch & World Clock ...,4.0,This dev describes why this application requir...,True
3656321,SpinArt,5.0,"This is a very easy to use little app, but the...",True
4106737,"Rain Therapy: Rest, Relax, Unwind",1.0,Disappointed: I put the app settings AND my Ki...,True
3553421,Netflix,5.0,I downloaded Netflix on my kindle and I love i...,True
911605,SLAMMED!,4.0,The game was great. I was a little frustrated ...,True
...,...,...,...,...
820275,Gold Fish Casino Slots – Free Online Slot Mach...,5.0,I enjoy this app but the newest version for my...,True
4412980,ZOOKEEPER DX,4.0,Like the game. Same premise as Tetris Attack ...,True
2394305,Flats,5.0,This game is awesome. But a few changes would ...,True
1845489,FollowMyHealth® Mobile (Kindle Tablet Edition),5.0,It is very easily accessible and helpful. I c...,True


Encoding and Splitting
Next, we encode our star_rating column and split our dataset into training and testing sets.

In [None]:
# Loading the dataset
dataset = Dataset.from_pandas(ds)

# encoding the 'star_rating' column
dataset = dataset.class_encode_column("star_rating")

# Splitting the dataset into training and testing sets
dataset = dataset.train_test_split(test_size=0.1, seed=42, stratify_by_column="star_rating")

train_dataset = dataset['train']
test_dataset = dataset['test']
print(train_dataset[:5])

###Model Preparation 🛠️

Now, let's prepare our T5 model for training.

###Tokenizer Initialization

In [None]:
MODEL_NAME = 't5-base'
tokenizer = T5Tokenizer.from_pretrained('t5-base')

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


###Data Preprocessing Function
We define a function to preprocess our data, preparing it for the model.

### Why `inputs.update({'labels': target_input_ids})` is Done

The model expects the input data to have both the input tokens and the target tokens:
- **Input Tokens**: These are used for generating predictions.
- **Target Tokens**: These are used to compute the loss and update the model's weights.

By updating the `inputs` dictionary with `labels`, we ensure that the DataLoader or training loop can directly use this dictionary to provide both inputs and targets to the model. This is a common practice in preparing data for models in the Hugging Face Transformers library.


In [None]:
# Defining the function to preprocess the data
def preprocess_data(examples):
    examples['prompt'] = [f"review: {product_title}, {star_rating} Stars!" for product_title, star_rating in zip(examples['product_title'], examples['star_rating'])]
    examples['response'] = [f"{review_body}" for review_body in examples['review_body']]

    inputs = tokenizer(examples['prompt'], padding='max_length', truncation=True, max_length=128)
    targets = tokenizer(examples['response'], padding='max_length', truncation=True, max_length=128)

    # Set -100 at the padding positions of target tokens
    target_input_ids = []
    for ids in targets['input_ids']:
        target_input_ids.append([id if id != tokenizer.pad_token_id else -100 for id in ids])

    inputs.update({'labels': target_input_ids})
    return inputs


###Preprocessing Datasets

### Using DataCollatorWithPadding

The `DataCollatorWithPadding` is a utility provided by the Hugging Face Transformers library that ensures batches of data have uniform length by padding the sequences to the length of the longest sequence in the batch. This is essential for efficient batch processing and ensures that all input sequences in a batch are the same length, which is required by most deep learning models.

#### Key Points:
- **Tokenization**: The collator uses the tokenizer to handle padding, which means it will use the padding token ID defined in the tokenizer.
- **Dynamic Padding**: Instead of padding all sequences to a fixed length, it pads them dynamically to the length of the longest sequence in each batch.
- **Efficiency**: This dynamic approach helps in efficient use of memory and computational resources during training.

By using `DataCollatorWithPadding`, we simplify the data preparation process and ensure that each batch is properly padded, enabling the model to process inputs in a consistent manner.


In [None]:
train_dataset = train_dataset.map(preprocess_data, batched=True)
test_dataset = test_dataset.map(preprocess_data, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/90000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

###Fine-Tuning the Model 🎯

With our data ready, we proceed to fine-tune the T5 model on our dataset.

In [None]:
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)

TRAINING_OUTPUT = "./models/t5_fine_tuned_reviews"
training_args = TrainingArguments(
    output_dir=TRAINING_OUTPUT,
    num_train_epochs=3,
    per_device_train_batch_size=12,
    per_device_eval_batch_size=12,
    save_strategy='epoch',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator,
)

trainer.train()

###Saving and Loading the Model 💾

After training, we save our model for later use and demonstrate how to load it.

In [None]:
trainer.save_model(TRAINING_OUTPUT)

In [None]:
# Loading the fine-tuned model
# model = T5ForConditionalGeneration.from_pretrained(TRAINING_OUTPUT)

# or get it directly trained from here:
model = T5ForConditionalGeneration.from_pretrained("TheFuzzyScientist/T5-base_Amazon-product-reviews")

config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

###Generating Reviews ✍️

Finally, we use our fine-tuned model to generate reviews for new products.

In [None]:
# Defining the function to generate reviews
def generate_review(text):
    inputs = tokenizer("review: " + text, return_tensors='pt', max_length=512, padding='max_length', truncation=True)
    outputs = model.generate(inputs['input_ids'], max_length=128, no_repeat_ngram_size=3, num_beams=6, early_stopping=True)
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return summary

In [None]:
# Generating reviews for random products
random_products = test_dataset.shuffle(42).select(range(10))['product_title']

print(generate_review(random_products[0] + ", 3 Stars!"))
print(generate_review(random_products[1] + ", 5 Stars!"))
print(generate_review(random_products[2] + ", 2 Stars!"))

Mystical Oracle Cards I've been using these cards for a few months now and they are working great. The only thing I don't like about them is that they are a bit bulky. I'm not sure if it's just me or if they're just me.
XiiaLive - Internet Radio I bought this radio for my daughter for Christmas and she loves it. It's easy to use and the sound quality is great. I would recommend this radio to anyone looking for a good Internet radio.
It's a good product, but it's not as good as I thought it would be. I've been using it for about a month now, and I'm not sure if I'll ever use it again.
