# Consumer Reviews Summarization - Project Part 3


[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/Ariamestra/ConsumerReviews/blob/main/project_part3.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Ariamestra/ConsumerReviews/blob/main/project_part3.ipynb)


## 1. Introduction
My goal for this project is to develop a system capable of generating concise summaries of customer reviews. This will help users in quickly skim through feedback on products by transforming detailed reviews into short comments. These comments will be categorized as positive, neutral, or negative, corresponding to the sentiment of the rating provided. To achieve this, the system will use the capabilities of the pre-trained T5 model. I selected the T5 model as my pre-trained choice because it is a text-to-text transformer, thats good at tasks like summarization. I opted for T5-small due to its size, which is more manageable. Additionally, T5 is versatile in handling different summarization types, like extractive summarization where it picks out important sentences from the text.<br>
<br>

**Data** <br>
The dataset was sourced from Kaggle, specifically the [Consumer Review of Clothing Product](https://www.kaggle.com/datasets/jocelyndumlao/consumer-review-of-clothing-product)
 dataset. This dataset includes customer reviews from Amazon. It has all sorts of feedback from buyers about different products. Along with the customers' actual reviews, ratings, product type, material, construction, color, finish, and durability.<br>



In [1]:
# Import all packages needed
#!pip install transformers pandas nltk scikit-learn
#!pip install sentencepiece
#!pip install transformers[torch]


import pandas as pd
import re
import nltk
import torch
import sys
import sentencepiece as spm
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from transformers import T5ForConditionalGeneration, T5Tokenizer
from sklearn.model_selection import train_test_split

nltk.download('punkt')       
nltk.download('stopwords')  
nltk.download('wordnet')     

MODEL_NAME = 't5-small'
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)

data_URL = 'https://raw.githubusercontent.com/Ariamestra/ConsumerReviews/main/Reviews.csv'
df = pd.read_csv(data_URL)
print(f"Shape: {df.shape}")
df.head()

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word 

Shape: (49338, 9)


Unnamed: 0,Title,Review,Cons_rating,Cloth_class,Materials,Construction,Color,Finishing,Durability
0,,Absolutely wonderful - silky and sexy and comf...,4.0,Intimates,0.0,0.0,0.0,1.0,0.0
1,,Love this dress! it's sooo pretty. i happene...,5.0,Dresses,0.0,1.0,0.0,0.0,0.0
2,Some major design flaws,I had such high hopes for this dress and reall...,3.0,Dresses,0.0,0.0,0.0,1.0,0.0
3,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5.0,Pants,0.0,0.0,0.0,0.0,0.0
4,Flattering shirt,This shirt is very flattering to all due to th...,5.0,Blouses,0.0,1.0,0.0,0.0,0.0


In [2]:
# Drop all rows with any null values
df = df.dropna()
print(f"Shape: {df.shape}")
df.head()

# Make lowercase
df['Review'] = df['Review'].str.lower()

# Remove punctuation
df['Review'] = df['Review'].apply(lambda x: re.sub(r'[^\w\s]', '', x))

# Tokenize reviews
df['Review'] = df['Review'].apply(word_tokenize)

# Remove stop words
stop_words = set(stopwords.words('english'))
df['Review'] = df['Review'].apply(lambda x: [word for word in x if word not in stop_words])

# Lemmatize the words
lemmatizer = WordNetLemmatizer()
df['Review'] = df['Review'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

# Join the words
df['Review'] = df['Review'].apply(lambda x: ' '.join(x))

Shape: (5442, 9)


In [3]:
# Split the dataset
X_train, X_test = train_test_split(df, test_size=0.2, random_state=42)

total_samples = df.shape[0]
train_size = X_train.shape[0]
test_size = X_test.shape[0]

train_percentage = (train_size / total_samples) * 100
test_percentage = (test_size / total_samples) * 100

print(f"Train size: {train_size} ({train_percentage:.2f}%)")
print(f"Test size: {test_size} ({test_percentage:.2f}%)")

Train size: 4353 (79.99%)
Test size: 1089 (20.01%)


In [4]:
# Split the dataset
X_train, X_test = train_test_split(df, test_size=0.2, random_state=42)

# Reset the indices 
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)

total_samples = df.shape[0]
train_size = X_train.shape[0]
test_size = X_test.shape[0]

train_percentage = (train_size / total_samples) * 100
test_percentage = (test_size / total_samples) * 100

print(f"Train size: {train_size} ({train_percentage:.2f}%)")
print(f"Test size: {test_size} ({test_percentage:.2f}%)")

Train size: 4353 (79.99%)
Test size: 1089 (20.01%)


In [5]:
from torch.utils.data import Dataset, DataLoader
# -------------------------------------------------------------------------------------------------------------------
class CustomDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.max_length = max_length
        self.text = dataframe['Review'].tolist()  # 'text' should be the column with text data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        text = str(self.text[index])
        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            return_token_type_ids=True,
            truncation=True
        )
        return {
            'input_ids': torch.tensor(inputs['input_ids'], dtype=torch.long),
            'attention_mask': torch.tensor(inputs['attention_mask'], dtype=torch.long),
            'labels': torch.tensor(self.data['Cons_rating'][index], dtype=torch.long)  # 'label' should be the column with label data
        }

# Assuming you have a tokenizer available
train_dataset = CustomDataset(X_train, tokenizer, max_length=512)
eval_dataset = CustomDataset(X_test, tokenizer, max_length=512)


In [None]:
class CustomDataset(Dataset):
    # ... other methods ...

    def __getitem__(self, index):
        # Assuming self.texts and self.labels are initialized and populated
        # with the dataset's features and labels respectively
        
        # Tokenize the text to get input_ids and attention_mask
        inputs = self.tokenizer.encode_plus(
            self.texts[index],
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt' # Ensure that the tokenizer returns PyTorch tensors
        )
        
        # Retrieve the label
        label = torch.tensor(self.labels[index], dtype=torch.long)
        
        # Return a dictionary in the format expected by the Trainer class
        return {
            'input_ids': inputs['input_ids'].squeeze(), # Remove the batch dimension
            'attention_mask': inputs['attention_mask'].squeeze(), # Remove the batch dimension
            'labels': label
        }

# Assuming tokenizer, self.texts, self.labels, and self.max_length are defined
# You would initialize your dataset like this:
dataset = CustomDataset(tokenizer=tokenizer, texts=texts, labels=labels, max_length=512)

# Then create the DataLoader
data_loader = DataLoader(dataset, batch_size=32, shuffle=True)

# When you iterate over the DataLoader, it will collate the individual items
# into a single batch, which is then formatted correctly for the Trainer
for batch in data_loader:
    # Now batch is a dictionary with the keys 'input_ids', 'attention_mask', and 'labels'
    # where the values are tensors with the first dimension being the batch size
    print(batch['input_ids'].shape)  # e.g., torch.Size([32, 512])
    print(batch['attention_mask'].shape)  # e.g., torch.Size([32, 512])
    print(batch['labels'].shape)  # e.g., torch.Size([32])
    break  # Just as an example to print the first batch


In [6]:
from torch.utils.data import DataLoader

# Define the batch size you want
batch_size = 32

# Create the DataLoader
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

first_batch = next(iter(train_loader))
print(first_batch['input_ids'].shape)


torch.Size([32, 512])


The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure.

In [7]:
def __getitem__(self, index):
    # Retrieve the input data
    inputs = self.tokenizer.encode_plus(
        self.texts[index],
        max_length=self.max_length,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )

    # Extract the input IDs and attention mask
    input_ids = inputs['input_ids'].squeeze()
    attention_mask = inputs['attention_mask'].squeeze()

    # Retrieve the label
    label = torch.tensor(self.labels[index], dtype=torch.long)

    # Return a dictionary
    return {
        'input_ids': input_ids,
        'attention_mask': attention_mask,
        'labels': label
    }


In [8]:
from transformers import Trainer, TrainingArguments
import torch


training_args = TrainingArguments(
    output_dir='./results',          
    num_train_epochs=3,              
    per_device_train_batch_size=8,  
    per_device_eval_batch_size=16,   
    warmup_steps=500,                
    weight_decay=0.01,               
    logging_dir='./logs',            
    logging_steps=10,
    do_train=True,
    do_eval=True,
    evaluation_strategy="epoch",     
)

trainer = Trainer(
    model=model,                        
    args=training_args,                  
    train_dataset=train_dataset,         
    eval_dataset=eval_dataset,          
)

trainer.train()


ValueError: not enough values to unpack (expected 2, got 1)

Evaluate and save model

In [None]:
trainer.evaluate()

model.save_pretrained('./my_model')
tokenizer.save_pretrained('./my_model')


In [None]:
def summarize_review(review):
    input_ids = tokenizer.encode("summarize: " + review, return_tensors="pt")
    summary_ids = model.generate(input_ids, max_length=50, num_beams=4, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

for review in reviews[:5]:
    summarized_review = summarize_review(review)
    print(f"Original review: {review}")
    print(f"Summarized review: {summarized_review}\n")
    print("")

## Conclusion
In conclusion, the project has successfully used the pre-trained T5 model to transform extensive customer reviews into brief summaries. This advancement not only enhances the efficiency of user evaluations by providing understanding into product feedback. The implementation of this summarization tool shows practical use of text-to-text transformers in real-world scenarios, simplifying the decision-making process for consumers.