# Review Summarization using GPT2

## Author: 
ShrugalTayal (shrugal20408@iiitd.ac.in)

## Introduction
This notebook guides through fine-tuning Hugging Face's GPT-2 model on Amazon Fine Food Reviews dataset for summarization.

Through this exploration, we seek to improve retrieval accuracy and user experience in information retrieval applications.

## Data Preparation

### Load the Amazon Fine Food Reviews dataset

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('../res/Reviews.csv')

In [7]:
# Retrieve the column names of the DataFrame
data.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

In [8]:
# Retrieve the first 10 rows of the DataFrame
# This line returns a new DataFrame containing the first 10 rows of the original DataFrame
# It's useful for quickly inspecting the structure and content of the dataset
data.head(10)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...
5,6,B006K2ZZ7K,ADT0SRK1MGOEU,Twoapennything,0,0,4,1342051200,Nice Taffy,I got a wild hair for taffy and ordered this f...
6,7,B006K2ZZ7K,A1SP2KVKFXXRU1,David C. Sullivan,0,0,5,1340150400,Great! Just as good as the expensive brands!,This saltwater taffy had great flavors and was...
7,8,B006K2ZZ7K,A3JRGQVEQN31IQ,Pamela G. Williams,0,0,5,1336003200,"Wonderful, tasty taffy",This taffy is so good. It is very soft and ch...
8,9,B000E7L2R4,A1MZYO9TZK0BBI,R. James,1,1,5,1322006400,Yay Barley,Right now I'm mostly just sprouting this so my...
9,10,B00171APVA,A21BT40VZCCYT4,Carol A. Reed,0,0,5,1351209600,Healthy Dog Food,This is a very healthy dog food. Good for thei...


In [9]:
# Find NaN values in each column
nan_values_per_column = data.isna().sum()

print("NaN values in each column:")
print(nan_values_per_column)

NaN values in each column:
Id                         0
ProductId                  0
UserId                     0
ProfileName               26
HelpfulnessNumerator       0
HelpfulnessDenominator     0
Score                      0
Time                       0
Summary                   27
Text                       0
dtype: int64


In [3]:
# Print the shape of the DataFrame before dropping NaN values
print("Shape of DataFrame before dropping rows with NaN values:", data.shape)

# Drop rows with NaN values
data = data.dropna()

# Print the shape of the DataFrame after dropping NaN values
print("Shape of DataFrame after dropping rows with NaN values:", data.shape)

Shape of DataFrame before dropping rows with NaN values: (568454, 10)
Shape of DataFrame after dropping rows with NaN values: (568401, 10)


In [4]:
# Take only 25% of the data randomly
data_sampled = data.sample(frac=0.25, random_state=42)  # Adjust the random_state as needed for reproducibility

# Print the shape of the sampled DataFrame
print("Shape of sampled DataFrame:", data_sampled.shape)

Shape of sampled DataFrame: (142100, 10)


### Text Preprocessing on the ‘Text’ and ‘Summary’ column from the dataset

In [6]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [7]:
# Define a function for text preprocessing
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    
    # Remove special characters and numbers using regex
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    # Join tokens back into text
    cleaned_text = ' '.join(tokens)
    
    return cleaned_text

In [8]:
# Clean and preprocess the 'Text' column
data_sampled['cleaned_text'] = data_sampled['Text'].apply(preprocess_text)

In [9]:
# Clean and preprocess the 'Summary' column
data_sampled['cleaned_summary'] = data_sampled['Summary'].apply(preprocess_text)

In [33]:
df = data_sampled[['cleaned_text', 'cleaned_summary']]

# Verify the resulting DataFrame
df.head()

Unnamed: 0,cleaned_text,cleaned_summary
18828,avocado oil beat olive oil dressing anything v...,avocado oil
363857,new year resolution season bought protein plus...,healthy tastesoky
342609,pretty sad got peach indeed delicious stated c...,peach china
62213,find people north america brew tea long period...,perhaps brewing wrong
467133,love recycled work well storebought product ce...,work great


## Model Training

1. Initialize GPT-2 Tokenizer and Model
2. Data Splitting: 75:25 for training and testing.
3. Custom Dataset Class for data prep.
4. Fine-Tune GPT-2 on review dataset.
5. Hyperparameter Tuning for optimization.

In [51]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from torch.utils.data import Dataset
from transformers import TrainerCallback

# Split the dataset into training and testing sets
train_df, test_df = train_test_split(df, test_size=0.25, random_state=42)

# Initialize GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Add padding token to the tokenizer
tokenizer.pad_token = tokenizer.eos_token

class AmazonReviewDataset(Dataset):
    def __init__(self, data, tokenizer, max_length=128):
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text = self.data.iloc[idx]['cleaned_text']
        summary = self.data.iloc[idx]['cleaned_summary']

        input_text = summary + ' ' + text
        inputs = self.tokenizer(input_text, padding='max_length', truncation=True, max_length=self.max_length, return_tensors='pt')
        labels = self.tokenizer(text, padding='max_length', truncation=True, max_length=self.max_length, return_tensors='pt')['input_ids']

        return {
            'input_ids': inputs['input_ids'].flatten(),
            'attention_mask': inputs['attention_mask'].flatten(),
            'labels': labels.flatten()
        }


# Create training and testing datasets
train_dataset = AmazonReviewDataset(train_df, tokenizer)
test_dataset = AmazonReviewDataset(test_df, tokenizer)


class DeleteCheckpointCallback(TrainerCallback):
    def on_step_end(self, args, state, control, **kwargs):
        if args.output_dir is not None:
            # Delete checkpoints and logs after each step
            control.should_save = False
            control.should_log = False

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",  # Directory to save model checkpoints and logs
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=1,
    evaluation_strategy="epoch",
    logging_dir="./logs",
)

# Define Trainer object with DeleteCheckpointCallback
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    callbacks=[DeleteCheckpointCallback()]
)

# Fine-tune the model
trainer.train()

# Save the trained model and tokenizer
model.save_pretrained("fine_tuned_gpt2_model")
tokenizer.save_pretrained("fine_tuned_gpt2_model")

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


  0%|          | 0/2665 [00:00<?, ?it/s]

  0%|          | 0/889 [00:00<?, ?it/s]

{'eval_loss': 2.584134101867676, 'eval_runtime': 1297.9282, 'eval_samples_per_second': 2.737, 'eval_steps_per_second': 0.685, 'epoch': 1.0}
{'train_runtime': 21533.9564, 'train_samples_per_second': 0.495, 'train_steps_per_second': 0.124, 'train_loss': 2.555314112335835, 'epoch': 1.0}


('fine_tuned_gpt2_model\\tokenizer_config.json',
 'fine_tuned_gpt2_model\\special_tokens_map.json',
 'fine_tuned_gpt2_model\\vocab.json',
 'fine_tuned_gpt2_model\\merges.txt',
 'fine_tuned_gpt2_model\\added_tokens.json')

## Model Evaluation

Compute ROUGE scores on test set to assess model performance.

In [3]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from rouge import Rouge

# Load tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("fine_tuned_gpt2_model")

# Load model
model = GPT2LMHeadModel.from_pretrained("fine_tuned_gpt2_model")

# Define review text
review_text = "The Fender CD-60S Dreadnought Acoustic Guitar is a great instrument for beginners. It has a solid construction, produces a rich sound, and feels comfortable to play. However, some users have reported issues with the tuning stability."
# review_text = input("Given Review Text: ")

# Tokenize the review text
inputs = tokenizer(review_text, return_tensors="pt")

# Generate summary
generated_summary_ids = model.generate(inputs.input_ids, max_length=50, num_beams=4, early_stopping=True)
generated_summary = tokenizer.decode(generated_summary_ids[0], skip_special_tokens=True)

# Print generated summary
print("Generated Summary:", generated_summary)

# Define actual summary
actual_summary = "Good for beginners but has tuning stability issues."
# actual_summary = input("Given Summary :")

# Compute ROUGE scores
rouge = Rouge()
scores = rouge.get_scores(generated_summary, actual_summary, avg=True)

# Print ROUGE scores
print("ROUGE-1: Precision:", scores["rouge-1"]["p"], "Recall:", scores["rouge-1"]["r"], "F1-Score:", scores["rouge-1"]["f"])
print("ROUGE-2: Precision:", scores["rouge-2"]["p"], "Recall:", scores["rouge-2"]["r"], "F1-Score:", scores["rouge-2"]["f"])
print("ROUGE-L: Precision:", scores["rouge-l"]["p"], "Recall:", scores["rouge-l"]["r"], "F1-Score:", scores["rouge-l"]["f"])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Summary: The Fender CD-60S Dreadnought Acoustic Guitar is a great instrument for beginners. It has a solid construction, produces a rich sound, and feels comfortable to play. However, some users have reported issues with the tuning stability.
ROUGE-1: Precision: 0.17647058823529413 Recall: 0.75 F1-Score: 0.2857142826303855
ROUGE-2: Precision: 0.05714285714285714 Recall: 0.2857142857142857 F1-Score: 0.09523809246031753
ROUGE-L: Precision: 0.14705882352941177 Recall: 0.625 F1-Score: 0.2380952350113379
