# Summarizing Review Texts

This notebook looks at fine-tuning a pre-trained model to get summaries of the review CC texts.

In [1]:
import numpy as np
import pandas as pd

import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

import nltk
#nltk.download('stopwords')

## Text Cleaning if Needed

Text cleaning follows here.

In [3]:
YT_df = pd.read_csv('youtube_reviews.csv')
YT_df.head(3)

Unnamed: 0,Headphone_Name,Sony_Review_Text
0,sony xm4 earbuds,"The Sony WF-1000XM4 earbuds, which is a mouth..."
1,sony xm4 earbuds,(wind rushing)\n(slow music) - As much as I l...
2,sony xm4 earbuds,[Music] what's going on guys it's your averag...


In [14]:
YT_df.rename(columns={'Sony_Review_Text': 'Review_Text'}, inplace=True)

YT_df = YT_df.dropna()

In [17]:
stop_words = set(stopwords.words('english'))

stemmer = PorterStemmer()

emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # Emoticons
                           u"\U0001F300-\U0001F5FF"  # Symbols & Pictographs
                           u"\U0001F680-\U0001F6FF"  # Transport & Map Symbols
                           u"\U0001F700-\U0001F77F"  # Alchemical Symbols
                           u"\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
                           u"\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
                           u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
                           u"\U0001FA00-\U0001FA6F"  # Chess Symbols
                           u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
                           u"\U0001FB00-\U0001FBFF"  # Symbols for Legacy Computing"
                           u"\U0001FC00-\U0001FCFF"  # St. George's Flag
                           u"\U0001F004-\U0001F0CF"  # CJK Compatibility Ideographs
                           u"\U0001F170-\U0001F251"  # Enclosed Ideographic Supplement
                           "]+", flags=re.UNICODE)

In [18]:
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    #remove emoji's
    text = emoji_pattern.sub(r'', text)
    # Remove special characters, numbers, and punctuation
    text = re.sub(r'\d+', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize the text
    words = word_tokenize(text)
    # Remove stopwords and apply stemming
    words = [stemmer.stem(word) for word in words if word not in stop_words]
    # Join the words back into a string
    preprocessed_text = ' '.join(words)
    return preprocessed_text

YT_df['Review_Text'] = YT_df['Review_Text'].apply(lambda x: preprocess_text(x))

In [19]:
YT_df.head(3)

Unnamed: 0,Headphone_Name,Review_Text
0,sony xm4 earbuds,soni wfxm earbud mouth name sit toward top ear...
1,sony xm4 earbuds,wind rush slow music much love airpod pro admi...
2,sony xm4 earbuds,music what go guy averag consum today final go...


In [21]:
YT_df['Headphone_Name'].unique()

array(['sony xm4 earbuds', 'Galaxy Buds2 Pro', 'Sennheiser MTW3',
       'Bose Quietcomfort Earbuds 2', 'Bose Quietcomfort Earbuds',
       'Beats Fit Pro Earbuds', 'AirPods Pro 2 Earbuds',
       'Sony Linkbuds S', 'Sony Linkbuds original', 'Pixel Buds Pro',
       'Soundcore Liberty 3', 'AirPods 3', 'Jabra Elite 7 Pro',
       'Sony WF-1000XM5', '1MORE Evo', 'Buy LG TONE TF8'], dtype=object)

In [31]:
YT_df.groupby(by = 'Headphone_Name').agg(lambda x: ' '.join(x)).reset_index()

Unnamed: 0,Headphone_Name,Review_Text
0,1MORE Evo,video sponsor theyv got us check evo earbud to...
1,AirPods 3,music hey what mkbhd okay im last stage macboo...
2,AirPods Pro 2 Earbuds,music right ill honest wasnt even plan review ...
3,Beats Fit Pro Earbuds,hey got airpod third gener last week turn appl...
4,Bose Quietcomfort Earbuds,final bose new dollar nois cancel quietcomfort...
5,Bose Quietcomfort Earbuds 2,bose quietcomfort earbud frustrat product grea...
6,Buy LG TONE TF8,hi everyon welcom channel lg quit busi late an...
7,Galaxy Buds2 Pro,six month sinc bought pair samsung galaxi bud ...
8,Jabra Elite 7 Pro,jabra claim reinvent true wireless earbud new ...
9,Pixel Buds Pro,music pixel bud seri cheapest one right cost n...


## Manually gathering summaries for headphones. 

In [2]:
orig_df = pd.read_csv("youtube_reviews.csv")

headphone_names = pd.Series(orig_df['Headphone_Name'].unique())
headphone_names

0                sony xm4 earbuds
1                Galaxy Buds2 Pro
2                 Sennheiser MTW3
3     Bose Quietcomfort Earbuds 2
4       Bose Quietcomfort Earbuds
5           Beats Fit Pro Earbuds
6           AirPods Pro 2 Earbuds
7                 Sony Linkbuds S
8          Sony Linkbuds original
9                  Pixel Buds Pro
10            Soundcore Liberty 3
11                      AirPods 3
12              Jabra Elite 7 Pro
13                Sony WF-1000XM5
14                      1MORE Evo
15                Buy LG TONE TF8
dtype: object

Let's get review summaries from blogs and put them in a list in the same order. First, focus on reviews from engadget. 

Here are the links used to get the text: 
https://www.engadget.com/sony-wf-1000xm4-review-160006474.html

https://www.engadget.com/samsung-galaxy-buds-2-pro-review-160057740.html

No review for Sennheiser MTW3

https://www.engadget.com/bose-quietcomfort-earbuds-2-review-130026306.html

https://www.engadget.com/bose-quietcomfort-earbuds-review-144502194.html

https://www.engadget.com/beats-fit-pro-review-140004462.html

https://www.engadget.com/airpods-pro-review-second-generation-130048218.html

https://www.engadget.com/sony-linkbuds-review-170020552.html

No review for original linkbuds

https://www.engadget.com/google-pixel-buds-review-170044941.html

No review for Soundcore Liberty 3

No review for AirPods 3

No review for Jabra Elite 7 Pro

https://www.engadget.com/sony-wf-1000xm5-earbuds-review-striving-for-perfection-160023581.html#:~:text=Wrap-up,riddles%20Sony%20needed%20to%20solve.

No review for 1MORE Evo

No review for Buy LG TONE TF8

In [3]:
engadget = ['Sony nearly did it again. The company has dominated both over-ear and true wireless product categories for the last few years. It has a knack for creating a compelling combination of sound quality, noise cancelling performance, customization and features. None of the competition comes close to what the WF-1000XM4 offers in terms of what the earbuds can do for you automatically with features like Adaptive Sound Control and Speak-to-Chat. These are almost the complete package, if only the new ear tips offered a better fit. Even the best of the three pairs included in the box never felt truly comfortable. I only found relief when I grabbed the silicone tips from the M3 instead, and most people won’t have access to those. It seems so simple, but if you mess it up, a basic thing like ear tips can nearly ruin otherwise stellar earbuds. The WF-1000XM4 is available now in black and silver color options for $280.',
            'I’ve said a set of Samsung’s Galaxy Buds are its best yet before – more than once. That’s because the company continues to improve its formula with each subsequent release, whether that’s the regular Buds or the Buds Pro. And now I have to declare it again. The Buds 2 Pro are a huge leap from the 2021 Pro model, with massive improvements to the audio, notable gains in noise cancellation and the introduction of several new features. Samsung lets its loyal customers unlock the best of the Buds 2 Pro, the same way Apple and Google have done. That’s not likely to change, but Samsung is making a strong case for owners of its phones to invest in its audio products too.',
            '',
            'If it’s supreme noise blocking you’re looking for in your next set of true wireless earbuds, the QCE II is the choice. With the updates Bose delivers here with the help of CustomTune, not only is the ANC noticeably better than the previous model, but overall audio quality and ambient sound mode are also improved. Sure, I’d like more than six hours of battery life and conveniences like multipoint connectivity and wireless charging should be standard fare at this point. For $299, I’d expect some of those basics to be included and Bose passed on them.',
            'Bose has come a long way since the SoundSport Free. The company had years to perfect its next set(s) of true wireless earbuds, and it’s created a tempting package. The QuietComfort Earbuds have powerful ANC and great overall sound quality, plus premium features like wireless charging. The limited customization and touch controls could be a headache for some, and the large-sized buds create a look some may not want. And when you factor in price, Sony’s WF-1000XM3 is an attractive alternative despite its age. Bose and Sony have done battle over noise-cancelling headphones during the last few years, now they’re doing the same for true wireless earbuds. And Bose finally has a product that can give Sony a run for its money.',
            'If you’re looking for the best of what AirPods has to offer in earbuds that don’t have the polarizing stick apparatus, the Beats Fit Pro should do the trick. They offer a nice blend of features, sound and noise-cancelling performance for the price. Sure, there are better options but they also cost significantly more, especially if you’re looking for the absolute best audio quality. For now, Beats is giving the masses an AirPods alternative that’s actually still packed with Apple tech. And that’s an interesting proposition for iPhone owners.',
            'Apple’s noise-canceling earbuds were way overdue for an update. While the company didn’t see the need to change the overall design, it did extensive upgrades on the inside, introducing new features and improving performance along the way. Importantly, it made all of these changes while keeping the price at $249. Things like improved audio, more powerful ANC, Adaptive Transparency and even the upgrades to the charging case make the new AirPods Pro a worthwhile update to a familiar formula. Let’s just hope we don’t have to wait another three years for a full redesign.',
            'Sony largely succeeded at what it set out to do: It built a set of true wireless earbuds that offers transparent audio by design rather than relying on microphones to pipe in ambient sound. Indeed, the LinkBuds blend your music, podcasts or videos with whatever is going on around you. There are certainly benefits for this, whether it be the ability to be less of a jerk in the office or to stay safe outdoors. Even with all of the handy tech Sony packs in, earbuds need to be comfortable enough to wear for long periods of time, and the area around the unique ring-shaped drivers is simply too hard to be accommodating. Consistent audio performance would make a big difference, too. For now, the LinkBuds are an interesting product that could be more compelling with some refinements. Hopefully Sony will do just that, because I’m very much looking forward to version 2.0. The LinkBuds are available to order today from Amazon and Best Buy in grey and white color options for $180.',
            '',
            'Google’s best earbuds yet are also its most complete package thus far. All of the features that made 2020’s redesigned Pixel Buds and the A-Series follow-up such compelling options for Android users, especially Pixel owners, are back. And while the Pixel Buds Pro are $20 more than what we got two years ago, the 2022 version is much improved. Active noise cancellation and the refined sound quality are equally impressive, and well worth the extra money. As long as Google can deliver spatial audio quickly and it works well, the only thing lacking is call quality, which may not be a dealbreaker for you.',
            '',
            '',
            '',
            'With the WF-1000XM5, Sony improves its already formidable mix of great sound, effective ANC and handy features. These earbuds are undoubtedly the company’s best and most comfortable design in its premium model so far, which was one of the few remaining riddles Sony needed to solve. For all of the company’s ability to add so many features, many of them still need fine-tuning, but that doesn’t make them any less useful in their current state. The WF-1000XM5 are more expensive too, which means the competition has one key area it can beat Sony. As is typically the case, there aren’t many flaws with the company’s latest model and its rivals still have their work cut out for them. The WF-1000XM5 are available for pre-order now in black and silver color options for $300. According to Amazon, the earbuds will ship on August 4th.',
            '',
            '']

In [4]:
summaries_df = pd.concat([headphone_names, pd.Series(engadget)], axis = 1)

summaries_df = summaries_df.rename(columns={0: 'Headphone_Name', 1: 'Summary'})

summaries_df = summaries_df.set_index('Headphone_Name')

summaries_df

Unnamed: 0_level_0,Summary
Headphone_Name,Unnamed: 1_level_1
sony xm4 earbuds,Sony nearly did it again. The company has domi...
Galaxy Buds2 Pro,I’ve said a set of Samsung’s Galaxy Buds are i...
Sennheiser MTW3,
Bose Quietcomfort Earbuds 2,If it’s supreme noise blocking you’re looking ...
Bose Quietcomfort Earbuds,Bose has come a long way since the SoundSport ...
Beats Fit Pro Earbuds,If you’re looking for the best of what AirPods...
AirPods Pro 2 Earbuds,Apple’s noise-canceling earbuds were way overd...
Sony Linkbuds S,Sony largely succeeded at what it set out to d...
Sony Linkbuds original,
Pixel Buds Pro,Google’s best earbuds yet are also its most co...


## Grouping Reviews per headphone to get training dataset.

In [5]:
orig_df['Sony_Review_Text'] = orig_df['Sony_Review_Text'].astype(str)

#concatenating all review texts for each headphone so we have one row per headphone
grouped_orig_df  = orig_df.groupby(by = 'Headphone_Name')['Sony_Review_Text'].apply(''.join)

In [6]:
reviews_summaries_df = pd.concat([grouped_orig_df, summaries_df], axis=1)
reviews_summaries_df = reviews_summaries_df.rename(columns = {'Sony_Review_Text':'Review_Text'})
reviews_summaries_df

Unnamed: 0_level_0,Review_Text,Summary
Headphone_Name,Unnamed: 1_level_1,Unnamed: 2_level_1
1MORE Evo,- This video was sponsored by 1MORE. And they...,
AirPods 3,[Music] hey what's up mkbhd here okay so i'm ...,
AirPods Pro 2 Earbuds,[Music] all right I'll be honest I wasn't eve...,Apple’s noise-canceling earbuds were way overd...
Beats Fit Pro Earbuds,hey there so we got the airpods third generat...,If you’re looking for the best of what AirPods...
Bose Quietcomfort Earbuds,finally bose's new 280 dollar noise canceling...,Bose has come a long way since the SoundSport ...
Bose Quietcomfort Earbuds 2,The Bose QuietComfort Earbuds are a \nfrustra...,If it’s supreme noise blocking you’re looking ...
Buy LG TONE TF8,hi everyone and welcome to the channel lg hav...,
Galaxy Buds2 Pro,It's been six months since I bought \na pair ...,I’ve said a set of Samsung’s Galaxy Buds are i...
Jabra Elite 7 Pro,jabra claims to have reinvented true wireless...,
Pixel Buds Pro,[Music] so the pixel buds a series the cheape...,Google’s best earbuds yet are also its most co...


In [7]:
len(reviews_summaries_df.loc['1MORE Evo', 'Review_Text'])

95035

In [8]:
len(orig_df.iloc[0, 1])

10576

## BERT Fine-Tuning

Testing out BERT pre-trained models.

Burrowed some code from https://github.com/rohan-paul/MachineLearning-DeepLearning-Code-for-my-YouTube-Channel/blob/master/NLP/Fine_Tuning_Pegasus_for_Text_Summarization.ipynb

This is also useful, and where the code from above seemingly came from:
https://huggingface.co/learn/nlp-course/chapter7/5?fw=pt

In [9]:
#!pip install tokenizers==0.13.2
#!pip install transformers[sentencepiece] datasets sacrebleu rouge_score py7zr -q
#!pip install bert-extractive-summarizer
#!pip install transformers
#!pip install --upgrade huggingface-hub
#!pip install rouge_score
#!conda install -c pytorch pytorch

In [10]:
from transformers import set_seed, pipeline, AutoTokenizer, AutoModelForSeq2SeqLM, BertTokenizer
from nltk.tokenize import sent_tokenize
import torch

from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments

from datasets import Dataset, DatasetDict

In [18]:
device = "cuda" if torch.cuda.is_available() else "cpu"

model_ckpt = "google/pegasus-cnn_dailymail"
#model_ckpt = "google/pegasus-multi_news"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [53]:
def generate_batch_sized_chunks(list_of_elements, batch_size):
    """split the dataset into smaller batches that we can process simultaneously
    Yield successive batch-sized chunks from list_of_elements.
    
    Yields consecutive chunks from a list.

    Args:
        list_of_elements (List[Any]): The list to be divided into chunks.
        batch_size (int): The size of chunks.

    Yields:
        List[Any]: A chunk from the list of the specified size.
        
    """
    for i in range(0, len(list_of_elements), batch_size):
        yield list_of_elements[i : i + batch_size]

def calculate_metric_on_test_ds(dataset, metric, model, tokenizer, 
                               batch_size=2, device=device, 
                               column_text="Review_Text", 
                               column_summary="Summary"):
    """
    Calculates a specified metric on a test dataset.

    Args:
        dataset (Dataset): The dataset to evaluate.
        metric (Metric): The metric to calculate.
        model (nn.Module): The model to evaluate.
        tokenizer (Tokenizer): The tokenizer to use for text processing.
        batch_size (int, optional): The batch size for evaluation.
        device (torch.device, optional): The device to use for computation.
        column_text (str, optional): The name of the text column in the dataset.
        column_summary (str, optional): The name of the summary column in the dataset.

    Returns:
        Dict[str, float]: The calculated metric scores.
    """
    article_batches = list(generate_batch_sized_chunks(dataset[column_text], batch_size))
    target_batches = list(generate_batch_sized_chunks(dataset[column_summary], batch_size))

    for article_batch, target_batch in tqdm(
        zip(article_batches, target_batches), total=len(article_batches)):
        
        inputs = tokenizer(article_batch, max_length=1024,  truncation=True, 
                        padding="max_length", return_tensors="pt")
        
        summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                         attention_mask=inputs["attention_mask"].to(device), 
                         length_penalty=0.8, num_beams=8, max_length=128)
        ''' parameter for length penalty ensures that the model does not generate sequences that are too long. '''
        
        # Finally, we decode the generated texts, 
        # replace the <n> token, and add the decoded texts with the references to the metric.
        decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True, 
                                clean_up_tokenization_spaces=True) 
               for s in summaries]      
        
        decoded_summaries = [d.replace("<n>", " ") for d in decoded_summaries]
        
        
        metric.add_batch(predictions=decoded_summaries, references=target_batch)
        
    #  Finally compute and return the ROUGE scores.
    score = metric.compute()
    return score

In [13]:
from transformers import AutoConfig
import os

model_ckpt = "facebook/bart-large-cnn"

# Get the configuration for the model
config = AutoConfig.from_pretrained(model_ckpt)

# Check if the configuration contains the 'model_type' field
if hasattr(config, 'model_type'):
    print(f"Model type: {config.model_type}")
else:
    print("Configuration does not contain 'model_type' field.")

Model type: bart


In [14]:
# Check if the 'config.json' file exists in the model directory
model_directory = os.path.join(model_ckpt, 'model')
config_file_path = os.path.join(model_directory, 'config.json')

if os.path.exists(config_file_path):
    print(f"'config.json' file exists in the model directory: {config_file_path}")
else:
    print("No 'config.json' file found in the model directory.")

No 'config.json' file found in the model directory.


Trying this on some sample text first.

In [19]:
sample_text = reviews_summaries_df.loc['sony xm4 earbuds', 'Review_Text']

In [20]:
# Tokenize and encode the text
input_ids = tokenizer.encode("summarize: " + sample_text, return_tensors="pt", max_length=1024, truncation=True).to(device)

# Generate the summary
summary_ids = model_pegasus.generate(input_ids, max_length=500, min_length=300, length_penalty=2.0, num_beams=4, early_stopping=True)

# Decode and print the summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summary:", summary)


Summary: The Sony WF-1000XM4 earbuds cost $279.<n>They're one of the most affordable earbuds on the market.<n>Reviewer says they have the best noise cancellation of any buds he's tried.<n>They have great bass, clarity and support Sony's LDAC technology.<n>But they're not the best for phone calls that I've ever used, but they'll get you by.<n>They're also great for when I'm cooking or cleaning around the home.<n>But they're not the best for phone calls that I've ever used, but they'll still get you by.<n>You can customize what you want each tap to register, plus a long press, plus a long press, plus a double tap, plus the IPX4 water resistance rating, as well as a long press, and double tap, and triple tap, as well as a long press, and double tap, as well as a long press, and triple tap, as well as a long press, and double tap, as well as a long press, and triple tap, as well as a long press, and double tap, as well as a long press, and triple tap, as well as a long press, and double ta

In [21]:
len(summary)

1179

Now let's try to fine tune it.

In [22]:
from sklearn.model_selection import train_test_split

nonempty_df = reviews_summaries_df.replace(r'^\s*$', pd.NA, regex=True).dropna()

# Define the proportions for the splits
train_size = 0.6
validation_size = 0.2
test_size = 0.2

# First, split the data into a temporary training set and a temporary test set
train, temp_test = train_test_split(nonempty_df, test_size=1 - train_size, random_state=42)

# Then, split the temporary test set into the validation set and the final test set
final_validation, final_test = train_test_split(temp_test, test_size=test_size / (test_size + validation_size), random_state=42)


In [23]:
train_ds = Dataset.from_pandas(train)
validation_ds = Dataset.from_pandas(final_validation)
test_ds = Dataset.from_pandas(final_test)

ds = DatasetDict()

ds['train'] = train_ds
ds['validation'] = validation_ds
ds['test'] = test_ds

ds

DatasetDict({
    train: Dataset({
        features: ['Review_Text', 'Summary', 'Headphone_Name'],
        num_rows: 5
    })
    validation: Dataset({
        features: ['Review_Text', 'Summary', 'Headphone_Name'],
        num_rows: 2
    })
    test: Dataset({
        features: ['Review_Text', 'Summary', 'Headphone_Name'],
        num_rows: 2
    })
})

In [24]:
model_ckpt = "google/pegasus-cnn_dailymail"

pipe = pipeline('summarization', model = model_ckpt )

pipe_out = pipe(ds['test'][0]['Review_Text'][:1000])


tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [25]:
from datasets import load_metric
from tqdm import tqdm

rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

rouge_metric = load_metric('rouge')

score = calculate_metric_on_test_ds(ds['test'], rouge_metric, model_pegasus, tokenizer)

  rouge_metric = load_metric('rouge')
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:43<00:00, 43.12s/it]


In [26]:
rouge_dict = dict((rn, score[rn].mid.fmeasure ) for rn in rouge_names )

pd.DataFrame(rouge_dict, index = ['pegasus'])

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
pegasus,0.284566,0.044188,0.159174,0.159174


In [27]:
def convert_examples_to_features(example_batch):
    input_encodings = tokenizer(example_batch['Review_Text'] , max_length = 1024, truncation = True )
    
    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(example_batch['Summary'], max_length = 128, truncation = True )
        
    return {
        'input_ids' : input_encodings['input_ids'],
        'attention_mask': input_encodings['attention_mask'],
        'labels': target_encodings['input_ids']
    }

#dataset = Dataset.from_dict(dataset_dict)

In [28]:
dataset_dict_pt = ds.map(convert_examples_to_features, batched = True)

Map:   0%|          | 0/5 [00:00<?, ? examples/s]



Map:   0%|          | 0/2 [00:00<?, ? examples/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

In [29]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model_pegasus)

In [55]:
import accelerate
import transformers
from transformers import TrainingArguments, Trainer

trainer_args = TrainingArguments(
    output_dir='pegasus-reviews', num_train_epochs=6, warmup_steps=500,
    per_device_train_batch_size=2, per_device_eval_batch_size=2,
    weight_decay=0.01, logging_steps=1,
    evaluation_strategy='epoch', save_steps=1e6,
    gradient_accumulation_steps=16
) 

In [56]:
trainer = Trainer(model=model_pegasus, args=trainer_args,
                  tokenizer=tokenizer, data_collator=data_collator,
                  train_dataset=dataset_dict_pt["train"], 
                  eval_dataset=dataset_dict_pt["validation"])

In [57]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.8541,4.311434
2,0.8178,4.311206
3,0.9366,4.310781
4,0.831,4.310161
5,0.2781,4.308462


TrainOutput(global_step=6, training_loss=0.7578463157018026, metrics={'train_runtime': 874.1067, 'train_samples_per_second': 0.034, 'train_steps_per_second': 0.007, 'total_flos': 78015539183616.0, 'train_loss': 0.7578463157018026, 'epoch': 5.33})

In [58]:
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

rouge_metric = load_metric('rouge')

score = calculate_metric_on_test_ds(
    ds['test'], rouge_metric, trainer.model, tokenizer, batch_size = 2, column_text = 'Review_Text', column_summary= 'Summary'
)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:42<00:00, 42.50s/it]


In [59]:
rouge_dict = dict((rn, score[rn].mid.fmeasure ) for rn in rouge_names )

pd.DataFrame(rouge_dict, index = [f'pegasus'] )

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
pegasus,0.284566,0.044188,0.159174,0.159174


In [61]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [62]:
trainer.push_to_hub(commit_message="Training complete", tags="summarization")

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

'https://huggingface.co/ravinderbrai/pegasus-reviews/tree/main/'

In [63]:
gen_kwargs = {"length_penalty": 0.8, "num_beams":8, "max_length": 128}



sample_text = ds["test"][0]["Review_Text"]

reference = ds["test"][0]["Summary"]

pipe = pipeline("summarization", model='pegasus-reviews')

In [69]:
## 
print("Review_Text:")
#print(sample_text)


print("\nSummary:")
#print(reference)


print("\nModel Summary:")
#print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])

Review_Text:

Summary:

Model Summary:


In [64]:
pipe(sample_text[:1025], **gen_kwargs)[0]["summary_text"]

"The new beats fit pro noise cancelling earbuds come in four color options .<n>The earhook gives you a feeling of security that your buds aren't going to fall off your head .<n>What's interesting with the fit pro is that the wing tip has been ."

In [65]:
len(sample_text)

93063

In [66]:
sample_tokens = tokenizer.tokenize(sample_text)  # Use your specific tokenizer

# Calculate the total token count
total_tokens = len(sample_tokens)
print(total_tokens)

Token indices sequence length is longer than the specified maximum sequence length for this model (20104 > 1024). Running this sequence through the model will result in indexing errors


20104


In [67]:
#dividing the sample text into parts to fit the 1024 max token length
max_sequence_length = 1024
num_segments = (total_tokens + max_sequence_length - 1) // max_sequence_length

slicing_indices = list(range(0, total_tokens, max_sequence_length))
slicing_indices[-1]

19456

In [70]:
full_txt = ""

for i in range(0, len(slicing_indices)-1):
    full_txt += pipe(sample_text[slicing_indices[i]:slicing_indices[i+1]], **gen_kwargs)[0]["summary_text"]
    

In [71]:
full_txt

"The new beats fit pro noise cancelling earbuds come in four color options .<n>The earhook gives you a feeling of security that your buds aren't going to fall off your head .<n>What's interesting with the fit pro is that the wing tip has been .integrated into the design it's one size fits all you can't replace it as far as i can tell but it does seem durable .<n>Because it's an extension of the sport fin it is a soft to the touch finish and a little bit of a grip to it use it to control music playback answer and end calls .<n>A long press switches between noise canceling and transparency modes you can also program the long press to be volume controls on the buds themselves .The case for the fit pro isn't as small as the airpods pro case or even the beats studio buds case but it's still pretty compact and much smaller .<n>The case charges via usbc not lightning however it's missing the wireless charging found in the airpods pro and airpods 3 cases .beats studio buds don't have apple's h

# Summarizing Review Texts Version 2

Due to lack of training data, here will be an alternative approach where the reivews won't be grouped together by headphones. The summaries will then be the same for each review per headphone. This will give many more training samples and also the reviews won't be as long. Moreover, to get summaries, the text won't have to be split into as many parts due to max token lengths.

In [5]:
from transformers import set_seed, pipeline, AutoTokenizer, AutoModelForSeq2SeqLM, BertTokenizer
from nltk.tokenize import sent_tokenize
import torch

from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments

from datasets import Dataset, DatasetDict

from datasets import load_metric
from tqdm import tqdm

In [6]:
individual_reviews_df = orig_df.merge(summaries_df, on='Headphone_Name', how='left')

#just removing reviewtext title error
individual_reviews_df = individual_reviews_df.rename(columns={'Sony_Review_Text': 'Review_Text'})

In [24]:
individual_reviews_df.iloc[0:15]

Unnamed: 0,Headphone_Name,Review_Text,Summary
0,sony xm4 earbuds,"The Sony WF-1000XM4 earbuds, which is a mouth...",Sony nearly did it again. The company has domi...
1,sony xm4 earbuds,(wind rushing)\n(slow music) - As much as I l...,Sony nearly did it again. The company has domi...
2,sony xm4 earbuds,[Music] what's going on guys it's your averag...,Sony nearly did it again. The company has domi...
3,sony xm4 earbuds,so it's been almost two years since sony rele...,Sony nearly did it again. The company has domi...
4,sony xm4 earbuds,[Music] hey guys so is the Sony wfos xm4 stil...,Sony nearly did it again. The company has domi...
5,sony xm4 earbuds,Sony wf-1000xm4 have one major flaw at least ...,Sony nearly did it again. The company has domi...
6,sony xm4 earbuds,outside of airpods I wouldn't be surprised if...,Sony nearly did it again. The company has domi...
7,Galaxy Buds2 Pro,It's been six months since I bought \na pair ...,I’ve said a set of Samsung’s Galaxy Buds are i...
8,Galaxy Buds2 Pro,all right the galaxy buds 2 pro wow these ear...,I’ve said a set of Samsung’s Galaxy Buds are i...
9,Galaxy Buds2 Pro,hi there i've got samsung's 230 galaxy buds 2...,I’ve said a set of Samsung’s Galaxy Buds are i...


In [7]:
device = "cuda" if torch.cuda.is_available() else "cpu"

model_ckpt = "google/pegasus-cnn_dailymail"
#model_ckpt = "google/pegasus-multi_news"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
def generate_batch_sized_chunks(list_of_elements, batch_size):
    """split the dataset into smaller batches that we can process simultaneously
    Yield successive batch-sized chunks from list_of_elements.
    
    Yields consecutive chunks from a list.

    Args:
        list_of_elements (List[Any]): The list to be divided into chunks.
        batch_size (int): The size of chunks.

    Yields:
        List[Any]: A chunk from the list of the specified size.
        
    """
    for i in range(0, len(list_of_elements), batch_size):
        yield list_of_elements[i : i + batch_size]

def calculate_metric_on_test_ds(dataset, metric, model, tokenizer, 
                               batch_size=2, device=device, 
                               column_text="Review_Text", 
                               column_summary="Summary"):
    """
    Calculates a specified metric on a test dataset.

    Args:
        dataset (Dataset): The dataset to evaluate.
        metric (Metric): The metric to calculate.
        model (nn.Module): The model to evaluate.
        tokenizer (Tokenizer): The tokenizer to use for text processing.
        batch_size (int, optional): The batch size for evaluation.
        device (torch.device, optional): The device to use for computation.
        column_text (str, optional): The name of the text column in the dataset.
        column_summary (str, optional): The name of the summary column in the dataset.

    Returns:
        Dict[str, float]: The calculated metric scores.
    """
    article_batches = list(generate_batch_sized_chunks(dataset[column_text], batch_size))
    target_batches = list(generate_batch_sized_chunks(dataset[column_summary], batch_size))

    for article_batch, target_batch in tqdm(
        zip(article_batches, target_batches), total=len(article_batches)):
        
        inputs = tokenizer(article_batch, max_length=1024,  truncation=True, 
                        padding="max_length", return_tensors="pt")
        
        summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                         attention_mask=inputs["attention_mask"].to(device), 
                         length_penalty=0.8, num_beams=8, max_length=128)
        ''' parameter for length penalty ensures that the model does not generate sequences that are too long. '''
        
        # Finally, we decode the generated texts, 
        # replace the <n> token, and add the decoded texts with the references to the metric.
        decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True, 
                                clean_up_tokenization_spaces=True) 
               for s in summaries]      
        
        decoded_summaries = [d.replace("<n>", " ") for d in decoded_summaries]
        
        
        metric.add_batch(predictions=decoded_summaries, references=target_batch)
        
    #  Finally compute and return the ROUGE scores.
    score = metric.compute()
    return score

def convert_examples_to_features(example_batch):
    input_encodings = tokenizer(example_batch['Review_Text'] , max_length = 1024, truncation = True )
    
    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(example_batch['Summary'], max_length = 128, truncation = True )
        
    return {
        'input_ids' : input_encodings['input_ids'],
        'attention_mask': input_encodings['attention_mask'],
        'labels': target_encodings['input_ids']
    }

In [9]:
from sklearn.model_selection import train_test_split

#replacing empty values with na and then dropping those rows
individual_reviews_df = individual_reviews_df.iloc[0:20].replace(r'^\s*$', pd.NA, regex=True).dropna()

# Define the proportions for the splits
train_size = 0.6
validation_size = 0.2
test_size = 0.2

# First, split the data into a temporary training set and a temporary test set
train, temp_test = train_test_split(individual_reviews_df, test_size=1 - train_size, random_state=42)

# Then, split the temporary test set into the validation set and the final test set
final_validation, final_test = train_test_split(temp_test, test_size=test_size / (test_size + validation_size), random_state=42)

In [10]:
#getting dataset in a form for trainign with hugging face libraries
train_ds = Dataset.from_pandas(train)
validation_ds = Dataset.from_pandas(final_validation)
test_ds = Dataset.from_pandas(final_test)

ds = DatasetDict()

ds['train'] = train_ds
ds['validation'] = validation_ds
ds['test'] = test_ds

ds

DatasetDict({
    train: Dataset({
        features: ['Headphone_Name', 'Review_Text', 'Summary', '__index_level_0__'],
        num_rows: 7
    })
    validation: Dataset({
        features: ['Headphone_Name', 'Review_Text', 'Summary', '__index_level_0__'],
        num_rows: 3
    })
    test: Dataset({
        features: ['Headphone_Name', 'Review_Text', 'Summary', '__index_level_0__'],
        num_rows: 3
    })
})

In [11]:
model_ckpt = "google/pegasus-cnn_dailymail"

pipe = pipeline('summarization', model = model_ckpt)

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

rouge_metric = load_metric('rouge')

score = calculate_metric_on_test_ds(ds['test'], rouge_metric, model_pegasus, tokenizer)

  rouge_metric = load_metric('rouge')
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:06<00:00, 33.26s/it]


In [13]:
rouge_dict = dict((rn, score[rn].mid.fmeasure ) for rn in rouge_names )

pd.DataFrame(rouge_dict, index = ['pegasus'])

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
pegasus,0.249946,0.052983,0.145596,0.145596


In [14]:
dataset_dict_pt = ds.map(convert_examples_to_features, batched = True)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model_pegasus)

Map:   0%|          | 0/7 [00:00<?, ? examples/s]



Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

In [15]:
import accelerate
import transformers
from transformers import TrainingArguments, Trainer

trainer_args = TrainingArguments(
    output_dir='pegasus-individual-reviews', num_train_epochs=1,
    per_device_train_batch_size=2, per_device_eval_batch_size=2,
    logging_steps=8,
    evaluation_strategy='epoch', save_steps=1e6,
    gradient_accumulation_steps=16
) 

In [16]:
trainer = Trainer(model=model_pegasus, args=trainer_args,
                  tokenizer=tokenizer, data_collator=data_collator,
                  train_dataset=dataset_dict_pt["train"], 
                  eval_dataset=dataset_dict_pt["validation"])

trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mravinderbrai[0m. Use [1m`wandb login --relogin`[0m to force relogin


You're using a PegasusTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,No log,3.75068


TrainOutput(global_step=1, training_loss=1.0914041996002197, metrics={'train_runtime': 184.7545, 'train_samples_per_second': 0.038, 'train_steps_per_second': 0.005, 'total_flos': 20226250899456.0, 'train_loss': 1.0914041996002197, 'epoch': 1.0})

In [17]:
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

rouge_metric = load_metric('rouge')

score = calculate_metric_on_test_ds(
    ds['test'], rouge_metric, trainer.model, tokenizer, batch_size = 2, column_text = 'Review_Text', column_summary= 'Summary'
)

rouge_dict = dict((rn, score[rn].mid.fmeasure ) for rn in rouge_names )

pd.DataFrame(rouge_dict, index = [f'pegasus'] )

100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:59<00:00, 29.50s/it]


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
pegasus,0.257471,0.053969,0.147844,0.147844
