# Summarization on SoundGuys Reviews

Training Pegasus model on a small sample of headphones using summaries and youtube videos by the website https://www.soundguys.com. Instead of using review summaries from other sites and pairing them with various youtube reviews in the Summarizing_Reviews notebook, this will be more consistent since there is a dedicated summary in te article review, and then they have a youtube video as well.

In [1]:
import requests
from bs4 import BeautifulSoup

#!pip install selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

### Beautiful Soup

In [2]:
url = 'https://www.soundguys.com/sony-wf-1000xm4-review-53454/'
response = requests.get(url)

if response.status_code == 200:
    html_content = response.text
else:
    print('Failed to retrieve the webpage. Status code:', response.status_code)

To get the class names here, will have to inspect the page manually. Also, if not working double check to see if the class name could have changed.

In [17]:
soup = BeautifulSoup(html_content, 'html.parser')

h2_elements = soup.find_all('h2')

second_h2 = h2_elements[0]
print(second_h2.get_text())

target_div = soup.find('div', class_='--___mf')
print(target_div.get_text())

Sony WF-1000XM4
If you want the best true wireless earphones, the Sony WF-1000XM4 has to be in the discussion among the most popular models. It has its foibles, but it's probably the most mature stab at the true wireless design that includes ANC. Still, the price is tough to surmount.


### Selenium

Now using selenium to get youtube link which is in an iframe

In [18]:
# Initialize ChromeDriver - note this takes some time to run but driver.get(url) takes even longer
driver = webdriver.Chrome()

# Now you can use the driver for navigating and interacting with Opera GX
driver.get(url)

In [20]:
outer_div = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CLASS_NAME, "--___0b"))
)

In [21]:
outer_div.get_attribute('outerHTML')

'<div class="--___0b --___e --___d --___Vk"><div class="--___Eq" style="width: 100%; height: 100%;"><div style="width: 100%; height: 100%;"><iframe frameborder="0" allowfullscreen="" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" title="Sony WF-1000XM4: The new king of true wireless ANC earbuds" width="100%" height="100%" src="https://www.youtube.com/embed/EcosCxU_GEU?autoplay=0&amp;mute=0&amp;controls=1&amp;origin=https%3A%2F%2Fwww.soundguys.com&amp;playsinline=1&amp;showinfo=0&amp;rel=0&amp;iv_load_policy=3&amp;modestbranding=1&amp;enablejsapi=1&amp;widgetid=1" id="widget2" data-gtm-yt-inspected-16="true"></iframe></div></div></div>'

Getting the "--___ yq" which has the iframe with the youtube link wouldn't work until   you open the class that it is nested in, which the click method does.

In [22]:
outer_div.click()

In [23]:
outer_div.get_attribute('outerHTML')

'<div class="--___0b --___e --___d --___Vk"><div class="--___Eq" style="width: 100%; height: 100%;"><div style="width: 100%; height: 100%;"><iframe frameborder="0" allowfullscreen="" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" title="Sony WF-1000XM4: The new king of true wireless ANC earbuds" width="100%" height="100%" src="https://www.youtube.com/embed/EcosCxU_GEU?autoplay=0&amp;mute=0&amp;controls=1&amp;origin=https%3A%2F%2Fwww.soundguys.com&amp;playsinline=1&amp;showinfo=0&amp;rel=0&amp;iv_load_policy=3&amp;modestbranding=1&amp;enablejsapi=1&amp;widgetid=1" id="widget2" data-gtm-yt-inspected-16="true"></iframe></div></div></div>'

In [24]:
outer_div = WebDriverWait(driver, 10).until(
    EC.visibility_of_element_located((By.CLASS_NAME, "--___Eq"))
)
outer_div.get_attribute('outerHTML')

'<div class="--___Eq" style="width: 100%; height: 100%;"><div style="width: 100%; height: 100%;"><iframe frameborder="0" allowfullscreen="" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" title="Sony WF-1000XM4: The new king of true wireless ANC earbuds" width="100%" height="100%" src="https://www.youtube.com/embed/EcosCxU_GEU?autoplay=0&amp;mute=0&amp;controls=1&amp;origin=https%3A%2F%2Fwww.soundguys.com&amp;playsinline=1&amp;showinfo=0&amp;rel=0&amp;iv_load_policy=3&amp;modestbranding=1&amp;enablejsapi=1&amp;widgetid=1" id="widget2" data-gtm-yt-inspected-16="true"></iframe></div></div>'

In [25]:
# Find the iframe within the outer div
iframe = outer_div.find_element(By.TAG_NAME, 'iframe')

# Get the src attribute of the iframe
iframe_src = iframe.get_attribute('src')

In [26]:
iframe_src

'https://www.youtube.com/embed/EcosCxU_GEU?autoplay=0&mute=0&controls=1&origin=https%3A%2F%2Fwww.soundguys.com&playsinline=1&showinfo=0&rel=0&iv_load_policy=3&modestbranding=1&enablejsapi=1&widgetid=1'

In [27]:
# Close the WebDriver
driver.quit()

## Refactoring Web Scraping

Turning the above into functions now to automate the process of getting data from SoundGuys websites.

In [28]:
def get_reviews_and_yt_vid(url):
    """
    url - the link to the website we are scraping
    """
    response = requests.get(url)
    
    #check if request worked
    if response.status_code == 200:
        html_content = response.text
    else:
        print('Failed to retrieve the webpage. Status code:', response.status_code)
        
    soup = BeautifulSoup(html_content, 'html.parser')
    
    #get the header for the summary text, which will also be the headphone name
    h2_elements = soup.find_all('h2')
    headphone_name = h2_elements[0].get_text()

    #get the review text now
    review_text = soup.find('div', class_='--___mf').get_text()
    
    return [headphone_name, review_text]

In [29]:
x, y = get_reviews_and_yt_vid(url)
print(x, y)

Sony WF-1000XM4 If you want the best true wireless earphones, the Sony WF-1000XM4 has to be in the discussion among the most popular models. It has its foibles, but it's probably the most mature stab at the true wireless design that includes ANC. Still, the price is tough to surmount.


In [30]:
#use selenium to get the src in a nested html class to get the youtube review automatically
def get_yt_vid(url):
    driver = webdriver.Chrome()
    driver.get(url)
    #need to add a wait time to load the page
    outer_div = WebDriverWait(driver, 50).until(
        EC.presence_of_element_located((By.CLASS_NAME, "--___0b"))
    )
    
    outer_div.click()
    
    outer_div = WebDriverWait(driver, 50).until(
        EC.presence_of_element_located((By.CLASS_NAME, "--___Eq"))
    )
    
    outer_div.click()
    
    # Find the iframe within the outer div and get src attribute which is the youtube link
    iframe = WebDriverWait(outer_div, 10).until(
        EC.presence_of_element_located((By.TAG_NAME, 'iframe'))
    )
    #iframe = outer_div.find_element(By.TAG_NAME, 'iframe')
    youtube_link = iframe.get_attribute('src')
    
    # Close the WebDriver
    driver.quit()
    
    return youtube_link

In [52]:
driver = webdriver.Chrome()
driver.get(url)
#need to add a wait time to load the page
outer_div = WebDriverWait(driver, 50).until(
    EC.presence_of_element_located((By.CLASS_NAME, "--___ec"))
)

outer_div.click()

outer_div = WebDriverWait(driver, 10).until(
    EC.visibility_of_element_located((By.CLASS_NAME, "--___uq"))
)

outer_div.click()

outer_div.get_attribute('outerHTML')

'<div class="--___uq" style="width: 100%; height: 100%;"><div style="width: 100%; height: 100%;"><div></div></div></div>'

In [53]:
outer_div.get_attribute('outerHTML')

'<div class="--___uq" style="width: 100%; height: 100%;"><div style="width: 100%; height: 100%;"><iframe frameborder="0" allowfullscreen="" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" title="Sony WF-1000XM4: The new king of true wireless ANC earbuds" width="100%" height="100%" src="https://www.youtube.com/embed/EcosCxU_GEU?autoplay=0&amp;mute=0&amp;controls=1&amp;origin=https%3A%2F%2Fwww.soundguys.com&amp;playsinline=1&amp;showinfo=0&amp;rel=0&amp;iv_load_policy=3&amp;modestbranding=1&amp;enablejsapi=1&amp;widgetid=1" id="widget2" data-gtm-yt-inspected-16="true"></iframe></div></div>'

In [54]:
# Find the iframe within the outer div and get src attribute which is the youtube link

#iframe = outer_div.find_element(By.TAG_NAME, 'iframe')
iframe = WebDriverWait(outer_div, 10).until(
    EC.presence_of_element_located((By.TAG_NAME, 'iframe'))
)


youtube_link = iframe.get_attribute('src')

# Close the WebDriver
driver.quit()

In [55]:
youtube_link

'https://www.youtube.com/embed/EcosCxU_GEU?autoplay=0&mute=0&controls=1&origin=https%3A%2F%2Fwww.soundguys.com&playsinline=1&showinfo=0&rel=0&iv_load_policy=3&modestbranding=1&enablejsapi=1&widgetid=1'

In [31]:
print(get_yt_vid(url))

https://www.youtube.com/embed/EcosCxU_GEU?autoplay=0&mute=0&controls=1&origin=https%3A%2F%2Fwww.soundguys.com&playsinline=1&showinfo=0&rel=0&iv_load_policy=3&modestbranding=1&enablejsapi=1&widgetid=1


## Collecting Data 

In [1]:
import pandas as pd
from tqdm import tqdm

In [29]:
headphone_name, review_text = get_reviews_and_yt_vid(url)

print(headphone_name)
print(review_text)

Sony WF-1000XM4
If you want the best true wireless earphones, the Sony WF-1000XM4 has to be in the discussion among the most popular models. It has its foibles, but it's probably the most mature stab at the true wireless design that includes ANC. Still, the price is tough to surmount.


In [59]:
yt_link = get_yt_vid(url)
print(yt_link)

https://www.youtube.com/embed/EcosCxU_GEU?autoplay=0&mute=0&controls=1&origin=https%3A%2F%2Fwww.soundguys.com&playsinline=1&showinfo=0&rel=0&iv_load_policy=3&modestbranding=1&enablejsapi=1&widgetid=1


In [32]:
soundguys_dct = {'Headphone': [], 'Review': [], 'YouTubeLink': []}

Collecting data manually from SoundGuys website. Each link here follows the same html structure, so we can get the headphone name, summary, and youtube link the same way for each one. 

**Note:** Unfortunately, there was not enough links like these for true wireless earbuds, so over ear headphones were also included. This should be okay though since we just want to fine tune the model a little bit for headphones in general.

Here is a link for future reference, where you can go to the review section for headphone reviews: https://www.soundguys.com/earbuds-headphones/.

In [33]:
urls = ['https://www.soundguys.com/sony-wf-1000xm4-review-53454/',
        'https://www.soundguys.com/apple-airpods-pro-review-27106/',
        'https://www.soundguys.com/bose-quietcomfort-ultra-earbuds-review-101362/',
        'https://www.soundguys.com/apple-airpods-pro-2nd-generation-review-78938/',
        'https://www.soundguys.com/sony-wf-1000xm5-review-95465/',
        'https://www.soundguys.com/apple-airpods-max-review-44975/',
        'https://www.soundguys.com/bose-quietcomfort-45-review-60657/',
        'https://www.soundguys.com/bose-qc35-ii-review-14264/',
        'https://www.soundguys.com/sennheiser-accentum-wireless-review-100972/',
        'https://www.soundguys.com/beyerdynamic-dt-990-pro-review-15223/',
        'https://www.soundguys.com/sony-wh-1000xm5-review-71783/',
        'https://www.soundguys.com/sennheiser-hd-560s-review-85989/',
        'https://www.soundguys.com/beats-studio3-wireless-review-16041/',
        'https://www.soundguys.com/nothing-ear-2-review-88883/',
        'https://www.soundguys.com/razer-blackshark-v2-review-36783/',
        'https://www.soundguys.com/microsoft-surface-headphones-2-review-35282/',
        'https://www.soundguys.com/sony-wh-ch710n-review-32483/',
        'https://www.soundguys.com/new-google-pixel-buds-review-32211/',
        'https://www.soundguys.com/sony-wf-1000xm3-review-25342/',
        'https://www.soundguys.com/bose-700-noise-cancelling-headphones-review-24897/',
        'https://www.soundguys.com/skullcandy-push-true-wireless-review-22815/',
        'https://www.soundguys.com/jlab-jbuds-air-review-19768/',
        'https://www.soundguys.com/razer-nari-ultimate-review-19718/',
        'https://www.soundguys.com/corsair-virtuoso-wireless-se-review-38016/'
        'https://www.soundguys.com/v-moda-crossfade-2-wireless-codex-18345/'
       ]

In [38]:
#using try and except in case anything doesn't work
for u in tqdm(urls, desc="Processing URLs", unit="URL"):
    try:
        # get headphone name and summary with beautiful soup
        headphone_name, review_text = get_reviews_and_yt_vid(u)
        soundguys_dct['Headphone'].append(headphone_name)
        soundguys_dct['Review'].append(review_text)
        
        #using selenium to get the youtube link
        yt_link = get_yt_vid(u)
        soundguys_dct['YouTubeLink'].append(yt_link)        
    except Exception as e:
        # Print or log the exception message if needed
        print(f"An error occurred for URL {u}: {e}")
        # Continue to the next iteration
        continue

Processing URLs: 100%|████████████████████████████████████████████████████████████████| 24/24 [10:07<00:00, 25.32s/URL]

Failed to retrieve the webpage. Status code: 404
An error occurred for URL https://www.soundguys.com/corsair-virtuoso-wireless-se-review-38016/https://www.soundguys.com/v-moda-crossfade-2-wireless-codex-18345/: local variable 'html_content' referenced before assignment





In [34]:
for u in urls:
    headphone_name, review_text = get_reviews_and_yt_vid(u)
    soundguys_dct['Headphone'].append(headphone_name)
    soundguys_dct['Review'].append(review_text)
    
    yt_link = get_yt_vid(url)
    soundguys_dct['YouTubeLink'].append(yt_link)
    

In [51]:
#soundguys_dct

In [42]:
df = pd.DataFrame(soundguys_dct)
df_no_duplicates = df.drop_duplicates()
df_no_duplicates_reset = df_no_duplicates.reset_index(drop=True)

In [46]:
df_no_duplicates_reset

Unnamed: 0,Headphone,Review,YouTubeLink
0,Sony WF-1000XM4,"If you want the best true wireless earphones, ...",https://www.youtube.com/embed/EcosCxU_GEU?auto...
1,Apple AirPods Pro (1st generation),For just $50 USD more than the original model ...,https://www.youtube.com/embed/JF_wpY8BqK8?auto...
2,Bose QuietComfort Ultra Earbuds,The Bose QuietComfort Ultra Earbuds are a comp...,https://www.youtube.com/embed/8vsE8xVN6rE?auto...
3,Apple AirPods Pro (2nd generation),The Apple AirPods Pro (2nd generation) modestl...,https://www.youtube.com/embed/icUWvqtJfUI?auto...
4,Sony WF-1000XM5,The Sony WF-1000XM5 takes notes from the succe...,https://www.youtube.com/embed/vPFM8x1LzwI?auto...
5,Apple AirPods Max,While the Apple AirPod Max is a great pair of ...,https://www.youtube.com/embed/FbNTNhNm0hE?auto...
6,Bose QuietComfort 45,The Bose QuietComfort 45 has all the parts to ...,https://www.youtube.com/embed/42m3l6DDtec?auto...
7,Bose QuietComfort 35 II,The Bose QuietComfort 35 II have some of the b...,https://www.youtube.com/embed/XJj5oM7rWvg?auto...
8,Sennheiser ACCENTUM Wireless,For a set of ANC headphones that cost roughly ...,https://www.youtube.com/embed/LFO6BsO8Yhg?auto...
9,Beyerdynamic DT 990 PRO,"For less than $200, it's not hard to understan...",https://www.youtube.com/embed/5oK0jW4wEzg?auto...


In [49]:
df_no_duplicates_reset.to_csv('SoundGuys_Data.csv', index = False)

Now we have a dataset with summaries associated with reviews by the same author. The dataset is still small but we will have to make do for now. 

Note some things to come back to: the html class names used to find this data changes daily, so will have to change it within the function if this is needed again. Should probably make it a parameter in the function too. 

## Fine-Tuning

### Converting Youtube Links to CC Text

In [2]:
df = pd.read_csv('SoundGuys_Data.csv')

df.head(3)

Unnamed: 0,Headphone,Review,YouTubeLink
0,Sony WF-1000XM4,"If you want the best true wireless earphones, ...",https://www.youtube.com/embed/EcosCxU_GEU?auto...
1,Apple AirPods Pro (1st generation),For just $50 USD more than the original model ...,https://www.youtube.com/embed/JF_wpY8BqK8?auto...
2,Bose QuietComfort Ultra Earbuds,The Bose QuietComfort Ultra Earbuds are a comp...,https://www.youtube.com/embed/8vsE8xVN6rE?auto...


In [3]:
from YouTubeReviewScraper import YouTubeReviewData
from googleapiclient.discovery import build
import json

In [4]:
api_key = "AIzaSyCr2cER-JhJlIf4JxkU2KMC1Cm7bhLX9oE"

In [6]:
""" This code will create a json file with your api key which you will need to load it in, in the next cell
# Define your API key
api_key = "your_api_key_here"

# Create a dictionary with the API key
data = {"api_key": api_key}

# Specify the file path
file_path = "config.json"

# Write the dictionary to the JSON file
with open(file_path, 'w') as json_file:
    json.dump(data, json_file, indent=2)

print(f"JSON file '{file_path}' created with API key.")
"""
pass

In [4]:
file_path = "config.json"

# Read the content of the JSON file
with open(file_path, 'r') as json_file:
    config_data = json.load(json_file)

# Extract the API key from the config file, which should also be in gitignore
api_key = config_data.get('api_key', None)

In [5]:
sample_yt_url = df.iloc[0, 2]
sample_yt_url

'https://www.youtube.com/embed/EcosCxU_GEU?autoplay=0&mute=0&controls=1&origin=https%3A%2F%2Fwww.soundguys.com&playsinline=1&showinfo=0&rel=0&iv_load_policy=3&modestbranding=1&enablejsapi=1&widgetid=1'

In [6]:
youtube = build('youtube', 'v3', developerKey=api_key)

#the youtube urls are embedded so the url has a lot of text after the video id, which we should clean up
def clean_yt_url(video_url):
    # Extract video ID from the URL
    video_id = video_url.split("?")[0]
    
    # If there are additional parameters after the video ID, remove them
    if '&' in video_id:
        video_id = video_id.split('&')[0]

    return video_id

# Get the cleaned YT url
cleaned_yt_url = clean_yt_url(sample_yt_url)
print(f'YouTube Video ID: {cleaned_yt_url}')

YouTube Video ID: https://www.youtube.com/embed/EcosCxU_GEU


In [10]:
def get_video_id(url):
    # Extract video ID from YouTube URL
    video_id = None
    if 'youtube.com' in url or 'youtube' in url:
        video_id = url.split('/')[-1] if 'youtube' in url else requests.utils.urlparse(url).query.split('=')[-1]
    return video_id

In [11]:
sample_yt_id = get_video_id(cleaned_yt_url)
print(sample_yt_id)

EcosCxU_GEU


In [50]:
youtube_scraper = YouTubeReviewData(api_key)

sample_cc_text = youtube_scraper.fetch_captions(sample_yt_id)
#sample_cc_text

In [12]:
def get_cc_text(url):
    cleaned_url = clean_yt_url(url)
    
    cleaned_url_id = get_video_id(cleaned_url)
    
    youtube_scraper = YouTubeReviewData(api_key)
    cc_text = youtube_scraper.fetch_captions(cleaned_url_id)
    
    return cc_text

In [13]:
tqdm.pandas(desc="Processing Videos")
df['CCText'] = df['YouTubeLink'].progress_apply(lambda x: get_cc_text(x))

Processing Videos: 100%|███████████████████████████████████████████████████████████████| 23/23 [00:39<00:00,  1.70s/it]

An error occurred: 
Could not retrieve a transcript for the video https://www.youtube.com/watch?v=_UdeZdUUA8I! This is most likely caused by:

Subtitles are disabled for this video

If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!





In [14]:
final_df = df.copy().dropna()
final_df = final_df.drop('YouTubeLink', axis=1)
final_df.rename(columns = {'Review': 'Summary', 'CCText': 'Review_Text'}, inplace = True)
final_df = final_df[['Headphone', 'Review_Text', 'Summary']]
final_df.head(3)

Unnamed: 0,Headphone,Review_Text,Summary
0,Sony WF-1000XM4,[Applause] well he is embedded [Music] nice [...,"If you want the best true wireless earphones, ..."
1,Apple AirPods Pro (1st generation),so I never liked the original air pods they l...,For just $50 USD more than the original model ...
2,Bose QuietComfort Ultra Earbuds,so Bose released their quiet Comfort Ultra ea...,The Bose QuietComfort Ultra Earbuds are a comp...


In [15]:
final_df.to_csv('SoundGuys_Review_Summary_Pairs.csv', index = False)

### Training Pegasus on SoundGuys Data

In [7]:
from transformers import set_seed, pipeline, AutoTokenizer, AutoModelForSeq2SeqLM, BertTokenizer
from nltk.tokenize import sent_tokenize
import torch

from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments

from datasets import Dataset, DatasetDict

from datasets import load_metric

In [8]:
final_df = pd.read_csv("SoundGuys_Review_Summary_Pairs.csv")

In [9]:
device = "cuda" if torch.cuda.is_available() else "cpu"

model_ckpt = "google/pegasus-cnn_dailymail"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.encoder.embed_positions.weight', 'model.decoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
def generate_batch_sized_chunks(list_of_elements, batch_size):
    """split the dataset into smaller batches that we can process simultaneously
    Yield successive batch-sized chunks from list_of_elements.
    
    Yields consecutive chunks from a list.

    Args:
        list_of_elements (List[Any]): The list to be divided into chunks.
        batch_size (int): The size of chunks.

    Yields:
        List[Any]: A chunk from the list of the specified size.
        
    """
    for i in range(0, len(list_of_elements), batch_size):
        yield list_of_elements[i : i + batch_size]

def calculate_metric_on_test_ds(dataset, metric, model, tokenizer, 
                               batch_size=2, device=device, 
                               column_text="Review_Text", 
                               column_summary="Summary"):
    """
    Calculates a specified metric on a test dataset.

    Args:
        dataset (Dataset): The dataset to evaluate.
        metric (Metric): The metric to calculate.
        model (nn.Module): The model to evaluate.
        tokenizer (Tokenizer): The tokenizer to use for text processing.
        batch_size (int, optional): The batch size for evaluation.
        device (torch.device, optional): The device to use for computation.
        column_text (str, optional): The name of the text column in the dataset.
        column_summary (str, optional): The name of the summary column in the dataset.

    Returns:
        Dict[str, float]: The calculated metric scores.
    """
    article_batches = list(generate_batch_sized_chunks(dataset[column_text], batch_size))
    target_batches = list(generate_batch_sized_chunks(dataset[column_summary], batch_size))

    for article_batch, target_batch in tqdm(
        zip(article_batches, target_batches), total=len(article_batches)):
        
        inputs = tokenizer(article_batch, max_length=1024,  truncation=True, 
                        padding="max_length", return_tensors="pt")
        
        summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                         attention_mask=inputs["attention_mask"].to(device), 
                         length_penalty=0.8, num_beams=8, max_length=128)
        ''' parameter for length penalty ensures that the model does not generate sequences that are too long. '''
        
        # Finally, we decode the generated texts, 
        # replace the <n> token, and add the decoded texts with the references to the metric.
        decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True, 
                                clean_up_tokenization_spaces=True) 
               for s in summaries]      
        
        decoded_summaries = [d.replace("<n>", " ") for d in decoded_summaries]
        
        
        metric.add_batch(predictions=decoded_summaries, references=target_batch)
        
    #  Finally compute and return the ROUGE scores.
    score = metric.compute()
    return score

def convert_examples_to_features(example_batch):
    input_encodings = tokenizer(example_batch['Review_Text'] , max_length = 1024, truncation = True )
    
    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(example_batch['Summary'], max_length = 128, truncation = True )
        
    return {
        'input_ids' : input_encodings['input_ids'],
        'attention_mask': input_encodings['attention_mask'],
        'labels': target_encodings['input_ids']
    }

In [11]:
from sklearn.model_selection import train_test_split

# Define the proportions for the splits
train_size = 0.6
validation_size = 0.2
test_size = 0.2

# First, split the data into a temporary training set and a temporary test set
train, temp_test = train_test_split(final_df, test_size=(1 - train_size), random_state=42)

# Then, split the temporary test set into the validation set and the final test set
final_validation, final_test = train_test_split(temp_test, test_size=test_size / (test_size + validation_size), random_state=42)

In [12]:
#getting dataset in a form for trainign with hugging face libraries
train_ds = Dataset.from_pandas(train)
validation_ds = Dataset.from_pandas(final_validation)
test_ds = Dataset.from_pandas(final_test)

ds = DatasetDict()

ds['train'] = train_ds
ds['validation'] = validation_ds
ds['test'] = test_ds

ds

DatasetDict({
    train: Dataset({
        features: ['Headphone', 'Review_Text', 'Summary', '__index_level_0__'],
        num_rows: 13
    })
    validation: Dataset({
        features: ['Headphone', 'Review_Text', 'Summary', '__index_level_0__'],
        num_rows: 4
    })
    test: Dataset({
        features: ['Headphone', 'Review_Text', 'Summary', '__index_level_0__'],
        num_rows: 5
    })
})

In [72]:
pipe = pipeline('summarization', model = model_ckpt)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [81]:
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

rouge_metric = load_metric('rouge')

score = calculate_metric_on_test_ds(ds['test'], rouge_metric, model_pegasus, tokenizer)

100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:45<00:00, 35.24s/it]


In [82]:
rouge_dict = dict((rn, score[rn].mid.fmeasure ) for rn in rouge_names )

pd.DataFrame(rouge_dict, index = ['pegasus'])

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
pegasus,0.312797,0.064264,0.176921,0.176485


In [85]:
dataset_dict_pt = ds.map(convert_examples_to_features, batched = True)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model_pegasus)

Map:   0%|          | 0/13 [00:00<?, ? examples/s]



Map:   0%|          | 0/4 [00:00<?, ? examples/s]

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

In [88]:
import accelerate
import transformers
from transformers import TrainingArguments, Trainer

trainer_args = TrainingArguments(
    output_dir='pegasus-individual-reviews', num_train_epochs=3,
    per_device_train_batch_size=2, per_device_eval_batch_size=2,
    logging_steps=8,
    evaluation_strategy='epoch', save_steps=1e6,
    gradient_accumulation_steps=16
) 

In [89]:
trainer = Trainer(model=model_pegasus, args=trainer_args,
                  tokenizer=tokenizer, data_collator=data_collator,
                  train_dataset=dataset_dict_pt["train"], 
                  eval_dataset=dataset_dict_pt["validation"])

trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,3.600292
2,No log,3.508366


TrainOutput(global_step=3, training_loss=1.2454113165537517, metrics={'train_runtime': 729.432, 'train_samples_per_second': 0.053, 'train_steps_per_second': 0.004, 'total_flos': 86683932426240.0, 'train_loss': 1.2454113165537517, 'epoch': 2.29})

In [90]:
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

rouge_metric = load_metric('rouge')

score = calculate_metric_on_test_ds(
    ds['test'], rouge_metric, trainer.model, tokenizer, batch_size = 2, column_text = 'Review_Text', column_summary= 'Summary'
)

rouge_dict = dict((rn, score[rn].mid.fmeasure ) for rn in rouge_names )

pd.DataFrame(rouge_dict, index = [f'pegasus'] )

100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:56<00:00, 38.71s/it]


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
pegasus,0.297014,0.074641,0.183399,0.184


In [92]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [93]:
trainer.push_to_hub(commit_message="Training complete", tags="summarization")

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.54k [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

'https://huggingface.co/ravinderbrai/pegasus-individual-reviews/tree/main/'

Now let's use the model on the test set.

In [13]:
sample_text = ds["test"][2]["Review_Text"]

reference = ds["test"][2]["Summary"]

In [14]:
gen_kwargs = {"length_penalty": 0.8, "num_beams":8, "max_length": 128}

pipe = pipeline("summarization", model='pegasus-reviews')

In [15]:
# we need to split the text into even parts and take summaries for each and then concatentate them, since the review texts are too long.
# this function will get the number of parts to split a list into given we want each subset of the list to be of a certain max length
def optimal_part_count(input_list, max_length):
    """
    Calculate the optimal number of parts, each having a length less than max_length.
    
    Parameters:
    - input_list: The list to be split.
    - max_length: The maximum length of each part.
    
    Returns:
    - The optimal number of parts.
    """
    total_length = len(input_list)
    optimal_count = total_length // max_length + (1 if total_length % max_length != 0 else 0)
    return optimal_count

In [16]:
tokenizer(sample_text)[0]

Token indices sequence length is longer than the specified maximum sequence length for this model (1626 > 1024). Running this sequence through the model will result in indexing errors


Encoding(num_tokens=1626, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [17]:
optimal_part_count(tokenizer(sample_text)[0], 1024)

2

In [16]:
#from transformers import pipeline, AutoTokenizer

def optimal_part_count(input_list, max_length):
    total_length = len(input_list)
    optimal_count = total_length // max_length + (1 if total_length % max_length != 0 else 0)
    return optimal_count

def get_full_summaries(review_text, max_part_length=1024, overlap=50):
    gen_kwargs = {"length_penalty": 0.8, "num_beams": 8, "max_length": 128}
    pipe = pipeline("summarization", model='pegasus-reviews')
    tokenizer = AutoTokenizer.from_pretrained("pegasus-reviews")

    # Tokenize the review_text
    tokens = tokenizer(review_text, return_tensors="pt")["input_ids"][0]
    total_tokens = len(tokens)

    # Calculate the optimal number of parts based on token count
    num_parts_of_review_text = optimal_part_count(tokens, max_part_length)

    # Calculate the token indices for each substring
    idx_lst = [i * total_tokens // num_parts_of_review_text for i in range(num_parts_of_review_text)]

    # Summarize each substring and concatenate the summaries
    full_summary = ''
    for i in range(len(idx_lst) - 1):
        idx_1 = idx_lst[i]
        idx_2 = idx_lst[i + 1]

        # Extract substring based on token indices
        substring = tokenizer.decode(tokens[idx_1:idx_2])

        # Summarize the substring
        summary = pipe(substring, **gen_kwargs)[0]["summary_text"]
        full_summary += summary

    # Add the last part that got missed from the for loop
    idx_1 = idx_lst[-1]
    full_summary += pipe(tokenizer.decode(tokens[idx_1:]), **gen_kwargs)[0]["summary_text"]

    return full_summary

# Example usage
result_summary = get_full_summaries(sample_text)
print(result_summary)


Token indices sequence length is longer than the specified maximum sequence length for this model (1626 > 1024). Running this sequence through the model will result in indexing errors


The airpods max is apple's first ever pair of over-ear bluetooth headphones .<n>The 549 price tag is double the price of some flagship headphones .<n>The headphone sounds great but it is very expensive and at 549 us dollars this costs double the price of some flagship headphones .If you're someone who prioritizes audio quality and customization above convenience i'm sorry to say but the airpods max isn't likely to please .<n>The airpods max are not waterproof or water resistant meaning that a little rain could be enough to ruin these headphones .<n>The airpods max will automatically detect where you're playing content from and switch accordingly .


In [23]:
num_parts_of_sample_text = optimal_part_count(tokenizer(sample_text)[0], 1024)

idx_lst = [i*len(sample_text)//num_parts_of_sample_text for i in range(num_parts_of_sample_text)]

full_summary = ''

for i in range(len(idx_lst)-1):
    idx_1 = idx_lst[i]
    idx_2 = idx_lst[i+1]
    full_summary += pipe(sample_text[idx_1:idx_2], **gen_kwargs)[0]["summary_text"]
    
full_summary += pipe(sample_text[idx_2:], **gen_kwargs)[0]["summary_text"]
print(full_summary)

The airpods max is apple's first ever pair of over-ear bluetooth headphones .<n>The 549 price tag is double the price of some flagship headphones .<n>The headphone sounds great but it is very expensive and at 549 us dollars this costs double the price of some flagship headphones .The sound is being adjusted by software if you're someone who prioritizes audio quality and customization above convenience .<n>The airpods max are not waterproof or water resistant meaning that a little rain could be enough to ruin these headphones .<n>The airpods max will automatically detect where you're playing content from and switch accordingly .


In [79]:
#pipe(sample_text[:4812], **gen_kwargs)[0]["summary_text"]
#pipe(sample_text[4812:], **gen_kwargs)[0]["summary_text"]

In [18]:
def get_full_summaries(review_text, max_part_length = 1024):
    gen_kwargs = {"length_penalty": 0.8, "num_beams":8, "max_length": 128}
    pipe = pipeline("summarization", model='pegasus-reviews')
    
    num_parts_of_review_text = optimal_part_count(tokenizer(review_text)[0], max_part_length)

    idx_lst = [i*len(review_text)//num_parts_of_review_text for i in range(num_parts_of_review_text)]

    #breaking the text up into parts and getting summaries for each, to then concatenate them into this empty string
    full_summary = ''
    for i in range(len(idx_lst)-1):
        idx_1 = idx_lst[i]
        idx_2 = idx_lst[i+1]
        full_summary += pipe(review_text[idx_1:idx_2], **gen_kwargs)[0]["summary_text"]
    
    #adding last part that got missed from for loop since it would give an index error
    full_summary += pipe(review_text[idx_lst[-1]:], **gen_kwargs)[0]["summary_text"]
    
    return full_summary

In [33]:
tokenizer(sample_text)[0]

Encoding(num_tokens=1626, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [18]:
get_full_summaries(sample_text)

"The airpods max is apple's first ever pair of over-ear bluetooth headphones .<n>The 549 price tag is double the price of some flagship headphones .<n>The headphone sounds great but it is very expensive and at 549 us dollars this costs double the price of some flagship headphones .The sound is being adjusted by software if you're someone who prioritizes audio quality and customization above convenience .<n>The airpods max are not waterproof or water resistant meaning that a little rain could be enough to ruin these headphones .<n>The airpods max will automatically detect where you're playing content from and switch accordingly ."

In [88]:
get_full_summaries(final_df['Review_Text'][6])

'The new bose qc45 headphones look pretty much identical to the previous model sporting the same plastic construction and matte finish .<n>The ear cups are large enough to suit most if not all ears and the synthetic leather pads create a seal to offer decent isolation .<n>The bose qc45 now features bluetooth 5.1 for range up to 9 meters before your connection starts to drop .The bose qc45 boasts improved noise cancellation compared to previous models .<n>Battery lasts 24 hours and 50 minutes with active noise cancelling enabled .<n>It only takes 2.5 hours to fully charge the headphones and just 15 minutes of charging gives you up to three hours of playtime .'

In [31]:
testing_text= ds["test"][1]["Review_Text"]
print(len(testing_text))
tokenizer(testing_text)[0]

7956


Encoding(num_tokens=1549, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [32]:
#pipe(testing_text, **gen_kwargs)[0]["summary_text"]

In [108]:
#for rt in list(ds["test"]['Review_Text']):
#    print(get_full_summaries(rt))

Now let's look at the summaries for the test set, and just makes sure they look nice and are readable.

In [109]:
for j in range(len(ds["test"]['Review_Text'])):
    print(ds["test"][j]['Headphone'])
    print(get_full_summaries(ds["test"][j]['Review_Text']))

HD 560S
The HD 560s are some of the best audio file headphones on a budget released in the last 10 years .<n>This set of headphones has really large plush ear cups and they have angled drivers to make sure that they don't actually touch your ears .<n>The open back design means that a lot of the music that you'll be listening to through these headphones will sound a lot more natural than it would on a set of closed-back headphones .
Nothing Ear (2)
The nothing year 2 doesn't reinvent the wheel but it certainly builds upon the strong foundations laid out by its predecessor .<n>The ear 2 also features in-ear detection which automatically pauses your music when you remove buds and resumes when you put them back in to control the buds .<n>The ear 2 can transmit up to 24 kilohertz audio via the old HDC 5.0 Bluetooth .Nothing's EQ doesn't tell you what frequency ranges are being affected for bass mid and treble sounds still it's better than no EQ .<n>The ANC performance does get close to the 

In [110]:
#tqdm.pandas()
#final_df['Review_Text'].progress_apply(lambda x: get_full_summaries(x))

In [13]:
youtube_reviews_df = pd.read_csv('youtube_reviews.csv')
youtube_reviews_df = youtube_reviews_df.rename(columns = {'Sony_Review_Text': 'Review_Text'})
youtube_reviews_df

Unnamed: 0,Headphone_Name,Review_Text
0,sony xm4 earbuds,"The Sony WF-1000XM4 earbuds, which is a mouth..."
1,sony xm4 earbuds,(wind rushing)\n(slow music) - As much as I l...
2,sony xm4 earbuds,[Music] what's going on guys it's your averag...
3,sony xm4 earbuds,so it's been almost two years since sony rele...
4,sony xm4 earbuds,[Music] hey guys so is the Sony wfos xm4 stil...
...,...,...
107,Buy LG TONE TF8,these are algae's first sports earbuds they a...
108,Buy LG TONE TF8,all right guys so this video is sponsored by ...
109,Buy LG TONE TF8,the market of true wireless earbuds has grown...
110,Buy LG TONE TF8,last year i reviewed lg's tone free fp 8 and ...


In [14]:
get_full_summaries(youtube_reviews_df['Review_Text'][0])

Token indices sequence length is longer than the specified maximum sequence length for this model (2350 > 1024). Running this sequence through the model will result in indexing errors


"The Sony WF-1000XM4 earbuds are a great value at $279 .<n>They have the best noise cancellation of any buds I've tried .<n>They have one of the best sound profiles of any buds I've tried .<n>They also support Sony’s LDAC which gets you higher quality audio over bluetooth .The Sony Pixel Buds have a touch surface that can register a single, double, or triple tap, plus a long press, which is something most other buds I’ve tried just can’t do .<n>You can customize what you want each tap to register as in the Sony app .<n>I’ve never run out of battery while listening to these and they’re rated for 8 hours of battery and the charge in the case will get you up to another 16 hours .The Sony WF-XM4s are the best noise canceling earbuds I've ever used .<n>They've worked great on my Samsung and Pixel phones, plus with Android you get LDAC quality streaming .<n>The only issue I ever ran into is when I had them in and I turned transparency mode on, it just turned off transparency mode without any

Trying the fine-tuned model on the updated version 2 dataset with reviews from various sources.

In [19]:
import pandas as pd
youtube_reviews_df = pd.read_csv('youtube_reviews_v2.csv')
youtube_reviews_df

Unnamed: 0,video_id,title,video_link,channel_name,review_text,headphone
0,0L4tO4HDjnU,Sony WF-1000XM4 Earbuds Review - 6 Months Later,https://www.youtube.com/watch?v=0L4tO4HDjnU,6 Months Later,"The Sony WF-1000XM4 earbuds, which is a mouth...",sony xm4 earbuds
1,hIWn_RjN-Wg,I&#39;m ditching my AirPods Pros - Sony WF-100...,https://www.youtube.com/watch?v=hIWn_RjN-Wg,ShortCircuit,(wind rushing) (slow music) - As much as I lo...,sony xm4 earbuds
2,fMw1h4hKT_E,Sony WF-1000XM4 review: silence to my ears,https://www.youtube.com/watch?v=fMw1h4hKT_E,The Verge,so it's been almost two years since sony rele...,sony xm4 earbuds
3,mpOv9h0L9Cc,The Most Advanced Earbuds? Sony WF-1000XM4 Rev...,https://www.youtube.com/watch?v=mpOv9h0L9Cc,UrAvgConsumer,[Music] what's going on guys it's your averag...,sony xm4 earbuds
4,Rknl4uIOyfY,Sony WF-1000XM4 true wireless earbuds full review,https://www.youtube.com/watch?v=Rknl4uIOyfY,GSMArena Official,hey what's up guys will here for gsm arena we...,sony xm4 earbuds
...,...,...,...,...,...,...
100,84qmGGT1erc,LG Tone Free TF8 Review | Waterproof True Wire...,https://www.youtube.com/watch?v=84qmGGT1erc,Andy's Tech Tone,hi everyone and welcome to the channel lg hav...,lg tone tf8
101,zExw0KHUlss,These have a WORLD&#39;S FIRST - LG TONE Free ...,https://www.youtube.com/watch?v=zExw0KHUlss,ShortCircuit,y'all ready for this lg sponsored this video ...,lg tone tf8
102,f_B1bY7Yatg,LG TONE Free Fit TF8 : Wireless Earbuds Fit To...,https://www.youtube.com/watch?v=f_B1bY7Yatg,Gamesky,all right guys so this video is sponsored by ...,lg tone tf8
103,cy4bC4or8Ho,LG TONE Free Fit TF8 Unboxing &amp; Hands-On W...,https://www.youtube.com/watch?v=cy4bC4or8Ho,TechTablets,these are algae's first sports earbuds they a...,lg tone tf8


In [21]:
#tokenizer(youtube_reviews_df['review_text'][6])['input_ids']

In [23]:
#tokenizer.decode(tokenizer(youtube_reviews_df['review_text'][6])['input_ids'])

In [34]:
optimal_part_count(tokenizer(youtube_reviews_df['review_text'][6])[0], 1024)

4

In [61]:
#tokenizer(youtube_reviews_df['review_text'][6])['input_ids']

In [90]:
total_length = len(tokens)
optimal_count = total_length // 1024 + (1 if total_length % 1024 != 0 else 0)
print(optimal_count)
for i in range(len(sublists)):
    print(len(token_chunks[i]))

4
973
1023
1023
1023


In [86]:
#[tokens[i - overlap:i + chunk_size] for i in range(overlap, len(tokens), chunk_size - overlap)]
idx_chunk_sizes = [i*len(tokens)//4 for i in range(4)]
[tokens[idx:idx + 1] for idx in idx_chunk_sizes]
idx_chunk_sizes

[0, 1022, 2044, 3066]

In [92]:
sublists = [tokens[idx_chunk_sizes[i]:idx_chunk_sizes[i+1]] for i in range(len(idx_chunk_sizes)-1)]
sublists.append(tokens[idx_chunk_sizes[-1]:])
len(sublists)

4

In [94]:
# Tokenize the entire text
tokens = tokenizer(youtube_reviews_df['review_text'][6])['input_ids']

# Split tokens into chunks of a specific size (e.g., 512) with an overlap
chunk_size = 1024-51
overlap = 50
token_chunks = [tokens[i - overlap:i + chunk_size] for i in range(overlap, len(tokens), chunk_size - overlap)]

token_chunks = sublists

# Summarize each chunk separately
chunk_summaries = []
for chunk_tokens in token_chunks:
    # Convert tokens back to text
    chunk_text = tokenizer.decode(chunk_tokens, skip_special_tokens=True)

    # Summarize the chunk
    summary = pipe(chunk_text, **gen_kwargs)
    #summary = summarizer(chunk_text, max_length=150, min_length=50, length_penalty=2.0, num_beams=4, early_stopping=True)
    chunk_summaries.append(summary[0]['summary_text'])

# Concatenate the individual summaries to form the final summary
final_summary = " ".join(chunk_summaries)

print(final_summary)

In-depth review of the wf-1000xm4 premium headphones .<n>Review includes build quality, packaging, apps, and connectivity .<n>Reviewer: This might be the ideal tws for a lot of you but personally for me this was not the perfect tws . These earphones have a lot of features that speak to chat and voice control .<n>They also have a lot of equalizer options and you can customize them .<n>The battery life on these is rated about eight hours . Airport pro xm4 headphones active noise cancellation is a big improvement from the last generation that were bad really regarding the battery life .<n>This has to be the best active noise cancellation on a tws period .<n>Memory foam ear tips also you press them and it inside the seal is so good that even if you are not using active noise cancellation it isolates you quite a bit . Battery life is not an issue on this one and when you put back in the case it just charges .<n>Supports wireless charging so if you're a fan of that wireless charging uh you c

In [57]:
def summarizing_long_text(text, max_length = 1024):
    #instantiate this in a class later on when refactoring
    #gen_kwargs = {"length_penalty": 0.8, "num_beams":5, "max_length": 128}
    #pipe = pipeline("summarization", model='pegasus-reviews')
    
    try:
        tokens = tokenizer(text)['input_ids']

        total_length = len(tokens)
        optimal_count = total_length // max_length + (1 if total_length % max_length != 0 else 0)

        idx_chunk_sizes = [i*len(tokens)//optimal_count for i in range(optimal_count)]

        token_chunks = [tokens[idx_chunk_sizes[i]:idx_chunk_sizes[i+1]] for i in range(len(idx_chunk_sizes)-1)]
        token_chunks.append(tokens[idx_chunk_sizes[-1]:])


        # Summarize each chunk separately
        chunk_summaries = []
        for chunk_tokens in token_chunks:
            # Convert tokens back to text
            chunk_text = tokenizer.decode(chunk_tokens, skip_special_tokens=True)

            # Summarize the chunk
            summary = pipe(chunk_text, **gen_kwargs)
            #summary = summarizer(chunk_text, max_length=150, min_length=50, length_penalty=2.0, num_beams=4, early_stopping=True)
            chunk_summaries.append(summary[0]['summary_text'])

        # Concatenate the individual summaries to form the final summary
        final_summary = " ".join(chunk_summaries)

        return final_summary
    
    except Exception as e:
        # Print or log the error for debugging
        print(f"An error occurred: {e}")
        # Return None or any other value to indicate failure
        return None

In [29]:
for i in range(0, 5):
    print(summarizing_long_text(youtube_reviews_df['review_text'][i*6]))

The Sony WF-1000XM4 earbuds are a great value at $279 .<n>They have the best noise cancellation of any buds I've tried .<n>They have one of the best sound profiles of any buds I’ve tried .<n>They also support Sony’s LDAC which gets you higher quality audio over bluetooth . The Sony Pixel Buds have a touch surface that can register a single, double, or triple tap, plus a long press, which is something most other buds I’ve tried just can’t do .<n>You can customize what you want each tap to register as in the Sony app .<n>I’ve never run out of battery while listening to these and they’re rated for 8 hours of battery and the charge in the case will get you up to another 16 hours . The Sony WF-XM4s are the best noise canceling earbuds I've ever used .<n>They've worked great on my Samsung and Pixel phones, plus with Android you get LDAC quality streaming .<n>The only issue I ever ran into is when I had them in and I turned transparency mode on, it just turned off transparency mode without an

Your max_length is set to 128, but your input_length is only 92. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=46)


The Galaxy buds 2 pro are Google's new earbuds for the iPhone 14 .<n>The buds come in black, white, lilac-y purple and are available in any color .<n>The fit is about the same as the google pixel pro but they're about 15 smaller than last year's version . Apple's earbuds don't sound as good as the company's advertised sound quality .<n>Head tracking doesn't work as well as the company's app .<n>The noise cancellation function doesn't work as well as the company's app .<n>The noise canceling function doesn't work as well as the company's app . The galaxy buds 2 pro are a pair of earbuds that connect to your phone .<n>They come in two sizes and come with built in noise cancellation and head tracking .<n>Reviewer says they're good for everyday music but not great for podcasts .
The Sennheiser momentum true Wireless 3.<n>One of my favorite sets of true wireless earbuds with a super strong feature set and app support .<n>Multi-point connectivity makes them even more useful for work or schoo

In [27]:
#get_full_summaries(youtube_reviews_df['review_text'][6])

In [28]:
tqdm.pandas()
#youtube_reviews_df['summarized_reviews'] = youtube_reviews_df['review_text'].progress_apply(lambda x: get_full_summaries(x))

In [31]:
summarizing_long_text(youtube_reviews_df['review_text'][18])

Your max_length is set to 128, but your input_length is only 92. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=46)


'The Sennheiser momentum true Wireless 3.<n>One of my favorite sets of true wireless earbuds with a super strong feature set and app support .<n>Multi-point connectivity makes them even more useful for work or school needs .'

In [36]:
tokens = tokenizer(youtube_reviews_df['review_text'][18])['input_ids']

# Split tokens into chunks of a specific size (e.g., 512) with an overlap
chunk_size = 1024-51
overlap = 50
token_chunks = [tokens[i - overlap:i + chunk_size] for i in range(overlap, len(tokens), chunk_size - overlap)]

len(tokens)

92

This video (18th row in current dataframe) has too few tokens because it is a youtube short. We don't want to summarize these so let's remove them.

In [38]:
youtube_reviews_df

Unnamed: 0,video_id,title,video_link,channel_name,review_text,headphone
0,0L4tO4HDjnU,Sony WF-1000XM4 Earbuds Review - 6 Months Later,https://www.youtube.com/watch?v=0L4tO4HDjnU,6 Months Later,"The Sony WF-1000XM4 earbuds, which is a mouth...",sony xm4 earbuds
1,hIWn_RjN-Wg,I&#39;m ditching my AirPods Pros - Sony WF-100...,https://www.youtube.com/watch?v=hIWn_RjN-Wg,ShortCircuit,(wind rushing) (slow music) - As much as I lo...,sony xm4 earbuds
2,fMw1h4hKT_E,Sony WF-1000XM4 review: silence to my ears,https://www.youtube.com/watch?v=fMw1h4hKT_E,The Verge,so it's been almost two years since sony rele...,sony xm4 earbuds
3,mpOv9h0L9Cc,The Most Advanced Earbuds? Sony WF-1000XM4 Rev...,https://www.youtube.com/watch?v=mpOv9h0L9Cc,UrAvgConsumer,[Music] what's going on guys it's your averag...,sony xm4 earbuds
4,Rknl4uIOyfY,Sony WF-1000XM4 true wireless earbuds full review,https://www.youtube.com/watch?v=Rknl4uIOyfY,GSMArena Official,hey what's up guys will here for gsm arena we...,sony xm4 earbuds
...,...,...,...,...,...,...
100,84qmGGT1erc,LG Tone Free TF8 Review | Waterproof True Wire...,https://www.youtube.com/watch?v=84qmGGT1erc,Andy's Tech Tone,hi everyone and welcome to the channel lg hav...,lg tone tf8
101,zExw0KHUlss,These have a WORLD&#39;S FIRST - LG TONE Free ...,https://www.youtube.com/watch?v=zExw0KHUlss,ShortCircuit,y'all ready for this lg sponsored this video ...,lg tone tf8
102,f_B1bY7Yatg,LG TONE Free Fit TF8 : Wireless Earbuds Fit To...,https://www.youtube.com/watch?v=f_B1bY7Yatg,Gamesky,all right guys so this video is sponsored by ...,lg tone tf8
103,cy4bC4or8Ho,LG TONE Free Fit TF8 Unboxing &amp; Hands-On W...,https://www.youtube.com/watch?v=cy4bC4or8Ho,TechTablets,these are algae's first sports earbuds they a...,lg tone tf8


In [None]:
def remove_yt_shorts(text, max_length = 128):
    tokens = tokenizer(text)['input_ids']
    
    if len(tokens) > max_length:
        

In [43]:
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError

def get_youtube_video_duration(api_key, video_url):
    # Extract video ID from the URL
    video_id = video_url.split("v=")[1]

    # Initialize the YouTube Data API
    youtube = build("youtube", "v3", developerKey=api_key)

    try:
        # Call the videos().list method to retrieve video details
        response = youtube.videos().list(
            part="contentDetails",
            id=video_id
        ).execute()

        # Extract video duration from the response
        duration = response["items"][0]["contentDetails"]["duration"]
        return duration

    except HttpError as e:
        print(f"An error occurred: {e}")
        return None

# Replace 'VIDEO_URL' with the URL of the YouTube video you want to check
video_url = youtube_reviews_df['video_link'][1]

video_duration = get_youtube_video_duration(api_key, video_url)

if video_duration:
    print(f"The duration of the video is: {video_duration}")


The duration of the video is: PT9M38S


In [45]:
def filter_short_videos(df, api_key):
    # Create a new column 'duration' to store video durations
    df['duration'] = df['video_link'].apply(lambda x: get_youtube_video_duration(api_key, x))

    # Convert duration strings to seconds (optional, depending on your needs)
    df['duration_seconds'] = df['duration'].apply(lambda x: pd.to_timedelta(x).seconds)

    # Filter rows where the duration is less than 60 seconds
    df_filtered = df[df['duration_seconds'] >= 60].copy()

    # Drop the temporary columns
    df_filtered = df_filtered.drop(['duration', 'duration_seconds'], axis=1)

    return df_filtered

In [48]:
cleaned_yt_reviews_df = filter_short_videos(youtube_reviews_df, api_key)

In [58]:
cleaned_yt_reviews_df['review_text'][0:2].apply(lambda x: summarizing_long_text(x))

0    The Sony WF-1000XM4 earbuds are a great value ...
1    Sony's WF-1000XM4s have a new integrated V1 pr...
Name: review_text, dtype: object

In [59]:
cleaned_yt_reviews_df['generated_summaries'] = cleaned_yt_reviews_df['review_text'].apply(lambda x: summarizing_long_text(x))

An error occurred: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).
An error occurred: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).
An error occurred: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).


In [62]:
cleaned_yt_reviews_df['generated_summaries'][0]

"The Sony WF-1000XM4 earbuds are a great value at $279 .<n>They have the best noise cancellation of any buds I've tried .<n>They have one of the best sound profiles of any buds I’ve tried .<n>They also support Sony’s LDAC which gets you higher quality audio over bluetooth . The Sony Pixel Buds have a touch surface that can register a single, double, or triple tap, plus a long press, which is something most other buds I’ve tried just can’t do .<n>You can customize what you want each tap to register as in the Sony app .<n>I’ve never run out of battery while listening to these and they’re rated for 8 hours of battery and the charge in the case will get you up to another 16 hours . The Sony WF-XM4s are the best noise canceling earbuds I've ever used .<n>They've worked great on my Samsung and Pixel phones, plus with Android you get LDAC quality streaming .<n>The only issue I ever ran into is when I had them in and I turned transparency mode on, it just turned off transparency mode without a

Now finally let's clean the text a bit and save this to the a new csv file with all the generated summaries.

In [None]:
import re

def clean_text(input_text):
    # Remove <n>s
    cleaned_text = input_text.replace("<n>", "")
    
    # Remove spaces before periods
    cleaned_text = re.sub(r'\s+(\.)', r'\1', cleaned_text)
    
    return cleaned_text

cleaned_text = clean_text(cleaned_yt_reviews_df['generated_summaries'][0])
print(cleaned_text)

The Sony WF-1000XM4 earbuds are a great value at $279.They have the best noise cancellation of any buds I've tried.They have one of the best sound profiles of any buds I’ve tried.They also support Sony’s LDAC which gets you higher quality audio over bluetooth. The Sony Pixel Buds have a touch surface that can register a single, double, or triple tap, plus a long press, which is something most other buds I’ve tried just can’t do.You can customize what you want each tap to register as in the Sony app.I’ve never run out of battery while listening to these and they’re rated for 8 hours of battery and the charge in the case will get you up to another 16 hours. The Sony WF-XM4s are the best noise canceling earbuds I've ever used.They've worked great on my Samsung and Pixel phones, plus with Android you get LDAC quality streaming.The only issue I ever ran into is when I had them in and I turned transparency mode on, it just turned off transparency mode without any input from me.I think this i

In [87]:
final_yt_reviews_df = cleaned_yt_reviews_df.dropna()

In [88]:
final_yt_reviews_df['generated_summaries'] = final_yt_reviews_df['generated_summaries'].apply(lambda x: clean_text(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_yt_reviews_df['generated_summaries'] = final_yt_reviews_df['generated_summaries'].apply(lambda x: clean_text(x))


In [92]:
final_yt_reviews_df.to_csv('yt_reviews_gen_summaries.csv')