# Data Collection and Exploration

Here we will load in the data from youtube, where collecting cc texts from relevant search queries is the main goal.

The outline is as follows (using laptops as the example product):

 - Use the YouTube API to query videos like "top ten laptops"
 - Get the cc texts and other relevant information from various of those videos
 - Extract top laptops from the cc text with NER
 - Then query the YouTube API again to get review cc text for each laptop
 - Store this all as raw data.

In [1]:
import os
from dotenv import load_dotenv
from youtube_transcript_api import YouTubeTranscriptApi
from googleapiclient.discovery import build

Change directories to get access to env file.

In [2]:
os.chdir("../")

In [3]:
pwd

'c:\\Users\\RaviB\\GitHub\\TechKnowBot'

Build YouTube connector.

In [4]:
load_dotenv()

youtube_api_key = os.getenv('GCP_YOUTUBE_API_KEY')

In [5]:
openai_api_key = os.getenv('OPENAI_API_KEY')
openai_api_org = os.getenv('OPENAI_API_ORG')

Take old code from SentiRec Analytics project: https://github.com/RavinderRai/SentiRec-Analytics/blob/main/modules/YouTubeReviewScraper.py.

In [110]:
class YouTubeAPIData:    
    def __init__(self, api_key):
        self.api_key = api_key
        self.youtube = build('youtube', 'v3', developerKey=api_key)
        
    def fetch_youtube_data(self, search_query, filtered_strings=[], max_results=5):
        """
        Search for YouTube videos based on a given query and retrieve additional information including closed captions.

        Parameters:
        - search_query (str): The search query used to find relevant videos on YouTube.
        - filtered_strings (list): Strings to filter video results by i needed. Defaults to an empty list. 
        - max_results (int): The maximum number of videos to retrieve. Defaults to 5.

        Returns:
        List[dict]: A list of dictionaries, each containing information about a video, including:
            - 'video_id' (str): The unique identifier for the video.
            - 'title' (str): The title of the video.
            - 'video_link' (str): The YouTube link to the video.
            - 'channel_name' (str): The name of the channel that uploaded the video.
            - 'cc_text' (str): The closed captions text for the video.

        Note:
        - The 'cc_text' field may contain an empty string if closed captions are not available. Mend that as needed.
        """       
        
        search_response = self.youtube.search().list(
            q=search_query,
            type='video',
            part='id, snippet',
            maxResults=max_results
        ).execute()
        
        videos_data = []
        for result in search_response.get('items', []):
            video_id = result['id']['videoId']
            title = result['snippet']['title']
            video_link = f'https://www.youtube.com/watch?v={video_id}'
            channel_name = result['snippet']['channelTitle']

            # Check and remove unwanted titles
            if not any(s in title for s in filtered_strings):
                cc_text = self.fetch_captions(video_id)
                videos_data.append({
                    'video_id': video_id,
                    'title': title,
                    'video_link': video_link,
                    'channel_name': channel_name,
                    'cc_text': cc_text
                })

        return videos_data
    
    def fetch_captions(self, video_id):
        """
        Get the closed captions. 

        Parameters:
        - video_id (str): The video id which is obtained in search_videos.
        
        Returns:
        String: Closed caption text of a youtube video
        """
        try:
            # Retrieve the transcript for the video
            transcript = YouTubeTranscriptApi.get_transcript(video_id)

            cc_text = ""

            # Concatenate the transcript text
            for entry in transcript:
                cc_text += ' ' + entry['text']
                
            cc_text = cc_text.replace('\n', ' ')
            return cc_text

        except Exception as e:
            print(f"An error occurred: {str(e)}")
            return f"An error occurred: {str(e)}"

In [111]:
youtube = YouTubeAPIData(youtube_api_key)

In [114]:
query_ex = "top ten laptops"
search_ex = youtube.fetch_youtube_data(query_ex, max_results=5)

In [9]:
print(search_ex[0]['cc_text'])

 top 10 best laptops 2023 number 10 HP Spectre x364 careful because you can't help but fall in love with the HP Spectre X 3614 just look at that beautiful 13.5 in touchscreen with a resolution of 3000x 2000 pixels that adjusts its image color to the environment thanks to its artificial intelligence system incredible plus its screen can be turned back becoming a kind of tablet regarding the CPU the Spectre x314 comes equipped with a core i7 1255 U and an Intel UHD GPU so it has power to spare whether you're doing everyday office tasks or if you're a content creator like me can you see why I love it it has 16 GB of RAM and up to one Terra of storage so you can save what you want last but not least it has four speakers that sound great so how much for this beauty a reasonable $1,050 number nine Samsung Galaxy Book 2 Pro 360 if you like graphic design or working with metrics in the office the galaxy book 2 Pro 360 is ideal for you fairly priced at $950 this laptop has a 15.6 in Amo LED ful

Getting examples for few shot config file.

In [10]:
search_ex[0]

{'video_id': 'FK8veh-L8AE',
 'title': 'TOP 10 BEST LAPTOPS 2023',
 'video_link': 'https://www.youtube.com/watch?v=FK8veh-L8AE',
 'channel_name': 'Trend Max',
 'cc_text': " top 10 best laptops 2023 number 10 HP Spectre x364 careful because you can't help but fall in love with the HP Spectre X 3614 just look at that beautiful 13.5 in touchscreen with a resolution of 3000x 2000 pixels that adjusts its image color to the environment thanks to its artificial intelligence system incredible plus its screen can be turned back becoming a kind of tablet regarding the CPU the Spectre x314 comes equipped with a core i7 1255 U and an Intel UHD GPU so it has power to spare whether you're doing everyday office tasks or if you're a content creator like me can you see why I love it it has 16 GB of RAM and up to one Terra of storage so you can save what you want last but not least it has four speakers that sound great so how much for this beauty a reasonable $1,050 number nine Samsung Galaxy Book 2 Pro 

Here is the text printed out so we can see it and extract the laptops manually to make the examples.json file.

top 10 best laptops 2023 number 10 HP Spectre x364 careful because you can't help but fall in love with the HP Spectre X 3614 just look at that beautiful 13.5 in touchscreen with a resolution of 3000x 2000 pixels that adjusts its image color to the environment thanks to its artificial intelligence system incredible plus its screen can be turned back becoming a kind of tablet regarding the CPU the Spectre x314 comes equipped with a core i7 1255 U and an Intel UHD GPU so it has power to spare whether you're doing everyday office tasks or if you're a content creator like me can you see why I love it it has 16 GB of RAM and up to one Terra of storage so you can save what you want last but not least it has four speakers that sound great so how much for this beauty a reasonable $1,050 number nine Samsung Galaxy Book 2 Pro 360 if you like graphic design or working with metrics in the office the galaxy book 2 Pro 360 is ideal for you fairly priced at $950 this laptop has a 15.6 in Amo LED full HD screen with a great resolution of 1920x 1080 pixels its image quality stands out offering vibrant colors and being 33% brighter than conventional laptops I love it the screen can rotate 360° and become a tablet since it's also a touchscreen so that you can work directly on your graphics or designs with the S Pen it CPU is a core i7 1260p which gives it tremendous strength and for its Graphics power it has an Intel Iris XE GPU so you can also use it to play it has 16 GB of RAM and 512 GB of storage that can be expanded with an SSD card number eight Acer Swift 5 2022 if you live a busy life you probably need the Acer Swift 5 designed for maximum portability it weighs just 2.2 lb thanks to its magnesium lithium and magnesium aluminum alloy casing don't think it's fragile though it screen is protected by a robust Gorilla Glass 4 has a resolution of 1920x 1080 pixels in full HD and is a touchcreen in case you need to handle elements in an agile way cool and practical right inside it hides an Intel Core i7 1260p CPU and an Intel Iris XE GPU so it's light and yet very powerful since it has 16 GB of RAM and up to 1 Terra of storage wow it's priced at $1,500 although I wouldn't say this is the best cost benefit ratio let's see what other options the market will offer number seven Macbook Pro 13 18 in M2 2022 if you need portability as in the previous number and the battery matters most to you the MacBook Pro is for you I won't make you wait until the end right now I'm telling you that the MacBook Pro's battery gives you Independence for more than 18 hours so you could work all day away from any plug that seems perfect to me but going into other specs I'll tell you that it screen measures 13.3 in and has a resolution of up to 2560 x600 pixels which is great its processor is the 8 core Apple M2 and as Graphics support it has an integrated 8 core M2 GPU which together gives it great integrity for all types of functions which is best combined with its up to 24 GB of RAM and up to 2 terab of storage giving the MacBook Pro colossal Power by the way it cost $970 but an attractive price number six Asus Rog zephrus G14 this is one one of my favorites on the list and you'll love it too if you're a gaming fan the Asus Rog zephrus G14 features an AMD ryzen 9 4900 HS CPU which has enough power for any program and even the most resource demanding games plus it has the graphics support of Nvidia GeForce RTX 2060 Max Q you can edit movies on this laptop because its 14-in LCD screen offers a wqhd image with a resolution of 2560 by 1440 pixels which is amazing and it's also anti-reflective so that nothing distorts what you see by the way have you seen that awesome backlit keyboard yet fabulous right well it has 8 GB of RAM not bad but it can be better and up to 1 TB of storage that is great how much for this laptop $2,000 kind of overpriced if we compare it with what we've seen so far but it's worth it number five framework laptop 13 this laptop really caught my attention because of its concept the framework laptop 13 is probably the only modular laptop on the market what does this mean it means that you can upgrade its parts yourself fabulous even so it comes perfectly equipped with an AMD ryzen 9 7840 U CPU which is very powerful and also has a radon 700m GPU how can you upgrade that well when better Pieces come out in the future but let's continue with the specs its Ram comes from 8 to 64 GB it's up to you and has a storage capacity from 250 GB to 2 TB have you felt the power of modular Paradise yet it screen is 13.5 in and you can't change it but you won't need to do it anyway because it already offers an incredible resolution of 2256 X 1540 pixels where you can work or even play with spectacular image quality it's priced starting at $11,000 and I must say that it's worth every penny number four Lenovo Chromebook duet 3 this is my absolute favorite on the list it has everything and for an incredible $380 this is as absurd as it is great the Lenovo Chromebook duet 3 is powered by a Snapdragon 7c Gen 2 processor that provides impressive performance for any kind of task if you want to play or like graphic design you'll be backed by a gorgeous Qualcomm adreno GPU wow by the way it's 10.95 in screen makes it a portable treasure which perfectly matches its 1.14 lb weight but let's go back to the screen it's a touchscreen that offers an incredible resolution of 2,000x 1200 pixels and comes with a pen to design or make your graphics for your meetings weight size incredible performance and economy if you work remotely and like to travel the Chromebook duet 3 could be your best friend number three Asus Zenbook 13 OLED this is another laptop that I love how couldn't I it takes image quality and portability to a level that's hard to beat for a perfectly balanced price of $970 the Asus Zenbook 13 OLED as its name suggests features a 13-in HDR OLED display giving you an absurdly sharp image with incredibly Vivid colors and a resolution of 1920x 1080 pixels spectacular but if you're still not totally stunned you're about to see that its interior is an unparalleled appetizer since it houses is an AMD ryzen 7 5700 U CPU which is very powerful and will provide Power to everything you need and obviously behind such graphic quality there had to be this integrated Radeon graphics GPU the only detail that could be improved is its Ram which is 8 GB but it has 512 GB of storage which is great if you focus on portability number two Apple MacBook Air M2 the trick to the Apple MacBook Air that makes it probably the best laptop out there is that it's absurdly complete and portable it has a 13.6 in screen with Incredible image quality and provides a resolution of 2560 X 1664 pixels also look at that elegant design it comes equipped with an 8 core M2 CPU which is one of the best creations of the brand and gives it incredible support for everything it also comes with it a 10 core M2 GPU to give it a bunch of power to play anything what about the ram well it goes to the cloud since it has 24 GB for an impeccable interface and up to 2 terab of storage which feels like you can carry your entire office or studio in a briefcase I love it it's priced at $2,000 which is fair but not so affordable number one Dell XPS 15 OLED naming the Dell XPS 15 OLED as the best laptop of these times isn't a difficult task for $1,580 which is fully balanced in a cost benefit ratio it gives you such a great performance that's very difficult to beat to start it has a very solid and Powerful Intel Core I9 12,900 HK CPU for graphic work gaming and any Management program and as if that wasn't enough it also comes with an Nvidia RTX 3050 TI 4 GB GPU so the Dell XPS 15 OLED can play anything it's 15.6 in OLED screen provides a great resolution of 3456 X 2160 pixels and an impressively sharp image quality but if you check its Ram get ready because it has 64 GB of RAM for a practically immediate interface and a storage of a monstrous four teras there's nothing more to add this is the best laptop which of these laptops best suits your needs tell me in the comments and while you're there don't forget to like this video And subscribe to my channel because all be posting more videos like this see you in the next electronic [Music] Adventure"}

Used ChatGPT to help in extracting laptops. Here are some items that an LLM could pick up as a Laptops or organizations but aren't.

 - GPU
 - SSD
 - CPU
 - Terra
 - Powerful Intel Core I9
 - AMD ryzen 9
 - Nvidia Geforce RTX
 - Nvidia RTX 3050 TI

In [107]:
data = [
    {
        "text": " top 10 best laptops 2023 number 10 HP Spectre x364 careful because you can't help but fall in love with the HP Spectre X 3614 just look at that beautiful 13.5 in touchscreen with a resolution of 3000x 2000 pixels that adjusts its image color to the environment thanks to its artificial intelligence system incredible plus its screen can be turned back becoming a kind of tablet regarding the CPU the Spectre x314 comes equipped with a core i7 1255 U and an Intel UHD GPU so it has power to spare whether you're doing everyday office tasks or if you're a content creator like me can you see why I love it it has 16 GB of RAM and up to one Terra of storage so you can save what you want last but not least it has four speakers that sound great so how much for this beauty a reasonable $1,050 number nine Samsung Galaxy Book 2 Pro 360 if you like graphic design or working with metrics in the office the galaxy book 2 Pro 360 is ideal for you fairly priced at $950 this laptop has a 15.6 in Amo LED full HD screen with a great resolution of 1920x 1080 pixels its image quality stands out offering vibrant colors and being 33% brighter than conventional laptops I love it the screen can rotate 360\u00b0 and become a tablet since it's also a touchscreen so that you can work directly on your graphics or designs with the S Pen it CPU is a core i7 1260p which gives it tremendous strength and for its Graphics power it has an Intel Iris XE GPU so you can also use it to play it has 16 GB of RAM and 512 GB of storage that can be expanded with an SSD card number eight Acer Swift 5 2022 if you live a busy life you probably need the Acer Swift 5 designed for maximum portability it weighs just 2.2 lb thanks to its magnesium lithium and magnesium aluminum alloy casing don't think it's fragile though it screen is protected by a robust Gorilla Glass 4 has a resolution of 1920x 1080 pixels in full HD and is a touchcreen in case you need to handle elements in an agile way cool and practical right inside it hides an Intel Core i7 1260p CPU and an Intel Iris XE GPU so it's light and yet very powerful since it has 16 GB of RAM and up to 1 Terra of storage wow it's priced at $1,500 although I wouldn't say this is the best cost benefit ratio let's see what other options the market will offer number seven Macbook Pro 13 18 in M2 2022 if you need portability as in the previous number and the battery matters most to you the MacBook Pro is for you I won't make you wait until the end right now I'm telling you that the MacBook Pro's battery gives you Independence for more than 18 hours so you could work all day away from any plug that seems perfect to me but going into other specs I'll tell you that it screen measures 13.3 in and has a resolution of up to 2560 x600 pixels which is great its processor is the 8 core Apple M2 and as Graphics support it has an integrated 8 core M2 GPU which together gives it great integrity for all types of functions which is best combined with its up to 24 GB of RAM and up to 2 terab of storage giving the MacBook Pro colossal Power by the way it cost $970 but an attractive price number six Asus Rog zephrus G14 this is one one of my favorites on the list and you'll love it too if you're a gaming fan the Asus Rog zephrus G14 features an AMD ryzen 9 4900 HS CPU which has enough power for any program and even the most resource demanding games plus it has the graphics support of Nvidia GeForce RTX 2060 Max Q you can edit movies on this laptop because its 14-in LCD screen offers a wqhd image with a resolution of 2560 by 1440 pixels which is amazing and it's also anti-reflective so that nothing distorts what you see by the way have you seen that awesome backlit keyboard yet fabulous right well it has 8 GB of RAM not bad but it can be better and up to 1 TB of storage that is great how much for this laptop $2,000 kind of overpriced if we compare it with what we've seen so far but it's worth it number five framework laptop 13 this laptop really caught my attention because of its concept the framework laptop 13 is probably the only modular laptop on the market what does this mean it means that you can upgrade its parts yourself fabulous even so it comes perfectly equipped with an AMD ryzen 9 7840 U CPU which is very powerful and also has a radon 700m GPU how can you upgrade that well when better Pieces come out in the future but let's continue with the specs its Ram comes from 8 to 64 GB it's up to you and has a storage capacity from 250 GB to 2 TB have you felt the power of modular Paradise yet it screen is 13.5 in and you can't change it but you won't need to do it anyway because it already offers an incredible resolution of 2256 X 1540 pixels where you can work or even play with spectacular image quality it's priced starting at $11,000 and I must say that it's worth every penny number four Lenovo Chromebook duet 3 this is my absolute favorite on the list it has everything and for an incredible $380 this is as absurd as it is great the Lenovo Chromebook duet 3 is powered by a Snapdragon 7c Gen 2 processor that provides impressive performance for any kind of task if you want to play or like graphic design you'll be backed by a gorgeous Qualcomm adreno GPU wow by the way it's 10.95 in screen makes it a portable treasure which perfectly matches its 1.14 lb weight but let's go back to the screen it's a touchscreen that offers an incredible resolution of 2,000x 1200 pixels and comes with a pen to design or make your graphics for your meetings weight size incredible performance and economy if you work remotely and like to travel the Chromebook duet 3 could be your best friend number three Asus Zenbook 13 OLED this is another laptop that I love how couldn't I it takes image quality and portability to a level that's hard to beat for a perfectly balanced price of $970 the Asus Zenbook 13 OLED as its name suggests features a 13-in HDR OLED display giving you an absurdly sharp image with incredibly Vivid colors and a resolution of 1920x 1080 pixels spectacular but if you're still not totally stunned you're about to see that its interior is an unparalleled appetizer since it houses is an AMD ryzen 7 5700 U CPU which is very powerful and will provide Power to everything you need and obviously behind such graphic quality there had to be this integrated Radeon graphics GPU the only detail that could be improved is its Ram which is 8 GB but it has 512 GB of storage which is great if you focus on portability number two Apple MacBook Air M2 the trick to the Apple MacBook Air that makes it probably the best laptop out there is that it's absurdly complete and portable it has a 13.6 in screen with Incredible image quality and provides a resolution of 2560 X 1664 pixels also look at that elegant design it comes equipped with an 8 core M2 CPU which is one of the best creations of the brand and gives it incredible support for everything it also comes with it a 10 core M2 GPU to give it a bunch of power to play anything what about the ram well it goes to the cloud since it has 24 GB for an impeccable interface and up to 2 terab of storage which feels like you can carry your entire office or studio in a briefcase I love it it's priced at $2,000 which is fair but not so affordable number one Dell XPS 15 OLED naming the Dell XPS 15 OLED as the best laptop of these times isn't a difficult task for $1,580 which is fully balanced in a cost benefit ratio it gives you such a great performance that's very difficult to beat to start it has a very solid and Powerful Intel Core I9 12,900 HK CPU for graphic work gaming and any Management program and as if that wasn't enough it also comes with an Nvidia RTX 3050 TI 4 GB GPU so the Dell XPS 15 OLED can play anything it's 15.6 in OLED screen provides a great resolution of 3456 X 2160 pixels and an impressively sharp image quality but if you check its Ram get ready because it has 64 GB of RAM for a practically immediate interface and a storage of a monstrous four teras there's nothing more to add this is the best laptop which of these laptops best suits your needs tell me in the comments and while you're there don't forget to like this video And subscribe to my channel because all be posting more videos like this see you in the next electronic [Music] Adventure",
        "spans": [
            {
                "text": "HP Spectre x364",
                "is_entity": True,
                "label": "LAPTOP",
                "reason": "is a laptop from the company HP"
            },
            {
                "text": "HP Spectre X 3614",
                "is_entity": False,
                "label": "==NONE==",
                "reason": "while this is a laptop, its' name is spelled slightly incorrectly resulting in a duplicate laptop"
            },
            {
                "text": "Spectre x364",
                "is_entity": False,
                "label": "==NONE==",
                "reason": "is the model of a laptop from the company HP, but not the full name in this case."
            },
            {
                "text": "Samsung Galaxy Book 2 Pro 360",
                "is_entity": True,
                "label": "LAPTOP",
                "reason": "is one of many laptops from the company Samsung's popular line of laptops called Galaxy Book"
            },
            {
                "text": "Acer Swift 5",
                "is_entity": True,
                "label": "LAPTOP",
                "reason": "is a laptop from the company Acer, and Swift 5 is a reference to the model"
            },
            {
                "text": "MacBook Pro 13 in M2",
                "is_entity": True,
                "label": "LAPTOP",
                "reason": "is one of many laptops from the company Apple's popular line of laptops called MacBook, and includes 13 in which is the size of the screen: 13 inches, and M2 which is the Apple's signature chip"
            },
            {
                "text": "Asus ROG Zephyrus G14",
                "is_entity": True,
                "label": "LAPTOP",
                "reason": "is one of many laptops from the company Asus's popular line of laptops called ROG Zephyrus"
            },
            {
                "text": "Framework Laptop 13",
                "is_entity": True,
                "label": "LAPTOP",
                "reason": "is a laptop from a lesser known company called Framework, who specializes in making customizable laptops"
            },
            {
                "text": "Lenovo Chromebook Duet 3",
                "is_entity": True,
                "label": "LAPTOP",
                "reason": "is a laptop from Lenovo, and Chromebook is a refence to Google's OS"
            },
            {
                "text": "Asus Zenbook 13 OLED",
                "is_entity": True,
                "label": "LAPTOP",
                "reason": "is one of many laptops from the company Asus's popular line of laptops called Zenbook"
            },
            {
                "text": "MacBook Air M2",
                "is_entity": True,
                "label": "LAPTOP",
                "reason": "is one of many laptops from the company Apple's popular line of laptops called MacBook"
            },
            {
                "text": "Dell XPS 15 OLED",
                "is_entity": True,
                "label": "LAPTOP",
                "reason": "is a laptop from a popular company Dell"
            },
            {
                "text": "MacBook Pro",
                "is_entity": False,
                "label": "==NONE==",
                "reason": "is type of Apple laptop but doesn't have enough specifications in it's name to be an LAPTOP entity"
            },
            {
                "text": "GPU",
                "is_entity": False,
                "label": "==NONE==",
                "reason": "GPU stands for graphical proccessing unit and is a computer component"
            },
            {
                "text": "SSD",
                "is_entity": False,
                "label": "==NONE==",
                "reason": "SSD stands for solid state drive and is a computer component"
            },
            {
                "text": "CPU",
                "is_entity": False,
                "label": "==NONE==",
                "reason": "CPU stands for central proccessing unit and is a computer component"
            },
            {
                "text": "Terra",
                "is_entity": False,
                "label": "==NONE==",
                "reason": "Terra is typo where here it should be tera, which is short for terabyte: a unit of digital information storage"
            },
            {
                "text": "Powerful Intel Core I9",
                "is_entity": False,
                "label": "==NONE==",
                "reason": "this is referencing a CPU type using the word Powerful as an adjective"
            },
            {
                "text": "AMD ryzen 9",
                "is_entity": False,
                "label": "==NONE==",
                "reason": "this is one of many CPU components a computer or laptop could have"
            },
            {
                "text": "Nvidia Geforce RTX",
                "is_entity": False,
                "label": "==NONE==",
                "reason": "this is referencing a specific type of GPU, namely Nvidia's Geforece RTX line"
            },
            {
                "text": "Nvidia RTX 3050 TI",
                "is_entity": False,
                "label": "==NONE==",
                "reason": "this is one of many GPU components a computer or laptop could have"
            },
            {
                "text": "Nvidia",
                "is_entity": True,
                "label": "==NONE==",
                "reason": "this is a company that is known for making GPU's, not laptops"
            },
            {
                "text": "AMD",
                "is_entity": True,
                "label": "==NONE==",
                "reason": "this is a company that is known for making CPU's and GPU's, not laptops"
            },
            {
                "text": "Chromebook",
                "is_entity": False,
                "label": "==NONE==",
                "reason": "this is a type of laptop that many companies make, that runs on Google's OS"
            }
        ]
    }
]

In [109]:
import json

# Write to a JSON file
with open('notebooks/examples.json', 'w') as f:
    json.dump(data, f, indent=4)

## NER

### Extracting Objects

After searching for videos with the best of for certain products, now we want to extract what those products are. We can do this with named entity recognition (NER).

In [12]:
import spacy
ner = spacy.load("en_core_web_sm")

In [13]:
search_ex[0]

{'video_id': 'FK8veh-L8AE',
 'title': 'TOP 10 BEST LAPTOPS 2023',
 'video_link': 'https://www.youtube.com/watch?v=FK8veh-L8AE',
 'channel_name': 'Trend Max',
 'cc_text': " top 10 best laptops 2023 number 10 HP Spectre x364 careful because you can't help but fall in love with the HP Spectre X 3614 just look at that beautiful 13.5 in touchscreen with a resolution of 3000x 2000 pixels that adjusts its image color to the environment thanks to its artificial intelligence system incredible plus its screen can be turned back becoming a kind of tablet regarding the CPU the Spectre x314 comes equipped with a core i7 1255 U and an Intel UHD GPU so it has power to spare whether you're doing everyday office tasks or if you're a content creator like me can you see why I love it it has 16 GB of RAM and up to one Terra of storage so you can save what you want last but not least it has four speakers that sound great so how much for this beauty a reasonable $1,050 number nine Samsung Galaxy Book 2 Pro 

In [14]:
cc_text_ex = search_ex[0]['cc_text']
doc = ner(cc_text_ex)

In [15]:
for ent in doc.ents:
    if ent.label_ in ["ORG", "PRODUCT"]:
        print(ent)

CPU
Intel
GPU
RAM
Samsung
CPU
Intel
RAM
SSD
Acer
CPU
Intel
RAM
Macbook Pro
the MacBook Pro
the MacBook Pro's
Apple
M2
RAM
G14
G14
AMD
CPU
Nvidia GeForce
LCD
RAM
AMD
CPU
Lenovo Chromebook
Lenovo Chromebook
Snapdragon
GPU
Chromebook
HDR
Vivid
AMD
CPU
GPU
Apple
the Apple MacBook Air
Dell
XPS 15
GPU
RAM


In [16]:
for i, ent in enumerate(doc.ents):
    if (ent.label_ == "ORG") and (doc.ents[i+1].label_ == "PRODUCT"):
        print(ent, doc.ents[i+1])

Apple M2
Dell XPS 15


### Spacy_LLM NER

Trying spacy-LLM integration for a better NER method, since the basic version above is missing a lot. Here is a guide on how to do this: https://github.com/explosion/spacy-llm/tree/main/usage_examples/ner_v3_openai.

In [12]:
from spacy_llm.util import assemble

In [13]:
os.environ["OPENAI_API_KEY"] = openai_api_key
os.environ["OPENAI_API_ORG"] = openai_api_org

In [14]:
nlp = assemble("notebooks/config_test.cfg", overrides={"paths.examples": "notebooks/example_test.json"})

In [15]:
search_ex[1]["cc_text"]

" this year we tested over 70 different laptops by far a record for this channel we tested small ones we tested big ones we tested cheap ones and we tested expensive ones a huge variety from all kinds of Manufacturers and when you use these laptops side by side just like we do it becomes so obvious which laptops are great and which are completely mediocre well today is the day that we countd down the top 10 laptops that we tested in 2023 if you are planning to buy one of these lap tops you'll obviously want to buy them at the best possible price so check the links below the video our team scours the internet to find the best deals and we update them daily plus if new laptops are released after this video that we end up liking even better we'll include them down there too number 10 the HB Pavilion plus 14 in for a price of around $729 you get a crazy amount for the money it will be configured with an OLED panel at that price which is bright vibrant color accurate has a fast refresh rate

In [16]:
def extract_entities(text, config_file, examples_file):
    nlp = assemble(config_file, overrides={"paths.examples": examples_file})
    doc = nlp(text)

    entities = doc.ents
    entities = [(ent.text, ent.label_) for ent in doc.ents]

    return entities

In [17]:
search_ex[0]

{'video_id': 'FK8veh-L8AE',
 'title': 'TOP 10 BEST LAPTOPS 2023',
 'video_link': 'https://www.youtube.com/watch?v=FK8veh-L8AE',
 'channel_name': 'Trend Max',
 'cc_text': " top 10 best laptops 2023 number 10 HP Spectre x364 careful because you can't help but fall in love with the HP Spectre X 3614 just look at that beautiful 13.5 in touchscreen with a resolution of 3000x 2000 pixels that adjusts its image color to the environment thanks to its artificial intelligence system incredible plus its screen can be turned back becoming a kind of tablet regarding the CPU the Spectre x314 comes equipped with a core i7 1255 U and an Intel UHD GPU so it has power to spare whether you're doing everyday office tasks or if you're a content creator like me can you see why I love it it has 16 GB of RAM and up to one Terra of storage so you can save what you want last but not least it has four speakers that sound great so how much for this beauty a reasonable $1,050 number nine Samsung Galaxy Book 2 Pro 

In [18]:
ents_ex_0 = extract_entities(text=search_ex[0]["cc_text"], config_file="notebooks/config_test2.cfg", examples_file="notebooks/example_test2.json")
print(len(ents_ex_0))
ents_ex_0



8


[('HP Spectre x364', 'LAPTOP'),
 ('Samsung Galaxy Book 2 Pro 360', 'LAPTOP'),
 ('Acer Swift 5', 'LAPTOP'),
 ('framework laptop 13', 'LAPTOP'),
 ('Lenovo Chromebook duet 3', 'LAPTOP'),
 ('Asus Zenbook 13 OLED', 'LAPTOP'),
 ('Apple MacBook Air M2', 'LAPTOP'),
 ('Dell XPS 15 OLED', 'LAPTOP')]

In [19]:
for i in range(0, 3):
    ents_ex = extract_entities(text=search_ex[i]["cc_text"], config_file="notebooks/config_test2.cfg", examples_file="notebooks/example_test2.json")
    print(len(ents_ex))
    print(ents_ex, "\n")


8
[('HP Spectre x364', 'LAPTOP'), ('Samsung Galaxy Book 2 Pro 360', 'LAPTOP'), ('Acer Swift 5', 'LAPTOP'), ('framework laptop 13', 'LAPTOP'), ('Lenovo Chromebook duet 3', 'LAPTOP'), ('Asus Zenbook 13 OLED', 'LAPTOP'), ('Apple MacBook Air M2', 'LAPTOP'), ('Dell XPS 15 OLED', 'LAPTOP')] 



KeyboardInterrupt: 

This performance is good enough for now. We only have one main slight issue, which is duplicates but with slightly different names. For example:

### Removing Duplicates with Cosine Similarity

('Apple MacBook air15 with M2', 'LAPTOP') and ('MacBook airs with M2', 'LAPTOP')

are the same. We could do a check to see if each item is a subset of the others in terms of character value counts, but in this case it would not be. We can instead use a similarity metric, like the cosine similarity.

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [21]:
ex_ner = 'Apple MacBook air15 with M2'
ex_ner_dup = 'MacBook airs with M2'

vectorizer = CountVectorizer().fit_transform([ex_ner, ex_ner_dup])
vectors = vectorizer.toarray()
cosine_sim = cosine_similarity(vectors)
print(f"Cosine Similarity: {cosine_sim[0, 1]}")    

Cosine Similarity: 0.6708203932499369


This works, but we need to run the cosine similarity on all our NER results (laptop names).

In [102]:
# a function to get the cosine similarties
def detect_ner_duplicates(ner_lst):
    vectorizer = CountVectorizer().fit_transform(ner_lst)
    vectors = vectorizer.toarray()
    cosine_sim = cosine_similarity(vectors)
    return cosine_sim

Let's run through and example. Use the laptop names from the example above plus the two other examples we looked at right above for the cosine similarity example calculation.

In [103]:
laptop_names = [ent[0] for ent in ents_ex_0] + [ex_ner, ex_ner_dup]

cosine_similarities = detect_ner_duplicates(laptop_names)
cosine_similarities

array([[1.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 1.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 1.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 1.        , 0.        ,
        0.28867513, 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 1.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.28867513, 0.        ,
        1.        , 0.        , 0.25      , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 1.        , 0.        , 0.67082039, 0.5       ],
       [0.        , 0.        , 0.       

We need a threshold to filter out the laptops that are too similar. Let's use 0.5 for now. And we need to identify the duplicates before we can remove them. Remember we need to keep one of the duplicate names, since they don't all have the exact same name.

In [104]:
threshold = 0.5
duplicates = []
for i in range(len(laptop_names)):
    for j in range(i + 1, len(laptop_names)):
        if cosine_similarities[i, j] >= threshold:
            duplicates.append((laptop_names[i], laptop_names[j], cosine_similarities[i, j]))

duplicates

[('Apple MacBook Air M2', 'Apple MacBook air15 with M2', 0.6708203932499369),
 ('Apple MacBook Air M2', 'MacBook airs with M2', 0.5),
 ('Apple MacBook air15 with M2', 'MacBook airs with M2', 0.6708203932499369)]

Now we can filter out the duplicate laptop names by iteratively looking at the duplicates, and removing the second value in that pair so we'll keep the first one. Also we then need to remove that second value from the duplicates list too since it has reordered versions of the pairs as you can see above. 

Note: I think there's one missing above, there should be another value: ('MacBook airs with M2', 'Apple MacBook Air M2', 0.5), probably due to an indexing issue since 'MacBook airs with M2' is the last item in the laptop_names list. This is fine to leave as is though since we would've removed that value anyway.

In [100]:
i = 0
while i < len(duplicates):
    dup = duplicates[i]

    # we'll keep the first value, and remove the second one
    laptop_names.remove(dup[1])

    # the duplicates list has reordered versions of the duplicate pairs, 
    # where the second value we just removed is the first one
    # so we need to find that one and remove it from the duplicates list
    for dupp in duplicates:
        if dup[1] == dupp[0]:
            duplicates.remove(dupp)

    i += 1


In [101]:
laptop_names

['HP Spectre x364',
 'Samsung Galaxy Book 2 Pro 360',
 'Acer Swift 5',
 'framework laptop 13',
 'Lenovo Chromebook duet 3',
 'Asus Zenbook 13 OLED',
 'Apple MacBook Air M2',
 'Dell XPS 15 OLED']

### Collecting Full NER Data

In [115]:
query = "top ten laptops 2022"
search = youtube.fetch_youtube_data(query, max_results=10)

An error occurred: 
Could not retrieve a transcript for the video https://www.youtube.com/watch?v=rGgDatUdedw! This is most likely caused by:

No transcripts were found for any of the requested language codes: ('en',)

For this video (rGgDatUdedw) transcripts are available in the following languages:

(MANUALLY CREATED)
None

(GENERATED)
 - nl ("Dutch (auto-generated)")[TRANSLATABLE]

(TRANSLATION LANGUAGES)
 - af ("Afrikaans")
 - ak ("Akan")
 - sq ("Albanian")
 - am ("Amharic")
 - ar ("Arabic")
 - hy ("Armenian")
 - as ("Assamese")
 - ay ("Aymara")
 - az ("Azerbaijani")
 - bn ("Bangla")
 - eu ("Basque")
 - be ("Belarusian")
 - bho ("Bhojpuri")
 - bs ("Bosnian")
 - bg ("Bulgarian")
 - my ("Burmese")
 - ca ("Catalan")
 - ceb ("Cebuano")
 - zh-Hans ("Chinese (Simplified)")
 - zh-Hant ("Chinese (Traditional)")
 - co ("Corsican")
 - hr ("Croatian")
 - cs ("Czech")
 - da ("Danish")
 - dv ("Divehi")
 - nl ("Dutch")
 - en ("English")
 - eo ("Esperanto")
 - et ("Estonian")
 - ee ("Ewe")
 - fil

We need to filter out data that have error messages as cc_text rather than actual cc_text.

In [118]:
search

[{'video_id': '2MXtOocAVXc',
  'title': 'TOP 10 BEST LAPTOPS 2022',
  'video_link': 'https://www.youtube.com/watch?v=2MXtOocAVXc',
  'channel_name': 'Trend Max',
  'cc_text': " do you need a powerful fast laptop with large memory capacity and good graphics stay tuned because in this video you're about to see the top 10 best laptops 2022 number 10 hp envy x360 let's start with a laptop that has great graphics especially for graphics designer games that require vivid visuals the hp envy x360 is powered by an amd ryzen 5 3500u processor and runs windows 10 although it can be upgraded to windows 11. it has a 13.3 inch convertible touchscreen so it can be used as a tablet as its screen can rotate 360 degrees the keyboard has its own lighting and added hotkeys including microphone on off keys it boasts a built-in 8 gigabyte ram and its battery is quite durable it can last 11 hours on a single charge the hp envy x360 is quite affordable you can get it starting at 700 with this model hp solved