# Data Collection and Exploration

Here we will load in the data from youtube, where collecting cc texts from relevant search queries is the main goal.

The outline is as follows (using laptops as the example product):

 - Use the YouTube API to query videos like "top ten laptops"
 - Get the cc texts and other relevant information from various of those videos
 - Extract top laptops from the cc text with NER
 - Then query the YouTube API again to get review cc text for each laptop
 - Store this all as raw data.

In [1]:
import os
from dotenv import load_dotenv
from youtube_transcript_api import YouTubeTranscriptApi
from googleapiclient.discovery import build

Change directories to get access to env file.

In [2]:
os.chdir("../")

In [3]:
pwd

'c:\\Users\\RaviB\\GitHub\\TechKnowBot'

Build YouTube connector.

In [4]:
load_dotenv()

youtube_api_key = os.getenv('GCP_YOUTUBE_API_KEY')

In [7]:
openai_api_key = os.getenv('OPENAI_API_KEY')
openai_api_org = os.getenv('OPENAI_API_ORG')

Take old code from SentiRec Analytics project: https://github.com/RavinderRai/SentiRec-Analytics/blob/main/modules/YouTubeReviewScraper.py.

In [21]:
class YouTubeReviewData:    
    def __init__(self, api_key):
        self.api_key = api_key
        self.youtube = build('youtube', 'v3', developerKey=api_key)
        
    def search_videos(self, search_query, max_results=5, individual_review=False):
        """
        Search for YouTube videos based on a given query and retrieve additional information including closed captions.

        Parameters:
        - search_query (str): The search query used to find relevant videos on YouTube.
        - max_results (int): The maximum number of videos to retrieve. Defaults to 5.
        - individual_review (bool): If the search query is for reviews of a specific product, then set this to True, 
        to ignore videos with VS in their titles as that indicates that the reviews isn't for just the individual product itself.

        Returns:
        List[dict]: A list of dictionaries, each containing information about a video, including:
            - 'video_id' (str): The unique identifier for the video.
            - 'title' (str): The title of the video.
            - 'video_link' (str): The YouTube link to the video.
            - 'channel_name' (str): The name of the channel that uploaded the video.
            - 'cc_text' (str): The closed captions text for the video. This is the review text.

        Note:
        - Videos with titles containing specific strings ('VS', 'vs', 'Vs') are excluded, as they indicate videos that aren't reviews specific to the  
        product in the search query.
        - The 'cc_text' field may contain an empty string if closed captions are not available.
        """       
        
        search_response = self.youtube.search().list(
            q=search_query,
            type='video',
            part='id, snippet',
            maxResults=max_results
        ).execute()        
        
        videos_info = []
        for result in search_response.get('items', []):
            video_id = result['id']['videoId']
            title = result['snippet']['title']
            video_link = f'https://www.youtube.com/watch?v={video_id}'
            channel_name = result['snippet']['channelTitle']

            # Check and remove unwanted titles
            strings_to_check = ["VS", "vs", "Vs"] if individual_review else []
            if not any(s in title for s in strings_to_check):
                review_text = self.fetch_captions(video_id)
                videos_info.append({
                    'video_id': video_id,
                    'title': title,
                    'video_link': video_link,
                    'channel_name': channel_name,
                    'cc_text': review_text
                })

        return videos_info
    
    def fetch_captions(self, video_id):
        """
        Get the closed captions. 

        Parameters:
        - video_id (str): The video id which is obtained in search_videos.
        
        Returns:
        String: Closed caption text of a youtube video
        """
        try:
            # Retrieve the transcript for the video
            transcript = YouTubeTranscriptApi.get_transcript(video_id)

            cc_text = ""

            # Concatenate the transcript text
            for entry in transcript:
                cc_text += ' ' + entry['text']
                
            cc_text = cc_text.replace('\n', ' ')
            return cc_text

        except Exception as e:
            print(f"An error occurred: {str(e)}")

In [22]:
youtube = YouTubeReviewData(youtube_api_key)

In [23]:
query_ex = "top ten laptops"
search_ex = youtube.search_videos(query_ex, 5)

In [24]:
print(search_ex[0]['cc_text'])

 top 10 best laptops 2023 number 10 HP Spectre x364 careful because you can't help but fall in love with the HP Spectre X 3614 just look at that beautiful 13.5 in touchscreen with a resolution of 3000x 2000 pixels that adjusts its image color to the environment thanks to its artificial intelligence system incredible plus its screen can be turned back becoming a kind of tablet regarding the CPU the Spectre x314 comes equipped with a core i7 1255 U and an Intel UHD GPU so it has power to spare whether you're doing everyday office tasks or if you're a content creator like me can you see why I love it it has 16 GB of RAM and up to one Terra of storage so you can save what you want last but not least it has four speakers that sound great so how much for this beauty a reasonable $1,050 number nine Samsung Galaxy Book 2 Pro 360 if you like graphic design or working with metrics in the office the galaxy book 2 Pro 360 is ideal for you fairly priced at $950 this laptop has a 15.6 in Amo LED ful

Getting examples for few shot config file.

In [26]:
search_ex[0]

{'video_id': 'FK8veh-L8AE',
 'title': 'TOP 10 BEST LAPTOPS 2023',
 'video_link': 'https://www.youtube.com/watch?v=FK8veh-L8AE',
 'channel_name': 'Trend Max',
 'cc_text': " top 10 best laptops 2023 number 10 HP Spectre x364 careful because you can't help but fall in love with the HP Spectre X 3614 just look at that beautiful 13.5 in touchscreen with a resolution of 3000x 2000 pixels that adjusts its image color to the environment thanks to its artificial intelligence system incredible plus its screen can be turned back becoming a kind of tablet regarding the CPU the Spectre x314 comes equipped with a core i7 1255 U and an Intel UHD GPU so it has power to spare whether you're doing everyday office tasks or if you're a content creator like me can you see why I love it it has 16 GB of RAM and up to one Terra of storage so you can save what you want last but not least it has four speakers that sound great so how much for this beauty a reasonable $1,050 number nine Samsung Galaxy Book 2 Pro 

In [27]:
ex_0 = {
        "text": search_ex[0],
        "spans": [
            {
                "text": "HP Spectre x364",
                "is_entity": true,
                "label": "Laptop"
                "is a laptop"
            },
            {
                "text": "Terra",
                "is_entity": false,
                "label": "Laptop"
                "a unit of measurement for storage"
            },
            {
                "text": "Samsung Galaxy Book 2 Pro 360",
                "is_entity": true,
                "label": "Laptop"
                "a unit of measurement for storage"
            }

        ]
    }

SyntaxError: invalid syntax (3471336060.py, line 4)

## Extracting Objects

After searching for videos with the best of for certain products, now we want to extract what those products are. We can do this with named entity recognition (NER).

In [9]:
import spacy
ner = spacy.load("en_core_web_sm")

In [10]:
search_ex[0]

{'video_id': 'FK8veh-L8AE',
 'title': 'TOP 10 BEST LAPTOPS 2023',
 'video_link': 'https://www.youtube.com/watch?v=FK8veh-L8AE',
 'channel_name': 'Trend Max',
 'cc_text': " top 10 best laptops 2023 number 10 HP Spectre x364 careful because you can't help but fall in love with the HP Spectre X 3614 just look at that beautiful 13.5 in touchscreen with a resolution of 3000x 2000 pixels that adjusts its image color to the environment thanks to its artificial intelligence system incredible plus its screen can be turned back becoming a kind of tablet regarding the CPU the Spectre x314 comes equipped with a core i7 1255 U and an Intel UHD GPU so it has power to spare whether you're doing everyday office tasks or if you're a content creator like me can you see why I love it it has 16 GB of RAM and up to one Terra of storage so you can save what you want last but not least it has four speakers that sound great so how much for this beauty a reasonable $1,050 number nine Samsung Galaxy Book 2 Pro 

In [11]:
cc_text_ex = search_ex[0]['cc_text']
doc = ner(cc_text_ex)

In [12]:
for ent in doc.ents:
    if ent.label_ in ["ORG", "PRODUCT"]:
        print(ent)

CPU
Intel
GPU
RAM
Samsung
CPU
Intel
RAM
SSD
Acer
CPU
Intel
RAM
Macbook Pro
the MacBook Pro
the MacBook Pro's
Apple
M2
RAM
G14
G14
AMD
CPU
Nvidia GeForce
LCD
RAM
AMD
CPU
Lenovo Chromebook
Lenovo Chromebook
Snapdragon
GPU
Chromebook
HDR
Vivid
AMD
CPU
GPU
Apple
the Apple MacBook Air
Dell
XPS 15
GPU
RAM


In [13]:
for i, ent in enumerate(doc.ents):
    if (ent.label_ == "ORG") and (doc.ents[i+1].label_ == "PRODUCT"):
        print(ent, doc.ents[i+1])

Apple M2
Dell XPS 15


## Spacy_LLM NER

Trying spacy-LLM integration for a better NER method, since the basic version above is missing a lot. Here is a guide on how to do this: https://github.com/explosion/spacy-llm/tree/main/usage_examples/ner_v3_openai.

In [20]:
#nlp = spacy.blank("en")

In [12]:
from spacy_llm.util import assemble

In [10]:
os.environ["OPENAI_API_KEY"] = openai_api_key
os.environ["OPENAI_API_ORG"] = openai_api_org

In [41]:
#import openai

#openai.api_key = os.getenv("OPENAI_API_ORG")

In [13]:
nlp = assemble("notebooks/config.cfg", overrides={"paths.examples": "notebooks/examples.json"})

In [14]:
doc = nlp("Sriracha sauce goes really well with hoisin stir fry, but you should add it after you use the wok.")

In [15]:
doc.text

'Sriracha sauce goes really well with hoisin stir fry, but you should add it after you use the wok.'

In [16]:
[(ent.text, ent.label_) for ent in doc.ents]

[('Sriracha sauce', 'INGREDIENT'),
 ('hoisin', 'INGREDIENT'),
 ('stir fry', 'DISH'),
 ('wok', 'EQUIPMENT')]

[paths]
examples = null

[nlp]
lang = "en"
pipeline = ["llm"]

[components]

[components.llm]
factory = "llm"

[components.llm.task]
@llm_tasks = "spacy.NER.v3"
labels = ["ORG", "LAPTOP"]
description = Entities are the names of laptops, 
    and the organizations are companies that made the laptop.
    Adjectives, verbs, adverbs are not entities.
    Pronouns are not entities.

[components.llm.task.label_definitions]
ORG = "Organization a laptop belongs to."
LAPTOP = "Name of a laptop."

[components.llm.task.examples]
@misc = "spacy.FewShotReader.v1"
path = "${paths.examples}"

[components.llm.model]
@llm_models = "spacy.GPT-3-5.v1"