# YouTube Scraper Tutorial

In this post, we will cover a quick guide on how to collect closed caption text from YouTube videos. With a dataset of texts like this, you can do sentiment analysis, NER, and more. More specifically, possible use cases might be NER on products, sentiment analysis from YouTube reviews, or text summarization from longer videos or podcasts.

To get started, you will need to set up a GCP account and enable the YouTube API. Don't worry, this is completely free. Next install the youtube_transcript_api and googleapiclient packages. Now we can import our packages.

In [17]:
# all the packages we will need
import os
from dotenv import load_dotenv
from youtube_transcript_api import YouTubeTranscriptApi
from googleapiclient.discovery import build

Next you'll need your GCP key. You can just paste it here, or create a .env file in the same directory and put it in there - up to you.

In [19]:
load_dotenv()
youtube_api_key = os.getenv('GCP_YOUTUBE_API_KEY')

Now let's get to the YouTube API part. First we'll use the imported build function which let's us create an object to interact with specified Google APIs. Obviously we will put YouTube in as an arguement. We'll then use version 3 alongside your API key.

In [20]:
youtube = build('youtube', 'v3', developerKey=youtube_api_key)

To use this youtube object, you'll actualy send in a query, as if you're searching youtube from the search bar as you normally would. We'll actually get the number of top results from this search query, where max_results will specify the exact number of results to get. Also, we'll set it to only search for videos (no playlists, channels, etc.), and also get the id and snippet (snippet contains info like title, description, etc.).

In [21]:
search_query = "Top video games of 2023"
max_results = 5

search_response = youtube.search().list(
    q=search_query,
    type='video',
    part='id, snippet',
    maxResults=max_results
).execute()

Now let's explore when this result is. Naturally, it is a list of 5 items, 1 for each result (since we set max results to 5). So let's just look at the first item in the list.

In [24]:
search_response.get('items', [])[0]

{'kind': 'youtube#searchResult',
 'etag': 'cAWTfIz-k4IQLn33ru7_-03bFE8',
 'id': {'kind': 'youtube#video', 'videoId': 'sXnoQdA6cYM'},
 'snippet': {'publishedAt': '2023-11-23T00:00:32Z',
  'channelId': 'UCaWd5_7JhbQBe4dknZhsHJg',
  'title': 'Top 10 Best Video Games of 2023',
  'description': "2023 has been another great year for video games! Welcome to WatchMojo, and today we're counting down our picks for the ...",
  'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/sXnoQdA6cYM/default.jpg',
    'width': 120,
    'height': 90},
   'medium': {'url': 'https://i.ytimg.com/vi/sXnoQdA6cYM/mqdefault.jpg',
    'width': 320,
    'height': 180},
   'high': {'url': 'https://i.ytimg.com/vi/sXnoQdA6cYM/hqdefault.jpg',
    'width': 480,
    'height': 360}},
  'channelTitle': 'WatchMojo.com',
  'liveBroadcastContent': 'none',
  'publishTime': '2023-11-23T00:00:32Z'}}

In [12]:
class YouTubeReviewData:    
    def __init__(self, api_key):
        self.api_key = api_key
        self.youtube = build('youtube', 'v3', developerKey=api_key)
        
    def search_videos(self, search_query, max_results=5):
        """
        Search for YouTube videos based on a given query and retrieve additional information including closed captions.

        Parameters:
        - search_query (str): The search query used to find relevant videos on YouTube.
        - max_results (int): The maximum number of videos to retrieve. Defaults to 5.

        Returns:
        List[dict]: A list of dictionaries, each containing information about a video, including:
            - 'video_id' (str): The unique identifier for the video.
            - 'title' (str): The title of the video.
            - 'video_link' (str): The YouTube link to the video.
            - 'channel_name' (str): The name of the channel that uploaded the video.
            - 'cc_text' (str): The closed captions text for the video. This is the review text.

        Note:
        - Videos with titles containing specific strings ('VS', 'vs', 'Vs') are excluded, as they indicate videos that aren't reviews specific to the  
        product in the search query.
        - The 'cc_text' field may contain an empty string if closed captions are not available.
        """       
        
        search_response = self.youtube.search().list(
            q=search_query,
            type='video',
            part='id, snippet',
            maxResults=max_results
        ).execute()
        
        
        """ old version
        video_ids = [result['id']['videoId'] for result in search_response.get('items', [])]
        titles = [result['snippet']['title'] for result in search_response.get('items', [])]
        
        #any videos with vs in the title means it is not a direct review of the specific headphone, so we will not want those
        strings_to_check = ["VS", "vs", "Vs"]
        indices_to_remove = [i for i, title in enumerate(titles) if any(s in title for s in strings_to_check)]
        # Remove items from the video_ids list using the indices
        video_ids = [video_ids[i] for i in range(len(video_ids)) if i not in indices_to_remove]
        """
        
        
        videos_info = []
        for result in search_response.get('items', []):
            video_id = result['id']['videoId']
            title = result['snippet']['title']
            video_link = f'https://www.youtube.com/watch?v={video_id}'
            channel_name = result['snippet']['channelTitle']

            # Check and remove unwanted titles
            strings_to_check = ["VS", "vs", "Vs"]
            if not any(s in title for s in strings_to_check):
                review_text = self.fetch_captions(video_id)
                videos_info.append({
                    'video_id': video_id,
                    'title': title,
                    'video_link': video_link,
                    'channel_name': channel_name,
                    'review_text': review_text
                })

        return videos_info
    
    def fetch_captions(self, video_id):
        """
        Get the closed captions. 

        Parameters:
        - video_id (str): The video id which is obtained in search_videos.
        
        Returns:
        String: Closed caption text of a youtube video
        """
        try:
            # Retrieve the transcript for the video
            transcript = YouTubeTranscriptApi.get_transcript(video_id)

            cc_text = ""

            # Concatenate the transcript text
            for entry in transcript:
                cc_text += ' ' + entry['text']
                
            cc_text = cc_text.replace('\n', ' ')
            return cc_text

        except Exception as e:
            print(f"An error occurred: {str(e)}")


In [16]:
youtube_scraper = YouTubeReviewData(youtube_api_key)
youtube_scraper.search_videos("Top ten laptops")

HttpError: <HttpError 403 when requesting https://youtube.googleapis.com/youtube/v3/search?q=Top+ten+laptops&type=video&part=id%2C+snippet&maxResults=5&alt=json returned "Request had insufficient authentication scopes.". Details: "[{'message': 'Insufficient Permission', 'domain': 'global', 'reason': 'insufficientPermissions'}]">