## Scraping YouTube videos

### Installing required libraries

In [1]:
! pip install google-api-python-client

Defaulting to user installation because normal site-packages is not writeable


In [2]:
! pip install youtube-transcript-api

Defaulting to user installation because normal site-packages is not writeable


### Scraping one YouTube video's details and transcript

##### Scraping YouTube video details

In [None]:
from googleapiclient.discovery import build

# Please put your own API key here
API_key = "Haha I locked it for you guys"

# Defining a loop that uses the YouTube API to scrape the basic data of a video
def get_video_details(video_id):
    youtube = build("youtube", "v3", developerKey=API_key)
    request = youtube.videos().list(part="snippet,statistics", id=video_id)\
    
    response = request.execute()

    if "items" in response and response["items"]:
        video = response["items"][0]
        details = {
            "title": video["snippet"]["title"],
            "channel": video["snippet"]["channelTitle"],
            "views": video["statistics"].get("viewCount", "N/A"),
            "publish_date": video["snippet"]["publishedAt"],
            "url": f"https://www.youtube.com/watch?v={video_id}"
        }
        return details
    return None

In [9]:
# Testing code on example video of the VVD-YouTube-video
video_id = "XqtessUPQEY"
details = get_video_details(video_id)
print(details)

{'title': '"Ik sta hier voor de veiligheid van Nederlanders." Dilan Yeşilgöz-Zegerius clasht met FvD.', 'channel': 'VVD', 'views': '649', 'publish_date': '2025-02-19T12:48:47Z', 'url': 'https://www.youtube.com/watch?v=XqtessUPQEY'}


##### Scraping YouTube video transcript in Dutch

In [12]:
from youtube_transcript_api import YouTubeTranscriptApi

# Defining a function to fetch the YouTube transcripts while utilising the YouTube transcript API
def get_transcript(video_id, language="nl"):
    try:
        transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=[language])
        transcript_text = " ".join([entry["text"] for entry in transcript])
        return transcript_text

    except Exception as e:
        return f"Error fetching transcript: {e}"

In [14]:
# Testing code on example video of the VVD-YouTube-video
# Sorry but this video actually is a speech ouch
video_id = "XqtessUPQEY"
transcript = get_transcript(video_id)
print("\nTranscript:", transcript)


Transcript: ehm de heer Rutte Secretaris generaal van de NAVO heeft uitgesproken dat het hele Westen dus van Wat zei die San Francisco tot Ankara geloof ik eh maar een derde aan wapentuig en munitie kan produceren van wat Rusland produceert hè Dat heeft hij uitgesproken nu dreigen wij in de situatie te zitten vandaag al dat Oekraïne met volledige steun van de NAVO dit eh conflict verliest eh is het dan niet wereldvreemd om te zeggen we gaan als Europa zelf onze broek ophouden en we gaan dan zorgen desnoods om dan zelf maar iets tegen die Russen tweer te stellen dat is toch totaal niet realistisch voorzitter mijn eerste reactie zou niet zijn op mijn rug liggen met mijn pootjes omhoog en aan Poetin vragen hoever die Europa door wil trekken ik sta wel voor de veiligheid van Nederlanders en ik zal er dan vervolgens ook alles aan doen dat wij daar paraat op eh zijn dus of dat is de samenwerking met Europa of dat is toch met onze Amerikaanse bondgenoot op een andere manier maar hier de sugg

### Scraping the previously defined data and transcripts for multiple videos

##### Getting an uploads list for a certain YouTube channel

In [22]:
# Finding this channel ID is a bit of a challenge, but can be done via following this tutorial: https://www.youtube.com/watch?v=0oDy2sWPF38
channel_id_VVD = "UCZean7nAZKDGIHANq-MuaGA"

# Getting the playlist of uploads for a specific YouTube channel
def get_uploads_playlist_id(channel_id):
    youtube = build("youtube", "v3", developerKey=API_key)
    request = youtube.channels().list(
        part="contentDetails",
        id=channel_id
    )
    response = request.execute()
    
    if response["items"]:
        return response["items"][0]["contentDetails"]["relatedPlaylists"]["uploads"]
    return None

In [23]:
# Testing code on example video of the VVD-YouTube-video
# Sorry but this video actually is a speech ouch
uploads_playlist_id = get_uploads_playlist_id(channel_id_VVD)
print("Uploads Playlist ID:", uploads_playlist_id)

Uploads Playlist ID: UUZean7nAZKDGIHANq-MuaGA


##### Getting all video IDs from the uploads playlist

In [26]:
import datetime

def get_videos_after_date(playlist_id, after_date_str):
    youtube = build("youtube", "v3", developerKey=API_key)
    video_list = []
    next_page_token = None

    # Converting string to datetime object
    after_date = datetime.datetime.strptime(after_date_str, "%Y-%m-%d")

    while True:
        request = youtube.playlistItems().list(part = "snippet", playlistId = playlist_id, maxResults = 50, pageToken = next_page_token)
        response = request.execute()

        for item in response["items"]:
            video_id = item["snippet"]["resourceId"]["videoId"]
            title = item["snippet"]["title"]
            publish_date = item["snippet"]["publishedAt"]
            
            # Convert publish date to datetime
            publish_date_obj = datetime.datetime.strptime(publish_date, "%Y-%m-%dT%H:%M:%SZ")

            if publish_date_obj > after_date:
                video_list.append({
                    "video_id": video_id,
                    "title": title,
                    "publish_date": publish_date
                })
            
            next_page_token = response.get("nextPageToken")
            if not next_page_token:
                break
            
    return video_list

In [28]:
# Example: Get all videos published after January 1, 2024
videos = get_videos_after_date(uploads_playlist_id, "2024-01-01")

HttpError: <HttpError 403 when requesting https://youtube.googleapis.com/youtube/v3/playlistItems?part=snippet&playlistId=UUZean7nAZKDGIHANq-MuaGA&maxResults=50&pageToken=EAAaflBUOkNKQURJaEEyUVVFd01UQTFSa1E1UVRGRE5qWTNLQUZJbDlqajdNVGNpd05RQVZvNElrTm9hRlpXVm5Cc1dWYzBNMkpyUm1GVE1GSklVMVZvUWxSdVJYUlVXRlpvVWpCRlUwUkJhVTVwWmtzNVFtaEVXVGM0VDBoQlp5SQ&key=AIzaSyClasiQU0SJao2EjTQTgl7-MJfVbBQO0-o&alt=json returned "The request cannot be completed because you have exceeded your <a href="/youtube/v3/getting-started#quota">quota</a>.". Details: "[{'message': 'The request cannot be completed because you have exceeded your <a href="/youtube/v3/getting-started#quota">quota</a>.', 'domain': 'youtube.quota', 'reason': 'quotaExceeded'}]">