<a href="https://colab.research.google.com/github/AravindKumarjoshi/colab-youtube-python/blob/main/my_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import re
import requests

# Task
Generate Python code to retrieve all video IDs from a YouTube playlist using the YouTube Data API. Explain how to obtain and use an API key securely.

## Set up api key

### Subtask:
Explain how to get a YouTube Data API key and securely store it in Colab.


**Reasoning**:
Provide an explanation of how to obtain a YouTube Data API key and securely store it in Google Colab.



In [3]:
# Instructions on obtaining and storing a YouTube Data API key:

# 1. Obtain a YouTube Data API Key:
#    a. Go to the Google Cloud Console (console.cloud.google.com). You may need to log in or create an account.
#    b. Create a new project or select an existing one.
#    c. In the navigation menu, go to "APIs & Services" > "Library".
#    d. Search for "YouTube Data API v3" and select it.
#    e. Click the "Enable" button.
#    f. In the navigation menu, go to "APIs & Services" > "Credentials".
#    g. Click "Create Credentials" and select "API Key".
#    h. Your API key will be generated. Copy this key. You can restrict the key's usage (e.g., by IP address or HTTP referrer) for added security, which is highly recommended.

# 2. Securely Store the API Key in Google Colab:
#    a. Open your Google Colab notebook.
#    b. In the left sidebar, click the "Secrets" tab (it looks like a key). If you don't see it, you might need to enable it in the notebook settings (Notebook settings -> enable "Show the Secrets tab").
#    c. Click "Add new secret".
#    d. In the "Name" field, enter a descriptive name for your secret, such as `YOUTUBE_API_KEY`.
#    e. In the "Value" field, paste the API key you copied from the Google Cloud Console.
#    f. Ensure that the "Notebook access" toggle is enabled for your current notebook.
#    g. To access the secret in your code, you can use the `user_secrets` dictionary provided by Colab's `google.colab` library. For example:
#       from google.colab import userdata
#       api_key = userdata.get('YOUTUBE_API_KEY')
#    h. **IMPORTANT:** Never hardcode your API key directly in your notebook code. Using the Secrets feature keeps your key secure and prevents accidental exposure if you share your notebook.

## Define api endpoint and parameters

### Subtask:
Define the base URL for the `playlistItems` endpoint and specify the necessary parameters like `key`, `playlistId`, and `part`.


**Reasoning**:
Define the base URL and the parameters dictionary for the API request as instructed.



In [8]:
from google.colab import userdata

base_url = 'https://www.googleapis.com/youtube/v3/playlistItems'

params = {
    'key': userdata.get('YOUTUBE_API_KEY'),
    'playlistId': 'PLdb6KKzTz-by4IuO4qsG8JZokr4ouvFH4', # Replace with an actual playlist ID
    'part': 'snippet',
    'maxResults': 180
}

## Make api request

### Subtask:
Generate code to make an HTTP GET request to the YouTube API using the `requests` library.


**Reasoning**:
Generate code to make an HTTP GET request to the YouTube API using the `requests` library, check the response status code, and print an error message if the request was not successful.



In [15]:
video_ids = []
paged_token = None

while True:
  params['pageToken'] = paged_token
  base_url = 'https://www.googleapis.com/youtube/v3/playlistItems'
  response = requests.get(base_url, params=params)
  data = response.json()

    # Extract video IDs from current page and add to all_video_ids
  for item in data.get('items', []):
        # ... extraction logic ...
        snippet = item.get('snippet', {})
        resource_id = snippet.get('resourceId', {})
        video_id = resource_id.get('videoId')
        if video_id:
            video_ids.append(video_id)

  paged_token = data.get('nextPageToken')

  if not paged_token:
        break # Exit loop if no more pages

print(f"Total video IDs retrieved: {len(video_ids)}")

# if response.status_code != 200:
#     print(f"Error: API request failed with status code {response.status_code}")
#     print(f"Response text: {response.text}")


# if response.status_code == 200:
#     data = response.json()
#     for item in data.get('items', []):
#         snippet = item.get('snippet', {})
#         resource_id = snippet.get('resourceId', {})
#         video_id = resource_id.get('videoId')
#         if video_id:
#             video_ids.append(video_id)

# print(f"Found {len(video_ids)} video IDs.")
# if video_ids:
#     print("First 10 video IDs:", video_ids[:10])

Total video IDs retrieved: 175, RlUKQsbFX1c


## Process api response

### Subtask:
Generate code to parse the JSON response from the API and extract the video IDs from the playlist items.


**Reasoning**:
Parse the JSON response and extract the video IDs as instructed.



## Handle pagination

### Subtask:
Explain how to handle pagination in the API response to retrieve all video IDs if the playlist has more items than the maximum per page.


In [None]:
import re
import requests
from google.colab import userdata

base_url_videos = 'https://www.googleapis.com/youtube/v3/videos'

# Assuming 'video_ids' is already populated from the previous step
# If not, you would need to run the previous steps to get the video_ids first.

# The YouTube Data API allows up to 50 video IDs per request.
# We'll process the video_ids in batches of 50.
batch_size = 50
pdf_urls = {} # Dictionary to store video ID and found PDF URLs

# Check if video_ids is defined and not empty
if 'video_ids' in locals() and video_ids:
    for i in range(0, len(video_ids), batch_size):
        batch_ids = video_ids[i:i + batch_size]
        # Join the batch_ids with commas for the API request
        video_ids_string = ','.join(batch_ids)

        params_videos = {
            'key': userdata.get('YOUTUBE_API_KEY'),
            'id': video_ids_string,
            'part': 'snippet' # We need the snippet part to get the description
        }

        response_videos = requests.get(base_url_videos, params=params_videos)
        data_videos = response_videos.json()

        if response_videos.status_code != 200:
            print(f"Error fetching video details: {response_videos.status_code}")
            print(f"Response text: {data_videos}")
            continue # Skip to the next batch if there's an error

        # Process the response and extract descriptions
        for item in data_videos.get('items', []):
            video_id = item.get('id')
            description = item.get('snippet', {}).get('description', '')

            # Use regular expressions to find the specific phrases and extract the URL that follows.
            # This regex looks for "Click here to download the PDFs of this series: " or "download the PDFs"
            # and then captures the URL that follows.
            # It assumes the URL is on the same line and starts with http or https.
            pdf_link_pattern = r"(?:Click here to download the PDFs of this series:|download the PDFs.*?)\s*(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)"
            found_links = re.findall(pdf_link_pattern, description)

            if found_links:
                # found_links will be a list of tuples, where each tuple is the captured group (the URL)
                # We want a list of just the URLs.
                pdf_urls[video_id] = [link for link in found_links]
                print(f"Found potential PDF links for video ID {video_id}: {pdf_urls[video_id]}")
            # else:
                # print(f"No specific PDF download phrase or link found for video ID {video_id}")

    print("\nSummary of video IDs and found potential PDF URLs:")
    if pdf_urls:
        for video_id, urls in pdf_urls.items():
            print(f"Video ID: {video_id}")
            if urls:
                for url in urls:
                    print(f"- {url}")
            else:
                print("  No potential PDF URLs found with the specified phrases.")
    else:
        print("No potential PDF URLs were found in any of the video descriptions.")

else:
    print("The 'video_ids' list is not populated. Please run the previous steps to get the video IDs.")

**Reasoning**:
Explain how to handle pagination in the API response, covering the points outlined in the instructions.



In [7]:
# Explanation on handling pagination in the YouTube Data API:

# 1. API Result Limits:
#    The YouTube Data API limits the number of results returned per request for the playlistItems endpoint.
#    The default and maximum number of results per page is typically 50.

# 2. Checking for nextPageToken:
#    The API response includes a field called `nextPageToken` if there are more results available beyond the current page.
#    You can check for the presence of this key in the top level of the JSON response data.

# 3. Indication of More Results:
#    If the `nextPageToken` field exists and is not empty, it signifies that there are additional pages of results that can be retrieved.

# 4. Using the pageToken Parameter:
#    To fetch the next page of results, you need to include the `pageToken` parameter in your subsequent API request.
#    The value of the `pageToken` parameter for the next request should be the value of the `nextPageToken` obtained from the *previous* response.

# 5. Looping Through Pages:
#    You need to implement a loop that continues to make API requests as long as a `nextPageToken` is present in the response.
#    In each iteration of the loop, you update the `pageToken` parameter with the `nextPageToken` from the latest response and make a new request.
#    The loop terminates when an API response does not contain a `nextPageToken`, indicating that you have retrieved all the items in the playlist.

# 6. Accumulating Video IDs:
#    Inside the loop, after processing the response from each page and extracting the video IDs, you should append these IDs to a single list that exists outside the loop.
#    This list will accumulate all the video IDs from all the pages until the loop finishes.

# Example structure (conceptual):
# all_video_ids = []
# next_page_token = None
#
# while True:
#     params['pageToken'] = next_page_token # Update pageToken for the next request
#     response = requests.get(base_url, params=params)
#     data = response.json()
#
#     # Extract video IDs from current page and add to all_video_ids
#     for item in data.get('items', []):
#         # ... extraction logic ...
#         video_id = ...
#         if video_id:
#             all_video_ids.append(video_id)
#
#     next_page_token = data.get('nextPageToken')
#
#     if not next_page_token:
#         break # Exit loop if no more pages
#
# print(f"Total video IDs retrieved: {len(all_video_ids)}")

## Summary:

### Data Analysis Key Findings

*   A YouTube Data API key is required to access playlist data and can be obtained through the Google Cloud Console.
*   The `playlistItems` endpoint is used to retrieve items within a playlist.
*   The necessary parameters for the API request include `key`, `playlistId`, and `part='snippet'`.
*   The API response is in JSON format, and video IDs are nested within `items` -> `snippet` -> `resourceId` -> `videoId`.
*   The API limits the number of results per request (typically 50), and pagination is handled using `nextPageToken` in the response and the `pageToken` parameter in subsequent requests.
*   Securely storing the API key, for example, using Google Colab's Secrets feature, is crucial.

### Insights or Next Steps

*   Implement the pagination logic in the code to ensure all video IDs are retrieved from playlists with more than 50 items.
*   Add error handling for potential issues like invalid playlist IDs or API rate limits.


# Task
Retrieve PDF URLs from the descriptions of YouTube videos using the YouTube Data API, given a list of video IDs.

## Define api endpoint and parameters for videos

### Subtask:
Define the base URL for the `videos` endpoint and specify the necessary parameters like `key`, `id`, and `part`.


**Reasoning**:
Define the base URL and the parameters dictionary for the API request to the `videos` endpoint as instructed.



In [17]:
# import re
# import requests
# from google.colab import userdata

base_url_videos = 'https://www.googleapis.com/youtube/v3/videos'

# Assuming 'video_ids' is already populated from the previous step
# If not, you would need to run the previous steps to get the video_ids first.

# The YouTube Data API allows up to 50 video IDs per request.
# We'll process the video_ids in batches of 50.
batch_size = 50
pdf_urls = {} # Dictionary to store video ID and found PDF URLs

# Check if video_ids is defined and not empty
if 'video_ids' in locals() and video_ids:
    for i in range(0, len(video_ids), batch_size):
        batch_ids = video_ids[i:i + batch_size]
        # Join the batch_ids with commas for the API request
        video_ids_string = ','.join(batch_ids)

        params_videos = {
            'key': userdata.get('YOUTUBE_API_KEY'),
            'id': video_ids_string,
            'part': 'snippet' # We need the snippet part to get the description
        }

        response_videos = requests.get(base_url_videos, params=params_videos)
        data_videos = response_videos.json()

        if response_videos.status_code != 200:
            print(f"Error fetching video details: {response_videos.status_code}")
            print(f"Response text: {data_videos}")
            continue # Skip to the next batch if there's an error

        # Process the response and extract descriptions
        for item in data_videos.get('items', []):
            video_id = item.get('id')
            description = item.get('snippet', {}).get('description', '')

            # Use regular expressions to find URLs that start with "https://drive.google.com"
            drive_link_pattern = r"https://drive.google.com(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
            found_links = re.findall(drive_link_pattern, description)

            if found_links:
                # found_links will be a list of the captured URLs
                pdf_urls[video_id] = found_links
                print(f"Found potential Drive links for video ID {video_id}: {pdf_urls[video_id]}")
            # else:
                # print(f"No Drive links found for video ID {video_id}")

    print("\nSummary of video IDs and found potential Drive URLs:")
    if pdf_urls:
        for video_id, urls in pdf_urls.items():
            print(f"Video ID: {video_id}")
            if urls:
                for url in urls:
                    print(f"- {url}")
            else:
                print("  No potential Drive URLs found.")
    else:
        print("No potential Drive URLs were found in any of the video descriptions.")

else:
    print("The 'video_ids' list is not populated. Please run the previous steps to get the video IDs.")

Found potential Drive links for video ID AHFBK0YDk3s: ['https://drive.google.com/uc?export=download&id=19yDmxmxk0Z78GZ_mJi5-vHx8qYIiBABF']
Found potential Drive links for video ID 6T0NWKiXe4Y: ['https://drive.google.com/uc?export=download&id=1Gt_DOuJUrcqaoNYib2exqa0qRLLR5yMO']
Found potential Drive links for video ID LIXZ2cuPv5c: ['https://drive.google.com/uc?export=download&id=1Ix2jqqgjEooSCNoJSn0b3YGQoy8prHB1']
Found potential Drive links for video ID tv43okv6v1Y: ['https://drive.google.com/uc?export=download&id=1nKn3VYtJWtu6FH-QLIa2yLK35Kde4fUW']
Found potential Drive links for video ID ITbbpElhGbM: ['https://drive.google.com/uc?export=download&id=1X8QafjJu2IGzWYKgy5My0xUdGylL_zkO']
Found potential Drive links for video ID W1FKsIMipdw: ['https://drive.google.com/uc?export=download&id=16ySidvrxeB9s2MmoO9popqMw_LVB484M']
Found potential Drive links for video ID l6tZrRx1V8Q: ['https://drive.google.com/uc?export=download&id=16ySidvrxeB9s2MmoO9popqMw_LVB484M']
Found potential Drive links