# Using a playlist that had over 1000 sharktank related videos, we used it as a data source to scrape more pitches

In [None]:
curl \
  'https://youtube.googleapis.com/youtube/v3/playlistItems?part=snippet%2CcontentDetails&maxResults=25&playlistId=PLExfYdecaiH7MgXZxNf_4xwqvYoSPz-vm&key=[YOUR_API_KEY]' \
  --header 'Authorization: Bearer [YOUR_ACCESS_TOKEN]' \
  --header 'Accept: application/json' \
  --compressed


In [None]:
from google.colab import userdata

# Task
Retrieve at least 100 unique videos from a YouTube playlist with 500 videos using the YouTube Data API, handling pagination to fetch multiple batches of results.

## Setup

### Subtask:
Import necessary libraries and define the API key and playlist ID.


In [None]:
import requests

api_key = userdata.get("YOUTUBE_DATA_API_KEY")  # Replace with your actual API key
playlist_id = 'PLExfYdecaiH7MgXZxNf_4xwqvYoSPz-vm'

## Initial API call

Make the initial API call to retrieve the first batch of playlist items and the `nextPageToken`.

In [None]:
base_url = "https://www.googleapis.com/youtube/v3/playlistItems"
all_playlist_items = []
next_page_token = None
# target_items = 100 # We want at least 100 unique items
target_items = 300 # We want at least 100 unique items

params = {
    'key': api_key,
    'playlistId': playlist_id,
    'part': 'snippet,contentDetails',
    'maxResults': 50,  # Maximum results per page
}

response = requests.get(base_url, params=params)
response_json = response.json()

all_playlist_items.extend(response_json.get('items', []))
next_page_token = response_json.get('nextPageToken')

print(f"Retrieved {len(all_playlist_items)} items from the first page.")

Retrieved 50 items from the first page.


## Iterate and retrieve

Loop through the API calls using the `nextPageToken` until the desired number of unique items is reached or there are no more pages. Store the unique items in a list.

In [None]:
while next_page_token and len(all_playlist_items) < target_items:
    params['pageToken'] = next_page_token
    response = requests.get(base_url, params=params)
    response_json = response.json()

    all_playlist_items.extend(response_json.get('items', []))
    next_page_token = response_json.get('nextPageToken')

    print(f"Retrieved a total of {len(all_playlist_items)} items so far.")

unique_video_ids = set()
unique_playlist_items = []

for item in all_playlist_items:
    video_id = item['contentDetails']['videoId']
    if video_id not in unique_video_ids:
        unique_video_ids.add(video_id)
        unique_playlist_items.append(item)

print(f"\nFinished retrieving items. Total unique items: {len(unique_playlist_items)}")

Retrieved a total of 100 items so far.
Retrieved a total of 150 items so far.
Retrieved a total of 200 items so far.
Retrieved a total of 250 items so far.
Retrieved a total of 300 items so far.

Finished retrieving items. Total unique items: 299


## Process results

Display the number of unique videos retrieved.

In [None]:
print(f"Successfully retrieved {len(unique_playlist_items)} unique videos.")
# You can further process unique_playlist_items as needed
print(unique_playlist_items)

Successfully retrieved 299 unique videos.
[{'kind': 'youtube#playlistItem', 'etag': '80Vm3vNKbW1nfZepYkqS-qU56Oc', 'id': 'UExFeGZZZGVjYWlIN01nWFp4TmZfNHh3cXZZb1NQei12bS43MDBCRThERkRDNUY4MDg3', 'snippet': {'publishedAt': '2022-07-05T09:13:58Z', 'channelId': 'UCw5gIZqTRvdMtVvm4ye3QQg', 'title': 'Shark Tank US | Uprising Food Drives Mark Nuts', 'description': 'Kristen and William Schumacher are seeking $500k for a 3% stake in their company Uprising Food.\n\nFrom Season 13 Episode 1\n \nWatch Shark Tank Now: http://AAN.SonyPictures.com/SharkTankUS \nSome of the links in above are affiliate links, we may earn a small commission if you click through and make a purchase.\n\nSubscribe to SPTV for more from your favorite shows: https://bit.ly/OfficialSPTV\n\nFOLLOW SONY PICTURES TELEVISION \nSPTV Facebook: https://www.facebook.com/SonyPicturesTV \nSPTV Twitter: https://twitter.com/SPTV \nSPTV Instagram: https://www.instagram.com/sptv/ \nSPTV: https://www.sonypictures.com\n\nAbout Shark Tank: Th

## Using the official Sharktank Channel Ids, We filter out illegitimate videos

In [None]:
title_prop_name ="videoOwnerChannelTitle"
id_prop_name ="videoOwnerChannelId"

SONY_PICTURES_TELEVISION_CHANNEL_TITLE= "Sony Pictures Television"
SONY_PICTURES_TELEVISION_CHANNEL_ID= "UCw5gIZqTRvdMtVvm4ye3QQg"
SHARK_TANK_GLOBAL_CHANNEL_TITLE= "Shark Tank Global"
SHARK_TANK_GLOBAL_CHANNEL_ID= "UCREgA-BmOocJ9Is_bZV6aJQ"


## Filter by Channel

Filter the retrieved unique playlist items to include only videos from the specified channel IDs.

In [None]:
SONY_PICTURES_TELEVISION_CHANNEL_IDS = [
    'UCw5gIZqTRvdMtVvm4ye3QQg',  # Example: SPTV (Sony Pictures Television)
]
SHARK_TANK_GLOBAL_CHANNEL_IDS = [
    'UC2v7x25sC9UO2Y1dU0C7zxQ',  # Example: Shark Tank Global (replace with actual ID)
]

filtered_playlist_items = [
    item for item in unique_playlist_items
    if (
        item['snippet']['channelId'] in SONY_PICTURES_TELEVISION_CHANNEL_IDS
        or item['snippet']['channelId'] in SHARK_TANK_GLOBAL_CHANNEL_IDS
        or 'Sony Pictures' in item['snippet']['channelTitle']
        or 'Shark Tank' in item['snippet']['channelTitle']
    )
]

print(f"Successfully retrieved {len(filtered_playlist_items)} unique videos from the specified channels.")

# Optional: see which channels were kept
for i, item in enumerate(filtered_playlist_items[:5], start=1):
    print(f"{i}. {item['snippet']['title']} — {item['snippet']['channelTitle']}")

Successfully retrieved 299 unique videos from the specified channels.
1. Shark Tank US | Uprising Food Drives Mark Nuts — roddyrod
2. Wisp Entrepreneur Admits He's Made Some Business Mistakes | Shark Tank US | Shark Tank Global — roddyrod
3. Mr. Wonderful Kicks Pavlok Entrepreneur Out Of The Tank | Shark Tank US | Shark Tank Global — roddyrod
4. Shark Tank US | Top 3 Biggest Deals — roddyrod
5. The Sharks Fight For A Deal With Little Elf  | Shark Tank US | Shark Tank Global — roddyrod


## Extract Video IDs and Save to JSON

Extract the video IDs from the filtered playlist items and save them to a JSON file.

In [None]:
import json

video_data = {item['contentDetails']['videoId']: item['snippet']['description'].split('\n')[0] for item in filtered_playlist_items}

output_filename = 'filtered_video_data.json'
with open(output_filename, 'w') as f:
    json.dump(video_data, f, indent=4)

print(f"Successfully extracted {len(video_data)} video IDs and first line of descriptions and saved to {output_filename}")

Successfully extracted 299 video IDs and first line of descriptions and saved to filtered_video_data.json
