# IMT 547 Project: Channel-Based Data Collection

Chesie Yu

02/17/2024

<style type = "text/css">  
    body {
        font-family: "Serif"; 
        font-size: 12pt;
    }
    em {
        color: #4E7F9E;
    }
    strong {
        color: #436D87;
    }
    /*
    li {
        color: #4E7F9E;
    }
    */
    img {
        display: block;
        margin: auto;
    } 
    .jp-RenderedHTMLCommon a:link { 
        color: #94C1C9;
    }
    .jp-RenderedHTMLCommon a:visited { 
        color: #94C1C9;
    }
    .jp-RenderedHTMLCommon code {
        color: #4E7F9E;
    }  
    .mark {
        color: #B00D00;
        background-color: #FFF7B1;
    }
</style>

## Authentication

In [1]:
# The YouTube API key
API_KEY = "AIzaSyDmZB9ybm-7Xztho8TiOXVcISiyZcgin5o"

In [2]:
# Install libraries
!pip install --upgrade google-api-python-client --quiet

In [3]:
# Import libraries
import json
import time
import pandas as pd
import googleapiclient
from googleapiclient import discovery, errors

In [4]:
# Initialize the YouTube API
youtube = googleapiclient.discovery.build("youtube", "v3", developerKey=API_KEY)

## Functions

In [5]:
def get_uploads_id(channel_id):
    """
    Fetch the uploads playlist ID for a given YouTube channel. 
    """
    # Call the API to find uploads channel id
    # Documentation: https://developers.google.com/youtube/v3/docs/channels/list
    request = youtube.channels().list(
        part="contentDetails",
        id=channel_id
    )
    res = request.execute()

    # Extract the uploads playlist id
    uploads_id = res["items"][0]["contentDetails"]["relatedPlaylists"]["uploads"]
    
    return uploads_id

In [6]:
def get_video_ids(uploads_id, max_videos=30, keywords=""):
    """
    Fetch video IDs from a YouTube playlist.
    """
    # Empty list to store video_ids
    video_ids = []
    page_token = None
    
    # Loop until we collect enough videos
    while len(video_ids) < max_videos: 
        # Call the API to extract video IDs from playlist
        # Documentation: https://developers.google.com/youtube/v3/docs/playlistItems
        request = youtube.playlistItems().list(
            part="snippet",
            playlistId=uploads_id,
            pageToken=page_token,
            maxResults=50
        )
        res = request.execute()
    
        # Store the video ids
        for v in res["items"]: 
            # Check if title contains keywords
            # Maybe try stemming/lemmatization if I have the time?
            title = v["snippet"]["title"].lower()
            if any(k.lower() in title for k in keywords):
                video_ids.append(v["snippet"]["resourceId"]["videoId"])

            # Exit the loop once the required number of videos is reached
            if len(video_ids) >= max_videos:
                break 
    
        # Set the token
        page_token = res.get("nextPageToken")

        # Exit the loop if no token is found
        if not page_token: 
            break
    
    return video_ids

In [7]:
def get_video_info(video_ids):
    """
    Fetch video info from a list of YouTube videos.
    """
    # Empty list to store video info
    video_info = []

    for vid in video_ids:
        # Call the API to extract video info from ids
        # Documentation: https://developers.google.com/youtube/v3/docs/videos#resource
        request = youtube.videos().list(
            part="snippet, statistics",
            id=vid
        )
        res = request.execute()
        
        for v in res["items"]:
            # Extract relevant video info
            video_info.append({
                "channel_id": v["snippet"]["channelId"],
                "channel_name": v["snippet"]["channelTitle"],
                "video_id": v["id"],
                "video_title": v["snippet"]["title"],
                "video_creation_time": v["snippet"]["publishedAt"],
                "video_description": v["snippet"]["description"],
                "video_tags": v["snippet"].get("tags", []), 
                "video_viewcount": v["statistics"].get("viewCount", "0"),
                "video_likecount": v["statistics"].get("likeCount", "0"),
                "video_commentcount": v["statistics"].get("commentCount", "0"),
            })
            
    # Print summary
    print(f"Number of Videos Extracted: {len(video_info)}\n")
    
    return video_info

In [8]:
def get_video_comments(video_ids, max_comments=100):
    """
    Fetch comments (up to max_comments) for a list of videos.
    """
    # Empty list to store the comments
    comment_info = []

    # Loop through the video ids
    for vid in video_ids:
        page_token = None 
        
        # Empty list to store individual video comments
        video_comment_info = []
        
        while len(video_comment_info) < max_comments:
            try:
                # Call the API to extract comments for videos
                # Documentation: https://developers.google.com/youtube/v3/docs/commentThreads/list
                request = youtube.commentThreads().list(
                    videoId=vid,
                    part="id, snippet, replies",
                    textFormat="plainText",
                    maxResults=100,
                    pageToken=page_token
                )
                res = request.execute()

                # Extract relevant comment info
                for c in res["items"]:
                    video_comment_info.append({
                        "video_id": c["snippet"]["videoId"],
                        "comment_id": c["snippet"]["topLevelComment"]["id"],
                        "comment_author_id": c["snippet"]["topLevelComment"]["snippet"]["authorChannelId"]["value"],
                        "comment_text": c["snippet"]["topLevelComment"]["snippet"]["textOriginal"],
                        "comment_time": c["snippet"]["topLevelComment"]["snippet"]["updatedAt"],
                        "comment_likecount": c["snippet"]["topLevelComment"]["snippet"]["likeCount"],
                        "comment_replycount": c["snippet"]["totalReplyCount"]
                    })
                    
                    # Exit the loop once the required number of comments is reached
                    if len(video_comment_info) >= max_comments:
                        break

                # Set the token
                page_token = res.get("nextPageToken")

                # Exit the loop if no token is found
                if not page_token: 
                    break

            # Error handling for commentsDisabled
            except errors.HttpError as e:
                if e.resp.status == 403 and "commentsDisabled" in str(e):
                    print(f"Comments are disabled for video {vid}.")
                else:
                    print(f"An error occurred for video {vid}: {e}")
                break 
    
        # Extend the comment info
        comment_info.extend(video_comment_info)

    # Print summary
    print(f"Number of Comments Extracted: {len(comment_info)}")
    print("====================================\n")
    
    return comment_info

### Main Function

In [9]:
def get_youtube_data(channel_ids, max_videos=30, max_comments=100, keywords=""):
    """
    Fetch videos and comments for a list of channels
    """
    # Start timing
    start_time = time.time()
    
    all_video_info = []
    all_comment_info = []

    for channel_name, channel_id in channel_ids_dict.items():
        print(f"Processing channel: {channel_name}")
        print("--------------------------------")
        
        # Get uploads playlist id for channel
        uploads_id = get_uploads_id(channel_id)

        # Get video ids from uploads playlist
        video_ids = get_video_ids(uploads_id, max_videos, keywords)

        # Get video info from videos ids
        video_info = get_video_info(video_ids)
        all_video_info.extend(video_info)

        # Fetch comments for each video
        comment_info = get_video_comments(video_ids, max_comments)
        all_comment_info.extend(comment_info)
    
    # Convert to DataFrames
    video_info_df = pd.DataFrame(all_video_info)
    comment_info_df = pd.DataFrame(all_comment_info)

    # Merge video information with comments
    yt_comments = pd.merge(video_info_df, comment_info_df, on="video_id", how="inner")

    # End timing 
    print(f"Runtime: {time.time() - start_time:.4f} seconds")
    
    return yt_comments

## Data Collection

In [10]:
# Set the parameters
max_videos = 30
max_comments = 100

# Select the keywords
action_keywords = [
    "call of duty", "gta", "the last of us", "god of war", "batman", 
    "red dead redemption", "assassin's creed", "star wars jedi", 
    "resident evil", "cyberpunk", "fallout", "tomb raider", "elden ring"
]

nonaction_keywords = [
    "minecraft", "pokemon go", "just dance", "it takes two", "uncharted",
    "brawl stars"
]

In [11]:
# Select the channels
channel_ids_dict = {
    # "AboFlah": "UCqq5n-Oe-r1EEHI3yvhVJcA",
    "Markiplier": "UC7_YxT-KID8kRbqZo7MyscQ", 
    # "Frost Diamond": "UC4hGmH5sABOA70D4fGb8qNQ",
    "SSSniperWolf": "UCpB959t8iPrxQWj7G6n0ctQ",
    # "VEGETTA777": "UCam8T03EOFBsNdR0thrFHdQ",
    # "rezendeevil": "UCbTVTephX30ZhQF5zwFppBg",
    "jacksepticeye": "UCYzPXprvl5Y-Sf0g4vX-m6g",
    "DanTDM": "UCS5Oz6CHmeoF7vSad0qqXfw",
    "VanossGaming": "UCKqH_9mk1waLgBiL2vT5b9g",
    "Preston": "UC70Dib4MvFfT1tU6MqeyHpQ", 
    "Aphmau": "UCzYfz8uibvnB7Yc1LjePi4g", 
    "The Game Theorists": "UCo_IB5145EVNcf8hw1Kku7w",
    "Ali-A": "UCYVinkwSX7szARULgYpvhLw",
    "PDK Films": "UC-F6LZSWz34xKknY8NNCzgQ"
}

In [12]:
# Get YouTube videos and comments for action videos
yt_action = get_youtube_data(channel_ids_dict, max_videos, max_comments, action_keywords)
yt_action["genre"] = "action"
yt_action.head(3)

Processing channel: Markiplier
--------------------------------
Number of Videos Extracted: 30

Number of Comments Extracted: 3000

Processing channel: SSSniperWolf
--------------------------------
Number of Videos Extracted: 30

Number of Comments Extracted: 3000

Processing channel: jacksepticeye
--------------------------------
Number of Videos Extracted: 30

Number of Comments Extracted: 3000

Processing channel: DanTDM
--------------------------------
Number of Videos Extracted: 16

Number of Comments Extracted: 1600

Processing channel: VanossGaming
--------------------------------
Number of Videos Extracted: 30

Number of Comments Extracted: 3000

Processing channel: Preston
--------------------------------
Number of Videos Extracted: 30

Number of Comments Extracted: 3000

Processing channel: Aphmau
--------------------------------
Number of Videos Extracted: 1

Number of Comments Extracted: 66

Processing channel: The Game Theorists
--------------------------------
Number of V

Unnamed: 0,channel_id,channel_name,video_id,video_title,video_creation_time,video_description,video_tags,video_viewcount,video_likecount,video_commentcount,comment_id,comment_author_id,comment_text,comment_time,comment_likecount,comment_replycount,genre
0,UC7_YxT-KID8kRbqZo7MyscQ,Markiplier,9pWXDlBvHHo,Resident Evil 4: Separate Ways DLC,2023-10-30T20:08:30Z,Check out Resident Evil 4 HERE ►► https://gsgh...,"[markiplier, resident evil 4, seperate ways dl...",1593069,104925,8686,UgzcGeP-nTXkwgwmyLV4AaABAg,UCYCM610WdShF-g-_cKePxrw,"I was falling off a cliff when Mark flew down,...",2024-02-17T17:44:28Z,0,0,action
1,UC7_YxT-KID8kRbqZo7MyscQ,Markiplier,9pWXDlBvHHo,Resident Evil 4: Separate Ways DLC,2023-10-30T20:08:30Z,Check out Resident Evil 4 HERE ►► https://gsgh...,"[markiplier, resident evil 4, seperate ways dl...",1593069,104925,8686,Ugy1pYQTYdvqlHkwM8B4AaABAg,UCtiE2RZAE4EWQ049ib979xg,If you edit out all the parts where Mark talks...,2024-02-17T12:26:10Z,0,0,action
2,UC7_YxT-KID8kRbqZo7MyscQ,Markiplier,9pWXDlBvHHo,Resident Evil 4: Separate Ways DLC,2023-10-30T20:08:30Z,Check out Resident Evil 4 HERE ►► https://gsgh...,"[markiplier, resident evil 4, seperate ways dl...",1593069,104925,8686,UgzNpxHp-Tks7j-KvDh4AaABAg,UCEx_pl0E2MYAYEuJCyxs2FQ,Wish he played the entire game :(,2024-02-16T15:31:38Z,0,0,action


In [13]:
# Check the dimension
yt_action.shape

(22566, 17)

In [14]:
# Write to CSV
yt_action.to_csv("../data/yt_action.csv", index=False)

In [15]:
# Get YouTube videos and comments for non-action videos
yt_nonaction = get_youtube_data(channel_ids_dict, max_videos, max_comments, nonaction_keywords)
yt_nonaction["genre"] = "non-action"
yt_nonaction.head(3)

Processing channel: Markiplier
--------------------------------
Number of Videos Extracted: 30

Number of Comments Extracted: 3000

Processing channel: SSSniperWolf
--------------------------------
Number of Videos Extracted: 9

Number of Comments Extracted: 900

Processing channel: jacksepticeye
--------------------------------
Number of Videos Extracted: 30

Number of Comments Extracted: 3000

Processing channel: DanTDM
--------------------------------
Number of Videos Extracted: 30

Number of Comments Extracted: 3000

Processing channel: VanossGaming
--------------------------------
Number of Videos Extracted: 30

Number of Comments Extracted: 3000

Processing channel: Preston
--------------------------------
Number of Videos Extracted: 30

Number of Comments Extracted: 3000

Processing channel: Aphmau
--------------------------------
Number of Videos Extracted: 30

Number of Comments Extracted: 3000

Processing channel: The Game Theorists
--------------------------------
Number of 

Unnamed: 0,channel_id,channel_name,video_id,video_title,video_creation_time,video_description,video_tags,video_viewcount,video_likecount,video_commentcount,comment_id,comment_author_id,comment_text,comment_time,comment_likecount,comment_replycount,genre
0,UC7_YxT-KID8kRbqZo7MyscQ,Markiplier,cCz7O3davRI,GOING WHERE NO ONE SHOULD EVER GO | Minecraft ...,2020-12-18T18:47:19Z,It's time to dive deeper into the Nether than ...,"[markiplier, minecraft, minecraft nether, the ...",2817363,162960,13459,UgzsAqou3xvJR1aD1mt4AaABAg,UC68tZ7GbfJl2GaMI5aD8g4A,MOREEEEEEE PLEASEEEEEE 😭,2024-02-18T05:01:32Z,0,0,non-action
1,UC7_YxT-KID8kRbqZo7MyscQ,Markiplier,cCz7O3davRI,GOING WHERE NO ONE SHOULD EVER GO | Minecraft ...,2020-12-18T18:47:19Z,It's time to dive deeper into the Nether than ...,"[markiplier, minecraft, minecraft nether, the ...",2817363,162960,13459,UgzZU8JFOa2mhfVQohF4AaABAg,UC9DJud-BrdcgME-BZcZ-tVw,"Day #1,495, still waiting for that next episod...",2024-02-16T18:42:36Z,0,0,non-action
2,UC7_YxT-KID8kRbqZo7MyscQ,Markiplier,cCz7O3davRI,GOING WHERE NO ONE SHOULD EVER GO | Minecraft ...,2020-12-18T18:47:19Z,It's time to dive deeper into the Nether than ...,"[markiplier, minecraft, minecraft nether, the ...",2817363,162960,13459,UgzXVcaeU1TvJ70cD7h4AaABAg,UCoHpRoWN8S8QGFesSc5QKWQ,mark.. where did this series go,2024-02-16T03:24:51Z,0,0,non-action


In [16]:
# Check the dimension
yt_nonaction.shape

(22800, 17)

In [17]:
# Write to CSV
yt_nonaction.to_csv("../data/yt_nonaction.csv", index=False)