## 1. Introduction

In this project, I decided to scrape metadata of two popular youtube channels that I like  - Pewdiepie and Mr.Beast. These channels have a very high subscriber count and also put out a variety of interesting videos. Using the video metadata, I wanted to see how the viewing trends have changed over the past few years as it is influenced directly by like count, view count and also on video duration.

After identifying some research questions and analysing them, I wanted to create a machine learning model to assess the popularity of a new video based on the view count, comment count and the video duration.

I have utilised the youtube API to scrape required data, and I secured the below details. 

## Column Descriptions

- videoId - A unique video id
- title	- The title of the youtube video
- publishedAt - The date of upload
- categoryId - A number which categorises the video type. (refer https://mixedanalytics.com/blog/list-of-youtube-video-category-ids/ for more information)
- categoryName	- Name of the category
- tags - Common tags associated with the youtube video
- viewCount	- The total number of views till date
- likeCount	- The total number of likes till date
- dislikeCount	- Should ideally show the dislike count, but was removed in latest youtube policy updates.
- commentCount	- Total number of comments for the youtube video
- duration - The duration of the youtube video
- subscriberCount - Total number of subscribers till date.

Here the youtube api only provided the latest subscriber count and not the historical count when the video was published. And due to the latest policy update the dislike count is zero.


## Research Questions

Here are some of the research questions I have identified

-  What are the yearly trends in video view counts?
- What is the general correlation between the like count and view count across the dataset of YouTube videos, and what does the scatter plot reveal about their relationship?
- What are the top 10 tags observed in all the youtube videos
- How does the plot for Average like v/s Video duration look like
- What is the average duration of videos for PewDiePie and MrBeast, and how does video length relate to popularity metrics like views and likes?


# 2. Data Collection

A Google cloud account is needed, where a new project is created and 'YouTube Data API v3' is enabled. After creating the credentials a unique api key is available to pull publicly accessible data. There was another api (Youtube Analytics API) which could provide deeper insights(demographics details of subscribers) of a youtube channel, but was only available to its owner. The data used in this project was only the publicly available data. 


In [2]:
import requests
import pandas as pd
from datetime import datetime
from googleapiclient.discovery import build

I start by gathering video IDs from a specific YouTube channel. My code first retrieves the channel's ID by using the channel's username provided to the YouTube API. Once I have the channel ID, I use it to fetch the IDs of videos uploaded by this channel. I've set up my code to perform this action twice(which can be adjusted to pull more data), each time collecting up to 50 video IDs per request, while making sure not to include any video IDs that I've already collected previously. This helps in building a comprehensive list of videos for further analysis without redundancy.

In [3]:
def fetch_video_ids(channel_username, api_key, existing_ids):
    base_url = "https://www.googleapis.com/youtube/v3/"
    channel_endpoint = base_url + "channels"
    channel_params = {
        'part': 'id',
        'forUsername': channel_username,
        'key': api_key
    }
    channel_response = requests.get(channel_endpoint, params=channel_params).json()

    # Check for errors in the response
    if 'error' in channel_response:
        print(f"Error in response: {channel_response['error']}")
        return []

    if 'items' not in channel_response or not channel_response['items']:
        print(f"No items found for channel: {channel_username}. Response: {channel_response}")
        return []

    channel_id = channel_response['items'][0]['id']
    
    video_ids = []
    next_page_token = None
    for _ in range(2):  # Adjust the range as needed
        search_endpoint = base_url + "search"
        search_params = {
            'part': 'id',
            'channelId': channel_id,
            'type': 'video',
            'maxResults': 50,
            'pageToken': next_page_token,
            'key': api_key
        }
        search_response = requests.get(search_endpoint, params=search_params).json()
        if 'items' in search_response:
            for item in search_response['items']:
                video_id = item['id']['videoId']
                if video_id not in existing_ids:
                    video_ids.append(video_id)
            next_page_token = search_response.get('nextPageToken', None)
        else:
            break

    return video_ids


Next, I focus on collecting detailed information about each video. For each video ID obtained previously, I use the YouTube API to gather various pieces of data. This includes the video's title, publication date, category, and various engagement metrics like view count, like count, and so on. I also fetch channel-related data to add more context to the video.


In [4]:
def fetch_youtube_data(video_id, api_key):
    video_url = "https://www.googleapis.com/youtube/v3/videos"
    channel_url = "https://www.googleapis.com/youtube/v3/channels"
    category_url = "https://www.googleapis.com/youtube/v3/videoCategories"
    
    # Fetch video data
    video_params = {
        'part': 'snippet,statistics,contentDetails',
        'id': video_id,
        'key': api_key
    }
    video_response = requests.get(video_url, params=video_params).json()
    video_data = video_response['items'][0]
    
    # Fetch channel data
    channel_id = video_data['snippet']['channelId']
    channel_params = {
        'part': 'statistics',
        'id': channel_id,
        'key': api_key
    }
    channel_response = requests.get(channel_url, params=channel_params).json()
    channel_data = channel_response['items'][0]
    
    # Fetch category name
    category_id = video_data['snippet']['categoryId']
    category_params = {
        'part': 'snippet',
        'id': category_id,
        'key': api_key
    }
    category_response = requests.get(category_url, params=category_params).json()
    category_name = category_response['items'][0]['snippet']['title']
    
    # Compile and return the data
    data = {
        'videoId': video_id,  
        'title': video_data['snippet'].get('title', None),
        'publishedAt': video_data['snippet'].get('publishedAt', None),
        'categoryId': video_data['snippet'].get('categoryId', None),
        'categoryName': category_name,
        'tags': video_data['snippet'].get('tags', []),
        'viewCount': video_data['statistics'].get('viewCount', 0),
        'likeCount': video_data['statistics'].get('likeCount', 0),
        'dislikeCount': video_data['statistics'].get('dislikeCount', 0),
        'commentCount': video_data['statistics'].get('commentCount', 0),
        'duration': video_data['contentDetails'].get('duration', None),
        'subscriberCount': channel_data['statistics'].get('subscriberCount', 0)
    }
    
    return data


Here, I've written a function to handle potential quota errors from the YouTube API. Since there's a limit to the number of requests one can make, this function helps in identifying when these limits are exceeded, allowing me to avoid unnecessary errors.

In [5]:
# Function to check if quota limit error is in the response
def is_quota_error(response):
    if 'error' in response:
        error_code = response['error']['code']
        if error_code == 403:
            error_reason = response['error']['errors'][0]['reason']
            if error_reason == 'quotaExceeded' or error_reason == 'rateLimitExceeded':
                return True
    return False


In the below code chunk, a new csv file is created or an existing one is used to build on the previous data. I have setup appropriate print statements to keep track of whats happening. Here, "pewdiepie.csv" is for storing PewDiePie channel content and "mrbeast.csv" for MrBeast's data. I have introduced this variation as I exceeded my daily limit in scraping just one creator's content per day. This was the workaround for getting more data.

The channel usernames are defined in a variable called channel_username, which has to be uncommented to pull its respective data. The rest of the code in the below chunk calls the previously defined functions to get unique video id's which are further used to get video metadata, and another check for the daily limit is performed to ensure smooth execution of the code. The scraped data is then converted to a dataframe. 

In [6]:
filename = "pewdiepie.csv"  # for pewdiepie data
# filename = "mrbeast.csv" # for mrbeast data

# Load existing data or create an empty DataFrame
try:
    df = pd.read_csv(filename)
    existing_ids = set(df['videoId'].tolist())
    print(f"Loaded existing data from {filename}.")
except FileNotFoundError:
    df = pd.DataFrame()
    existing_ids = set()
    print(f"{filename} not found. A new file will be created.")

# Defining channel names
channel_username = ['pewdiepie'] # Enabling this line by default to scrape PewDiePie's data
# channel_username = ['mrbeast6000'] # uncomment this line to scrape MrBeast data

# API key
api_key = 'AIzaSyBmEK-PE9FQzdqO14kx-NyOyh_PNrq4B9k'

# fetching the video id's
video_ids = fetch_video_ids(channel_username, api_key, existing_ids)

# fetching the metadata of each video 
for vid in video_ids:
    video_data = fetch_youtube_data(vid, api_key)
    
    # to check if the daily API limit is reached
    if is_quota_error(video_data):
        print("YouTube API quota limit reached. Stopping data fetch.")
        break
        
    # converting the data to a dataframe   
    df_new = pd.DataFrame([video_data])
    df = pd.concat([df, df_new], ignore_index=True)


temp2.csv not found. A new file will be created.


### Saving data to csv file

A total of 642 rows for PewDiepie and 597 rows for MrBeast is scraped so far and saved to their respective csv files using the code below. 


In [10]:
# Saving data to csv file
df.to_csv(filename, index=False, encoding='utf-8')
print(f"Data updated and saved to {filename}")


Data updated and saved to temp2.csv
