
# Exploratory Data Analysis of S.T.E.M Youtube Channels

## 1. Aims, objectives and background

### 1.1 Introduction

By daily searches, YouTube is the second largest search engine on the internet behind Google. With over 122 million daily users and 62% of U.S. internet users using the platform daily[[1]](https://thesocialshepherd.com/blog/youtube-statistics#:~:text=YouTube%20has%202.1%20billion%20monthly,122%20million%20users%20per%20day), the reach and scope of content that can gain traction on the platform is among the largest in the world. This has made it one of the first sites that educators go to in order to share their content with users, and also one of the go-to places for students looking for educational content. However, the algorithm behind YouTube's video recommendations is often changing and not completely understood by most users despite being one of the largest scale recommendation systems in use today[[2]](https://dl.acm.org/doi/10.1145/2959100.2959190).  
One of the main factors in the mainstream is that if a video receives more likes and comments (known as "engagement) then it will be promoted to more users. Another big consideration to many youtubers is video duration. Short videos are often watched all the way through, while longer ones have more opportunities to place advertisments (which benefits YouTube so logically would be rewarded by YouTube's algorithm) but may find watchers losing interest and leaving the video halfway through.[[3]](https://vidiq.com/blog/post/5-youtube-algorithm-myths-youtubers-need-to-know-about/).


### 1.2 Aims and objectives

Within this project, I will explore the following:

- Understand the YouTube API and use it to obtain video/channel metadata.
- Analyze this data to understand best practices among S.T.E.M. YouTube channels in order to grow a channel's audience.
- Explore the trending topics within the S.T.E.M. YouTube community using NLP techniques.
    - What topics are evergreen and a constant source of curiosity out of viewers? 
    - What questions are being asked in comments of these videos?



### 1.3 Steps of this project

1. Obtain video metadata via the YouTube API of mainstream S.T.E.M. communicators on YouTube. 
    - This contains many smaller steps, including: creating a developer key, requesting data from the API and transforming this data into something usable.
2. Preprocess and clean the data.
3. Build additional features for analysis.
4. Exploratory Data Analysis.
5. Conclusion.


### 1.4 The Dataset

#### Data Selection

As this project is focused on S.T.E.M. channels, there are not many available datasets that contain realtime data that is suitable for this purpose. As such, I have decided to use the [Google YouTube Data API version 3.0](https://developers.google.com/youtube/v3) to build my own dataset to explore. 

For analysis of a broader scope, there is a constantly updated dataset [available on Kaggle](https://www.kaggle.com/datasets/rsrishav/youtube-trending-video-dataset) containing the top 200 most trending videos of each day for the last several months. 

Complete steps for how I collected and built the dataset is contained in section *2. Data Creation*

#### Data Limitations

This dataset contains fully up-to-date, real-time and real-world data. However because this dataset was set up to look for channels that I am familiar with, the scope of the dataset might not reach as far as it could be with additional automation. Another consideration is what consistutes a 'popular' or 'large' channel. There are many metrics with which to measure a channel's popularity, I have mostly weighted the subscriber count as the main factor. However, there are many channels that recieve large viewcounts without gaining a large subscriber count because of the nature of the content. There are also channels that may focus on a more specific area within S.T.E.M. and therefore draw a smaller audience despite having a large audience for their niche. A final consideration is that without requesting additional request units from YouTube the API is limited to 10,000 requests per day. And so I have limited the scope of this EDA to remain within that limit. However, if desired a larger dataset could be generated by requesting different channels data on each day, though it would be slightly less up-to-date.

#### Ethics of data source

All data obtained in this dataset was done so through YouTube's API, the usage of which is free of charge given the the application send requests are within a certain limit. "The YouTube API uses a quota to ensure that developers use the service as intended and do not create applications that unfairly reduce service quality or limit access for others." Additional unit allocations can be requests via the YouTube API site but that is unneccesary for a project of this scope. 

The data in this project is also public, and could be collected manually just by looking at each channel's page in your internet browser. No information that has been collected is hidden or private. Additionally, the data is used only for research purposes and not for any commercial ones.

In [None]:
from googleapiclient.discovery import build
from dateutil import parser
import pandas as pd
import isodate 
from dateutil import parser
# Data viz packages
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

# NLP
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
from wordcloud import WordCloud

## 2. Data creation with Youtube API

To begin this project, I used my own personal google account to create a project on Google Developers Console. Next, I created an API key. Next I searched for the target YouTube channels to be analysed and collected the channel ID of each one. Finally I enabled the API for use in this application and imported the channel IDs into an array.  

In [19]:
api_key = 'AIzaSyBe804qZLpGzADIOi0yUXaO4qcwDJiSo3w'

channel_ids = [#'UCoxcjq-8xIDTYp3uz647V5A', #numberphile
               #'UC9-y-6csu5WGm29I7JiwpnA', #computerphile
               'UCYO_jab_esuFRV4b17AJtAw', #1blue3brown
               #'UCbfYPyITQ-7l4upoX8nvctg', #2MinutePapers
               
              ]
               
#Setting up api connection. 

api_service_name = "youtube"
api_version = 'v3'

#Get credentials and create API client.
youtube = build(api_service_name, api_version, developerKey = api_key)


Following this, I wrote some methods that would allow requests to be sent to YouTube which collects various data from these channels.

In [20]:
def get_channel_stats(youtube, channel_ids):
    '''
    Get channel stats
    Params:
    ----
    youtube: build object from youtube API
    channel_ids: list of channel IDs
    
    Returns:
    -----
    dataframe of all channel stats for each id in the channel_ids list.
    '''
    
    all_data = []
    request = youtube.channels().list(part = 'snippet,contentDetails,statistics', 
                                        id = ', '.join(channel_ids))
    response = request.execute()

    #loop to store data in dictionary.
    for item in response['items']:
        data = {'channelName' : item['snippet']['title'],
                'subscribers' : item['statistics']['subscriberCount'],
                'views' : item['statistics']['viewCount'],
                'totalVideos' : item['statistics']['videoCount'],
                'playlistId' : item['contentDetails']['relatedPlaylists']['uploads']
               }
        all_data.append(data)

    return pd.DataFrame(all_data)

In [21]:
def get_video_ids(youtube, playlist_ids):
    '''
    Get video ids for channels in channel_ids
    Params:
    ----
    youtube: build object from youtube API
    playlist_ids: list of channels' playlist IDs
    
    Returns:
    -----
    Array of id values for each video in each channels 
    '''
    
    video_ids = []
    
    request = youtube.playlistItems().list(
            part="snippet,contentDetails",
            playlistId=playlist_id,
            maxResults = 50
        )
    response = request.execute()
    
    #loop to collect all ids.
    for item in response['items']:
        video_ids.append(item['contentDetails']['videoId'])
        
        
    next_page_token = response.get('nextPageToken')

    while next_page_token is not None:
        request = youtube.playlistItems().list(
            part="snippet, contentDetails",
            playlistId=playlist_id,
            maxResults = 50
        )
        response = request.execute()
        #loop to collect all ids.
        for item in response['items']:
            video_ids.append(item['contentDetails']['videoId'])


        next_page_token = response.get('nextPageToken') 
    return video_ids

In [22]:
def get_video_details(youtube, video_ids):
    '''
    Get video statistics for all videos contained in each playlist of playlist_ids
    Params:
    ----
    youtube: build object from youtube API
    playlist_id: list of channels' playlist IDs
    
    Returns:
    -----
    Dataframe of video ids with various statistics such at Video Title, date of upload, viewcount etc.
    '''
    all_video_info = []
    
    for i in range(0, len(video_ids), 50):
        request = youtube.videos().list(
            part="snippet, contentDetails, statistics",
            id = ','.join(video_ids[i: i+50])
        )
        response = request.execute()


        for video in response['items']:
            stats_to_keep = {'snippet' : ['channelTitle', 'title', 'description', 'tags', 'publishedAt'],
                             'statistics' : ['viewCount', 'likeCount', 'commentCount'],
                             'contentDetails' : ['duration', 'definition', 'caption']
                            }
            video_info = {}
            video_info['video_id'] = video['id']

            for k in stats_to_keep.keys():
                for v in stats_to_keep[k]:
                    try:
                        video_info[v] = video[k][v]
                    except: 
                        video_info[v] = None

            all_video_info.append(video_info)
    return pd.DataFrame(all_video_info)

In [23]:
def get_comments_in_videos(youtube, video_ids):
    '''
    Get top level comments as text from all videos given in video_ids (max 10 comments due to API's request quota).
    Params:
    youtube: the build object from googleapliclient.discovery
    video_ids: the array of video ids
    
    Returns: 
    DataFrame with video IDs and top level comments of each video.
    '''
    all_comments = []
    
    for video_id in video_ids:
        try:
            request = youtube.commentThreads().list(
                part = "snippet,replies",
                videoId = video_id
            )
            response = request.execute()
            
            comments_in_video = [comment['snippet']['topLevelComment']['snippet']['textOriginal'] for comment in response['items'][0:10]]
            comments_in_video_info = {'video_id': video_id, 'comments': comments_in_video}

            all_comments.append(comments_in_video_info)
            
        except:
            #When error occurs because of commends being disabled on a video etc.
            print(f"Could not collect comments for video {video_id}")
        
    return pd.DataFrame(all_comments)


## Get overall channel statistics

Using the first defined method above, get_channel_stats, we are going to obtain the channel stats for each channel in our array of channel_ids.

In [24]:
channel_data = get_channel_stats(youtube, channel_ids)

channel_data

HttpError: <HttpError 403 when requesting https://youtube.googleapis.com/youtube/v3/channels?part=snippet%2CcontentDetails%2Cstatistics&id=UCYO_jab_esuFRV4b17AJtAw&key=AIzaSyBe804qZLpGzADIOi0yUXaO4qcwDJiSo3w&alt=json returned "The request cannot be completed because you have exceeded your <a href="/youtube/v3/getting-started#quota">quota</a>.". Details: "[{'message': 'The request cannot be completed because you have exceeded your <a href="/youtube/v3/getting-started#quota">quota</a>.', 'domain': 'youtube.quota', 'reason': 'quotaExceeded'}]">

Looking at the datatypes for the columns in this dataframe, many of the 'count' columns are currently stored as objects. This should be changed to more easily access this data.

In [None]:
#convert count columns to numeric columns.
numeric_cols = ['subscribers', 'views', 'totalVideos']
channel_data[numeric_cols] = channel_data[numeric_cols].apply(pd.to_numeric, errors = 'coerce')

With this fixed, we can look at the number of subscribers for each channel to understand the relative popularity of these channels.

In [None]:
sns.set(rc={'figure.figsize': (10, 8)})
ax = sns.barplot(data = channel_data.sort_values('subscribers', 
                ascending = False), x = 'channelName', y = 'subscribers')
ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: '{:,.0f}'.format(x/1000) + 'K'))
plot = ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)


Next, a look at the rankings based on total views across all videos rather than number of subscribers as overall popularity can be based on several of these metrics not only one. 

In [None]:
ax = sns.barplot(data = channel_data.sort_values('views', ascending = False), x = 'channelName', y = 'views')
ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: '{:,.0f}'.format(x/1000) + 'K'))
plot = ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)

## Get video statistics for all channels

Now, we will gather data from all videos among all channels in our channel_ids array. In total we collected metadata for 2322 videos.

In [None]:
# Create a dataframe with video statistics and comments from all channels

video_df = pd.DataFrame()
comments_df = pd.DataFrame()

for c in channel_data['channelName'].unique():
    print("Getting video information from channel: " + c)
    playlist_id = channel_data.loc[channel_data['channelName']== c, 'playlistId'].iloc[0]
    video_ids = get_video_ids(youtube, playlist_id)
    
    # get video data
    video_data = get_video_details(youtube, video_ids)
    # get comment data
    comments_data = get_comments_in_videos(youtube, video_ids)

    # append video data together and comment data toghether
    video_df = video_df.append(video_data, ignore_index=True)
    comments_df = comments_df.append(comments_data, ignore_index=True)

In [18]:
video_df

0

For comment_df we limited the number of comments we recieved in order to not go over the daily 10,000 unit request quota that the API has.

In [3]:
comment_df

NameError: name 'comment_df' is not defined

In [None]:
# Save as csv file.
video_df.to_csv("video_data_youtube_eda.csv")
comment_df.to_csv("comment_data_youtube_eda.csv")

## Preprocessing & Feature engineering

In order to effectively analyze the data, a few pre-processing steps have to be taken. Firstly, some columns should be reformatted, particularly the publishedAt and duration columns which are not in a logical format currently. Also, enriching the data with some new features that combine or refactor some of the data currently collected would help as well.

### Check for empty values

In [None]:
video_df.isnull().any()

In [None]:
video_df.publishedAt.sort_values().value_counts()

In [None]:
col = ['viewCount', 'likeCount', 'commentCount']
video_df[cols] = video_df[cols].apply(pd.to_numeric, errors = 'coerce', axis = 1)

### Enriching data

Some data can be enriched to help with analysis, for example:

- creating a column that shows on which day of the week the video was uploaded.
- convert duration column into seconds.
- calculate comments and likes per 1000 views.
- calculate title character length.

In [None]:
# publish data column
video_df['publishedAt'] = video_df['publishedAt'].apply(lambda x: parser.parse(x))
video_df['publishedDay'] = video_df['publishedAt'].apply(lambda x: x.strftime("%A"))

#convert duration to seconds
video_df['durationSecs'] = video_df['duration'].apply(lambda x: isodate.parse_duration(x))
video_df['durationSecs'] = video_df['durationSecs'].astype('timedelta64[s]')

#add number of tags for each video as a column
video_df['tagsCount'] = video_df['tags'].apply(lambda x: 0 if x is None else len(x))

#comments and likes per 1000 views
video_df['likeRatio'] = video_df['likeCount'] / video_df['viewCount'] * 1000
video_df['commentRatio'] = video_df['commentCount'] / video_df['viewCount'] * 1000

# Title length
video_df['titleLength'] = video_df['title'].apply(lambda x: len(x))

Check the data set to ensure everything has come out as expected.

In [None]:
video_df.head()

## Exploratory Data Analysis

### Views distribution per channel

We can see how the views are distributed across each channel in the dataset using the video statistics we have collected. Some channels will be shown to have a consistent views per video while others more spikey as they fluctuate between trending and not. 

In [None]:
plt.rcParams['figure.figsize'] = (18,6)
sns.violinplot(video_df['channelTitle'], video_df['viewCount'])
plt.title("Views per channel", loc = 'right', fontsize = 14)
plt.show()

### Does having more comments and likes correlate with viewcount?
It seems logical that more people watching a video will result in more likes and comments, but as some channels have more commited repeat viewers than others, does this actually hold true in reality? 
Below shows a plot of this and suggests there is infact a strong correlation. However, likes and comments have a stronger correlation to each other than to the viewcount. As well as this, the number of likes seems to correlate more heavily than the number of comments.  

In [None]:
fig, ax = plt.subplots(1,2)
sns.scatterplot(data = video_df, x = 'commentCount', y = 'viewCount', ax = ax[0])
sns.scatterplot(data = video_df, x = 'likeCount', y = 'viewCount', ax = ax[1])

Next, looking at the ratio of likes/comments to 1000 views.

In [None]:
fig, ax = plt.subplot(1,2)
sns.scatterplot(data = video_df, x = 'commentRatio', y = 'viewCount', ax = ax[0])
sns.scatterplot(data = video_df, x = 'likeRatio', y = 'viewCount', ax = ax[1])

After comparing these figures with the absolute number of comments and likes, the correlation is much less clear. The connection between comments and view counts seems to diminish. This would suggest that channels with a high comment-view ratio have built a close community of viewers, who are willing to discuss the video with each other instead of moving onto another once they have finished watching. 

### Does video duration matter for views and interactions?

Many YouTube experts claim that the duration of a video is important for generating a higher 'weight' in the YouTube algorithm for a video. And that videos under or over a certain limit won't be pushed as much by the recommendation engine[[4]](https://piktochart.com/blog/ideal-video-length/#:~:text=Ideal%20YouTube%20video%20length%3A%205,lasting%205%20to%2015%20minutes). 

In [None]:
sns.histplot(data = video_df[video_df['durationSecs'] < 10000], x = 'durationSecs', bins = 30)

Next a look at duration against comment and like count. It appears that ______ shorter/longer videos tend to get more likes and comments than longer/shorter ones. 

In [None]:
fig, ax = plt.subplots(1,2)
sns.scatterplot(data = video_df, x = "durationSecs", y = "commentCount", ax = ax[0])
sns.scatterplot(data = video_df, x = "durationSecs", y = "likeCount", ax = ax[1])


### Is title length important for attracting viewers? 

This is another consideration often mentioned by experts claiming that title lengths between 60-70 characters[[5]](https://vidiq.com/blog/post/youtube-video-title/#:~:text=Stay%20Under%20the%20YouTube%20Title%20Character%20Limit&text=On%20YouTube%2C%20you%20have%20a,up%20as%20a%20truncated%20title) will encourage the algorithm to recommend that particular video. But is there any weight to this claim?

In [None]:
sns.scatterplot(data = video_df, x = "titleLength", y = 'viewCount')

## Wordcloud for title buzzwords

It is also interesting to see what common topics or words are used in titled in this demographic and which words most frequently appear. Below shows the wordcloud for the most common words in video titles among those in the dataset. However, stopwords such as "you" and "I" have not been counted.

In [None]:
stop_words = set(stopwords.words('english'))
video_df['title_no_stopwords'] = video_df['title'].apply(lambda x: [item for item in str(x).split() if item not in stop_words])

all_words = list([a for b in video_df['title_no_stopwords'].tolist() for a in b])
all_words_str = " ".join(all_words)

In [None]:
def plot_cloud(wordcloud):
    plt.figure(figsize = (30, 30))
    plt.imshow(wordcloud)
    plt.axis('off')
    
wordcloud = WordCloud(width = 2000, height = 1000, random_state = 1, background_color = 'white', 
                     colormap = 'viridis', collocation = False).generate(all_words_str)

plot_cloud(wordcloud)

(TODO)