YouTube, founded in 2005, has grown to become the second-largest search engine globally, processing over 3 billion searches per month [1]. Despite its popularity, the inner workings of the YouTube algorithm, such as how videos gain views and recommendations, remain somewhat of a mystery. YouTube boasts one of the most expansive and sophisticated recommendation systems in existence [2]. For new content creators, understanding why some videos succeed while others don't can be quite challenging. Numerous "myths" surround the factors contributing to a video's success on YouTube [3], such as the number of likes or comments, video duration, and other factors. It's also intriguing to explore emerging trends in specific niche topics covered by YouTube channels.
As someone who is looking into starting a new YouTube channel focused on the Pathfinder Second edition tabletop role playing system and as soneone who is also keen in involving myself in the data science/data analytics fields, I wanted to try to help myself see the market as well as polish my data and python skills. For the purpose of this project, I'll be focusing solely on Pathfinder channels. Thus, this project will delve into the statistical analysis of approximately the top 10 most successful Pathfinder 2e YouTube channels.
This project aims to achieve the following objectives:
- Gain proficiency in utilizing the YouTube API and extracting video data.
- Analyze video data to investigate common assumptions about YouTube video success, including: - Does the number of likes and comments impact a video's view count? - Is video duration a factor in views and user interaction (likes/comments)? - Does title length influence views? - How many tags do well-performing videos typically have, and what are the common tags used? - Across all the creators considered, how frequently do they upload new videos, and on which days of the week?
- Employ Natural Language Processing (NLP) techniques to explore trending topics, focusing on: - Identifying popular topics covered in videos (e.g., using word clouds for video titles). - Examining the questions posed in the video comments section.
The project will progress through the following stages:
- Retrieve video metadata from the YouTube API for the top 10-15 data science channels. This involves several steps, including creating a developer key, requesting data, and transforming the responses into a usable format.
- Preprocess the data and engineer additional features for analysis.
- Conduct exploratory data analysis.
- Draw conclusions based on the findings.
I created my own dataset using the Google YouTube Data API version 3.0. The specific steps involved in data creation are detailed in section 2, "Creating the Dataset," below.
The dataset employed in this research is suitable for analysis and reflects real-world data. However, the Youtube channels selected are based only of my knowledge of and participation in the Pathfinder 2e community and may not be entirely accurate.
According to the YouTube API's guidelines, the usage of the API is free of charge, as long as the application's requests adhere to the designated quota limit. "The YouTube Data API uses a quota to ensure that developers use the service as intended and do not create applications that unfairly reduce service quality or limit access for others." The default quota allocation for each application is 10,000 units per day, with the option to request additional quota by completing a form to the YouTube API Services if the limit is reached.
Since all data obtained from the YouTube API is public data, accessible to anyone on the internet, privacy concerns are not prominent in this case. Moreover, the data obtained is exclusively for research purposes and not for commercial gain.
# Imports
from googleapiclient.discovery import build # Youtube API
from dateutil import parser
import pandas as pd
from IPython.display import JSON
import isodate
# Data Visualization Packages
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
# NLP
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
from wordcloud import WordCloud[nltk_data] Downloading package stopwords to C:\nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to C:\nltk_data...
[nltk_data] Package punkt is already up-to-date!
To start, I created a project in the Google Developers Console, then requested an API key, and enabled the Youtube API specifically so I could send API requests to Youtube API services. Then using https://syncwith.com/youtube/find-youtube-channel-id I found the channel ids for the handful of youtube channels I wanted to analyze.
Then I created the functions for getting the channel statistics by means of the API.
api_key = '' # API key
channel_ids = ['UCYzLQM31VrbDEH8lNaBBLWQ', # The Rules Lawyer
'UCIkfrKHG0Ps8yeo2LR1Cqjw', # NoNat1s
'UCMo2HMsssNQ-AOM-ZT9yN7A', # deadly d8
'UChtQk-wDX5F15f1dYGO5SQQ', # How It's Played
'UCC562W7VxEyLFr_l5uwUIaA', # Frizz
'UCPV4Vwo_msUIqjB3vUvX_7g', # Team Player Gaming
]api_service_name = "youtube"
api_version = "v3"
# Get credentials and create an API client
youtube = build(
api_service_name, api_version, developerKey=api_key)# Functions for getting channel statistics from the API
def get_channel_stats(youtube, channel_ids):
"""
Get channel statistics: title, subscriber count, view count, video count, upload playlist
Params:
youtube: the build object from googleapiclient.discovery
channels_ids: list of channel IDs
Returns:
Dataframe containing the channel statistics for all channels in the provided list: title, subscriber count, view count, video count, upload playlist
"""
all_data = []
request = youtube.channels().list(
part='snippet,contentDetails,statistics',
id=','.join(channel_ids))
response = request.execute()
for i in range(len(response['items'])):
data = dict(channelName = response['items'][i]['snippet']['title'],
subscribers = response['items'][i]['statistics']['subscriberCount'],
views = response['items'][i]['statistics']['viewCount'],
totalVideos = response['items'][i]['statistics']['videoCount'],
playlistId = response['items'][i]['contentDetails']['relatedPlaylists']['uploads'])
all_data.append(data)
return pd.DataFrame(all_data)
def get_video_ids(youtube, playlist_id):
"""
Get list of video IDs of all videos in the given playlist
Params:
youtube: the build object from googleapiclient.discovery
playlist_id: playlist ID of the channel
Returns:
List of video IDs of all videos in the playlist
"""
request = youtube.playlistItems().list(
part='contentDetails',
playlistId = playlist_id,
maxResults = 50)
response = request.execute()
video_ids = []
for i in range(len(response['items'])):
video_ids.append(response['items'][i]['contentDetails']['videoId'])
next_page_token = response.get('nextPageToken')
more_pages = True
while more_pages:
if next_page_token is None:
more_pages = False
else:
request = youtube.playlistItems().list(
part='contentDetails',
playlistId = playlist_id,
maxResults = 50,
pageToken = next_page_token)
response = request.execute()
for i in range(len(response['items'])):
video_ids.append(response['items'][i]['contentDetails']['videoId'])
next_page_token = response.get('nextPageToken')
return video_ids
def get_video_details(youtube, video_ids):
"""
Get video statistics of all videos with given IDs
Params:
youtube: the build object from googleapiclient.discovery
video_ids: list of video IDs
Returns:
Dataframe with statistics of videos, i.e.:
'channelTitle', 'title', 'description', 'tags', 'publishedAt'
'viewCount', 'likeCount', 'favoriteCount', 'commentCount'
'duration', 'definition', 'caption'
"""
all_video_info = []
for i in range(0, len(video_ids), 50):
request = youtube.videos().list(
part="snippet,contentDetails,statistics",
id=','.join(video_ids[i:i+50])
)
response = request.execute()
for video in response['items']:
stats_to_keep = {'snippet': ['channelTitle', 'title', 'description', 'tags', 'publishedAt'],
'statistics': ['viewCount', 'likeCount', 'favouriteCount', 'commentCount'],
'contentDetails': ['duration', 'definition', 'caption']
}
video_info = {}
video_info['video_id'] = video['id']
for k in stats_to_keep.keys():
for v in stats_to_keep[k]:
try:
video_info[v] = video[k][v]
except:
video_info[v] = None
all_video_info.append(video_info)
return pd.DataFrame(all_video_info)
def get_comments_in_videos(youtube, video_ids):
"""
Get top level comments as text from all videos with given IDs (only the first 10 comments due to quote limit of Youtube API)
Params:
youtube: the build object from googleapiclient.discovery
video_ids: list of video IDs
Returns:
Dataframe with video IDs and associated top level comment in text.
"""
all_comments = []
for video_id in video_ids:
try:
request = youtube.commentThreads().list(
part="snippet,replies",
videoId=video_id
)
response = request.execute()
comments_in_video = [comment['snippet']['topLevelComment']['snippet']['textOriginal'] for comment in response['items'][0:10]]
comments_in_video_info = {'video_id': video_id, 'comments': comments_in_video}
all_comments.append(comments_in_video_info)
except:
# When error occurs - most likely because comments are disabled on a video
print('Could not get comments for video ' + video_id)
return pd.DataFrame(all_comments) Using 'get_channel_stats' to get the channel stats for the Youtube channels in scope
channel_data = get_channel_stats(youtube, channel_ids)Now print out the data and take a look at the channel stats overview.
channel_data.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| channelName | subscribers | views | totalVideos | playlistId | |
|---|---|---|---|---|---|
| 0 | Frizz | 2180 | 244674 | 219 | UUC562W7VxEyLFr_l5uwUIaA |
| 1 | Team Player Gaming | 808 | 111349 | 214 | UUPV4Vwo_msUIqjB3vUvX_7g |
| 2 | Nonat1s | 49700 | 7854610 | 454 | UUIkfrKHG0Ps8yeo2LR1Cqjw |
| 3 | deadly d8 | 3640 | 141837 | 68 | UUMo2HMsssNQ-AOM-ZT9yN7A |
| 4 | How It's Played | 21400 | 2558816 | 316 | UUhtQk-wDX5F15f1dYGO5SQQ |
| 5 | The Rules Lawyer | 28400 | 3042191 | 181 | UUYzLQM31VrbDEH8lNaBBLWQ |
The count columns in the channel_data are in string format, so we'll need to convert them to numeric so we can do numeric operations on them
# Converting the columns describing counts from strings to numeric columns
numeric_cols = ['subscribers', 'views', 'totalVideos']
channel_data[numeric_cols] = channel_data[numeric_cols].apply(pd.to_numeric, errors = 'coerce', axis = 1) # convert columns that should be numeric to numericLet's take a look at the number of subscribers per channel to have a view of how popular the channels are when compared with one another.
# Graphing results: Channel Names vs Subscriber count
sns.set(rc={'figure.figsize':(10,8)})
ax = sns.barplot(x='channelName', y='subscribers', data=channel_data.sort_values('subscribers', ascending=False))
ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: '{:,.0f}'.format(x/1000) + 'K'))
plot = ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)Looking at channels considering the total number of views of the channels.
# Graphing results: Channel Names vs Views
ax = sns.barplot(x='channelName', y='views', data=channel_data.sort_values('views', ascending=False))
ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: '{:,.0f}'.format(x/1000) + 'K'))
plot = ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)From here ,we are getting the video statistics for all the channels. There are a total of 1433 videos.
# Create a dataframe with video statistics and comments from all channels
video_df = pd.DataFrame()
comments_df = pd.DataFrame()
for c in channel_data['channelName'].unique():
print("Getting video information from channel: " + c)
playlist_id = channel_data.loc[channel_data['channelName']== c, 'playlistId'].iloc[0]
video_ids = get_video_ids(youtube, playlist_id)
# get video data
video_data = get_video_details(youtube, video_ids)
# get comment data
comments_data = get_comments_in_videos(youtube, video_ids)
# append video data together and comment data toghether
video_df = video_df.append(video_data, ignore_index=True)
comments_df = comments_df.append(comments_data, ignore_index=True)
Getting video information from channel: Frizz
C:\Users\arthu\AppData\Local\Temp\ipykernel_13904\3750072411.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
video_df = video_df.append(video_data, ignore_index=True)
C:\Users\arthu\AppData\Local\Temp\ipykernel_13904\3750072411.py:18: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
comments_df = comments_df.append(comments_data, ignore_index=True)
Getting video information from channel: Team Player Gaming
C:\Users\arthu\AppData\Local\Temp\ipykernel_13904\3750072411.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
video_df = video_df.append(video_data, ignore_index=True)
C:\Users\arthu\AppData\Local\Temp\ipykernel_13904\3750072411.py:18: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
comments_df = comments_df.append(comments_data, ignore_index=True)
Getting video information from channel: Nonat1s
Could not get comments for video -SOX01Kp9H4
C:\Users\arthu\AppData\Local\Temp\ipykernel_13904\3750072411.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
video_df = video_df.append(video_data, ignore_index=True)
C:\Users\arthu\AppData\Local\Temp\ipykernel_13904\3750072411.py:18: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
comments_df = comments_df.append(comments_data, ignore_index=True)
Getting video information from channel: deadly d8
Could not get comments for video 2H9nQy_efq0
C:\Users\arthu\AppData\Local\Temp\ipykernel_13904\3750072411.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
video_df = video_df.append(video_data, ignore_index=True)
C:\Users\arthu\AppData\Local\Temp\ipykernel_13904\3750072411.py:18: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
comments_df = comments_df.append(comments_data, ignore_index=True)
Getting video information from channel: How It's Played
C:\Users\arthu\AppData\Local\Temp\ipykernel_13904\3750072411.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
video_df = video_df.append(video_data, ignore_index=True)
C:\Users\arthu\AppData\Local\Temp\ipykernel_13904\3750072411.py:18: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
comments_df = comments_df.append(comments_data, ignore_index=True)
Getting video information from channel: The Rules Lawyer
Could not get comments for video gP6p_UBxVvc
C:\Users\arthu\AppData\Local\Temp\ipykernel_13904\3750072411.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
video_df = video_df.append(video_data, ignore_index=True)
C:\Users\arthu\AppData\Local\Temp\ipykernel_13904\3750072411.py:18: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
comments_df = comments_df.append(comments_data, ignore_index=True)
video_df.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| video_id | channelTitle | title | description | tags | publishedAt | viewCount | likeCount | favouriteCount | commentCount | duration | definition | caption | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | isaqzjQaa00 | Frizz | Why Pathfinder 2e fights are EASY to make! | Welcome back folks! Today we're talking about ... | [Pathfinder, Pathfinder 2e, TTRPG, equipment, ... | 2023-06-10T13:00:34Z | 715 | 53 | None | 13 | PT10M8S | hd | false |
| 1 | 2LVkmEJyM_o | Frizz | How the Pathfinder 2e Remaster could fix the A... | Welcome back folks! Today we're talking about ... | [Pathfinder, Pathfinder 2e, TTRPG, equipment, ... | 2023-06-08T13:01:01Z | 780 | 69 | None | 20 | PT8M24S | hd | false |
| 2 | Q3_akUXtmMk | Frizz | What makes MELEE and RANGED combat different! | Welcome back folks! Today we're talking about ... | [Pathfinder, Pathfinder 2e, TTRPG, equipment, ... | 2023-06-03T13:05:51Z | 1078 | 76 | None | 15 | PT5M4S | hd | false |
| 3 | E_D8_lJhj00 | Frizz | Why you should be EXCITED for Pathfinder 2e's ... | Welcome back folks! Today we're talking about ... | [Pathfinder, Pathfinder 2e, TTRPG, equipment, ... | 2023-06-01T13:00:17Z | 1685 | 111 | None | 23 | PT10M9S | hd | false |
| 4 | CvxEQ6Nd1KY | Frizz | Lets talk about Paizocon (and play Risk of Rain)! | Welcome back folks! Today we're playing some R... | [pathfinder, ttrpg, paizo, dnd, D&D, 2e, secon... | 2023-05-31T01:12:55Z | 243 | 14 | None | 1 | PT3H43M37S | hd | false |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1446 | MB7ut3qL_zI | The Rules Lawyer | Melding Might and Magic: Overview of Pathfinde... | We look at the new Magus class in Pathfinder 2... | [2nd edition, Premiere_Elements_2021, magus, p... | 2021-08-30T10:27:40Z | 20710 | 669 | None | 67 | PT34M11S | hd | false |
| 1447 | PRq0VaN1-XE | The Rules Lawyer | This Ain't D&D: Tactics + Strategies for Pathf... | Here we look at the notorious difficulty of Pa... | [Premiere_Elements_2021, pathfinder 2e, pathfi... | 2021-08-26T19:22:25Z | 32414 | 1311 | None | 182 | PT1H1M11S | hd | false |
| 1448 | fLnmENaQykE | The Rules Lawyer | Martials and clerics are FUN in Pathfinder 2e!... | Playing a martial character or a cleric doesn'... | [pathfinder rpg, pf2e, pf2, 2nd edition, pathf... | 2021-08-19T21:40:47Z | 8760 | 337 | None | 51 | PT54M19S | hd | false |
| 1449 | zBb10JowP30 | The Rules Lawyer | Understand Character Creation in Pathfinder 2e | (This is a reposting with improved audio, but ... | [pf2, pf2e, pathfinder rpg, pathfinder, 2nd ed... | 2021-08-18T20:29:03Z | 31876 | 1059 | None | 57 | PT32M55S | hd | false |
| 1450 | jafxucfk0PY | The Rules Lawyer | Pathfinder 2nd Edition is easy to learn! A Fir... | A video where I try to dispel the idea that Pa... | [pathfinder rpg, pathfinder, 2nd edition, pf2e... | 2021-07-31T07:31:06Z | 13634 | 638 | None | 44 | PT22M37S | hd | false |
1451 rows × 13 columns
comments_df.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| video_id | comments | |
|---|---|---|
| 0 | isaqzjQaa00 | [It's the perfect system until them there barb... |
| 1 | 2LVkmEJyM_o | [Chirugeon could get into body-modification vi... |
| 2 | Q3_akUXtmMk | [I will never not like your videos! Concise an... |
| 3 | E_D8_lJhj00 | [Finally I will be able to make an atomic ant ... |
| 4 | CvxEQ6Nd1KY | [I am extreamely consered about powercreep in ... |
| ... | ... | ... |
| 1443 | MB7ut3qL_zI | [Which hybrid study for the Magus are other pe... |
| 1444 | PRq0VaN1-XE | [Is it possible to use reactions after taking ... |
| 1445 | fLnmENaQykE | [A FEW CORRECTIONS: \n8:10 - Sneaking to a pla... |
| 1446 | zBb10JowP30 | [I've been binging your videos. Do you address... |
| 1447 | jafxucfk0PY | [This combat seems far more Engaging. I just s... |
1448 rows × 2 columns
# Write video data to CSV file for future references
video_df.to_csv('video_data_top10_channels.csv')
comments_df.to_csv('comments_data_top10_channels.csv')# Analysis Ideas
# * Views per video
# * Likes and comments vs views
# * Title Length vs views
# * Video Duration
# * Video Upload Schedules
# * Tags
# * Tags vs viewsChecking dataframe for empty values
video_df.isnull().any() # any nulls?video_id False
channelTitle False
title False
description False
tags True
publishedAt False
viewCount False
likeCount False
favouriteCount True
commentCount False
duration False
definition False
caption False
dtype: bool
video_df.publishedAt.sort_values().value_counts()2023-03-23T13:00:34Z 2
2019-01-25T16:21:14Z 1
2022-07-20T20:40:51Z 1
2022-07-26T14:27:41Z 1
2022-07-25T14:00:30Z 1
..
2021-09-06T14:14:55Z 1
2021-09-06T08:06:29Z 1
2021-09-04T21:29:12Z 1
2021-09-03T15:30:03Z 1
2023-06-14T15:30:07Z 1
Name: publishedAt, Length: 1450, dtype: int64
video_df.dtypes # checking data typesvideo_id object
channelTitle object
title object
description object
tags object
publishedAt object
viewCount object
likeCount object
favouriteCount object
commentCount object
duration object
definition object
caption object
dtype: object
# Checked the data types and count columns such as view counts and comment count are strings not numerical
# convert the proper columns to numeric
cols = ['viewCount', 'likeCount', 'favouriteCount', 'commentCount']
video_df[cols] = video_df[cols].apply(pd.to_numeric, errors='coerce', axis=1)So I want to enrich the data for further analyses, for example:
- To create published date column with another column showing the day in the week the video was published, which will be useful for later analysis.
- To convert video duration to seconds instead of the current default string format
- To calculate number of tags for each video
- To calculate comments and likes per 1000 view ratio
- To calculate title character length
# Create Publish Day of the Week column
video_df['publishedAt'] = video_df['publishedAt'].apply(lambda x: parser.parse(x))
video_df['pushblishDayName'] = video_df['publishedAt'].apply(lambda x: x.strftime("%A")) # convert duration to seconds
video_df['durationSecs'] = video_df['duration'].apply(lambda x: isodate.parse_duration(x))
video_df['durationSecs'] = video_df['durationSecs'].astype('timedelta64[s]')video_df[['durationSecs', 'duration']].dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| durationSecs | duration | |
|---|---|---|
| 0 | 608.0 | PT10M8S |
| 1 | 504.0 | PT8M24S |
| 2 | 304.0 | PT5M4S |
| 3 | 609.0 | PT10M9S |
| 4 | 13417.0 | PT3H43M37S |
| ... | ... | ... |
| 1446 | 2051.0 | PT34M11S |
| 1447 | 3671.0 | PT1H1M11S |
| 1448 | 3259.0 | PT54M19S |
| 1449 | 1975.0 | PT32M55S |
| 1450 | 1357.0 | PT22M37S |
1451 rows × 2 columns
# Add tag count
video_df['tagCount'] = video_df['tags'].apply(lambda x: 0 if x is None else len(x))# comments and likes per 1000 view ratio
video_df['likeRatio'] = video_df['likeCount']/ video_df['viewCount'] * 1000
video_df['commentRatio'] = video_df['commentCount']/ video_df['viewCount'] * 1000# Title character length
video_df['titleLength'] = video_df['title'].apply(lambda x: len(x))# looking at the dataset to be sure everything went okay
video_df.head().dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| video_id | channelTitle | title | description | tags | publishedAt | viewCount | likeCount | favouriteCount | commentCount | duration | definition | caption | pushblishDayName | durationSecs | tagCount | likeRatio | commentRatio | titleLength | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | isaqzjQaa00 | Frizz | Why Pathfinder 2e fights are EASY to make! | Welcome back folks! Today we're talking about ... | [Pathfinder, Pathfinder 2e, TTRPG, equipment, ... | 2023-06-10 13:00:34+00:00 | 715.0 | 53.0 | NaN | 13.0 | PT10M8S | hd | false | Saturday | 608.0 | 34 | 74.125874 | 18.181818 | 42 |
| 1 | 2LVkmEJyM_o | Frizz | How the Pathfinder 2e Remaster could fix the A... | Welcome back folks! Today we're talking about ... | [Pathfinder, Pathfinder 2e, TTRPG, equipment, ... | 2023-06-08 13:01:01+00:00 | 780.0 | 69.0 | NaN | 20.0 | PT8M24S | hd | false | Thursday | 504.0 | 28 | 88.461538 | 25.641026 | 55 |
| 2 | Q3_akUXtmMk | Frizz | What makes MELEE and RANGED combat different! | Welcome back folks! Today we're talking about ... | [Pathfinder, Pathfinder 2e, TTRPG, equipment, ... | 2023-06-03 13:05:51+00:00 | 1078.0 | 76.0 | NaN | 15.0 | PT5M4S | hd | false | Saturday | 304.0 | 34 | 70.500928 | 13.914657 | 45 |
| 3 | E_D8_lJhj00 | Frizz | Why you should be EXCITED for Pathfinder 2e's ... | Welcome back folks! Today we're talking about ... | [Pathfinder, Pathfinder 2e, TTRPG, equipment, ... | 2023-06-01 13:00:17+00:00 | 1685.0 | 111.0 | NaN | 23.0 | PT10M9S | hd | false | Thursday | 609.0 | 34 | 65.875371 | 13.649852 | 59 |
| 4 | CvxEQ6Nd1KY | Frizz | Lets talk about Paizocon (and play Risk of Rain)! | Welcome back folks! Today we're playing some R... | [pathfinder, ttrpg, paizo, dnd, D&D, 2e, secon... | 2023-05-31 01:12:55+00:00 | 243.0 | 14.0 | NaN | 1.0 | PT3H43M37S | hd | false | Wednesday | 13417.0 | 10 | 57.613169 | 4.115226 | 49 |
With the video statistics for all channel, now we can see how the views are distributed per channel. Some channels might have a lot of views on one of their videos and the rest do not receive many views. Other channels might have more evenly distribution views per video.
plt.rcParams['figure.figsize'] = (18, 6)
sns.violinplot(x=video_df['channelTitle'], y=video_df['viewCount'], data=video_df, palette = 'pastel')
plt.title('Views per channel', fontsize = 14)
plt.show()
video_df.head().dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| video_id | channelTitle | title | description | tags | publishedAt | viewCount | likeCount | favouriteCount | commentCount | duration | definition | caption | pushblishDayName | durationSecs | tagCount | likeRatio | commentRatio | titleLength | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | isaqzjQaa00 | Frizz | Why Pathfinder 2e fights are EASY to make! | Welcome back folks! Today we're talking about ... | [Pathfinder, Pathfinder 2e, TTRPG, equipment, ... | 2023-06-10 13:00:34+00:00 | 715.0 | 53.0 | NaN | 13.0 | PT10M8S | hd | false | Saturday | 608.0 | 34 | 74.125874 | 18.181818 | 42 |
| 1 | 2LVkmEJyM_o | Frizz | How the Pathfinder 2e Remaster could fix the A... | Welcome back folks! Today we're talking about ... | [Pathfinder, Pathfinder 2e, TTRPG, equipment, ... | 2023-06-08 13:01:01+00:00 | 780.0 | 69.0 | NaN | 20.0 | PT8M24S | hd | false | Thursday | 504.0 | 28 | 88.461538 | 25.641026 | 55 |
| 2 | Q3_akUXtmMk | Frizz | What makes MELEE and RANGED combat different! | Welcome back folks! Today we're talking about ... | [Pathfinder, Pathfinder 2e, TTRPG, equipment, ... | 2023-06-03 13:05:51+00:00 | 1078.0 | 76.0 | NaN | 15.0 | PT5M4S | hd | false | Saturday | 304.0 | 34 | 70.500928 | 13.914657 | 45 |
| 3 | E_D8_lJhj00 | Frizz | Why you should be EXCITED for Pathfinder 2e's ... | Welcome back folks! Today we're talking about ... | [Pathfinder, Pathfinder 2e, TTRPG, equipment, ... | 2023-06-01 13:00:17+00:00 | 1685.0 | 111.0 | NaN | 23.0 | PT10M9S | hd | false | Thursday | 609.0 | 34 | 65.875371 | 13.649852 | 59 |
| 4 | CvxEQ6Nd1KY | Frizz | Lets talk about Paizocon (and play Risk of Rain)! | Welcome back folks! Today we're playing some R... | [pathfinder, ttrpg, paizo, dnd, D&D, 2e, secon... | 2023-05-31 01:12:55+00:00 | 243.0 | 14.0 | NaN | 1.0 | PT3H43M37S | hd | false | Wednesday | 13417.0 | 10 | 57.613169 | 4.115226 | 49 |
These are the videos in the collection with the highest number of views.
ax = sns.barplot(x = 'title', y = 'viewCount', data = video_df.sort_values('viewCount', ascending=False)[0:9])
plot = ax.set_xticklabels(ax.get_xticklabels(), rotation=90) # thank you Google
ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos:'{:,.0f}'.format(x/1000) + 'K'))Conversely these are the videos in the collection with the least number of views.
ax = sns.barplot(x = 'title', y = 'viewCount', data = video_df.sort_values('viewCount', ascending=True)[0:9])
plot = ax.set_xticklabels(ax.get_xticklabels(), rotation=45) # thank you Google
ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos:'{:,.0f}'.format(x)))Firstly, I would like to check if comments and likes do correlate with how many views a video would get. In the plots below, it can be observed that the number of views and number of comments/likes strongly correlated with each other. The number of likes seems to suggest stronger correlation than the number of comments. However, this is expected as the more people watching a video, the more likely this video will get comments and likes. To correct for this factor, we will plot these relationships again using the comments per 1000 view and likes per 1000 view ratios.
fig, ax =plt.subplots(1,2)
sns.scatterplot(data = video_df, x = "commentCount", y = "viewCount", ax=ax[0])
sns.scatterplot(data = video_df, x = "likeCount", y = "viewCount", ax=ax[1])<AxesSubplot:xlabel='likeCount', ylabel='viewCount'>
Now we will take a look at the correlation if we look at the comment ratio and like ratio instead of the absolute number.
fig, ax =plt.subplots(1,2)
sns.scatterplot(data = video_df, x = "commentRatio", y = "viewCount", ax=ax[0])
sns.scatterplot(data = video_df, x = "likeRatio", y = "viewCount", ax=ax[1])<AxesSubplot:xlabel='likeRatio', ylabel='viewCount'>
Upon adjusting for the total number of views, the previously observed correlation becomes less evident. The association between comments and views seems to disappear entirely, as numerous videos garner millions of views but receive very few comments, while some videos with a small number of views exhibit higher levels of interaction. It is worth noting that leaving a comment requires more effort compared to views and likes, and typically comments tend to decline as videos age.
Regarding the relationship between likes and views, there still appears to be a slight positive correlation, albeit subtle. This suggests that as the number of views increases, more individuals are inclined to click the like button. This finding aligns with the concept of social proof, which suggests that people are more likely to favor products or content that have already received approval from a large number of others.
Taking a glance at the histogram below, we can observe that most videos tend to fall within the 300 to 1200-second mark, which translates to roughly 5 to 20 minutes in duration. However, in order to accommodate for any excessively long videos, such as potential streaming content, it has been deemed necessary to impose a duration limit of 10,000 seconds.
sns.histplot(data=video_df[video_df['durationSecs'] < 10000], x="durationSecs", bins=50)<AxesSubplot:xlabel='durationSecs', ylabel='Count'>
Now we plot the duration against comment count and like count. It can be seen that actually shorter videos tend to get more likes and comments than very long videos.
fig, ax =plt.subplots(1,2)
sns.scatterplot(data = video_df, x = "durationSecs", y = "commentCount", ax=ax[0])
sns.scatterplot(data = video_df, x = "durationSecs", y = "likeCount", ax=ax[1])<AxesSubplot:xlabel='durationSecs', ylabel='likeCount'>
There is no clear relationship between title length and views as seen the scatterplot below.
sns.scatterplot(data = video_df, x = "titleLength", y = "viewCount")<AxesSubplot:xlabel='titleLength', ylabel='viewCount'>
I'm curious to explore the content that creators are focusing on in their videos and discover the key terms that appear most frequently in their video titles. To achieve this, I'll generate a wordcloud showcasing the most common words used. However, before diving into the analysis, it's important to eliminate stopwords like "you," "I," "the," and others that don't significantly contribute to the overall meaning of the title.
stop_words = set(stopwords.words('english'))
video_df['title_no_stopwords'] = video_df['title'].apply(lambda x: [item for item in str(x).split() if item not in stop_words])
all_words = list([a for b in video_df['title_no_stopwords'].tolist() for a in b])
all_words_str = ' '.join(all_words)
def plot_cloud(wordcloud):
plt.figure(figsize=(30, 20))
plt.imshow(wordcloud)
plt.axis("off");
wordcloud = WordCloud(width = 2000, height = 1000, random_state=1, background_color='black',
colormap='viridis', collocations=False).generate(all_words_str)
plot_cloud(wordcloud)We can see that the most common words are Pathfinder, 2e, Rule, Reminder and Teamwork
It seems that most videos have between 5 and 40 tags. The relationship between number of tags and view count is not clearly seen, but too few tags or too many tags do seem to correlate with fewer views.
sns.scatterplot(data = video_df, x = "tagCount", y = "viewCount")<AxesSubplot:xlabel='tagCount', ylabel='viewCount'>
It's interesting to see that more videos are uploaded on Mondays and Thursday. Fewer videos are uploaded on Sundays.
day_df = pd.DataFrame(video_df['pushblishDayName'].value_counts())
weekdays = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_df = day_df.reindex(weekdays)
ax = day_df.reset_index().plot.bar(x='index', y='pushblishDayName', rot=0)We can see what are the frequently-mentioned words in the video comments. I will do this by visualizing the keywords in a wordcloud.
stop_words = set(stopwords.words('english'))
comments_df['comments_no_stopwords'] = comments_df['comments'].apply(lambda x: [item for item in str(x).split() if item not in stop_words])
all_words = list([a for b in comments_df['comments_no_stopwords'].tolist() for a in b])
all_words_str = ' '.join(all_words) wordcloud = WordCloud(width = 2000, height = 1000, random_state=1, background_color='black',
colormap='viridis', collocations=False).generate(all_words_str)
plot_cloud(wordcloud)In this project, we have explored the video data of some Pathfinder channels and revealed many interesting findings for anyone who may be starting out with a Youtube channel:
- The more likes and comments a video has, the more views the video gets (it is not guaranteed that this is a causal relationship, it is simply a correlation and can work both way). Likes seem to be a better indicator for interaction than comments and the number of likes seem to follow the "social proof", which means the more views the video has, the more people will like it.
- Most videos have between 5 and 40 tags.
- Most-viewed videos tend to have average title length of 30-70 characters. Too short or too long titles seem to harm viewership.
- Videos are usually uploaded on Mondays and Thursday. Weekends and Sunday in particular is not a popular time for posting new videos.
- Comments on videos are generally positive,
The findings should also be taken with a grain of salt for a number of reasons:
- The number of videos is quite small (the dataset has only ~3,700 videos)
- I have only considered the first 10 comments on each video, which might not be representative for all comments
- There are many other factors that haven't been taken into the analysis, including the marketing strategy of the creators and many random effects that would affect how successful a video is
To expand and build on this research project, one can:
- Expand the dataset to also smaller channels in data science niche
- Do sentiment analysis on the comments and find out which videos get more positive comments and which videos get less positive comments
- Do market research by analyzing questions in the comment threads and identifying common questions/ market gaps which could potentially filled
- Conduct this research for other niches (e.g. vlogs or beauty channels), to compare different niches with each other to see the different patterns in viewership and video characteristics.
References/ Resources used:
[1] Youtube API. Avaiable at https://developers.google.com/youtube/v3
[2] Converting video durations to time function. https://stackoverflow.com/questions/15596753/how-do-i-get-video-durations-with-youtube-api-version-3
[3] P. Covington, J. Adams, E. Sargin. The youtube video recommendation system. In Proceedings of the Fourth ACM Conference on Recommender Systems, RecSys '16, pages 191-198, New York, NY, USA, 2016. ACM.
Inspired by https://youtu.be/D56_Cx36oGY













