# Analyzing Fitness Channels: An Exploratory Data Analysis of YouTube's Workout Videos

## 1. Background
### 1.1 Introduction
### 1.2 Objectives
 Within this project, I would like to explore the following questions:
 
- Which channels have the highest engagment (views, likes, and comments) per video on average?
- Is there any correlation between video duration and engagement metrics (views, likes, comments)?
- Do certain types of videos (e.g., cardio, strength training) tend to receive more engagement?
- What is the distribution of video duration across the different duration categories for each channel?
- Is there any seasonal pattern in video uploads for each channel?
- What are the most common video tags or topics for each channel?
- Do video titles' length or content impact their engagement or viewership?
- Which channels have the highest percentage of videos with specific tags (e.g., yoga, HIIT, dance)?
- Are there any specific topics or video tags that consistently perform well across all channels?
- Are there any significant differences in engagement metrics between channels that focus on different types of workouts?
- Do videos with certain characteristics (e.g., no equipment, low impact) attract more views or engagement?
- Is there any correlation between the number of tags and the engagement of videos?
- Which channels have the highest proportion of specific workout categories (e.g., upper body, legs, pilates)?
- Do videos with certain workout categories tend to receive more comments or likes compared to others?

## 2. Data

The channel names of the YouTube fitness channels are MadFit, blogilates, emi wong, Rebecca-Louise and Chloe Ting.

I used the YouTube Data API v3 to scrape the channel and video information. 
To do so, the following steps were carried out in order:
- created a project on Google Developers Console
- requested an authorization credential (API key). 
- enabled Youtube API for my project 

As for the functions that I have used to scrape the youtube data, I give credit to thu-vu92 for using the code provided her GitHub project 'youtube-api-analysis'. This project was inspired by the YouTube API tutorial video created by Thu Vu data analytics, titled 'Youtube API for Python: How to Create a Unique Data Portfolio Project'.

##  3. Analysis and Key Findings

In [2]:
# standard imports
import os
import pandas as pd
import numpy as np
from datetime import datetime
from os.path import dirname, abspath
from dateutil import parser
import scipy.stats as stats

# for visualization
import datashader as ds, bokeh
import holoviews as hv
from holoviews import opts
import hvplot.pandas     # adds hvplot method to pandas objects
import colorcet as cc
from colorcet.plotting import swatch, swatches, candy_buttons

import panel as pn
import panel.widgets as pnw

hv.extension('bokeh','matplotlib', 'plotly')

In [3]:
# Read in the csv files from the processed data folder
project_dir = dirname(dirname(abspath("03-data-exploration.ipynb")))
videos_df = pd.read_csv(project_dir + "/data/processed/fitness_videos_processed_2023_07_11.csv")
channels_df = pd.read_csv(project_dir + "/data/processed/fitness_channels_processed_2023_07_11.csv")

In [4]:
# Get the names of columns that are of boolean data type
boolean_columns = videos_df.select_dtypes(include='bool').columns.tolist()

# Convert boolean columns to numeric (0 or 1)
videos_df[boolean_columns] = videos_df[boolean_columns].astype(int)

# Get only the date from the datetime column
channels_df['published_datetime'] = pd.to_datetime(channels_df['published_date'].apply(lambda x: parser.parse(x)))
channels_df['published_date'] = channels_df['published_datetime'].dt.date

#### Which channels have the highest average views, likes, and comments per video?

Since some channels have been around longer than others, comparing the total views, likes, and comments might not provide a fair comparison. Instead, it is more appropriate to compare the channels based on the average views, likes, and comments per video. By calculating the average views, likes, and comments per video for each channel, you can compare the channels' performance in a more equitable manner, considering the engagement metrics on a per-video basis. This way, even channels with a smaller number of videos or channels that have been active for a shorter duration can be fairly compared with others.

In [5]:
channels_summary = (videos_df
                    .groupby('channel')
                    .agg(total_videos = ('video_id', np.size),
                         average_views = ('total_views', np.mean),
                         average_likes = ('total_likes', np.mean),
                         average_comments = ('total_comments', np.mean)
                         )
                    .reset_index()
                    )

channels_summary = (channels_df[['channel', 'published_date','total_subscribers']]
                    .merge(channels_summary, on='channel', how='left')
                    .sort_values(by='published_date')
                    .set_index('channel')
                    .style
                    .format(precision=1)
                    .background_gradient(cmap='Blues')
                    )

channels_summary

Unnamed: 0_level_0,published_date,total_subscribers,total_videos,average_views,average_likes,average_comments
channel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
blogilates,2009-06-13,8730000,1184,2415288.0,112896.1,1702.8
Chloe Ting,2011-08-17,24700000,410,7297930.2,144762.5,4225.5
Rebecca-Louise,2012-09-22,720000,1269,92988.1,1251.9,86.4
emi wong,2014-11-02,6130000,506,1634240.1,33017.5,782.8
MadFit,2018-03-02,8050000,683,1367484.0,32873.8,655.7


We can see that Chloe Ting has the highest performance in all metric despite having the lowest number of videos. On the other hand, Rebecca-louise has the lowest performance in all metrics despite having the highest number of videos in her channel.

#### Is there any correlation between video duration and engagement metrics (views, likes, comments)?

In [6]:
# Select the relevant columns for the correlation test
columns_for_correlation = ['duration_mins', 'total_views', 'total_likes', 'total_comments']
correlation_data = videos_df[columns_for_correlation]

# Calculate the Pearson correlation coefficient and p-values for each pair of variables
correlation_results = correlation_data.corr(method='pearson')
p_values = correlation_data.corr(method=lambda x, y: stats.pearsonr(x, y)[1])

# Display the correlation matrix and p-values
print("Correlation Matrix:")
print(correlation_results)

print("\nP-Values:")
print(p_values)

Correlation Matrix:
                duration_mins  total_views  total_likes  total_comments
duration_mins        1.000000    -0.008449    -0.099690        0.004963
total_views         -0.008449     1.000000     0.858080        0.950088
total_likes         -0.099690     0.858080     1.000000        0.799916
total_comments       0.004963     0.950088     0.799916        1.000000

P-Values:
                duration_mins  total_views   total_likes  total_comments
duration_mins    1.000000e+00     0.590809  2.022046e-10        0.752122
total_views      5.908088e-01     1.000000  0.000000e+00        0.000000
total_likes      2.022046e-10     0.000000  1.000000e+00        0.000000
total_comments   7.521217e-01     0.000000  0.000000e+00        1.000000


#### Is there any seasonal pattern in video uploads for each channel?

In [6]:
# Convert 'published_date' column to datetime format
videos_df['published_date'] = pd.to_datetime(videos_df['published_date'])


# Group videos by 'channelName' and 'published_date', and count the number of videos for each group
video_upload_freq = (videos_df
                     .groupby(['channel', pd.Grouper(key='published_date', freq='M')])
                     .size()
                     .reset_index(name='upload_count')
                     .query('published_date > 2018')
                     )

# Create the interactive line plot using hvplot
upload_freq = video_upload_freq.hvplot.line(x='published_date', 
                                          y='upload_count', 
                                          by='channel',
                                          xlabel='Published Date', 
                                          ylabel='Number of Videos Uploaded',
                                          title='Video Upload Frequency for each Channel over Past 5 Years',
                                          width=950, 
                                          height=500, 
                                          color=hv.Cycle('Category10'),
                                          alpha = 0.7,
                                          line_dash = 'dotdash',
                                          legend = 'top_right',
                                          line_width = 2.5
                                          )

# Display the line plot
upload_freq

Interestingly enough, there was not any usually high production of videos during the Covid19 pandemic. More people may have been introducted to home fitness workouts or been more inclined to doing them, however the channels themselves did not have a significant increase in number of workout videos uploaded compared to prior Covid19.

#### What is the distribution of video duration across the different duration categories for each channel?

In [7]:
# Convert 'duration_category' to a categorical data type with the desired order
desired_order = ['0-5','5-10', '10-20', '20-30', '30-45', '45+']
videos_df['duration_category'] = pd.Categorical(videos_df['duration_category'], categories=desired_order, ordered=True)

# Group videos by 'duration_category' and 'channelName', and calculate the mean views and likes for each group
video_counts_by_channel_duration = (videos_df
                                    .groupby(['duration_category', 'channel'])
                                    .size()
                                    .reset_index(name='video_count')
                                    )

# Calculate the total number of videos for each channel
total_videos_by_channel = (videos_df
                           .groupby('channel')
                           .size()
                           .reset_index(name='total_videos')
                           )

# Merge the two DataFrames to calculate the percentage of videos in each duration category for each channel
video_counts_by_channel_duration = video_counts_by_channel_duration.merge(total_videos_by_channel, on='channel', how='left')
video_counts_by_channel_duration['percentage'] = (video_counts_by_channel_duration['video_count'] / video_counts_by_channel_duration['total_videos']) * 100


# Create the interactive bar chart using hvplot
bar_chart = video_counts_by_channel_duration.hvplot.barh(x='channel',
                                                        y='percentage', 
                                                        by='duration_category',
                                                        xlabel='Channel', 
                                                        ylabel='Percentage of Videos (%)',
                                                        title='Percentage of Videos in Each Duration Category for Each Channel',
                                                        width=900, 
                                                        height=400, 
                                                        legend='top', 
                                                        color=hv.Cycle('Pastel1'),
                                                        stacked = True,
                                                        alpha = 0.9
                                                        )

# Display the bar chart
bar_chart

The 10-20 duration category has the highest percentage of videos for each channel except for Blogilates. Blogilates has the highest percetage of videos in the 10-20 duration category and second highest percentage in the 10-20 category. The category with the second highest percentage of videos for Madfit is 20-30 and blogilates is 10-20, while everyone else is 10-20.

#### Which channels have the highest proportion of specific workout categories (e.g., upper body, legs, pilates)?

In [8]:
body_parts = ['FULL_BODY','UPPER_BODY','LOWER_BODY','CHEST','BACK','ABS','ARMS','LEGS','GLUTES']

# Group by 'published_year' and sum the counts of each boolean value
grouped_by_bodypart = videos_df.groupby('channel')[body_parts].agg('sum').reset_index()

grouped_by_bodypart = grouped_by_bodypart.melt(id_vars='channel', var_name='body_category', value_name='video_count')
grouped_by_bodypart['body_category'] = pd.Categorical(grouped_by_bodypart['body_category'])
grouped_by_bodypart['channel'] = pd.Categorical(grouped_by_bodypart['channel'])

body_part_chart = grouped_by_bodypart.hvplot.bar(x = 'body_category',
                                                  y = 'video_count',
                                                  by = 'channel',
                                                  stacked = True,
                                                  color = hv.Cycle('Category10'),
                                                  alpha = 0.7,
                                                  width = 900,
                                                  height = 450,
                                                  legend = 'top_left'
                                                  )

body_part_chart

## References

[1] Google Developers. (2023). YouTube Data API v3. Retrieved date placeholder, from https://developers.google.com/youtube/v3  
[2] Thu Vu data analytics. (2022, January 22). Youtube API for Python: How to Create a Unique Data Portfolio Project [Video file]. Retrieved from https://www.youtube.com/watch?v=D56_Cx36oGY