# Data Exploration

I am thinking of dividing the analysis into two parts to get a broader as well as more in-depth understanding of this landscape. Although, I am keeping in mind that these 5 channels alone do not represent the entire home fitness industry.

Part 1: focus on analyzing the overall trends and patterns present in the workout videos.
* What are the most and least preferred workout lengths?
* Which workout types or body parts are the most and least popular? 
* Is there a significant difference in engagement (e.g., likes, comments, view counts) between workout videos with different video lengths?
* Is a correlation between the number of video tags and video views for workout videos?
* Has COVID-19 resulted in a significant increase in viewership? If so, try to quantify by how much.


Part 2: focus on analyzing specific channels within the workout video category. 
* Is there a significant difference in engagement between workout videos uploaded by different channels?

In [1]:
# standard imports
import os
import pandas as pd
import numpy as np

# for visualizations using bokeh
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
output_notebook()

#### Read the channels and videos files into dataframes

In [2]:
# Get current working directory
cwd = os.path.dirname(os.getcwd())
# Create filepath to the processed subfolder in data folder
path = cwd + "/data/processed"

# Read in the csv files from the processed data folder
channels_df = pd.read_csv(path + "/fitness_channels_processed_2023_06_28.csv")
videos_df = pd.read_csv(path + "/fitness_videos_processed_2023_06_28.csv")

In [3]:
videos_df.head()

Unnamed: 0,video_id,channelTitle,title,description,tags,publishedAt,viewCount,likeCount,commentCount,duration,...,tagsCount,youtubeShorts,workoutTime,workoutType,bodyPart,standingWorkout,noEquipment,noJumping,lowImpact,strengthTraining
0,e7zzES8PeG4,Chloe Ting,Shocking Before After Transformation Results! ...,Check out these amazing before and after trans...,"['Abs', 'Abs results', 'Abs workout results', ...",2023-06-28T14:00:23Z,36033,2677.0,163.0,PT9M22S,...,25,False,,,,,,,,
1,5GLA8MrlDnM,Chloe Ting,A day in my life living in Korea,Short vlog from a day out and about while in S...,"['dayinmylife', 'korea', 'seoul', 'vlog', 'chl...",2023-06-05T14:51:22Z,317017,9618.0,708.0,PT12M37S,...,13,False,,,,,,,,
2,ljNgkSctkXg,Chloe Ting,INTENSE Full Body Workout - 30 Min No Equipment,This is a 30 min full body intense workout fro...,"['workout', 'home workout', 'full body workout...",2023-05-17T14:00:27Z,620735,17223.0,863.0,PT31M14S,...,27,False,30 Min,,FULL_BODY,,NO_EQUIPMENT,,,
3,0rL2496zybs,Chloe Ting,10 Min Core & Upper Body | No Equip Home Workout!,This is episode 4 of the 2023 Summer Shred Cha...,"['core', 'abs', 'upper body', 'upper body work...",2023-05-15T14:00:01Z,349717,10354.0,313.0,PT10M52S,...,27,False,10 Min,,ABS,,NO_EQUIPMENT,,,
4,PEX2uefaUAY,Chloe Ting,Perky Booty & Leg Workout | 20 min Glute Workout,This is episode 3 of the 2023 Summer Shred Cha...,"['tiny waist', 'waist', 'booty', 'booty workou...",2023-05-09T14:00:15Z,684910,18468.0,531.0,PT20M39S,...,29,False,20 min,,GLUTES,,,,,


Filter out rows that have empty values for all the columns that describe the workout since this implies the video is mostly likely either a vlog or reaction to people's transformation from doing their workouts, not a workout video.

In [4]:
# Specify the columns to check for missing values
columns_to_check = ['workoutTime', 'workoutType',
                    'bodyPart', 'standingWorkout',
                    'noEquipment', 'noJumping',
                    'lowImpact', 'strengthTraining']

# Filter out rows with missing values in all the specified columns
workouts_df = videos_df.dropna(subset=columns_to_check, how='all')

In [5]:
print("The number of rows dropped from the videos dataframe is:",
      videos_df.shape[0] - workouts_df.shape[0])

The number of rows dropped from the videos dataframe is: 852


!!! Note to self: make sure to check these videos to make sure we didn't drop any workouts. 

## Part 1: General Trends and Patterns in Workout Videos

In [6]:
grouped = workouts_df.groupby('workoutTime')
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001E4E38A25E0>

In [11]:
list(grouped.groups.keys())

['1 HOUR',
 '1 Hour',
 '1 Min',
 '10 MIN',
 '10 MINS',
 '10 MINUTE',
 '10 MINUTES',
 '10 Min',
 '10 Mins',
 '10 Minute',
 '10 Minutes',
 '10 min',
 '10 mins',
 '10 minute',
 '10 minutes',
 '11 MIN',
 '11 Min',
 '11 min',
 '11 minutes',
 '12 MIN',
 '12 Min',
 '12 Minute',
 '12 Minutes',
 '12 min',
 '12 minute',
 '12 minutes',
 '13 min',
 '14 Min',
 '14 Minute',
 '15 MIN',
 '15 MINS',
 '15 MINUTE',
 '15 MINUTES',
 '15 Min',
 '15 Mins',
 '15 Minute',
 '15 Minutes',
 '15 min',
 '15 mins',
 '15 minute',
 '15 minutes',
 '16 min',
 '20 MIN',
 '20 MINUTES',
 '20 Min',
 '20 Mins',
 '20 Minute',
 '20 min',
 '20 mins',
 '20 minute',
 '20 minutes',
 '21 Minute',
 '24 HOURS',
 '24 Hour',
 '24 Hours',
 '25 Min',
 '25 min',
 '25 minute',
 '28 min',
 '3 MIN',
 '3 Min',
 '3 Minute',
 '3 min',
 '3 minute',
 '3 minutes',
 '30 MIN',
 '30 MINS',
 '30 MINUTE',
 '30 MINUTES',
 '30 Min',
 '30 Mins',
 '30 Minute',
 '30 Minutes',
 '30 min',
 '30 mins',
 '30 minute',
 '30 minutes',
 '35 MIN',
 '36 HOURS',
 '4 MI

In [None]:
# Prepare data for violin plot
grouped = workouts_df.groupby('workoutTime')
source = ColumnDataSource(grouped)

# Create figure
p = figure(height=400,
           title="Video Length Comparison (Views)",
           x_range=list(grouped.groups.keys()),
           toolbar_location=None,
           tools="")

# Create violin plot
p.violin(source=source,
         x='Video Length',
         y='Views',
         width=0.5,
         line_color='black',
         fill_color='skyblue')

p.xgrid.grid_line_color = None
p.y_range.start = 0
p.yaxis.axis_label = "Views"

show(p)


## Part 2: Comparison of Specific Channels