# Understanding the landscape of YouTube Home Workout Creators

**Since Covid-19 spread globally, many countries imposed lockdown and quarantine. As so, sucked at home many people have to turn to YouTube and look for workout videos to keep active. Home workout content creators become the top lists during that time however, it is a competitive platforms for all without exception, no matter of whom target the general public, or more niche area. It evokes my interests to understand the landscape of YouTube fitness creators. How do they vary from each other? What are they targeted audiences? What are their mindsets conveying to watchers?**

## PLAN: Outline the scope pf this project
This project aims to provide insights into the characteristics and strategies of successful fitness channels and offer recommendations for marketing or new content creators. It includes:<p>
1. Identity and compare the Popular Workout YouTuber
2. Analysis of Content and Sentiment
3. Audience comment Analysis
4. Summary and Recommendation


Notes:
1. The Home Workout Creators is whom focus on follow along videos only.

### Import Data And Data Cleaning

In [17]:
 #import library

import pandas as pd
import seaborn as sns
import plotly.express as px
from matplotlib import ticker
from wordcloud import WordCloud
import re

In [7]:
video_df = pd.read_csv('fitness_youtuber_data.csv')

In [8]:
video_df = video_df.drop('Unnamed: 0', axis = 1)

In [9]:
video_df.columns

Index(['video_id', 'channelTitle', 'title', 'description', 'tags',
       'publishedAt', 'viewCount', 'likeCount', 'favouriteCount',
       'dislikeCount', 'commentCount', 'duration', 'definition', 'caption'],
      dtype='object')

In [10]:
video_df.shape

(6451, 14)

In [11]:
video_df.isnull().sum()

video_id             0
channelTitle         0
title                0
description        882
tags              1084
publishedAt          0
viewCount            1
likeCount          201
favouriteCount    6451
dislikeCount      6451
commentCount         4
duration             0
definition           0
caption              0
dtype: int64

In [12]:
video_df = (video_df.drop(['favouriteCount', 'dislikeCount'], axis = 1))

In [13]:
video_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6451 entries, 0 to 6450
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   video_id      6451 non-null   object 
 1   channelTitle  6451 non-null   object 
 2   title         6451 non-null   object 
 3   description   5569 non-null   object 
 4   tags          5367 non-null   object 
 5   publishedAt   6451 non-null   object 
 6   viewCount     6450 non-null   float64
 7   likeCount     6250 non-null   float64
 8   commentCount  6447 non-null   float64
 9   duration      6451 non-null   object 
 10  definition    6451 non-null   object 
 11  caption       6451 non-null   bool   
dtypes: bool(1), float64(3), object(8)
memory usage: 560.8+ KB


In [14]:
#find the numeric columns and change the data type 
numeric_cols = ['viewCount', 'likeCount', 'commentCount']
video_df[numeric_cols] = video_df[numeric_cols].apply(pd.to_numeric, errors = 'coerce', axis = 1)

In [15]:
#find the date and change it to nicer format 

video_df['publishedAt'] = pd.to_datetime(video_df['publishedAt'])
video_df['publishedDayName'] = video_df['publishedAt'].dt.day_name()
video_df['publishedAt'] = video_df['publishedAt'].apply(lambda x: x.strftime("%Y-%m-%d %H:%M:%S"))
video_df['publishedAt'] = pd.to_datetime(video_df['publishedAt'])

In [18]:
def get_seconds(iso_str):
    hours = re.search(r"(\d+)H", iso_str)
    hours = hours.group(1) if hours else 0
    minutes = re.search(r"(\d+)M", iso_str)
    minutes = minutes.group(1) if minutes else 0
    seconds = re.search(r"(\d+)S", iso_str)
    seconds = seconds.group(1) if seconds else 0
    return int(hours) * 60 + int(minutes) *60 + int(seconds)
video_df['durationSec'] = video_df['duration'].apply(get_seconds).astype(int)

In [19]:
video_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6451 entries, 0 to 6450
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   video_id          6451 non-null   object        
 1   channelTitle      6451 non-null   object        
 2   title             6451 non-null   object        
 3   description       5569 non-null   object        
 4   tags              5367 non-null   object        
 5   publishedAt       6451 non-null   datetime64[ns]
 6   viewCount         6450 non-null   float64       
 7   likeCount         6250 non-null   float64       
 8   commentCount      6447 non-null   float64       
 9   duration          6451 non-null   object        
 10  definition        6451 non-null   object        
 11  caption           6451 non-null   bool          
 12  publishedDayName  6451 non-null   object        
 13  durationSec       6451 non-null   int64         
dtypes: bool(1), datetime64[n

### Identity and Compare the Popular Workout YouTuber

In [20]:
df = video_df
df['tag_count'] = df.tags
df.tag_count = df.tag_count.replace([None], 0).apply(lambda x: len(x) if x != 0 else 0)

In [37]:
#detect and remove noise data-- short videos
#sort out short videos
short_videos = pd.DataFrame(columns=df.columns)
for i in df.index:
    if 'shorts' in df.iloc[i,2]:
        tempt = df.iloc[i,:]
        short_videos = pd.concat([short_videos, tempt.to_frame().T])
long_videos = df[~df.video_id.isin(short_videos.video_id) & (df.durationSec >=300)]

In [38]:
long_videos['current_date'] = pd.to_datetime('2024-07-24')
long_videos['days'] = long_videos['current_date'] - long_videos['publishedAt']
long_videos['days'] = long_videos['days'].dt.days



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [39]:
#detect and remove noise data-- non-workout videos
workout_videos = long_videos.copy()
other_type_video = ['eat', 'recipes', 'meals','vlog','drinks', 'meal', 'recipe','Vlog', 'vegan', 'travel', 'a day with me', 'a day', 'subscriber', 'million','mental health','podcast','nutrition', 'fitness app', 'app', 'midset', 'questions', 'q&a']

workout_videos['tag_list'] = workout_videos['tags'].str.split(',').apply(lambda x: [tag.strip().lower() for tag in x] if isinstance(x, list) else [])

# Function to check if any word from other_type_video is in the tag list
def contains_excluded_word(tag_list, exclude_list):
    for tag in tag_list:
        if tag[1:-1] in exclude_list:
            return True 
    return False
    # return any(exclude_word in tag for tag in tag_list for exclude_word in exclude_list)
# def contains_excluded_word(tag_list, excluded_words):
#     return any(word in tag_list for word in excluded_words)
# a = workout_videos['tag_list']
# b = [word.lower() for word in other_type_video]
mask = workout_videos['tag_list'].apply(lambda x: not contains_excluded_word(x, [word.lower() for word in other_type_video]))
workout_videos = workout_videos[mask]

#### Video Uploads Overview 

In [27]:
colors = {"Lilly Sabri": "indigo", 
         "Caroline Girvan": "mediumaquamarine", 
         "Chloe Ting": "pink", 
         "Eleni Fit": "lightblue", 
         "Boho Beautiful Yoga": "darkturquoise",
         "Move With Nicole": "orange", 
         "emi wong": "cornflowerblue", 
         "growingannanas" : "violet",
          "Pamela Reif" : "PaleVioletRed",
          "MadFit": "tomato"}

In [43]:
fig = px.histogram(workout_videos, x="publishedAt", color = "channelTitle", facet_col="channelTitle", 
                   template  = "simple_white", facet_col_wrap =3,height = 1000,
           color_discrete_map=colors,  opacity=0.7)

# create graph
fig.update_layout(showlegend=False,
                  xaxis_title="Published Date", yaxis_title="Number of Videos",
                  title= dict(text ="<b>Video uploads by channel <b>",
                              x=0.5,
                              font = dict(family ="Old Standard TT", size = 24) ), ) 

From the monthly number of videos uploaded, as a workout video content creator, they tend to have stable and consistent updates. The frequency of creating new videos varies:.... However, we could find that most creators reduce the creations in 2021 after they experienced high production of year in 2020 due to Covid, indicating that they might get tired and need some chill.

In [44]:
#video uploads over years of all               
fig = px.histogram(long_videos, x="publishedAt", 
                   template="simple_white",  
                   opacity=0.7)


fig.update_layout(showlegend=True, 
                  title= dict(text ="<b>Video uploads <b><br><sup>Data from YouTube API - 24th Jun 2024 </sup>",
                              x=0.5, 
                              font = dict(family ="Old Standard TT", size = 24) )) 

#### 

#### 

In [25]:
# other_type_video = ['eat', 'recipes', 'meals','vlog','drinks', 'meal', 'recipe','Vlog', 'vegan', 'travel', 'a day with me', 'a day', 'subscriber', 'million']
# all_tags = ' '.join(long_videos['tags'].dropna())
# tag_list = re.findall(r"'(.*?)'", all_tags.lower())
# 
# tag_count_list = {}
# for tag in tag_list:
#     if not any(word in tag for word in other_type_video):
#         tag_count_list[tag] = tag_count_list.get(tag,0) + 1
# tag_count_list = pd.DataFrame.from_dict(tag_count_list, orient='index', columns = [ 'count']).reset_index()
# tag_count_list.columns = ['tag', 'count']
# tag_count_list = tag_count_list.sort_values(by = 'count', ascending = False)
# tag_count_list = tag_count_list.drop([3954,3955], axis = 0).reset_index().drop('index', axis = 1)