### Data Exploration and Analysis

Now we'll dive into the data and try to discover any insights, patterns or trends in the data. 
First, we'll read in the data and format it . Afterwards, we'll carry out the exploration step-by-step in the following manner:
1. Univariate analysis: Look at the individual (marginal) distribution of the features we deem important.
    - important numeric features are: total views/likes/comments, and video duration.
    - important categorical features are: published year, body categories (abs, legs, etc...) and workout types (cardio, HIIT, etc...)
    - we may also identify any outliers at this stage and we'll handle it accordingly.
2. Multivariate analysis: Look at the relationships between numeric features.
    - see if there's any correlation between any of the numeric features.
3. Answer questions of interests.
    - Is there any correlation between video duration and engagement metrics (views, likes, comments)?
    - Do video titles' length impact their engagement or viewership?
    - Is there any correlation between the number of tags and the engagement of videos?
    - Which channels have the highest engagment (views, likes, and comments) per video on average?
    - What is the distribution of video duration across the different duration categories for each channel?
    - Is there any seasonal pattern in video uploads for each channel?
    - Which channels have the highest proportion of body-specific workout categories (e.g., upper body, legs, abs)?
    - Which channels have the highest percentage of videos with specific tags (e.g., yoga, HIIT, dance)?

In [469]:
# standard imports
import pandas as pd
import numpy as np
from datetime import datetime
from os.path import dirname, abspath
from dateutil import parser
import calendar

# for stats
import scipy.stats as stats
from scipy.stats import kruskal
from scikit_posthocs import posthoc_dunn

# for visualization
import plotly
import plotly.express as px
import plotly.graph_objects as go

px.defaults.template = "plotly_white"

### Read in and Format Data

In [470]:
# Read in the csv files from the processed data folder
project_dir = dirname(dirname(abspath("03-data-analysis.ipynb")))
videos_df = pd.read_csv(project_dir + "/data/processed/fitness_videos_processed_2023_07_11.csv")
channels_df = pd.read_csv(project_dir + "/data/processed/fitness_channels_processed_2023_07_11.csv")

In [471]:
# Formatting channels dataframe
# Get only the date from the datetime column
channels_df['published_date'] = pd.to_datetime(channels_df['published_date'].apply(lambda x: parser.parse(x)))
channels_df['published_date'] = channels_df['published_date'].dt.date

# Formatting videos dataframe
# Convert published_date column to datetime format
videos_df['published_date'] = pd.to_datetime(videos_df['published_date'])

# Get the names of columns that are of boolean data type
boolean_columns = videos_df.select_dtypes(include='bool').columns.tolist()
# Convert boolean columns to numeric (0 or 1)
videos_df[boolean_columns] = videos_df[boolean_columns].astype(int)

# Convert channel column to category data type
videos_df['channel'] = videos_df['channel'].astype('category')

# Convert 'duration_category' to a categorical data type with the desired order
desired_order = ['0-5', '5-10', '10-20', '20-30', '30-45', '45+']
videos_df['duration_category'] = pd.Categorical(videos_df['duration_category'], categories=desired_order, ordered=True)

### 1. Univariate Analysis

In [472]:
px.histogram(videos_df, x='total_views', color='channel', marginal='violin', title='Distribution of Total Views')


First of all, we observe that the distribution of the total views is right-skewed for all the channels. In other words, all the channels have some videos that for some reason are more popular with their viewers. We can also see that there are a couple of videos that are clearly outliers present in the data and it is causing the distribution to be heavily skewed to the right. Based on our observation, we note that Emi Wong's channel has one outlier while blogilates and Chloe Ting's channel each has two outliers respectively. Let's take a closer look at those outliers.

#### 1.1 Identifying and handling outliers

In [473]:
videos_df[['video_id','channel','title','published_date','total_views','duration_mins']].sort_values('total_views', ascending=False).head(6)

Unnamed: 0,video_id,channel,title,published_date,total_views,duration_mins
1418,2pLT-olgUJs,Chloe Ting,Get Abs in 2 WEEKS | Abs Workout Challenge,2019-08-12,508201510,11.066667
1419,2MoGxae-zyo,Chloe Ting,Do This Everyday To Lose Weight | 2 Weeks Shre...,2019-08-08,207907358,14.383333
937,jXm0y-csiuE,emi wong,"10 MIN EXERCISE & STRETCH FOR SHOULDERS, NECK ...",2020-04-05,86202358,10.5
1441,6TmQiugy_qw,Chloe Ting,Abs Workout ðŸ”¥Get that 11 Line Abs in 35 days,2019-04-30,75241032,11.133333
1615,k43k68kmeyE,blogilates,You asked for a thicker yoga mat. And with meâ€¦...,2023-05-19,73619594,0.25
1864,LEeluLcp14w,blogilates,Agree or disagree? Headphones in = donâ€™t talk ...,2022-04-01,71375553,1.0


I recall that at the beginning of the covid19 pandemic, Chloe Ting's channel experienced a large influx of new viewers as some of her abs workouts were very well-received. There was an abs challenge in particular that garnered attention resulting in her channel gaining a lot of new subscribers and was most watched among her viewers during the pandemic - that's the one with the 508 million views. 

For emi wong's channel, the outlier is a shoulder, neck and collarbone stretch and exercise video released during the pandemic. Getting more definition in the collarbone area seems to be something that many viewers wanted and many have confirmed on seeing results in the comments. Since there are not many videos specifically focused on this area of the body, the video naturally got a lot of views. Although the outliers we've identified so far significantly are higher compared to each respective channel's viewership, we choose to keep them as they are informative with regards to the fitness needs and likes of the viewers. 

For the channel blogilates however, it would appear that the outliers were promotional youtube shorts for for new products in her brand, popflex. So, in this case, we would want to drop those videos as the high number of views are related to viewers checking out the new products, not necessarily about fitness. However, this brings up a new problem which is that the youtube shorts may or may not be workout/fitness-related.

In [474]:
# look at the top 10 views for the channel blogilates
videos_df.query('channel == "blogilates"')[['video_id','title','published_date','total_views','duration_mins']].sort_values('total_views', ascending=False).head(10)

Unnamed: 0,video_id,title,published_date,total_views,duration_mins
1615,k43k68kmeyE,You asked for a thicker yoga mat. And with meâ€¦...,2023-05-19,73619594,0.25
1864,LEeluLcp14w,Agree or disagree? Headphones in = donâ€™t talk ...,2022-04-01,71375553,1.0
1727,tLX3UcdxlPc,And Iâ€™m not even close to being done #fashion ...,2022-08-18,50358153,1.0
1631,sSroH0LEfVM,I am genuinely so sorry itâ€™s taken me THIS LON...,2023-03-27,46683966,1.0
1624,vqB7hVyP0ps,I played around with a brand new silhouette an...,2023-04-20,43664933,1.016667
1925,XHErU2TiopY,Easiest messy bun hack EVER ðŸ™Œ,2022-01-29,43400143,0.35
1928,EJuRti53M74,What Disney Princesses would wear to the gym ðŸ‘‘,2022-01-25,34055805,0.416667
1944,Un15N-r9Wzg,I tried the human windmill challengeâ€¦,2021-12-24,29306268,0.266667
1728,tHfy5gGVUg4,I designed the most ROMANTIC workout skort. Wh...,2022-08-15,28988625,1.0
1921,zpL59ry_Ur0,You HAVE to be one of these 5 kinds of people,2022-02-03,26912193,0.25


 We looked at the videos with the top 10 views for blogilates and they were all youtube shorts as well as not being fitness-related. In that case, the solution we decided on is that it would be best to remove the youtube shorts from the rest of the data as it should be treated differently.

In [475]:
# Save all the youtube shorts in a separate dataframe
ytshorts_df = videos_df.query('duration_mins <= 1')

# Drop all youtube shorts from the videos data
videos_df = videos_df.query('duration_mins > 1')

In [476]:
px.histogram(videos_df, 
             x ='total_likes', 
             color='channel', 
             marginal='violin', 
             title="Distribution of Total Likes", 
             labels={'total_likes':'Total Likes','count':'Count','channel':'Channel'})

In [477]:
px.histogram(videos_df, 
             x ='total_comments', 
             color='channel', 
             marginal='violin', 
             title="Distribution of Total Comments",
             labels={'total_comments':'Total Comments','count':'Count','channel':'Channel'})


The shape of the distributions of both the total likes and total comments is similar to the distribution of the total views. The only thing different is the magnitude of the range with the magnitude of total comments < total likes < total views.

In [478]:
px.box(ytshorts_df, 
       x ='total_views',
       y='channel', 
       color='channel', 
       title="Distribution of Total Views of YouTube Shorts",
       labels={'total_views':'Total Views','channel':'Channel'})

As expected, blogilates has a lot of youtube shorts and it is incredibly higher in volume compared to the other channels. This makes sense as blogilates uses her channel to also introduce or promote her brand products which are very popular.

In [479]:
fig = px.histogram(videos_df, 
                   x='duration_mins',
                   color='channel', 
                   marginal='box',
                   title='Distribution of Video Duration',
                   labels={'duration_mins':'Video Duration (in Mins)','channel':'Channel'})
fig.show()

In [480]:
df = videos_df['duration_category'].value_counts(normalize = True).reset_index(name='percentage').rename(columns={'index': 'duration_category'})
df['percentage'] = df['percentage'].round(4) * 100
df['duration_category'] = pd.Categorical(df['duration_category'], categories=desired_order, ordered=True)

fig = px.bar(df, 
             x = 'duration_category', 
             y = 'percentage', 
             title="Percentage of Videos for each Duration Category",
             labels={'duration_category':'Workout Duration Category','percentage':'Percentage'},
             width=800)
fig.show()

From our data, we observe that most videos fall in the '10-20 Mins' video duration category. This is more or less to be expected. It is surprising though that there are more videos less than 5 mins than videos in the '20-30 Mins' category. This can be explained by the distribution of the video duration where we see that the 25th percentile of all the channels is roughly around 8-10 mins, except for blogilates which has its 25th percentile at 6.2 minutes. Thus, compared to the other channels, blogilates seem to have a preference for videos on the shorter side in duration.

Finally, we note that less than 1% of the videos are longer than 45 mins. So, it would make sense to get rid of the '45+' category and simply create a new category for videos that are longer than 30 mins. 

In [481]:
# all videos in the '30-45' and '45+' category will now be in the '30+' category
# also, since we are only looking at the videos that are youtube shorts, we will rename the '0-5' category to '1-5'
# Replace values in 'duration_category' column
videos_df['duration_category'] = videos_df['duration_category'].replace({'0-5': '1-5', '30-45': '30+', '45+': '30+'})

In [482]:
df = videos_df.groupby(['published_year', 'channel']).agg(
    count=('video_id', np.size)).fillna(0).reset_index()

# Calculate the percentage change in total number of videos from the past year
df['percentage_change'] = df.groupby('channel')['count'].pct_change() * 100

fig = px.bar(df, 
             x='published_year', 
             y='count', 
             color='channel',
             title='Total Number of Videos Published per Year',
             labels={'channel':'Channel','published_year':'Published Year of Video','count':'Video Count'},
             width=800,
             template='plotly'
             )

fig.update_layout(legend=dict(
    yanchor="top",
    y=0.97,
    xanchor="left",
    x=0.013
))


fig.show()

In 2020, all the channels published more videos than usual hence why it is the year with the highest number of videos published. This is mostly because it was early in the covid19 pandemic and a lot more became interested in youtube workouts to stay in shape during the covid lockdown. AFter 2020, the number of videos produced by each channel naturally reduced back to its normal production. However, aside from MadFit and Rebecca-Louise, the other three channels had a quite a sharp decrease in the number of videos produced, most notably blogilates.

In [483]:
body_parts = ['FULL_BODY', 'UPPER_BODY', 'LOWER_BODY', 'CHEST', 'BACK', 'ABS', 'ARMS','LEGS', 'GLUTES']
df = videos_df[body_parts].agg('sum').reset_index(name='count').rename(columns={'index':'body_part'}).sort_values('count', ascending=False)
df['percentage'] = df['count']/ df['count'].sum() * 100
px.bar(df, 
       x='body_part', 
       y='percentage', 
       title='Percentage of Videos per Body-Specific Workout Type',
       labels={'body_part':'Body Part','percentage':'Percentage'},
       width=600)

First, we note that there are workouts that target specific body parts than those that target the whole upper, lower or full body. The only exception is back and chest workouts which is the least targeted body part.

In [484]:
workout_type = ['HIIT', 'CARDIO', 'DANCE', 'TABATA', 'PILATES', 'BARRE', 'YOGA', 'STRENGTH_TRAINING']
df = videos_df[workout_type].agg('sum').reset_index(name='count').rename(columns={'index':'workout_type'}).sort_values('count',ascending=False)
df['percentage'] = df['count']/ df['count'].sum() * 100
px.bar(df, 
       x='workout_type', 
       y='percentage', 
       title='Percentage of Videos per Workout Type', 
       labels={'workout_type':'Type of Workout','percentage':'Percentage'},
       width=600)

Before we delve any further, we take note that these different workout types are not mutually exclusive. In other words, it is possible for a workout to fall under more than one workout type and this is actually often the case. First, we note that there are less 1% of videos that are barre or tabata workouts. Thus, it would make sense to just group them into an 'Others' category. 

Secondly, the majority of the youtube workouts is strength training workouts which is not surprising. However, pilates workouts being the second most published videos is as we would have expected it to be either cardio or HIIT. This may be largely attributed to the fact that blogilates is a pilates instructor. Still, considering the target audience, we would have expected the blogilates channel to post more cardio or HIIT than pilates as well. Let's take a closer look pilates workouts.

In [485]:
videos_df.groupby('channel').agg(count = ('PILATES',np.sum))

Unnamed: 0_level_0,count
channel,Unnamed: 1_level_1
Chloe Ting,0
MadFit,2
Rebecca-Louise,10
blogilates,779
emi wong,1


Our speculation was correct - there are 792 pilates workouts in total in this data and 779 of them is created by blogilates. That is, 98% of the pilates workouts in this dataset is from blogilates.

In [486]:
# Group by 'published_year' and sum the counts of each boolean value
grouped_by_workouttype = videos_df.groupby('channel')[workout_type].agg('sum').reset_index()

grouped_by_workouttype = grouped_by_workouttype.melt(id_vars='channel', var_name='workout_type', value_name='video_count')

grouped_by_workouttype['workout_type'] = pd.Categorical(grouped_by_workouttype['workout_type'])
grouped_by_workouttype['channel'] = pd.Categorical(grouped_by_workouttype['channel'])

grouped_by_workouttype.query('channel == "blogilates"').sort_values('video_count',ascending=False)


Unnamed: 0,channel,workout_type,video_count
23,blogilates,PILATES,779
38,blogilates,STRENGTH_TRAINING,447
8,blogilates,CARDIO,361
3,blogilates,HIIT,337
33,blogilates,YOGA,299
13,blogilates,DANCE,27
28,blogilates,BARRE,2
18,blogilates,TABATA,0


### II. Multivariate Analysis

In [487]:
# Define indices corresponding to channel categories, using pandas label encoding
index_vals = videos_df['channel'].astype('category').cat.codes

fig = go.Figure(data=go.Splom(
                dimensions=[dict(label='total views',
                                 values=videos_df['total_views']),
                            dict(label='total likes',
                                 values=videos_df['total_likes']),
                            dict(label='total comments',
                                 values=videos_df['total_comments']),
                            dict(label='video duration',
                                 values=videos_df['duration_mins']),
                            dict(label='title length',
                                 values=videos_df['title_length']),
                            dict(label='total tags',
                                 values=videos_df['total_tags'])],
                text=videos_df['channel'],
                diagonal_visible=False, # remove plots on diagonal
                marker=dict(color=index_vals,
                            showscale=False, # colors encode categorical variables
                            line_color='white', line_width=0.5)
                ))


fig.update_layout(
    title='Matrix Plot of Numerical Features of the Workout Videos',
    dragmode='select',
    width=1000,
    height=1000,
    hovermode='closest',
    template='plotly'
)

fig.show()

In [488]:
numeric_cols = ['total_views', 'total_likes', 'total_comments',
                'duration_mins', 'video_age', 'title_length', 'total_tags']
df_corr = videos_df[numeric_cols].corr().round(2)

fig = px.imshow(df_corr, text_auto=True, width=950, title="Correlation of Numeric Features of the Workout Videos", aspect="auto")

fig.show()

### **III. Explore specific questions**

Using the matix plot and correlation heatmap, we can answer the following questions:

#### **Is there any correlation between video duration and engagement metrics (views, likes, comments)?**
No, there is no correlation between video duration and the engagement metrics such as views, likes and comments. In fact, the correlation is close to 0. This means that there is no positive or negative linear trend in the engagement metrics as duration increases. However, from the scatter matrix we can see that mosft of the videos with the highest views, likes and comments are all around 15-20 mins. After that, the engagement metrics actually decreases 

#### **Do video titles' length or the number of tags impact their engagement or viewership?**
There is no strong correlation between title length and viewership/or likes.  It would be more accurate to say that videos with short title length, roughly less than 25 characters, tend to get less viewership and likes. Maybe this points to the fact that they are not descriptive enough to attract viewers. At the same time, videos with overly descriptive title do not get any significant increase in viewers either. Hence, neither title length nor the number of tags have any significant impact on views, likes, or comments.

#### **Is there any correlation between the number of tags and the engagement of videos?**
No, there is none. Still, we can see that the videos with the highest engagement (views, likes or comments) have around 20-35 tags in total. This means the videos with most engagement have neither too little nor too many tags.

#### **Which channels have the highest average views, likes, and comments per video?**

Since some channels have been around longer than others, comparing the total views, likes, and comments might not provide a fair comparison. Instead, it is more appropriate to compare the channels based on the average views, likes, and comments per video. By calculating the average views, likes, and comments per video for each channel, you can compare the channels' performance in a more equitable manner, considering the engagement metrics on a per-video basis. This way, even channels with a smaller number of videos or channels that have been active for a shorter duration can be fairly compared with others.

In [498]:
channels_summary = (videos_df
                    .groupby('channel')
                    .agg(total_videos = ('video_id', np.size),
                         average_views = ('total_views', np.median),
                         average_likes = ('total_likes', np.median),
                         average_comments = ('total_comments', np.median)
                         )
                    .reset_index()
                    )

channels_summary = (channels_df[['channel', 'published_date','total_subscribers']]
                    .merge(channels_summary, on='channel', how='left')
                    .sort_values(by='published_date')
                    .set_index('channel')
                    .style
                    .format(precision=1)
                    .background_gradient(cmap='Oranges')
                    )

channels_summary

Unnamed: 0_level_0,published_date,total_subscribers,total_videos,average_views,average_likes,average_comments
channel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
blogilates,2009-06-13,8730000,822,619272.5,12568.0,914.0
Chloe Ting,2011-08-17,24700000,401,2036668.0,51740.0,1393.0
Rebecca-Louise,2012-09-22,720000,1053,28683.0,662.0,65.0
emi wong,2014-11-02,6130000,428,286521.5,6288.5,236.5
MadFit,2018-03-02,8050000,600,740090.5,18992.0,428.0


We can see that Chloe Ting has the highest performance in all metric despite having the lowest number of videos. On the other hand, Rebecca-louise has the lowest performance in all metrics despite having the highest number of videos in her channel.

#### **Is there any seasonal pattern in video uploads for each channel?**

In [490]:
# Convert the numeric month values to month names using pd.Categorical
videos_df['published_month_abbr'] = videos_df['published_month'].apply(lambda x: calendar.month_abbr[x])

# Order the 'month' column
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
videos_df['published_month_abbr'] = pd.Categorical(videos_df['published_month_abbr'], categories=month_order, ordered=True)

# Group videos by 'channelName', 'published_year', and 'published_month', and count the number of video uploads for each group
uploads_data = (videos_df
                    .query('published_year >= 2018')
                    .groupby(['channel', 'published_month_abbr'])
                    .size()
                    .reset_index(name='video_count'))


# Create a heatmap using Plotly's go.Heatmap
fig = go.Figure(go.Heatmap(
    x=uploads_data['published_month_abbr'],
    y=uploads_data['channel'],
    z=np.log(uploads_data['video_count'])))

# Customize the layout
fig.update_layout(title="Log-Transformed Monthly Total Videos by Channel",
                  xaxis_title='Month of the Year',
                  yaxis_title='Channel',
                  xaxis_nticks=36,  # Adjust the number of x-axis ticks as needed
                  yaxis_nticks=len(videos_df['channel'].unique()),  # Adjust the number of y-axis ticks
                  yaxis_autorange='reversed',  # To have the first channel at the top
                  height=450,
                  width=950)

# Show the heatmap
fig.show()
fig.write_html(project_dir+ "/output/figures/heatmap_monthly_total_videos.html",full_html=False, include_plotlyjs='cdn')

Interestingly enough, there was not any usually high production of videos during the Covid19 pandemic. More people may have been introducted to home fitness workouts or been more inclined to doing them, however the channels themselves did not have a significant increase in number of workout videos uploaded compared to prior Covid19.

#### **What is the distribution of video duration across the different duration categories for each channel?**

In [491]:
# Group videos by 'duration_category' and 'channelName', and calculate the mean views and likes for each group
video_counts_by_channel_duration = (videos_df
                                    .groupby(['duration_category', 'channel'])
                                    .size()
                                    .reset_index(name='video_count')
                                    )

# Calculate the total number of videos for each channel
total_videos_by_channel = (videos_df
                           .groupby('channel')
                           .size()
                           .reset_index(name='total_videos')
                           )

# Merge the two DataFrames to calculate the percentage of videos in each duration category for each channel
video_counts_by_channel_duration = video_counts_by_channel_duration.merge(
    total_videos_by_channel, on='channel', how='left')
video_counts_by_channel_duration['percentage'] = (
    video_counts_by_channel_duration['video_count'] / video_counts_by_channel_duration['total_videos']) * 100

video_counts_by_channel_duration['channel'] = pd.Categorical(
    video_counts_by_channel_duration['channel'])


# Create the interactive bar chart using hvplot
duration_bar_chart = px.bar(video_counts_by_channel_duration, x='channel',
                            y='percentage',
                            color='duration_category',
                            title='Percentage of Videos in each Duration Category per Channel',
                            labels={'duration_category':'Duration Category','channel':'Channel','percentage':'% of Videos'}
                            )


# Display the bar chart
duration_bar_chart

#### **Which channels have the highest number of body-specific workout categories (e.g., upper body, legs, abs)?**

In [492]:
# Group by 'published_year' and sum the counts of each boolean value
grouped_by_body_part = videos_df.groupby('channel')[body_parts].agg('sum').reset_index()

grouped_by_body_part = grouped_by_body_part.melt(id_vars='channel', var_name='body_part', value_name='video_count')

grouped_by_body_part['body_part'] = pd.Categorical(grouped_by_body_part['body_part'])
grouped_by_body_part['channel'] = pd.Categorical(grouped_by_body_part['channel'])

px.bar(grouped_by_body_part, 
       x='channel',
       y='video_count', 
       color='body_part', 
       title='Distribution of Body-Specific Workout Categories by Channel',
       labels={'video_count': 'Count',
               'body_part': 'Body Part',
               'channel': 'Channel'})

#### **Which channels have the highest number of videos with specific tags (e.g., yoga, HIIT, dance)?**

In [493]:
# Group by 'published_year' and sum the counts of each boolean value
grouped_by_workouttype = videos_df.groupby('channel')[workout_type].agg('sum').reset_index()

grouped_by_workouttype = grouped_by_workouttype.melt(id_vars='channel', var_name='workout_type', value_name='video_count')

grouped_by_workouttype['workout_type'] = pd.Categorical(grouped_by_workouttype['workout_type'])
grouped_by_workouttype['channel'] = pd.Categorical(grouped_by_workouttype['channel'])

# Replace values in 'duration_category' column
grouped_by_workouttype['workout_type'] = grouped_by_workouttype['workout_type'].replace({'TABATA': 'OTHERS', 'BARRE': 'OTHERS', 'DANCE':'OTHERS'})

px.bar(grouped_by_workouttype, 
       x='channel',
       y='video_count', 
       color='workout_type',
       title='Distribution of  Workout Types by Channel',
       labels={'video_count': 'Count',
               'workout_type': 'Workout Type',
               'channel': 'Channel'})

#### **Is there a significant difference in engagement (views) between channels?**

In [494]:
# Create a dictionary to store data for each channel
channel_data = {}

# Iterate through unique channel names and extract total_views for each channel
unique_channels = videos_df['channel'].unique()
for channel in unique_channels:
    channel_data[channel] = videos_df[videos_df['channel'] == channel]['total_views']

# Perform Kruskal-Wallis test
statistic, p_value = kruskal(*channel_data.values())

# Print the test results
print("Kruskal-Wallis H Test:")
print(f"Test Statistic: {statistic}")
print(f"P-value: {p_value}")

# Check for statistical significance (typically p-value < 0.05)
if p_value < 0.05:
    print("There is a statistically significant difference in total_views between the channels.")
else:
    print("There is no statistically significant difference in total_views between the channels.")


Kruskal-Wallis H Test:
Test Statistic: 1731.192533298366
P-value: 0.0
There is a statistically significant difference in total_views between the channels.


In [495]:
# Perform Dunn's test for pairwise comparisons
dunn_results = posthoc_dunn(videos_df, val_col='total_views', group_col='channel')

# Display the results
print(dunn_results < 0.05)


                Chloe Ting  MadFit  Rebecca-Louise  blogilates  emi wong
Chloe Ting           False    True            True        True      True
MadFit                True   False            True        True      True
Rebecca-Louise        True    True           False        True      True
blogilates            True    True            True       False      True
emi wong              True    True            True        True     False


#### **Is there a signicant different in engagment between duration categories for each channel?**

In [496]:
# Create a dictionary to store data for each duration category
duration_data = {}

# Iterate through unique duration categories and extract total_views for each category
unique_duration_categories = videos_df['duration_category'].unique()
for category in unique_duration_categories:
    duration_data[category] = videos_df[videos_df['duration_category'] == category]['total_views']

# Perform Kruskal-Wallis test
statistic, p_value = kruskal(*duration_data.values())

# Print the test results
print("Kruskal-Wallis H Test for Engagement vs. Duration Categories:")
print(f"Test Statistic: {statistic}")
print(f"P-value: {p_value}")

# Check for statistical significance (typically p-value < 0.05)
if p_value < 0.05:
    print("There is a statistically significant difference in engagement (total_views) between duration categories.")
else:
    print("There is no statistically significant difference in engagement (total_views) between duration categories.")


Kruskal-Wallis H Test for Engagement vs. Duration Categories:
Test Statistic: 145.12422718800417
P-value: 2.255945844089391e-30
There is a statistically significant difference in engagement (total_views) between duration categories.


In [497]:
# Perform Dunn's test for pairwise comparisons of duration categories
dunn_results = posthoc_dunn(videos_df, val_col='total_views', group_col='duration_category')

# Display the results
print(dunn_results < 0.05)


         1-5   5-10  10-20  20-30    30+
1-5    False  False   True   True  False
5-10   False  False   True   True   True
10-20   True   True  False   True   True
20-30   True   True   True  False   True
30+    False   True   True   True  False


#### 