In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import numpy as np


In [None]:
channels = pd.read_csv("channel_data.csv")
channels['avgViews'] = channels['View Count'] / channels['Video Count']
channels.replace([np.inf, -np.inf], np.nan, inplace=True)
channels.dropna(inplace=True)
channels

# Is Average Viewership Proportional to Subscriber Count?

Does the number of subscribers a channel has influence its average viewership? In this analysis, we aim to answer that question so that content creators can determine whether growing their subscriber base is a worthwhile endeavor for increasing average viewership, and subsequently profit. To do this, we will use a scatterplot for data visualization, along with linear regression to visualize possible correlation. We will also use a Pearson test to see if our results are statistically significant.

In [None]:

sns.lmplot(data=channels, x='Subscriber Count', y='avgViews')
plt.title("Average Views per Video vs. Subscriber Count")
plt.xlabel("Subscriber Count")
plt.ylabel("Average Views per Video")
channels.sort_values(by='avgViews', inplace=True)
display(channels)

result = stats.pearsonr(x=channels['Subscriber Count'], y=channels['avgViews'])
display(f"p-value: {result.pvalue}")
display(f"Largest # of Subs: {channels[channels['Subscriber Count'] == channels['Subscriber Count'].max()].Title}")
display(f"Largest # of Views/Video: {channels[channels['avgViews'] == channels['avgViews'].max()].Title}")

The plot above shows that there is a positive correlation between average viewership and subscriber count. This makes sense because the more subscribers a channel has, the more people will be regularly consuming their content, and as a result their average views per video will increase. The p-value obtained by our Pearson test further supports this idea, allowing us to reject the null hypothesis. However, notice the channels with a low subscriber amount that have an enormous amount of views per video. These outliers are popular musicians, who unlike content creators have a small amount of videos. Each music video they post tends to go viral, and so their views per video are very high.

# Are larger YouTubers better than small YouTubers at getting views?

In [None]:
cutoff = channels['Subscriber Count'].median()
small = channels[channels['Subscriber Count'] < cutoff]
large = channels[channels['Subscriber Count'] >= cutoff]

sns.displot(data=small, x='avgViews', log_scale=True, kde=True)
plt.title(f"Views per Video of Channels with Less Than {cutoff} Subscribers")
plt.xlabel("Views Per Video")
sns.displot(data=large, x='avgViews',log_scale=True, kde=True)
plt.title(f"Views per Video of Channels with More Than {cutoff} Subscribers")
plt.xlabel("Views Per Video")


result = stats.ttest_ind(large['avgViews'],small['avgViews'])
display(f"T-test p-value: {result.pvalue}")
display(f"Median Views/Video for Small Creators: {small['avgViews'].median()}")
display(f"Median Views/Video for Large Creators: {large['avgViews'].median()}")


The plots above show that there is a difference in the distributions of views per video for small and large creators. This, combined with the large difference in the median views per video for the two populations supports the claim that there is a significant difference in the average viewership of large creators and the average viewership of small creators. This is claim is further supported by a small p-value of 0.06, which though not technically statistically significant helps lend further credibility to the claim that large creators get more views per video than small creators.

# Subscribers vs. Video Count

Is the number of subscribers to a YouTube channel proportional to the number of videos that channel has uploaded? If the two are strongly correlated, then a good strategy to grow a YouTube channel would be to churn out a large quantity of videos. If they are weakly correlated or not correlated, it may not be so important for creators to be putting out a large quantity of videos. Determining whether subscriber count is correlated to total number of videos uploaded can help content creators to formulate a strategy to maximize their channel growth. 

In [None]:
sns.lmplot(data=channels, x="Video Count", y="Subscriber Count")
plt.title("Subscriber Count vs. Number of Videos")
result = stats.pearsonr(x=channels['Video Count'], y=channels["Subscriber Count"])

display(f"Corellation coefficient: {result.statistic}")
display(f"p-value: {result.pvalue}")


The plot above shows that there may be a weak correlation between the total number of videos a channel puts out and its subscriber count. The p-value obtained by a pearson test is 0.19, which says that this result is not statistically significant. This tells us that having more videos does not necessarily correspond to having more subscribers, and as such quantity of videos is not a reliable way to grow a YouTube channel

# Subscribers vs Channel Views


## Analysis Objective
This expanded analysis aims to provide a deeper understanding of the engagement and efficiency of the top YouTube channels by examining their views-to-subscriber ratio, visualizing the distribution of this ratio, identifying the top performers, and exploring the relationship between subscribers and views.
    

In [None]:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
# df = pd.read_csv('./youtube_channels_stats.csv')
df = pd.read_csv('./channel_data.csv')

# Calculate the Views per Subscriber Ratio
df['Views_per_Subscriber'] = df['View Count'] / df['Subscriber Count']
df_sorted = df.sort_values(by='Views_per_Subscriber', ascending=False)

df_sorted.head()


In [None]:

# Plotting the distribution of views-to-subscriber ratios
plt.figure(figsize=(10, 6))
sns.histplot(df['Views_per_Subscriber'], bins=30, kde=True)
plt.title('Distribution of Views to Subscriber Ratio')
plt.xlabel('Views per Subscriber')
plt.ylabel('Frequency')
plt.show()


In [None]:

# Displaying the top 10 channels by Views per Subscriber Ratio
top_10 = df_sorted.head(10)
top_10


In [None]:

# Scatter plot of Subscriber Count vs View Count
plt.figure(figsize=(12, 8))
sns.scatterplot(data=df, x='Subscriber Count', y='View Count', size='Views_per_Subscriber', legend=False, sizes=(20, 200))
plt.title('Subscriber Count vs View Count with Views to Subscriber Ratio')
plt.xlabel('Subscriber Count')
plt.ylabel('View Count')
plt.xscale('log')
plt.yscale('log')
plt.grid(True)
plt.show()



## Conclusion
This extended analysis of YouTube channels reveals significant insights into content efficiency. By examining the views-to-subscriber ratio, we identify channels that successfully grasp the viewer's interest and gain subscribers based on that. The histogram of the ratio distribution highlights the variance in subscribers across channels, while the top 10 channels by this metric showcase those with exceptional performance. Furthermore, the scatter plot reveals the relationship between subscriber count and view count, underlining the diversity in strategies for achieving YouTube success. Overall, the analysis underscores the importance of creating content that resonates with and actively retains viewers and gains subscribers.
