<a href="https://www.kaggle.com/code/bencaiello/top-1000-yt-ers-w-language-translation-english?scriptVersionId=143687055" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()


# Install translation package (only need to do once!)
try:
    import translators as ts
except:
    !pip install translators
    import translators as ts

import warnings
warnings.simplefilter('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        filepath = (os.path.join(dirname, filename))


Note that this first step may run slowly if you copy this notebook, as a download from pip is used for the translation package. This only should occur on the first run of the notebook.

# Introduction & First Look at Data!

Youtube is a major social media platform dedicated to allowing users to publish, share, and watch video content. This data set contains data about the top 1000 users of Youtubes -- the top 1000 channels -- including the number of subscribers to each channel, as well as the average engagement per video (visits/views, likes, comments) that a channel gets. 

Additionally, many (but not all) channels have a known country of origin and/or category of content that they produce.

In this notebook, we will plot relationships primarily about total subscriber count and how it relates to the country / category of content produced by a given top 1000 youtube channel, but also some other interesting questions -- like how do subscribers counts relate to engagement per video? & Do 20% of the top 1000 channels have 80% of the subscribers?

This first step I will display some of the data anda information about the uncleaned dataframe, and then there is some data cleaning / rearranging going on insde the hidden code block to address some of the issues I found. It don't don't go into the details of that in this visible text, but if you are interested, go look inside the hidden code block to see what and why I did certain steps!

In [None]:
file = pd.read_csv(filepath)

# Inspect the nature of the data in each column, the number of null values and the datatypes of the columns
display(file.head())
display(file.info())

# Check duplicates. Note that simply checking duplicates on the overall dataframe will fail because the rank of every entry is unique.
duplicate_users = file[file['Username'].duplicated()]

# Remove triple quotes to see the duplicated entries:
'''
for i in duplicate_users['Username']:
    print(file[file['Username'] == i])
'''

# Drop all duplicated entries. Note some have varied numerical values. Just dropped the second occurance of each duplicate.
file = file.drop(duplicate_users.index)
file = file.reset_index()
file = file.drop('index', axis = 1)

#  The numerical columns do not need to be floats as they represent discrete units. 
# Note: trying to read these columns (3,5-7) directly as 'int' type in the pd.read_csv call did not succeed.
for i in file.columns:
    try:
        file[i] = file[i].astype('int')
    except:
        pass

# The Country column, although having no null values has 'Unknown' values
# Will convert the nulls in categories to 'Unknown' for consistency.
# Do not want to drop nulls because there are too many in the categories column. Instead Unknown will be its own category.
file['Categories'] = file['Categories'].fillna('Unknown')

# Additionally, I will drop the links column, as it does not provide useful information for most visualizations.
file = file.drop('Links',axis = 1)

# I will make a copy of the dataframe for use later (under the correlation header)
fresh_file = file.copy()

# I will also create a column of subscribers in millions (easier to plot)
file['Subscribers (millions)'] = file['Suscribers'] / 1000000

# Translation of Category and Country columns to English

I want all the country and category names to be in English, but along the way, I also convert the category column into a column of lists (instead of a column of comma separated strings). This change may be useful later when dealing with the fact that each channel can have more than one category!

In [None]:
# I am a monolingual English speaker -- so let's convert the country / category names into English!
# Link to package:  https://pypi.org/project/translate-api/
# Note that documentation at the provided link is inaccurate!

# First, identify the categories in the Categories columns. Note that some channels are more than one category.
cat_list = []
for i in file['Categories']:
    split_list = []
    split_list = i.split(',')
    for j in split_list:
        j = j.strip()
        cat_list.append(j)

# Now I isolate the unique categories, by using the set datatype:
cat_set = set(cat_list)


# Next, I convert the unique categories into a single string of each category, separated by commas.
# This is important to reduce the number of queries to the translation tool / URL to a minimum.
# The translate_text function throws an error if you query the same URL more than ~7 times in rapid succession.
cat_list = list(cat_set)
cat_str = ''
for i in cat_list:
    cat_str = cat_str + i + ','
    
# Now we translate:
english = ts.translate_text(cat_str)

# Then undo the single string back into a list:
english_list = english.split(',')

# Then make a dictionary to match the Spanish phrases to the English translations.
trans_dict = {}
for i,ii in enumerate(cat_list):
    trans_dict[ii] = english_list[i]
    
# Next we translate the column using the dictionary:
translation_list = []
for i in file['Categories']:
    entry = i.split(',')
    entry_list = []
    for j in entry:
        j = j.strip()
        j = trans_dict[j]
        entry_list.append(j)
    translation_list.append(entry_list)
file['Categories'] = translation_list

# Countries are much simpler, as they only have singular values per entry:
# A similar process is followed as above, with fewer steps. 
# Consult the comments above if you want to understand why each step is taken
country_list = list(file['Country'].unique())
country_str = ''
for i in country_list:
    country_str = country_str + i + ','
english_c = ts.translate_text(country_str)
english_list_c = english_c.split(',')
trans_dict = {}
for i,ii in enumerate(country_list):
    trans_dict[ii] = english_list_c[i]
file['Country'] = file['Country'].replace(trans_dict)

display(file.head())

# Make dummy variables for every category & begin to plot!

I add columns for each of the ~24 categories, with each channel receiving a value of 0 if they aren't in that category or 1 if they are. 

This is useful for plotting the categories later, and for containing all the information in the category column, including channels with more than one category, while also not trying to change the shape of the dataframe.

Next I begin plotting the distribution of the numerical variables in the dataset, to see if they are normally distributed and whether a logarithmic transformation of the data might be useful when plotting / analyzing the data.

In [None]:
# Make dummy variable columns!
english_list = english_list[0:-1]
for i in english_list:
    T_F_list = []
    for j in file['Categories']:
        if i in j:
            T_F_list.append(1)
        else:
            T_F_list.append(0)
    file[i] = T_F_list
display(file.head())

# change likes / comments in main dataframe so they can be plotted as log-values:
file['Likes'] = (file['Likes'] + 0.01) / 10000
file['Comments'] = (file['Comments'] + 0.001) / 1000
file['Visits'] = (file['Visits'] + 0.001) / 10000

# Look at distribution of numerical variables:
dist_list = ['Subscribers (millions)','Likes','Comments','Visits']
dist_df = file[dist_list]

sns.boxplot(dist_df)
plt.title('Distribution of Subs (in millions), Likes (1000s), Comments (1000s), & Visits (1000s)')
plt.show()

print('\n The data is extremely skewed! \n')

sns.boxplot(dist_df)
plt.title('Distribution of Subs (in millions), Likes (1000s), Comments (1000s), & Visits (1000s) on Log scale')
plt.yscale('log')
plt.show()

print('\n Will use log distribution when plotting likes/comments! \n')



# Categories: Counts and AVG Subscribers by category

Now I will plot the number of channels in each category, the total number of subscribers represented in each category, and the average number of subscribers per channel in each category.

Note that since each channel can have more than one category, the numbers of subscribers from all the categories added up should be greater than the overall number of subscribers of the channels of the dataset.

Additionally, the total number of subscribers is >20,000 million (or in other words, >20 billion). Since the total world population is around 8 billion, this indicates a lot of overlapping subscribers (people subscribed to more than one of the top 1000 channels), duplicate accounts, and/or fake accounts inflating the numbers seen here! Likely a combination of all of these, especially the 

In [None]:
# Make a data frame with all the categories of channel
# Note that this has to be a new dataframe, as it will bel onger than 1000

cat_counts = pd.DataFrame()
cat_list = []
for i in file['Categories']:
    for j in i:
        cat_list.append(j)       
cat_counts['Categories'] = cat_list
cat_count_ordered_list = cat_counts['Categories'].value_counts().index

sns.countplot(x=cat_counts['Categories'], order = cat_count_ordered_list)
plt.xticks(rotation=90,size = 10)
plt.title('Channel Counts of Each Category')
plt.xlabel(None)
plt.show()


loop_dict = {}
i = 10
while i < 33:
    loop_list = file[file[file.columns[i]] == 1]['Subscribers (millions)']
    loop_dict[file.columns[i]] = loop_list
    i += 1
loop_dict['Overall'] = file['Subscribers (millions)']

cats = pd.DataFrame(loop_dict)
cat_mean_order = cats.mean().sort_values(ascending = False).index
cat_total_order = cats.sum().sort_values(ascending = False)
ov_mean = loop_dict['Overall'].mean()

plt.bar(cat_total_order.index, cat_total_order)
plt.xticks(rotation = 90, size = 10)
plt.title('Total Subscribers from all Channels for a given Category')
plt.ylabel('Total Subscribers')
plt.show()


sns.barplot(cats, order = list(cat_mean_order),color = 'r')
plt.xticks(rotation = 90, size = 10)
plt.title('Average Subscribers per Channel by Category')
plt.ylabel('AVG Subscribers (Millions)')
plt.hlines(ov_mean,xmin = -1, xmax = 23, linestyles = 'dashed' )
plt.annotate('Avg subs',xy=(20,ov_mean + 1),size = 10)
plt.show()

**Only four categories outperform the overall average: Toys, Music & Dance, Education, Video Games, and Animation!**

# Now by Country!

Here I do a very similar analysis as what I did for the visualization by category above, just here by country. Note that since there are no channels with more than one country, the total from each country ought to add up to the overall number of subscribers across the 1000 channels.

In [None]:
country_counts = file['Country'].value_counts().index

sns.countplot(x=file['Country'], order = country_counts)
plt.xticks(rotation=90, size = 10)
plt.title('Channel Counts of Each Country')
plt.xlabel(None)
plt.show()

total_subs_per_country = file.groupby('Country')['Subscribers (millions)'].sum().sort_values(ascending = False)

plt.bar(x = total_subs_per_country.index, height = total_subs_per_country)
plt.xticks(rotation = 90, size = 10)
plt.title('Total Subscribers from all Channels in a Given Country')
plt.ylabel('Total Subscribers (millions)')
plt.show()

avg_subs_per_country_list = file.groupby('Country')['Subscribers (millions)'].mean().sort_values(ascending = False).index

sns.barplot(file, x = file['Country'], y = 'Subscribers (millions)', order = list(avg_subs_per_country_list),color = 'r')
plt.xticks(rotation = 90, size = 10)
plt.title('Average Subscribers per Channel by Category')
plt.ylabel('AVG Subscribers (Millions)')
plt.hlines(ov_mean,xmin = -1, xmax = 28, linestyles = 'dashed' )
plt.annotate('Avg subs',xy=(22,ov_mean + 1),size = 10)
plt.show()



**Once again, only a few countries exceed the average subs / channel! Likely the high number of India and 'Unknown' country channels and their relatively higher subscribed count are pulling up the overall average.**

# Test the Pareto Distribution! Do 20% of the channels have 80% of the Subscribers?


The Pareto distribution is a principle about how the top few percent possess or produce the majority of a given resource -- perhaps in this case, something like YT subscribers?

One name for this is the "80-20" rule, aka that 80 of the given resource (the principle was originally discovered in terms of wealth) is held by only 20% people. 

[Follow this link to see a discussion of this principle, & as a source!](https://dlab.berkeley.edu/news/explaining-80-20-rule-pareto-distribution#:~:text=The%20Pareto%20distribution%20is%20a%20power%2Dlaw%20probability%20distribution%2C%20and,sloped%20(see%20Figure%201).)
You will also see at this link that the general shape of the Pareto distribution -- high values at the start that rapidly drop off --  has showed up throughout the earlier visualizations of subscriber count across categories / countries. Her though, I compare to the subsrcribers of the channels, not the categories/countries.

In [None]:
# Recall that 6 duplicate entries were dropped, so instead of 1000 [999] index, and 200 [199] index for the total
# length of the dataframe & the 20% mark, respectively, I use [993] and [198].

plt.bar(x=file.index,height=file['Subscribers (millions)'],edgecolor = 'b',color = 'b')
plt.vlines(198, ymin = 0, ymax = 250,color = 'r')
plt.annotate('Top 20%',xy = (10,250), size = 10)
plt.annotate('Lower 80%',xy = (250,250), size = 10)
plt.title('Distribution of Subs in order of rank')
plt.ylabel('Subscribers (millions)')
plt.xlabel('Rank')
plt.show()

file['Subscribers cumulative (mil)'] = file['Subscribers (millions)'].cumsum()

sns.displot(y = file['Subscribers cumulative (mil)'], kind = 'ecdf')
plt.vlines(0.198, ymin = 0, ymax = file['Subscribers cumulative (mil)'][993],color = 'r')
plt.annotate('Top 20%',xy = (0,21000), size = 10)
plt.annotate('Lower 80%',xy = (0.21,21000), size = 10)
plt.xlabel('Proportion of Channels included in Cumulative Subscribers')
plt.title('Cumulative Subscribers over top 1000 YT channels')
plt.show()

twenty = file['Subscribers cumulative (mil)'][198]
total = file['Subscribers cumulative (mil)'][993]
eighty = total - twenty

print('Number of Subscribers from top 200 channels (in millions): ', round(twenty, 3))
print('% of total: ', round(twenty / total, 3) * 100)

**Answer:**

No! here 20% of the top 1000 YT channels have 40% of their subscribers, not 80%!

# What about per video engagement? Does is relate to Subscriber count?

Let's look at how engagement per video tracks with total subscribers!

However, a number of channels have 0 average visits, likes, and/or comments per video. While this might be possible for comments (some videos disable comments), it seems improbable that any of the top 1000 channels would have zero views / video or zero likes / video. 
I will drop all rows where there are zero visits as these channels all also had 0 likes / comments. This still leaves the rows with non-zero visits / comments but zero likes. For now I will keep them, and they only represent a small number (28) of the entries in the dataset.

Note that subscribers are in millions, while the others are in units of 1,000s before the log transformation.

In [None]:
# Plot these columns after log transformation
file['log Likes'] = np.log(file['Likes'])
file['log Comments'] = np.log(file['Comments'])
file['log Subs'] = np.log(file['Subscribers (millions)'])
file['log Visits'] = np.log(file['Visits'])

# Recall I made a copy of the file ('fresh_file') after the initial cleaning steps. 
# Here I use it to slice the transformed dataframe, as the transformed dataframe does not have 0 values for the log-transformed columns:
zero_visits = file[fresh_file['Visits'] == 0]
display(zero_visits)

# As it turns out all channels with 0 average views also have 0 likes / comments.
# This looks like a data scraping / collection problem (perhaps?), so I will drop all these rows when doing correlations:
file = file.drop(zero_visits.index)
fresh_file_no_zero = fresh_file.drop(zero_visits.index)

zero_likes = file[fresh_file_no_zero['Likes'] == 0]
zero_likes.shape

In [None]:
corr = np.corrcoef(file['log Likes'],file['log Subs'])
correlation = 'corr = ' + str(round(corr[0][1],2))


sns.regplot(file,x='log Subs',y = 'log Likes', ci = None, line_kws = {'color':'r', 'linestyle':'dashed'})
plt.title('Likes per video vs. Subscribers')
plt.annotate(correlation, xy = (5,0.1), size = 10)
plt.show()

corr = np.corrcoef(file['log Comments'],file['log Subs'])
correlation = 'corr = ' + str(round(corr[0][1],2))

sns.regplot(file,x='log Subs',y = 'log Comments', ci = None, line_kws = {'color':'r', 'linestyle':'dashed'})
plt.title('Comments per video vs. Subscribers')
plt.annotate(correlation, xy = (4.5,-7.5), size = 10)
plt.show()

corr = np.corrcoef(file['log Visits'],file['log Subs'])
correlation = 'corr = ' + str(round(corr[0][1],2))

sns.regplot(file,x='log Subs',y = 'log Visits', ci = None, line_kws = {'color':'r', 'linestyle':'dashed'})
plt.title('Visits per video vs. Subscribers')
plt.annotate(correlation, xy = (4.75,2), size = 10)
plt.show()


corr = np.corrcoef(file['log Comments'],file['log Likes'])
correlation = 'corr = ' + str(round(corr[0][1],2))

sns.regplot(file,x='log Likes',y = 'log Comments', ci = None, line_kws = {'color':'r', 'linestyle':'dashed'})
plt.title('Comments per video vs. Likes per video (logarithmic scale)')
plt.annotate(correlation, xy = (-12,-10), size = 10)
plt.show()

corr = np.corrcoef(file['log Visits'],file['log Likes'])
correlation = 'corr = ' + str(round(corr[0][1],2))

sns.regplot(file,x='log Likes',y = 'log Visits', ci = None, line_kws = {'color':'r', 'linestyle':'dashed'})
plt.title('Visits per video vs. Likes per video')
plt.annotate(correlation, xy = (-12,-3), size = 10)
plt.show()


**Per Video Likes and Visits / Comments correlate well with each other -- but these do not strongly correlate with subscribe count!**

While it might be expected that the most successful channels (in terms of subscribers) would also be the most successful in terms of per video engagement, this does not seem to be the case. There could be a number of reasons for this discrepancy:

-- There continues to be the possibility is that there is an issue with the underlying data, as the presence of top 1000 channels with an average of 0 likes per video seems fairly improbable. However, even apart from the few channels with 0 values for likes, there does not seem to be a strong trend between visits and subscribers nor sith the other two engagement metrics.

-- For commments specifically, some channels may, by default, prohibit comments. This guarantees a 0 comment count regardless of other metrics of engagement or of video quality. 

-- Some channels may specialize in large volumes of videos, where high engagement per video is less important to the overall subscriber count. For example, a channel that specializes in many short sports clips might have limited engagement per video while still having many sports-enthusiast subscribers. If many sports are represented in the clips, than a given subscriber is likely to only engage with a few videos in the sport of their interest and not the rest of the channels videos.

I'm interested in whether this lack of correlation between channels subscribers and per video engagement is true of all categories of video, or perhaps certain categories of video do show evidence of a relationship between these metrics.
Let's look a bit deeper!


# Engagement and Correlations to Subscribers by category!

Maybe subscribers and engagement don't correlated well in the overall dataset, but perhaps this an effect of video category?

In [None]:
# Planned!