# The primary objective of this EDA project is to provide guidance for those interested in aiming to reach a top 1000 YouTube channel.
We will be looking into the following:

Which categories are the most popular?

Which categories are the fastest growing?

Which categories are the highest earning?

In [117]:
# Importing the pandas library and aliasing it as 'pd'
import pandas as pd

# Importing the plotly express module from the plotly library and aliasing it as 'px'
import plotly.express as px

In [118]:
# Loading the csv into pandas DataFrame
df = pd.read_csv("/kaggle/input/global-youtube-statistics-2023/Global YouTube Statistics.csv", encoding="latin-1")

In [119]:
# Getting an an overiview of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 995 entries, 0 to 994
Data columns (total 28 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   rank                                     995 non-null    int64  
 1   Youtuber                                 995 non-null    object 
 2   subscribers                              995 non-null    int64  
 3   video views                              995 non-null    float64
 4   category                                 949 non-null    object 
 5   Title                                    995 non-null    object 
 6   uploads                                  995 non-null    int64  
 7   Country                                  873 non-null    object 
 8   Abbreviation                             873 non-null    object 
 9   channel_type                             965 non-null    object 
 10  video_views_rank                         994 non-n

In [120]:
# Finding missing values
df.isnull().sum()

rank                                         0
Youtuber                                     0
subscribers                                  0
video views                                  0
category                                    46
Title                                        0
uploads                                      0
Country                                    122
Abbreviation                               122
channel_type                                30
video_views_rank                             1
country_rank                               116
channel_type_rank                           33
video_views_for_the_last_30_days            56
lowest_monthly_earnings                      0
highest_monthly_earnings                     0
lowest_yearly_earnings                       0
highest_yearly_earnings                      0
subscribers_for_last_30_days               337
created_year                                 5
created_month                                5
created_date 

In [121]:
#looking for duplicates based on 'Title'
duplicate_rows = df[df.duplicated(subset='Title', keep=False)]
duplicate_rows

Unnamed: 0,rank,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,...,subscribers_for_last_30_days,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
8,9,Like Nastya,106000000,90479060000.0,People & Blogs,Like Nastya Vlog,493,Russia,RU,People,...,100000.0,2016.0,Jan,14.0,81.9,144373535.0,4.59,107683889.0,61.52401,105.318756
265,266,Preston,24000000,8279004000.0,Gaming,Preston,4009,United States,US,Entertainment,...,200000.0,2012.0,Jul,4.0,88.2,328239523.0,14.7,270663028.0,37.09024,-95.712891
458,459,Like Nastya Vlog,18500000,8147576000.0,People & Blogs,Like Nastya Vlog,493,Russia,RU,People,...,100000.0,2016.0,Jan,14.0,81.9,144373535.0,4.59,107683889.0,61.52401,105.318756
729,730,PrestonPlayz,14600000,5766647000.0,Gaming,Preston,4009,United States,US,Entertainment,...,200000.0,2012.0,Jul,4.0,88.2,328239523.0,14.7,270663028.0,37.09024,-95.712891
783,784,#Refugio Mental,14100000,3594937000.0,Entertainment,#Refugio Mental,846,,,Entertainment,...,,2013.0,Aug,10.0,,,,,,
939,940,#Refï¿½ï¿½ï¿½ï¿½,12700000,2709954000.0,Entertainment,#Refugio Mental,846,,,Entertainment,...,,2013.0,Aug,10.0,,,,,,


In [122]:
# removing duplicates based on 'Title' category
df = df.drop_duplicates(subset='Title')

**Looking at the difference between the columns 'channel_type' and 'category'.
We decide to go with 'category' to use in the analysis due to it having more unique values, allowing for more detail.**

In [123]:
# Looking at the difference between channel type and channel category
unique_channel_types = df['channel_type'].unique()
unique_categories = df['category'].unique()

print("Unique Channel Types:", unique_channel_types)
print("\nUnique Categories:", unique_categories)

Unique Channel Types: ['Music' 'Games' 'Entertainment' 'Education' 'People' 'Sports' 'Film'
 'News' nan 'Comedy' 'Howto' 'Nonprofit' 'Autos' 'Tech' 'Animals']

Unique Categories: ['Music' 'Film & Animation' 'Entertainment' 'Education' 'Shows' nan
 'People & Blogs' 'Gaming' 'Sports' 'Howto & Style' 'News & Politics'
 'Comedy' 'Trailers' 'Nonprofits & Activism' 'Science & Technology'
 'Movies' 'Pets & Animals' 'Autos & Vehicles' 'Travel & Events']


**There are some rows with low or zero yearly earnings.
Channels with this many views should in theory be earning, so we decide to remove these rows because we will be doing analysis on earnings.**

In [124]:
# drop rows with less than 100 in earnings
df.drop(df[df['highest_yearly_earnings'] < 100].index, inplace=True)
df.drop(df[df['lowest_yearly_earnings'] < 100].index, inplace=True)

**We use the average of highest yearly earnings and lowest yearly earnings as a proxy for yearly earnings**

In [125]:
# Displaying the shape of the DataFrame mainly to show row count
df.shape

(830, 28)

In [126]:
# We don't have data for every year so we decide to average the highest and lowest yearly earnings estimate
# Create a new column 'yearly_earnings' with the average of 'lowest_yearly_earnings' and 'highest_yearly_earnings'
df['yearly_earnings'] = (df['lowest_yearly_earnings'] + df['highest_yearly_earnings']) / 2

**Looking at minimum and maximun 'created_year'**

In [127]:
# finding minimum and maximun 'created_year'
min_created_year = df['created_year'].min()
max_created_year = df['created_year'].max()

print(f"Minimum 'created_year': {min_created_year}")
print(f"Maximum 'created_year': {max_created_year}")

Minimum 'created_year': 1970.0
Maximum 'created_year': 2022.0


**We find there is a channel with the creation date of 1970.**

In [128]:
# Find the index of the row with the minimum 'created_year'
min_created_year_index = df['created_year'].idxmin()
# display the row
df.loc[min_created_year_index]

rank                                                   102
Youtuber                                           YouTube
subscribers                                       36300000
video views                                   3010784935.0
category                                   News & Politics
Title                                              YouTube
uploads                                                744
Country                                      United States
Abbreviation                                            US
channel_type                                 Entertainment
video_views_rank                                    2860.0
country_rank                                          32.0
channel_type_rank                                      5.0
video_views_for_the_last_30_days                21103000.0
lowest_monthly_earnings                             5300.0
highest_monthly_earnings                           84400.0
lowest_yearly_earnings                             63300

**We find that one channel which has a created year of 1970, We decide to drop this row due to the chance that other data in the row could be influenced by the error.**

**For the rest of the data, values of 'created_year' seem reasonable**

In [129]:
# Drop the rows with the 'created_year' equal to 1970
df = df[df['created_year'] != 1970]
# this drops 1 row

In [130]:
# Displaying the shape of the DataFrame mainly to show row count
df.shape

(829, 29)

**We will be doing analysis using category so we decide to drop rows with missing category values**

In [131]:
# drop rows where category value is missing
df.dropna(subset=['category'], inplace=True)

**Looking over the data we see some channels with 0 views, this could cause problems when doing statistical analysis, we decide to drop these rows.**

The rows with 0 views seem to be YouTube's own channel categories, which show subscriber count, but views are not shown.

In [132]:
# drop rows where video views are 0
df = df[df['video views'] != 0]

In [133]:
# seeing how many rows are remaining for analysis
df.shape

(795, 29)

# What is the relationship between views and subscribers?
Here we will be looking at the relationship between views and subscribers, it is shown to be linerarly correlated.

Which seems reasonable, and gives us a clue that the data could be reliable.

In [134]:
# Creating a scatter plot using plotly express (px)
fig1 = px.scatter(
    x=df["subscribers"],       # X-axis data from the "subscribers" column of the DataFrame
    y=df["video views"],       # Y-axis data from the "video views" column of the DataFrame
    color=df["category"],       # Color-coding points based on the "category" column of the DataFrame
    size=df["video views"],     # Adjusting point sizes based on the "video views" column of the DataFrame
    hover_name=df["Title"]      # Adding title information for hover-over tooltips based on the "Title" column
)

# Updating the layout of the plot
fig1.update_layout(
    xaxis=dict(title="Subscribers"),  # Adding a title to the X-axis
    yaxis=dict(title="Views"),        # Adding a title to the Y-axis
    title="Relationship between Views and Subscribers",  # Adding a title to the entire plot
    title_x=0.48                      # Adjusting the horizontal position of the title
)

# Displaying the plot
fig1.show()

# Scatterplot showing the relationship between earnings and subscribers

In [135]:
# Creating another scatter plot using plotly express (px)
fig2 = px.scatter(
    x=df["subscribers"],           # X-axis data from the "subscribers" column of the DataFrame
    y=df["yearly_earnings"],       # Y-axis data from the "yearly_earnings" column of the DataFrame
    color=df["category"],           # Color-coding points based on the "category" column of the DataFrame
    size=df["yearly_earnings"],     # Adjusting point sizes based on the "yearly_earnings" column of the DataFrame
    hover_name=df["Title"]          # Adding title information for hover-over tooltips based on the "Title" column
)

# Updating the layout of the second plot
fig2.update_layout(
    xaxis=dict(title="Subscribers"),   # Adding a title to the X-axis
    yaxis=dict(title="Earnings"),        # Adding a title to the Y-axis
    title="Relationship between Earnings and Subscribers",
    title_x=0.48
)
# Displaying the plot
fig2.show()

In [136]:
# Define a pleasant color scheme of 20 colors
colors = [
    'skyblue', 'lightgreen', 'lightcoral', 'gold', 'lightskyblue',
    'lightpink', 'lightgreen', 'lightblue', 'lightcoral', 'lightsalmon',
    'lightseagreen', 'palegoldenrod','paleturquoise', 'lightsteelblue',
    'lightcyan', 'thistle', 'lightcoral', 'lightgreen', 'peachpuff', 'aliceblue'
]

# Distribution of channels by category

In [137]:
# Calculating the count of each category and sorting them in ascending order
category = df['category'].value_counts().sort_values(ascending=True)

# Creating a pie chart using plotly express (px)
fig3 = px.pie(
    values=category.values,          # Values represent the count of each category
    names=category.index             # Names represent the category labels
)

# Updating trace properties for the pie chart
fig3.update_traces(
    textposition='inside',           # Placing text labels inside the pie chart
    textinfo='percent+label'         # Displaying both percentage and label information
)

# Updating the layout of the pie chart
fig3.update_layout(
    title_text="Distribution of Categories",  # Adding a title to the pie chart
    title_x=0.49,                            # Adjusting the horizontal position of the title
    uniformtext_minsize=10,                 # Setting the minimum size for uniform text
    showlegend=False                        # Hiding the legend
)

# Displaying the pie chart
fig3.show()

In [138]:
# Get the top 12 channel categories
top12_channel_categories = df['category'].value_counts().head(12)

# Plotting the bar chart using plotly.express
fig4 = px.bar(
    x=top12_channel_categories.index,
    y=top12_channel_categories.values,
    color=top12_channel_categories.index,  # Color the bars based on channel categories
    text=top12_channel_categories.values,  # Display the count values on top of the bars
    title='12 Most Popular Channel Categories',
    labels={'x': 'Category', 'y': 'Amount of Channels'}
)

# Customize the layout
fig4.update_layout(
    title_x=0.5,  # Center the title above the plot
    font=dict(size=18),  # Set the font size
    showlegend=False,  # Hide the legend
    xaxis={'categoryorder': 'total ascending'}  # Sort x-axis categories in ascending order
)

# Display the plot
fig4.show()

# Which categories are growing the fastest on average?
We see that entertainment and music are popular, but which categories are growing the fastest on average?

In [139]:
# Calculate Channel Age
df['channel_age'] = 2023 - df['created_year']

# Calculate Average Subscribers growth per Year (asgpy)
df['asgpy'] = df['subscribers']/df['channel_age']

In [140]:
# Calculate the percentage growth for each category
category_growth = df.groupby('category')['asgpy'].mean().reset_index()

# Calculate the overall mean growth for all categories
overall_mean_growth = df['asgpy'].mean()

# Calculate the percentage growth relative to the overall mean
category_growth['percentage_growth'] = (category_growth['asgpy'] / overall_mean_growth) * 100

# Sort the categories by percentage growth in ascending order and select the top 20
category_growth = category_growth.sort_values(by='percentage_growth', ascending=True).head(20)

# Plotting the bar chart using plotly.express
fig5 = px.bar(category_growth, x='category', y='percentage_growth', color='category',
             color_discrete_sequence=colors, labels={'percentage_growth': 'Percentage Growth Relative to Overall Mean (%)'},
             title='Category-wise Average Subscribers Growth')

# Center the title above the bar chart
fig5.update_layout(title=dict(x=0.5), showlegend=False)

# Rotate x-axis labels for better readability
fig5.update_layout(xaxis=dict(tickangle=45))

# Set y-axis limits
fig5.update_layout(yaxis=dict(range=[0, 200]))

# Show the plot
fig5.show()

# How do categories compare to each other in terms of earnings?

In [141]:
# Group by 'category' and calculate the mean of yearly earnings for each category
category_earnings = df.groupby('category')['highest_yearly_earnings'].mean().reset_index()

# Sort the categories based on yearly earnings in ascending order
sorted_categories_earnings = category_earnings.sort_values(by='highest_yearly_earnings', ascending=True)

# Choose the top N categories to display
top_categories_earnings = sorted_categories_earnings.head(20)

# Plot the bar chart using plotly.express
fig6 = px.bar(top_categories_earnings, x='category', y='highest_yearly_earnings', 
             color='category', color_discrete_sequence=colors,
             labels={'highest_yearly_earnings': 'Highest Yearly Earnings'},
             title='Highest Yearly Earnings Average by Category')

# Center the title above the bar chart
fig6.update_layout(title=dict(x=0.5), showlegend=False)

# Rotate x-axis labels for better readability
fig6.update_layout(xaxis=dict(tickangle=45))

# Show the plot
fig6.show()

In [142]:
# Create a box plot using plotly.express
fig7 = px.box(df, x='category', y='yearly_earnings', color='category',
             labels={'yearly_earnings': 'Yearly Earnings', 'category': 'Category'},
             title='Approximate Yearly Earnings by Category')

# Center the title above the bar chart
fig7.update_layout(title=dict(x=0.5), showlegend=False)

# Customize the layout for better readability
fig7.update_layout(xaxis=dict(tickangle=45))

# Show the plot
fig7.show()

# Conclusion
The data indicates that 'Shows' and 'Pets & Animals' are the fastest growing categories.

They are also the highest earning.

They are relatively niche (less than 2% of top 1000).

The data indicates that 'Shows' or 'Pets & Animals' would be a strong category to go with if aiming for a top 1000 YouTube channel.

The results could be impacted by insufficient data as we only have 12 channels in the 'shows' category and 2 channels in the 'Pets & Animals' category.