<a href="https://www.kaggle.com/code/isaachrad/eda-on-global-youtube-statistics-2023?scriptVersionId=144051387" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Table of Contents

* [1. Importing and Reading Data](#Section-one)
* [2. First glance at Data](#Section-two)
* [3. Data Cleaning](#Section-three)
    * [3.1 Columns Review](#subsection-one-of-section-3)
    * [3.2 Remove unwanted columns](#subsection-two-of-section-3)
    * [3.3 Duplicates Review](#subsection-three-of-section-3)
    * [3.4 Missing Values Review](#subsection-four-of-section-3)
    * [3.5 Dtype Review](#subsection-five-of-section-3)
    * [3.6 Removing Wrong Values](#subsection-six-of-section-3)
    * [3.7 Resetting Index](#subsection-seven-of-section-3)
* [4. Output the Cleaned Data](#Section-four)
* [5. Data Analysis, Visualisation, Interpretation](#Section-five)
     * [5.1 Correlation Heatmap](#subsection-one-of-section-5)
     * [5.2 Number of Channels created in each year (Bar Chart)](#subsection-two-of-section-5)
     * [5.3 Distributaion of Channels by type (Pie Chart)](#subsection-three-of-section-5)
     * [5.4 Distribution of Channels by Country -top 15- (Pie Chart)](#subsection-four-of-section-5)
     * [5.5 Top 10 YouTube Channels (H-Bar Chart)](#subsection-five-of-section-5)
         * [5.5.1 By no. of Subscribers](#subsection-one-of-five-of-section-5)
         * [5.5.2 By total Video Views](#subsection-two-of-five-of-section-5)
     * [5.6  Relationship between Subscreibers & Video Views (Scatter Plot)](#subsection-six-of-section-5)
     * [5.7  Relationship between Video Views & Highest Yearly Earning (Scatter Plot)](#subsection-seven-of-section-5)
     * [5.8  Relationship between Video Views & no. of Uploads (Scatter Plot)](#subsection-eight-of-section-5)
     * [6. Brief Conclusion](#Section-6)

# 1 Importing and Reading Data <a id="Section-one" ></a>

In [None]:
# to be able to see all columns in dataframe:
pd.set_option('display.max_columns', None)

#import Data
df = pd.read_csv('/kaggle/input/global-youtube-statistics-2023/Global YouTube Statistics.csv', encoding = 'latin-1')

# 2 First glance at Data <a id="Section-two" ></a>

In [None]:
df.head(10)

In [None]:
df.info()

In [None]:
df.describe()

# 3 Data Cleaning <a id="Section-three" ></a>



In [None]:
# create a copy of our dataset to manipulate it freely
df = df.copy()

## 3.1 Columns Review  <a id="subsection-one-of-section-3" ></a>

In [None]:
df.columns

## 3.2 Remove unwanted columns <a id="subsection-two-of-section-3" ></a>

In [None]:
df = df.drop(['rank', 'Abbreviation', 'country_rank', 'created_month',
             'created_date', 'Gross tertiary education enrollment (%)', 'Unemployment rate', 'Urban_population'], axis=1)

#to Confirm that the selected columns are dropped
df.columns

## 3.3 Duplicates Review <a id="subsection-three-of-section-3" ></a>

In [None]:
# using duplicated() to mark duplicates
duplicates = df.duplicated()


# filter df to check if any duplicates are detected. as we will see, there are none
duplicated_rows = df[duplicates]

print(duplicated_rows)

## 3.4 Missing Values Review <a id="subsection-four-of-section-3" ></a>

In [None]:
# define a variable to detect nulls
nulls_count = df.isnull().sum()


# filter to just focus on the columns with nulls
nulls_count[nulls_count > 0]

In [None]:
# make a variable for Object dtype, in otherwords categoricals
cat_colms = df.select_dtypes(include =['object']).columns

# using fillna() to replace missing valuse in cat_colms with 'Unknown'
df[cat_colms]= df[cat_colms].fillna("Unknown")

In [None]:
# make a variable for Numerical dtype.
num_colms = df.select_dtypes(include = ['int64', 'float']).columns

# again using fillna() to replace missing values in num_colms with '0'
df[num_colms] = df[num_colms].fillna(0)

In [None]:
# to confirm that there are no nulls anymore
df.isnull().sum()

## 3.5 Dtype Review <a id="subsection-five-of-section-3" ></a>

In [None]:
df.info()

In [None]:
''' There is no need for some of the columns to be float
therefore I change them to int64'''


df= df.astype({
    'video views': 'int64',
    'channel_type_rank' : 'int64',
    'video_views_rank' : 'int64',
    'video_views_for_the_last_30_days' : 'int64',
    'subscribers_for_last_30_days' : 'int64',
    'created_year' : 'int64',
    'Population' : 'int64',
    'lowest_monthly_earnings' : 'int64',
    'highest_monthly_earnings' : 'int64',
    'lowest_yearly_earnings' : 'int64',
    'highest_yearly_earnings' : 'int64'
})

#  to check the result
df.info()

## 3.6 Removing Wrong Values <a id="subsection-six-of-section-3" ></a>

There are some values in df that are surely wrong. Like 0 views for video views.
If we had access to the statistic team we might have the chance to correct those values.
But that is not the case here. 
So we need to remove them by dropping their entire rows

In [None]:
# seeking wrong values
df.groupby('video views').size().head(5)

In [None]:
zero_views_index= (df[df['video views']==0]).index
print(zero_views_index)

In [None]:
# now we can drop them using their index
df= df.drop(axis=0, index=(zero_views_index))

In [None]:
# to validate the result
df.groupby('video views').size().head(5)

## 3.7 Resetting Index <a id="subsection-seven-of-section-3" ></a>

Because we have dropped some rows, the index is not in order. So we need to reset it. But before we do that, it is better to make sure that the df is sortet correctly. 

In [None]:
#take a look at the index. for example index 1 is missing
df.head(5)

In [None]:
# to make sure that the data set is sorted by the number of Subscribers in descending order
df.sort_values(by= 'subscribers', ascending = False)

# to reset the index
df = df.reset_index(drop= True) #If drop=True then it does not add the new column of the current row index in the DataFrame

df.head()

# 4 Output the Cleaned Data <a id="Section-four" ></a>

In [None]:
# stores in Notebook > Output section
df.to_csv('/kaggle/working/Cleaned_Global_YouTube_Statistic_2023', index=False, encoding='latin-1')

# 5 Data Analysis, Visualisation, Interpretation <a id="Section-five" ></a>

## 5.1 Correlation Heatmap <a id="subsection-one-of-section-5" ></a>

In [None]:
plt.style.use("seaborn")

plt.rcParams['figure.figsize']= (16,8)

title = "Correlation Heatmap"

plt.title(title,fontsize=18, weight= 'bold')

sns.heatmap(df.corr(), cmap="BuPu", annot=True)

plt.show()

#### Interpretation

* **Subscribers & Video Views:** There is a very strong postitive correlation between these two variables. It demonstrates that channels with more/less number of *Subscribers*, have more/less *Viedo Views*

* **Video Views & Earnings:** There is a moderate positive correlation between *Video Views* and all four columns of Earnings(H/L monthly and H/L yearly). It means that more/less Video Views has a direct effect on more/less Earnings.

* **video_views_for_the_last_30_days & Earnings:** There is a positive correlation between *video_views_for_the_last_30_days* and all four columns of Earnings(H/L monthly and H/L yearly). 

* **Population & subscribers / Video Views:** There is a very weak positive correlation between these two. It demonestrates that more/less *population* of the country that the Youtube channel is based in, does not effect *subscribers* or *Video Views* greatly. (It can be interpreted that, in real life because of the world-wide connection on Internet, the local factors such as country's popullation does not effect channels success on YouTube)

* **Uploads & subscribers / Video Views / Earning:** There is a weak possitive correlation between number of *uploads* and subscribers / Video Views / Earning. Which means uploading more videos on Youtube does not guarantee the success of a Channel. 

## 5.2  Number of Channels created in each year (Bar Chart) <a id="subsection-two-of-section-5" ></a>

In [None]:
channels_in_year= df['created_year'].value_counts() 
channels_in_year

In [None]:
# In order to be able to plot 'channels_in_year' we need it as a dataframe. one solution is:
channels_in_year= pd.DataFrame(channels_in_year) #first make a DataFrame. 
channels_in_year= channels_in_year.reset_index() #reset the index
channels_in_year.columns = ['Year', 'Created Channels'] #define columns' names
channels_in_year

In [None]:
# it would be a good idea to sort it by Year
channels_in_year= channels_in_year.sort_values(by = 'Year', ascending = True) 

#let's drop Year 0 which is index=17 and Year 1970 which is index=19
channels_in_year= channels_in_year.drop(axis= 0, index=[17, 19])

#we need to reset our index for one last time
channels_in_year= channels_in_year.reset_index(drop= True)

#let's check
channels_in_year

In [None]:
# Ploting 'channels_in_year'

x= channels_in_year['Year']
y= channels_in_year['Created Channels']
 
# Figure Size
fig = plt.figure(figsize =(10, 4))
 
# Horizontal Bar Plot
plt.bar(x, y, color = '#32A645')

# Define range for values in axis
plt.ylim(0,110)
plt.xlim(2004,2022)

plt.xlabel("Year")
plt.ylabel("Number of Channels")
plt.title("Number of Channels created in each Year",  weight = 'bold')


plt.plot(x,y) 

#### Observation

The bar chart illustrates number of channels created in each year from 2005 to 2022. Here are some highlights
1. Quite a sharp increase in 2006 followed by a significant decrease in the next year
2. More stablized period from 2007 to 2010 
3. Another dramatic increase in 2011 followed by a drop in the next year
4. It is intresting to note that the year 2014 is the Pick of the chart
5. In the next eight years (2015-2022), the rate of new Channels has gradually decrease and reached the lowest rate in 2022.


## 5.3 Distributaion of Channels by type (Pie Chart) <a id="subsection-three-of-section-5" ></a>

In [None]:
channel_types = df['channel_type'].value_counts()
channel_types

In [None]:
plt.pie(channel_types[:-3],colors=sns.color_palette('Set1'),
        labels= channel_types.index[:-3], autopct='%1.1f%%') #I ignored the last three for more visual clarity

plt.title('Distribution of YouTube channels by Type', weight = 'bold')
plt.show() 

### Observation

1. From the pie chart it is clear that YouTube users predominantly attend the platform for Entertainment purposes.
2. In broader sense, to interpret more precisely, we can call 'Entertainment, Music, Games, Comedy, Film' as ***all-Entertainments***. and 'Education, Howto' as ***all-Educations***
3. ***all-Entertainmnets*** forms the major chunk of the pie chart
4. ***all-Educations*** stands as minor chunk of the pie chart


## 5.4 Distribution of Channels by Country -top 15- (Pie Chart) <a id="subsection-four-of-section-5" ></a>

In [None]:
channel_orig = df['Country'].value_counts().head(15)
channel_orig

In [None]:
plt.pie(channel_orig, labels= channel_orig.index, autopct='%1.1f%%', colors=sns.color_palette('Set2'))
plt.title('Distribution of YouTube channels by Country - top 15', weight = 'bold')
plt.show() 

### Observation

From the pie chart it is clear that
1. ***United States*** has the highest number of YouTubers 
2. ***India*** is in second place by a considerable margin
3. Due to the fact that ***Unknown*** countries are ranked third, it seems that they have overshadowed the results of the survey.

## **5.5 Top 10 YouTube Channels** (H-Bar Chart) <a id="subsection-five-of-section-5" ></a>


### ***5.5.1 By no. of Subscribers***  <a id="subsection-one-of-five-of-section-5" ></a>

In [None]:
# first select columns
colms= ['Youtuber', 'subscribers']

# to select the top 10 
bar_colms= df.loc[0:9, colms]

bar_colms= bar_colms.sort_values('subscribers', ascending = True)

# to have a better visuality, it would be nice to have our no. in million
bar_colms['subscribers (MM)'] = (bar_colms['subscribers'] / 1000000).astype(int)

bar_colms

In [None]:
# Ploting 'Top 10 Youtube Channels by Subscribers'

x= bar_colms['Youtuber']
y= bar_colms['subscribers (MM)']
 
# Figure Size
fig = plt.figure(figsize =(10, 6))
 
# creating the bar plot
plt.barh(x, y, color= 'skyblue', height= 0.5 )
 
plt.xlabel("No. of Subscribers in million", weight='bold', fontsize = 12)
plt.ylabel("Youtuber", weight='bold', fontsize = 12)
plt.title("Top 10 Youtube Channels by no. of Subscribers", weight='bold', fontsize = 14)
plt.show()

### ***5.5.2 By total Video Views*** <a id="subsection-two-of-five-of-section-5" ></a>

In [None]:
# first select columns
colms= ['Youtuber', 'video views']

# to select the top 10
bar_colms= df.loc[0:9, colms]

bar_colms= bar_colms.sort_values('video views', ascending = True)

# to have a better visuality, it would be nice to have our no. of Video Views in billion
bar_colms['video views (bil)'] = (bar_colms['video views'] / 1000000000).astype(int)

bar_colms

In [None]:
# Ploting 'Top 10 Youtube channels by total Video Views'

x= bar_colms['Youtuber']
y= bar_colms['video views (bil)']
 
# Figure Size
fig = plt.figure(figsize =(10, 6))
 
# creating the bar plot
plt.barh(x, y, color= 'limegreen', height= 0.5 )
 
plt.xlabel("Total Video Views in billion", weight='bold', fontsize = 12)
plt.ylabel("Youtuber", weight='bold', fontsize = 12)
plt.title("Top 10 Youtube Channels by total Video Views", weight='bold', fontsize = 14)
plt.show()

## 5.6  Relationship between Subscreibers & Video Views (Scatter Plot) <a id="subsection-six-of-section-5" ></a>

In [None]:
# to select columns
colms= ['subscribers', 'video views']

scatter_colms= df.loc[0:, colms]

scatter_colms['subscribers (MM)'] = (scatter_colms['subscribers'] / 1000000).astype(int)
scatter_colms['video views (bil)'] = (scatter_colms['video views'] / 1000000000).astype(int)

scatter_colms

In [None]:
x= scatter_colms['subscribers (MM)']
y= scatter_colms['video views (bil)']

size= scatter_colms['subscribers (MM)'] * 3

plt.scatter(x, y, s = size, c ="navy", alpha=0.4)

plt.title('Relationship Between no. of Subscribers & total Video Views', fontsize = 14, weight = 'bold')
plt.xlabel('no. of Subscribers in million', fontsize = 12,  weight = 'bold')
plt.ylabel('total Video Views (bil)', fontsize = 12,  weight = 'bold')

plt.show()

### Observation

As it was revealed in 5.1, there is a strong positive correlation between no. of Subscribers and total Video Views.


## 5.7  Relationship between Video Views & Highest Yearly Earning (Scatter Plot) <a id="subsection-seven-of-section-5" ></a>

In [None]:
# first lets select our columns
colms= ['video views', 'highest_yearly_earnings']

scatter_colms= df.loc[0:, colms]

scatter_colms['video views (bill)'] = (scatter_colms['video views'] / 1000000000).astype(int)
scatter_colms['highest_yearly_earnings (MM)'] = (scatter_colms['highest_yearly_earnings'] / 1000000).astype(int)

scatter_colms

In [None]:
x= scatter_colms['video views (bill)']
y= scatter_colms['highest_yearly_earnings (MM)']

size= scatter_colms['highest_yearly_earnings (MM)'] * 3

plt.scatter(x, y, s= size, c ="green", alpha=0.4)

plt.title('Relationship Between Total Video Views & Highest Yearly Earning', fontsize = 14, weight = 'bold')
plt.xlabel('Total Video Views (bil)', fontsize = 12,  weight = 'bold')
plt.ylabel('Highest Yearly Earning', fontsize = 12,  weight = 'bold')

plt.show()

### Observation

This scatter make it clear that higher total Video View ***does not*** guarantee higher income. We can list several possible reasons:
* The lenght of videos has an impact on earning
* It is crucial to enable Ads on videos

## 5.8  Relationship between Video Views & no. of Uploads (Scatter Plot) <a id="subsection-eight-of-section-5" ></a>

In [None]:
# to select our columns
colms= ['uploads', 'video views']

scatter_colms= df.loc[0:, colms]

scatter_colms['video views (bil)'] = (scatter_colms['video views'] / 1000000000).astype(int)

scatter_colms

In [None]:
x= scatter_colms['video views (bil)']
y= scatter_colms['uploads']

size= scatter_colms['video views (bil)'] * 2

plt.scatter(x, y, s= size, c ="red", alpha=0.4)

plt.title('Relationship Between total Video Views & total Uploads', fontsize = 14, weight = 'bold')
plt.xlabel('total Video Views (bil)', fontsize = 12,  weight = 'bold')
plt.ylabel('total Uploads', fontsize = 12,  weight = 'bold')

plt.show()

### Observation

Echoing the same statement as in 5.1, uploading tons of videos ***does not*** result in higher Video Views. Several possible reasons might be:

1. Quality of contents plays an essential role in attracting viewers
2. Defining right targets is crusial
3. Continuity is more important than uploading a lot in a short time
4. Adopting effective marketing strategies is a must

# 6. Brief Conclusion <a id="Section-6" ></a>

1. 2006, 2011, 2014 are shown to be the years in which higher number of YouTube Channels were created.
2. From 2018 to the last year of this survey (2022), the number of channels created has decreased drastically.
3. There is strong positive correlation between ***Subscribers*** and ***Video Views***
4. There is a moderate positive correlation between ***Video Views*** and all four columns of ***Earnings(H/L monthly and H/L yearly)***
5. There is a weak possitive correlation between number of ***uploads*** and ***subscribers / Video Views / Earning***
6. YouTube users predominantly attend the platform for ***Entertainment*** purposes
7. ***Educational Categories*** (e.g. Education, HowTo) stand in minority of users' motivations
8. ***United States*** has the highest number of YouTubers and ***India*** stands in the second place
9. Higher total Video View ***does not*** guarantee higher income