# Machine Project Phase 1: 

This Jupyter Notebook includes the following contents:
    
    0.  2023 Global Youtube Statistics Dataset
    1.  Description of the Dataset
    2.  Data Cleaning Process of the Dataset
    3.  Exploratory Data Analysis 
    4.  Proposed Research Question

### GARYNATION  (S13)
    1.  Bien Aaron Miranda 
    2.  Luis Miguel Rana
    3.  Karl Andre Aquino
    4.  Dominic Luis Baccay

### Python Libraries

Import numpy and pandas.

pandas is a software library for Python which provides data structures and data analysis tools.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression


## The Dataset
    A collection of YouTube giants, this dataset offers a perfect avenue to analyze and gain valuable insights from the luminaries of the platform. With comprehensive details on top creators' subscriber counts, video views, upload frequency, country of origin, earnings, and more, this treasure trove of information is a must-explore for aspiring content creators, data enthusiasts, and anyone intrigued by the ever-evolving online content landscape.

In [None]:
#encoding came from the discussion so that it is readable
youtube_df = pd.read_csv("Global YouTube Statistics.csv", encoding='iso-8859-1')

If you view the `.csv` file in Excel, you can see that our dataset contains about 995 **observations** (rows) across 28 **variables** (columns). The following are the descriptions of each variable in the dataset.

- **`rank`**: Position of the YouTube channel based on the number of subscribers
- **`Youtuber`**: Name of the YouTube channel
- **`subscribers`**: Number of subscribers to the channel
- **`video views`**: Overall count views to the channel
- **`category`**: Category or niche of the channel
- **`Title`**: Title
- **`uploads`**: Total number of videos uploaded on the channel
- **`Country`**: Country where the YouTube channel originates
- **`Abbreviation`**: Abbreviation of the country
- **`channel_type`**: Type of the YouTube channel (e.g., individual, brand)
- **`video_views_rank`**: Ranking of the channel based on total video views
- **`country_rank`**: Ranking of the channel based on the number of subscribers within its country
- **`channel_type_rank`**: Ranking of the channel based on its type (individual or brand)
- **`video_views_for_the_last_30_days`**: Total video views in the last 30 days
- **`lowest_monthly_earnings`**: Lowest estimated monthly earnings from the channel \$
- **`highest_monthly_earnings`**: Highest estimated monthly earnings from the channel \$
- **`lowest_yearly_earnings`**: Lowest estimated yearly earnings from the channel \$
- **`highest_yearly_earnings`**: Highest estimated yearly earnings from the channel $
- **`subscribers_for_last_30_days`**: Number of new subscribers gained in the last 30 days
- **`created_year`**: Year when the YouTube channel was created
- **`created_month`**: Month when the YouTube channel was created
- **`created_date`**: Exact date of the YouTube channel's creation
- **`Gross tertiary education enrollment (\%)`**: Percentage of the population enrolled in tertiary education in the country %
- **`Population`**: Total population of the country
- **`Unemployment rate`**: Unemployment rate in the country %
- **`Urban_population`**: Percentage of the population living in urban areas %
- **`Latitude`**: Latitude coordinate of the country's location
- **`Longitude`**: Longitude coordinate of the country's location

## Reading the Dataset

Our first step is to load the dataset using pandas. This will load the dataset into a pandas DataFrame. To load the dataset, we use the read_csv function. Note that you may need to change the path depending on the location of the file in your machine.

Will also tackle about what can be seen in the Dataset as a DataFrame

In [None]:
youtube_df.info()

### Examining the observations within the Dataset

    
To gain a better understanding of the observations within the dataset, it's essential to examine their appearance or visual representation. This can provide valuable insights and help you make informed decisions based on the data.

In [None]:
youtube_df.head(10)

# Cleaning the Dataset

    This section of the Jupyter Notebook contains the overall process and variables affected during the data cleaning of our dataset.

### Before Cleaning:
    Number of Observations: 995
    Number of Columns: 28
    Data types: float64(18), int64(3), object(7)

### Initial Observations: 
-   There are special characters in the 'Youtuber' column.
-   There are missing/null values in the dataset.
-   Some variables are similar, therefore, unnecessary.
-   Some data types are inappropriate for their respective variables.



#### Variables to be Observed:
We will be dropping all columns/variables except the following.
- **`Youtuber`**: Name of the YouTube channel
- **`subscribers`**: Number of subscribers to the channel
- **`video views`**: Overall count views to the channel
- **`category`**: Category or niche of the channel
- **`uploads`**: Total number of videos uploaded on the channel
- **`Country`**: Country where the YouTube channel originates
- **`channel_type`**: Type of the YouTube channel (e.g., individual, brand)
- **`lowest_monthly_earnings`**: Lowest estimated monthly earnings from the channel \$
- **`highest_monthly_earnings`**: Highest estimated monthly earnings from the channel \$
- **`lowest_yearly_earnings`**: Lowest estimated yearly earnings from the channel \$
- **`highest_yearly_earnings`**: Highest estimated yearly earnings from the channel $
- **`created_year`**: Year when the YouTube channel was created
- **`Population`**: Total population of the country

In [None]:
# Declare the variables that will be used
columns_to_keep = ['Youtuber', 'subscribers', 'video views', 'uploads', 'category', 'channel_type', 'Country', 'lowest_monthly_earnings', 'highest_monthly_earnings', 'lowest_yearly_earnings', 'highest_yearly_earnings', 'created_year', 'Population']
columns_to_drop = list(set(youtube_df.columns) - set(columns_to_keep))

# Drop unused columns (variables)
youtube_df.drop(columns_to_drop, axis=1, inplace=True)

youtube_df

### The "Youtuber" variable - Names of Youtubers

- **`Youtuber`**: Name of the YouTube channel<br>

We want to check if there are any observations that have null "Youtuber" values, if any, we will be excluding these observations from the list.
     

In [None]:
# Check for null values
nullyoutuber_df = youtube_df[youtube_df["Youtuber"].isnull()]
nullyoutuber_df

We want to check if there are any duplicate "Youtuber" values, if any, the row will be removed.
     

In [None]:
# Check for duplicates
column_name = "Youtuber"
unique_count = youtube_df[column_name].nunique()
observation_count = youtube_df[column_name].count()

print(f"Number of observations in '{column_name}': {observation_count}")
print(f"Number of unique values in '{column_name}': {unique_count}")

Since there are special characters present within the dataset, we will replace them with underscores "_" instead for readability.

In [None]:
# Define an array of special characters to replace
chars_to_replace = ['ï', '¿', '½', 'ý']

# Replace the characters with underscores
for char in chars_to_replace:
    youtube_df['Youtuber'] = youtube_df['Youtuber'].str.replace(char, '_', regex=False)

youtube_df

### The "Subscribers" variable

- **`subscribers`**: Number of subscribers to the channel<br>

We want to check if there are any observations with null "subscribers" values.

In [None]:
# Check for null values
nullsubscriber_df = youtube_df[youtube_df["subscribers"].isnull()]
nullsubscriber_df

In [None]:
# Display the counts
null_subscriber_count = youtube_df['subscribers'].eq(0).sum()
print(f'Number of null values in "subscribers": {null_subscriber_count}')

Since there are none, we now want to convert the data type from int64 to float64 for scaling. <br>
We will scale the values to millions, so we will need to convert the data type to one that can recognize decimals.

In [None]:
# Convert to float64 data type
youtube_df['subscribers'] = youtube_df['subscribers'].astype('float64')
# Convert the "subscribers" column to values in millions
youtube_df['subscribers'] = youtube_df['subscribers'] / 1000000
# Rename the "subscribers" column to indicate values are in millions
youtube_df = youtube_df.rename(columns={'subscribers': 'subscribers (millions)'})
youtube_df

### The "Video Views" and "Uploads" variables
- **`video views`**: Overall count views to the channel
- **`uploads`**: Total number of videos uploaded on the channel<br>

We want to check if there are any observations that have null or zero values in these columns.

In [None]:
# Check for null values in "video views"
nullviews_df = youtube_df[youtube_df['video views'] == 0]
nullviews_df

In [None]:
# Check for null values in "uploads"
nulluploads_df = youtube_df[youtube_df['uploads'] == 0]
nulluploads_df.head(10)

In [None]:
# Display the counts
null_views_count = youtube_df['video views'].eq(0).sum()
null_uploads_count = youtube_df['uploads'].eq(0).sum()
print(f'Number of null values in "video views": {null_views_count}')
print(f'Number of null values in "uploads": {null_uploads_count}')

We want to see if there are any correlations between the "video views" and "uploads" with their respective indexes.

In [None]:
# Create a figure with two subplots side by side
fig, axs = plt.subplots(1, 2, figsize=(12, 4))

# Plot 'video views' (in millions) in the first subplot
axs[0].scatter(youtube_df.index, youtube_df['video views'] / 1000000)  # Divide by 1,000,000
axs[0].set_title('Correlation between Video Views (Millions) and Index')
axs[0].set_xlabel('Index')
axs[0].set_ylabel('Video Views (Millions)')

# Plot 'upload' in the second subplot
axs[1].scatter(youtube_df.index, youtube_df['uploads'], color='red')
axs[1].set_title('Correlation between Upload and Index')
axs[1].set_xlabel('Index')
axs[1].set_ylabel('Upload')

# Adjust the layout for better spacing
plt.tight_layout()

# Show the combined figure
plt.show()

Given the absence of significant correlations between video views, uploads, and their respective indexes, the decision has been made to remove the observations in question rather than replacing them with measures of central tendencies.

In [None]:
# Drop the rows where video views == 0
youtube_df = youtube_df[youtube_df["video views"] != 0]

# Drop the rows where uploads == 0
youtube_df = youtube_df[youtube_df["uploads"] != 0]
youtube_df

After removing the observations, we will be resetting the indexes ensure there are no gaps and the data remains continuous.

In [None]:
# Reset the index after dropping rows
youtube_df.reset_index(drop=True, inplace=True)
youtube_df

We will also scale the values of video views to the millions. Since the data type is already float64, there will be no conversion of data type.

In [None]:
# Convert the "video views" column to values in millions
youtube_df['video views'] = youtube_df['video views'] / 1000000
# Rename the "video views" column to indicate values are in millions
youtube_df = youtube_df.rename(columns={'video views': 'video views (millions)'})
# Round the values to 3 decimal places
youtube_df['video views (millions)'] = youtube_df['video views (millions)'].round(3)
youtube_df

### The "Category" variable
- **`category`**: Category or niche of the channel<br>

We want to check if there are any observations that has null values for the "category" column.

In [None]:
# Check for null values in "category"
nullcategory_df = youtube_df[youtube_df["category"].isnull()]
nullcategory_df.head(10)

In [None]:
# Display the counts
null_category_count = youtube_df['category'].isnull().sum()
print(f'Number of null values in "category": {null_category_count}')

We will now check for possible values we can use to replace the null values within the 'category' column based on the 'channel_type' column.<br>
We are replacing the null values instead of removing them due to the presence of the 'channel_type' variable which is similar in nature.

In [None]:
# Displaying the 'category' and 'channel_type' values for cleaning
column1 = 'category'
column2 = 'channel_type'

categoryandchannel_df = youtube_df.dropna(subset=[column1, column2])

unique_values_column1 = categoryandchannel_df[column1].unique().tolist()
unique_values_column2 = categoryandchannel_df[column2].unique().tolist()

df_unique_values_column1 = pd.DataFrame(unique_values_column1, columns=[column1])
df_unique_values_column2 = pd.DataFrame(unique_values_column2, columns=[column2])

# Concatenate DataFrames horizontally
result_df = pd.concat([df_unique_values_column1, df_unique_values_column2], axis=1)
result_df


After going through the unique values, we have prepared the following values to be used as replacement based on the channel_type.
| channel_type | category |
| ----------- | ----------- |
| Games      | Gaming       |
| People   | People & Blogs        |
| Entertainment      | Entertainment       |
| Sports   | Sports        |
| Film      | Film & Animation       |
| Howto   | Howto & Style        |
| Education      | Education       |
| Tech   | Science & Technology        |
| Music   | Music    |

For the remaining rows with neither "column" nor "channel_type" values, they will be removed.

In [None]:
# Replace category null values with closest value in channel_type

# Use the .loc method to set the new value based on the condition (where 'category' is null)
youtube_df.loc[(youtube_df['channel_type'] == 'Games') & (youtube_df['category'].isnull()), 'category'] = 'Gaming'
youtube_df.loc[(youtube_df['channel_type'] == 'People') & (youtube_df['category'].isnull()), 'category'] = 'People & Blogs'
youtube_df.loc[(youtube_df['channel_type'] == 'Entertainment') & (youtube_df['category'].isnull()), 'category'] = 'Entertainment'
youtube_df.loc[(youtube_df['channel_type'] == 'Sports') & (youtube_df['category'].isnull()), 'category'] = 'Sports'
youtube_df.loc[(youtube_df['channel_type'] == 'Film') & (youtube_df['category'].isnull()), 'category'] = 'Film & Animation'
youtube_df.loc[(youtube_df['channel_type'] == 'Howto') & (youtube_df['category'].isnull()), 'category'] = 'Howto & Style'
youtube_df.loc[(youtube_df['channel_type'] == 'Education') & (youtube_df['category'].isnull()), 'category'] = 'Education'
youtube_df.loc[(youtube_df['channel_type'] == 'Tech') & (youtube_df['category'].isnull()), 'category'] = 'Science & Technology'
youtube_df.loc[(youtube_df['channel_type'] == 'Music') & (youtube_df['category'].isnull()), 'category'] = 'Music'

# Drop remaining rows with null values for 'category' column
youtube_df.dropna(subset=['category'], inplace=True)

# Display the modified DataFrame
youtube_df

We will also drop the "channel_type" column since we will no longer be using it.

In [None]:
# Drop 'channel_type' column
youtube_df.drop(columns='channel_type', inplace=True)
youtube_df

After removing the observations, we will be resetting the indexes ensure there are no gaps and the data remains continuous.


In [None]:
# Reset the index after dropping rows
youtube_df.reset_index(drop=True, inplace=True)
youtube_df

### The "Country" variable
- **`Country`**: Country where the YouTube channel originates<br>

We want to check if there are any observations that has null values for the "Country" column.

In [None]:
# Check for null values in Country
null_country_df = youtube_df[youtube_df["Country"].isnull()]
null_country_df.head(10)

In [None]:
# Check for null values in Country
null_country_count = youtube_df["Country"].isnull().sum()
print(f'Number of null values in "Country": {null_country_count}')

There are 94 observations without a country, approximately 10% of the dataset has null "Country" values.<br>
Next, we will look for possible misrepresentations of the data.

In [None]:
# Get the unique values in the "Country" column
unique_country = youtube_df['Country'].unique()
print(unique_country)
print(f"Number of countries: {unique_country.__len__()}")


Since there are no misrepresentations, we will check the count for each unique value and determine if its possible to use the mode as replacement for the missing values.

In [None]:
# Get the count of each unique value in the "Country" column
country_counts = youtube_df['Country'].value_counts()

# Display the count of each unique value
print("Count of each unique value in 'Country':")
print(country_counts)

The mode value "United States" has 306 counts, approximately 32.27% of Youtubers in the dataset are from the United States.<br>
We will not replace missing values with the mode value since there is a large number of unique values in the dataset and the mode already represents a significant portion of the data. <br>
If replaced, it can lead to a lack of diversity in the dataset, and it may not accurately reflect the underlying data distribution.

    For visualization, if replaced we can see the following distribution of the values below.

In [None]:
# Replace missing values with the mode
test_df = youtube_df.copy()

test_df['Country'].fillna(test_df['Country'].mode()[0], inplace=True)

# Create a countplot to show the distribution of 'country' values
plt.figure(figsize=(10, 6))
sns.countplot(data=test_df, x='Country')
plt.xlabel('Country')
plt.ylabel('Count')
plt.title('Distribution of Country Values')
plt.xticks(rotation=90)  # Rotate x-axis labels for better readability

plt.show()

We will be removing the observations with null "Country" values.

In [None]:
# Drop observations with null values
youtube_df.dropna(subset=['Country'], inplace=True)
youtube_df

After removing the observations, we will be resetting the indexes ensure there are no gaps and the data remains continuous.



In [None]:
# Reset the index after dropping rows
youtube_df.reset_index(drop=True, inplace=True)
youtube_df

### The "Lowest Monthly Earnings" variable
- **`lowest_monthly_earnings`**: Lowest estimated monthly earnings from the channel \$<br>

We want to check if there are any values for "lowest_monthly_earnings" that are zero (0) or null.
     

In [None]:
# Check for null values in lowest_monthly_earnings
null_lowestmonthly_df = youtube_df[youtube_df["lowest_monthly_earnings"].eq(0)]
null_lowestmonthly_df.head(10)

In [None]:
# Display the count
null_lowestmonthly_count = youtube_df["lowest_monthly_earnings"].eq(0).sum()
print(f'Number of null values in "lowest_monthly_earnings": {null_lowestmonthly_count}')

We have 34 observations wherein they have zero (0) or null values for "lowest_monthly_earnings".<br><br>
Since there is the possibility of the earnings coming from sponsorships, affiliate marketing, etc., some of the earnings may not be accurated particularly the null values.
We will be removing these zero (0) or null values to maintain the overall quality and integrity of our dataset. Through removal, we can avoid distorting the underlying patterns in the data and also avoid the complications of imputations. We would still be able to perform a meaningful analysis despite reducing the dataset.

In [None]:
# Drop all observations with zero lowest_monthly_earnings
youtube_df.drop(youtube_df[youtube_df["lowest_monthly_earnings"].eq(0)].index, inplace=True)
youtube_df

After removing the observations, we will be resetting the indexes ensure there are no gaps and the data remains continuous.


In [None]:
# Reset the index after dropping rows
youtube_df.reset_index(drop=True, inplace=True)
youtube_df

### The "Highest Monthly Earnings" variable
- **`highest_monthly_earnings`**: Highest estimated monthly earnings from the channel \$<br>

We want to check if there are any values for "highest_monthly_earnings" that are zero (0) or null.

In [None]:
# Check for null values in highest_monthly_earnings
null_highestmonthly_df = youtube_df[youtube_df["highest_monthly_earnings"].eq(0)]
null_highestmonthly_df.head(10)

In [None]:
# Display the count
null_highestmonthly_count = youtube_df["highest_monthly_earnings"].eq(0).sum()
print(f'Number of null values in "highest_monthly_earnings": {null_highestmonthly_count}')

We have 0 observations wherein they have zero (0) or null values for "highest_monthly_earnings".

There is no need to drop any observations.

### The "Lowest Yearly Earnings" variable
- **`lowest_yearly_earnings`**: Lowest estimated yearly earnings from the channel \$<br>

We want to check if there are any values for "lowest_yearly_earnings" that are zero (0) or null.
     

In [None]:
# Check for null values in lowest_yearly_earnings
null_lowestyearly_df = youtube_df[youtube_df["lowest_yearly_earnings"].eq(0)]
null_lowestyearly_df.head(10)

In [None]:
# Display the count
null_lowestyearly_count = youtube_df["lowest_yearly_earnings"].eq(0).sum()
print(f'Number of null values in "lowest_yearly_earnings": {null_lowestyearly_count}')

We have 0 observations wherein they have zero (0) or null values for "lowest_yearly_earnings".

There is no need to drop any observations.

### The "Highest Yearly Earnings" variable
- **`highest_yearly_earnings`**: Highest estimated monthly earnings from the channel \$<br>

We want to check if there are any values for "highest_yearly_earnings" that are zero (0) or null.

In [None]:
# Check for null values in highest_yearly_earnings
null_highestyearly_df = youtube_df[youtube_df["highest_yearly_earnings"].eq(0)]
null_highestyearly_df.head(10)

In [None]:
# Display the count
null_highestyearly_count = youtube_df["highest_yearly_earnings"].eq(0).sum()
print(f'Number of null values in "highest_yearly_earnings": {null_highestyearly_count}')

We have 0 observations wherein they have zero (0) or null values for "highest_yearly_earnings". There is no need for cleaning.

### The "Created Year" variable
- **`created_year`**: Year when the YouTube channel was created<br>

We want to check if there are any values for "created_year" that are zero (0) or null.

In [None]:
# Check for null values in created_year
null_createdyear_df = youtube_df[youtube_df["created_year"].eq(0)]
null_createdyear_df.head(10)

In [None]:
# Display the count
null_createdyear_count = youtube_df["created_year"].eq(0).sum()
print(f'Number of null values in "created_year": {null_createdyear_count}')

We also want to check if there are any values less than 2005 since Youtube was made in this year.

In [None]:
# Check if there are any observations of Youtube channels created before 2005
subset = youtube_df[youtube_df['created_year'] < 2005]
subset


After manual checking, the channel does exist and the year shown is accurate but it may have been an error or a default value (0) in Posix time assigned in the database. 

As we are dealing with years, we will remove the observation instead since incorrect or error values for a year can significantly affect the accuracy and interpretation of the data.

In [None]:
# Get the index of the subset
indices_to_drop = subset.index

# Drop the observations from the original DataFrame
youtube_df.drop(indices_to_drop, inplace=True)

Since the values of the variables are years, it would be appropriate to convert the data type to int.

In [None]:
# Convert the values in the 'created_year' column to integers
youtube_df['created_year'] = youtube_df['created_year'].astype('int64')

### The "Population" variable
- **`Population`**: Total population of the country<br>

We want to check if there are any values for "Population" that are zero (0) or null.

In [None]:
# Check for null values in Population
null_population_df = youtube_df[youtube_df["Population"].eq(0)]
null_population_df.head(10)

In [None]:
# Display the count
null_population_count = youtube_df["Population"].eq(0).sum()
print(f'Number of null values in "Population": {null_population_count}')

We have 0 observations wherein they have zero (0) or null values for "Population". 

We want to scale the values to millions to make it more readable. 

In [None]:
# Convert the "Population" column to values in millions
youtube_df['Population'] = youtube_df['Population'] / 1000000
# Rename the "Population" column to indicate values are in millions
youtube_df = youtube_df.rename(columns={'Population': 'Population (millions)'})
# Round the values to 3 decimal places
youtube_df['Population (millions)'] = youtube_df['Population (millions)'].round(3)
youtube_df

### Check Data Types of the Values in each Column

We want to check if the values in each column match their respective data types (float64, int64, and object).

In [None]:
# Define the expected data types
expected_data_types = {
    'Youtuber': object,
    'subscribers (millions)': float,
    'video views (millions)': float,
    'uploads': int,
    'Country': object,
    'lowest_monthly_earnings': float,
    'highest_monthly_earnings': float,
    'lowest_yearly_earnings': float,
    'highest_yearly_earnings': float,
    'created_year': int,
    'Population (millions)': float
}

# Check if values in each column match the expected data type using a custom function
def check_data_type(column_name, expected_dtype):
    for value in youtube_df[column_name]:
        if not isinstance(value, expected_dtype):
            return False
    return True

all_columns_match = True

for column_name, expected_dtype in expected_data_types.items():
    if not check_data_type(column_name, expected_dtype):
        all_columns_match = False
        print(f"Column '{column_name}' does not match the expected data type {expected_dtype}.")

if all_columns_match:
    print("All columns match the expected data types.")
else:
    print("Not all columns match the expected data types.")

### Post Data Cleaning



#### Dataset Information:
- **`Rows/Observations`**: 820
- **`Variables/Columns`**: 12
- **`Data Types`**: float64(7), int64(2), object(3)

The variables that will be used are as follows:
- **`Youtuber`**: Name of the YouTube channel
- **`subscribers (millions)`**: Number of subscribers to the channel
- **`video views (millions)`**: Overall count views to the channel
- **`category`**: Category or niche of the channel
- **`uploads`**: Total number of videos uploaded on the channel
- **`Country`**: Country where the YouTube channel originates
- **`lowest_monthly_earnings`**: Lowest estimated monthly earnings from the channel \$
- **`highest_monthly_earnings`**: Highest estimated monthly earnings from the channel \$
- **`lowest_yearly_earnings`**: Lowest estimated yearly earnings from the channel \$
- **`highest_yearly_earnings`**: Highest estimated yearly earnings from the channel $
- **`created_year`**: Year when the YouTube channel was created
- **`Population (millions)`**: Total population of the country

In [None]:
youtube_df.info()

# Exploratory Data Analysis

    
    This section of the Jupyter Notebook contains the overall process of our Exploratory Data Analysis of our dataset, together with the questions that we are curious to comprehend and answer through exploratory data analysis.

### 1. What are the lowest and highest yearly earning categories?

**Description**:
- This analysis aims to identify and explore the YouTube content categories or niches that yield the lowest and highest yearly earnings.

**Variables**:
- `category`
- `lowest_yearly_earnings`
- `highest_yearly_earnings`

**Numerical Summary**

Get the lowest and highest yearly earnings of Youtube Channels per category.
Calculate the average yearly earnings for reference as well.

In [None]:
# show first 10 lowest_yearly_earnings values
youtube_df['lowest_yearly_earnings'].head(10)

In [None]:
# show first 10 highest_yearly_earnings values
youtube_df['highest_yearly_earnings'].head(10)

In [None]:
# calculate the average yearly earnings
youtube_df['average_yearly_earnings'] = (youtube_df['lowest_yearly_earnings'] + youtube_df['highest_yearly_earnings']) / 2

**Visualization**

Create a horizontal bar chart to visualize the lowest and highest yearly earnings categories, showing how they compare to one another.

In [None]:
# make category as string
youtube_df['category'] = youtube_df['category'].astype(str)

# create horizontal bar chart for lowest yearly earnings sorted 
plt.figure(figsize=(10, 6))
sns.barplot(data=youtube_df.sort_values(by='lowest_yearly_earnings', ascending=False),
            x='lowest_yearly_earnings', y='category')
plt.xlabel('Lowest Yearly Earnings')
plt.ylabel('Category')
plt.title('Lowest Yearly Earnings by Category')
plt.show()

# create horizontal bar chart for highest yearly earnings sorted 
plt.figure(figsize=(10, 6))
sns.barplot(data=youtube_df.sort_values(by='highest_yearly_earnings', ascending=False),
            x='highest_yearly_earnings', y='category')
plt.xlabel('Highest Yearly Earnings')
plt.ylabel('Category')
plt.title('Highest Yearly Earnings by Category')
plt.show()

# create horizontal bar chart for average yearly earnings sorted 
plt.figure(figsize=(10, 6))
sns.barplot(data=youtube_df.sort_values(by='average_yearly_earnings', ascending=False),
            x='average_yearly_earnings', y='category')
plt.xlabel('Average Yearly Earnings')
plt.ylabel('Category')
plt.title('Average Yearly Earnings by Category')
plt.show()

**Conclusion**

### 2. Is there a correlation between video views and subscriber counts on YouTube channels?

**Description**
- This analysis investigates the relationship between the number of video views and the number of subscribers on YouTube channels to determine if there's a correlation indicating viewer engagement.

**Variables**
- `video views`
- `subscribers`

**Numerical Summary**

Calculate the Pearson correlation coefficient to quantify the strength and direction of the correlation between video views and subscribers.


In [None]:
# calculate pearson correlation betwen video views and subscribers
youtube_df['video views (millions)'].corr(youtube_df['subscribers (millions)'])


**Visualization**

Create a scatterplot to visualize the relationship between video views and subscribers, indicating the nature of the correlation.

In [None]:
# scatterplot by using
plt.figure(figsize=(16, 8))
sns.scatterplot(x='video views (millions)', y='subscribers (millions)', data=youtube_df)
plt.xlabel('Video Views')
plt.ylabel('Subscribers')
plt.title('Video Views vs. Subscribers')
plt.show()

**Conclusion**


### 3. Is there a correlation between video views and average yearly earnings on YouTube channels?

**Description**
- This analysis explores whether there's a correlation between the number of video views and the monthly earnings of YouTube channels, offering insights into the monetization potential of views.

**Variables**
- `video views`
- `lowest_yearly_earnings`
- `highest_yearly_earnings`

**Numerical Summary**

Calculate the correlation coefficient to determine the strength and direction of the correlation between video views and average yearly earnings

In [None]:
# calculate average yearly earnings
youtube_df['avg_yearly_earnings'] = (youtube_df['lowest_yearly_earnings'] + youtube_df['highest_yearly_earnings']) / 2

# calculate pearson correlation betwen video views and average yearly earnings
youtube_df['video views (millions)'].corr(youtube_df['avg_yearly_earnings'])

**Visualization**

Create a scatterplot to visualize the relationship between video views and average yearly earnings, highlighting the nature of the correlation.


In [None]:
# visualize the data
plt.figure(figsize=(16, 8))
sns.scatterplot(x='video views (millions)', y='avg_yearly_earnings', data=youtube_df)
plt.xlabel('Video Views')
plt.ylabel('Average Yearly Earnings')
plt.title('Video Views vs. Average Yearly Earnings')
plt.show()

**Conclusion**

### 4. Are there discernible patterns related to a YouTube channel's category and its subscribers?


**Description**:
- We aim to explore if there are discernible patterns related to a YouTube channel's category and its subscribers. We will perform numerical summary and visualization to analyze the data.

**Variables**:
- `subscribers` : The number of subscribers for YouTube channels.
- `category` : The category of the YouTube channels.


**Numerical Summary**

We calculated summary statistics for the `subscribers` column using youtube_df['subscribers'].describe(). The subscribers column represents the `average number of subscribers` for each YouTube channel within their respective categories.

In [None]:
subscribers_summary = youtube_df['subscribers (millions)'].describe()

**Visualization**

We created a horizontal bar chart to display the average subscribers for each category, using the average_subscribers_by_category variable. The x-axis represents the average subscribers (in millions), and the y-axis displays the categories. The chart is color-coded for better visualization.

In [None]:
# Bar Chart - Average Subscribers by Category
average_subscribers_by_category = youtube_df.groupby('category')['subscribers (millions)'].mean().sort_values(ascending=False)
plt.figure(figsize=(10, 6))
sns.barplot(x=average_subscribers_by_category.values, y=average_subscribers_by_category.index, palette="viridis", orient="h")
plt.title('Average Subscribers by Category')
plt.xlabel('Average Subscribers (in millions)')
plt.ylabel('Category')
plt.show()

**Conclusion**

From our analysis, we have found discernible patterns related to a YouTube channel's category and its subscribers. The analysis revealed the following:

- Shows had the highest average number of subscribers, with an average of over 4 million subscribers.

- Trailers came in second place in terms of average subscribers.

- Film & Animation occupied the third position in terms of average subscribers.

- Pets & Animals had the third-to-last position with lower average subscribers.

- Autos & Vehicles ranked second to last in terms of average subscribers.

- Travel & Events was the category with the lowest average number of subscribers.

These patterns suggest that the type of content and category significantly influences the average number of subscribers on YouTube channels. Channels in the "Shows" category, on average, tend to have the most subscribers, while those in the "Travel & Events" category have the fewest. This information can be valuable for content creators and marketers looking to understand subscriber trends on YouTube.

### 5. Are there discernible patterns related to a YouTube channel's category and its average yearly earnings?

**Description**:
- We aim to investigate if there are discernible patterns related to a YouTube channel's category and its average yearly earnings. We will conduct numerical summary and visualization to analyze the data.


**Variables**:
- `category` : The category of the YouTube channels.
- `lowest_yearly_earnings` : The lowest estimated yearly earnings of the YouTube channels.
- `highest_yearly_earnings` : The highest estimated yearly earnings of the YouTube channels.


**Numerical Summary**

We will calculate summary statistics for the 'average_yearly_earnings' column, which is the sum of the lowest and highest yearly earnings. This column represents the average yearly earnings for each YouTube channel within their respective categories.

In [None]:
youtube_df['average_yearly_earnings'] = youtube_df['lowest_yearly_earnings'] + youtube_df['highest_yearly_earnings']

**Visualization**

We will create a horizontal bar chart to visualize the average yearly earnings for each category, using the average_yearly_earnings_by_category variable. The x-axis will represent the average yearly earnings (in dollars), and the y-axis will display the categories. The chart will be color-coded for better visualization.

In [None]:
# Bar Chart - Average Yearly Earnings by Category
average_yearly_earnings_by_category = youtube_df.groupby('category')['average_yearly_earnings'].mean().sort_values(ascending=False)
plt.figure(figsize=(10, 6))
sns.barplot(x=average_yearly_earnings_by_category.values, y=average_yearly_earnings_by_category.index, palette="viridis", orient="h")
plt.title('Average Yearly Earnings by Category')
plt.xlabel('Average Yearly Earnings (in dollars)')
plt.ylabel('Category')
plt.show()

**Conclusion**

Conclusion:
From our analysis, we have found discernible patterns related to a YouTube channel's category and its average yearly earnings, calculated as the sum of the lowest and highest yearly earnings. The analysis reveals the following:

- Shows had the highest average yearly earnings, with an average of 2.5 million dollars.

- Autos & Vehicles came in second place in terms of average yearly earnings.

- Sports occupied the third position in terms of average yearly earnings.

- Science & Technology had the third-to-last position with lower average yearly earnings.

- Howto & Style ranked second to last in terms of average yearly earnings.

- Travel & Events was the category with the lowest average yearly earnings.

These patterns suggest that the category of content significantly influences the average yearly earnings of YouTube channels. Channels in the "Shows" category, on average, tend to have the highest earnings, while those in the "Travel & Evetns" category have the lowest earnings. This information can be valuable for content creators and marketers looking to understand earning trends on YouTube.

### 6. What is the distribution of video views for YouTube channels within each content category?

**Description**:
- We aim to explore the distribution of video views for YouTube channels within each content category.


**Variables**:
- `category` : The category of the YouTube channels.
- `video views` : The number of video views for each YouTube channel.


**Numerical Summary**

Calculate the summary statistics for the 'video_views' column within each category. This will help us understand the distribution of video views, including measures like the median, quartiles, and potential outliers for each category.

In [None]:
# Calculate the mean of video views within each category
views_summary = youtube_df.groupby('category')['video views (millions)'].describe()

# Sort the summary statistics by the "count" column in descending for visualization
views_summary_sorted = views_summary.sort_values(by=('count'), ascending=False)

# Display the sorted summary
views_summary_sorted

**Visualization**

Create a box plot and bar chart to visually represent the distribution of video views for each category. Box plots are effective for showing the central tendency and spread of data within categories.

In [None]:
# Horizontal Box Plot - Distribution of Video Views by Category
plt.figure(figsize=(12, 6))
sns.boxplot(x='video views (millions)', y='category', data=youtube_df)
plt.title('Distribution of Video Views by Category')
plt.xlabel('Number of Video Views')
plt.ylabel('Category')
plt.show()

# Horizontal Bar Chart - Distribution of Video Views by Category
plt.figure(figsize=(12, 6))
sns.barplot(x='video views (millions)', y='category', data=youtube_df)
plt.title('Distribution of Video Views by Category')
plt.xlabel('Number of Video Views')
plt.ylabel('Category')
plt.show()

**Conclusion**

### 7. What are the most prevalent YouTube channel categories in the dataset?

Description:
- How many channels are in each category?

Variables:
- `category`

**Numerical Summary**

Count the frequency of YouTube channels within each unique category.

In [None]:
category_counts = youtube_df['category'].value_counts()

**Visualization**

Create a bar chart to visualize the distribution of YouTube channels across unique categories.

In [None]:
# create a bar chart to visualize the distribution of YouTube channels across categories
plt.figure(figsize=(12, 6))  # Adjust the figure size
ax = sns.barplot(x=category_counts.values, y=category_counts.index, palette="viridis", orient="h")
ax.set(xlabel='Number of Channels', ylabel='Category')
plt.title('Distribution of YouTube Channels Across Categories')

# Add data labels to the right of each bar for readability
for p in ax.patches:
    width = p.get_width()
    ax.annotate(f'{int(width)}', xy=(width, p.get_y() + p.get_height() / 2), ha='left', va='center', fontsize=9, color='black')

plt.tight_layout()  # Ensure labels and titles fit in the plot area
plt.show()


**Conclusion**

### 8. What is the distribution of YouTube channels by their country of origin in the dataset? Retrieve the top 15 countries.

Description:
- aims to identify the top 15 countries with most amount of channels.

Variables:
- `Country`

**Numerical Summary**

Get the frequency of YouTube channels from each country.

Retrieve the top 15 countries with most frequency of YouTube channels.

country_counts = youtube_df['Country'].value_counts()

In [None]:
top_15_countries = country_counts[:15]

top_15_countries

**Visualization**

Create a bar chart to visualize the distribution of YouTube channels by country – showing only the top 15.

In [None]:
plt.figure(figsize=(12, 6))  # Adjust the figure size
ax = sns.barplot(x=top_15_countries.values, y=top_15_countries.index, palette="viridis", orient="h")
ax.set(xlabel='Number of Channels', ylabel='Country')
plt.title('Distribution of YouTube Channels Across Countries')

# Add data labels to the right of each bar for readability
for p in ax.patches:
    width = p.get_width()
    ax.annotate(f'{int(width)}', xy=(width, p.get_y() + p.get_height() / 2), ha='left', va='center', fontsize=9, color='black')

plt.tight_layout()  # Ensure labels and titles fit in the plot area
plt.show()

**Conclusion**