# CSMODEL Machine Project: 

This Jupyter Notebook includes the following contents:
    
    0.  2023 Global Youtube Statistics Dataset
    1.  Description of the Dataset
    2.  Data Cleaning Process of the Dataset
    3.  Exploratory Data Analysis 
    4.  Proposed Research Question
    5.  Data Preprocessing
    6.  Data Modelling
    7.  Statistical Inference
    8.  Insights and Conclusion

### GARYNATION  (S13)
    1.  Bien Aaron Miranda 
    2.  Luis Miguel Rana
    3.  Karl Andre Aquino
    4.  Dominic Luis Baccay

# Phase 1

### Python Libraries

Import numpy and pandas.

pandas is a software library for Python which provides data structures and data analysis tools.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from scipy.stats import chi2_contingency
from scipy import stats

## The Dataset
This dataset, which includes well-known YouTube personalities, offers a compelling opportunity to analyze and glean invaluable insights from the influential figures within the platform. It contains comprehensive details on subscriber counts, video views, posting frequency, country of origin, income, and other relevant information about top creators. This treasure trove of data is indispensable for up-and-coming content creators, data aficionados, and anyone intrigued by the dynamic landscape of online content creation.

As stated by the dataset owner, dataset was painstakingly and diligently compiled from an array of highly regarded and credible sources, meticulously curating its contents to guarantee the utmost accuracy and unwavering reliability of the information it encapsulates, thereby instilling confidence in the veracity of the presented data.





In [None]:
#encoding came from the discussion so that it is readable
youtube_df = pd.read_csv("Global YouTube Statistics.csv", encoding='iso-8859-1')

If you view the `.csv` file in Excel, you can see that our dataset contains about 995 **observations** (rows) across 28 **variables** (columns). The following are the descriptions of each variable in the dataset.

- **`rank`**: Position of the YouTube channel based on the number of subscribers
- **`Youtuber`**: Name of the YouTube channel
- **`subscribers`**: Number of subscribers to the channel
- **`video views`**: Overall count views to the channel
- **`category`**: Category or niche of the channel
- **`Title`**: Title
- **`uploads`**: Total number of videos uploaded on the channel
- **`Country`**: Country where the YouTube channel originates
- **`Abbreviation`**: Abbreviation of the country
- **`channel_type`**: Type of the YouTube channel (e.g., individual, brand)
- **`video_views_rank`**: Ranking of the channel based on total video views
- **`country_rank`**: Ranking of the channel based on the number of subscribers within its country
- **`channel_type_rank`**: Ranking of the channel based on its type (individual or brand)
- **`video_views_for_the_last_30_days`**: Total video views in the last 30 days
- **`lowest_monthly_earnings`**: Lowest estimated monthly earnings from the channel \$
- **`highest_monthly_earnings`**: Highest estimated monthly earnings from the channel \$
- **`lowest_yearly_earnings`**: Lowest estimated yearly earnings from the channel \$
- **`highest_yearly_earnings`**: Highest estimated yearly earnings from the channel $
- **`subscribers_for_last_30_days`**: Number of new subscribers gained in the last 30 days
- **`created_year`**: Year when the YouTube channel was created
- **`created_month`**: Month when the YouTube channel was created
- **`created_date`**: Exact date of the YouTube channel's creation
- **`Gross tertiary education enrollment (\%)`**: Percentage of the population enrolled in tertiary education in the country %
- **`Population`**: Total population of the country
- **`Unemployment rate`**: Unemployment rate in the country %
- **`Urban_population`**: Percentage of the population living in urban areas %
- **`Latitude`**: Latitude coordinate of the country's location
- **`Longitude`**: Longitude coordinate of the country's location




## Reading the Dataset

Our first step is to load the dataset using pandas. This will load the dataset into a pandas DataFrame. To load the dataset, we use the read_csv function. Note that you may need to change the path depending on the location of the file in your machine.

Will also tackle about what can be seen in the Dataset as a DataFrame

In [None]:
youtube_df.info()

### Examining the observations within the Dataset

    
To gain a better understanding of the observations within the dataset, it's essential to examine their appearance or visual representation. This can provide valuable insights and help you make informed decisions based on the data.

In [None]:
youtube_df.head(10)

# Cleaning the Dataset

    This section of the Jupyter Notebook contains the overall process and variables affected during the data cleaning of our dataset.

### Before Cleaning:
    Number of Observations: 995
    Number of Columns: 28
    Data types: float64(18), int64(3), object(7)

### Initial Observations: 
-   There are special characters in the 'Youtuber' column.
-   There are missing/null values in the dataset.
-   Some variables are similar, therefore, unnecessary.
-   Some data types are inappropriate for their respective variables.



#### Variables to be Observed:
We will be dropping all columns/variables except the following.
- **`Youtuber`**: Name of the YouTube channel
- **`subscribers`**: Number of subscribers to the channel
- **`video views`**: Overall count views to the channel
- **`category`**: Category or niche of the channel
- **`uploads`**: Total number of videos uploaded on the channel
- **`Country`**: Country where the YouTube channel originates
- **`channel_type`**: Type of the YouTube channel (e.g., individual, brand)
- **`lowest_monthly_earnings`**: Lowest estimated monthly earnings from the channel \$
- **`highest_monthly_earnings`**: Highest estimated monthly earnings from the channel \$
- **`lowest_yearly_earnings`**: Lowest estimated yearly earnings from the channel \$
- **`highest_yearly_earnings`**: Highest estimated yearly earnings from the channel $
- **`created_year`**: Year when the YouTube channel was created
- **`Population`**: Total population of the country

In [None]:
# Declare the variables that will be used
columns_to_keep = ['Youtuber', 'subscribers', 'video views', 'uploads', 'category', 'channel_type', 'Country', 'lowest_monthly_earnings', 'highest_monthly_earnings', 'lowest_yearly_earnings', 'highest_yearly_earnings', 'created_year', 'Population']
columns_to_drop = list(set(youtube_df.columns) - set(columns_to_keep))

# Drop unused columns (variables)
youtube_df.drop(columns_to_drop, axis=1, inplace=True)

youtube_df

### The "Youtuber" variable

- **`Youtuber`**: Name of the YouTube channel<br>

We will explore the **`Youtuber`** column and check if it needs any cleaning or preprocessing.

We want to check if there are any observations that have null **`Youtuber`** values, if any, we will be excluding these observations from the list.


In [None]:
# Check for null values
column_name = "Youtuber"
nullyoutuber_count = youtube_df["Youtuber"].isnull().sum()
print(f"Number of unique values in '{column_name}': {nullyoutuber_count}")


We want to check if there are any duplicate **`Youtuber`** values, if any, the row will be removed.
     

In [None]:
# Check for duplicates
unique_count = youtube_df[column_name].nunique()
observation_count = youtube_df[column_name].count()

print(f"Number of observations in '{column_name}': {observation_count}")
print(f"Number of unique values in '{column_name}': {unique_count}")

Since there are special characters present within the dataset, we will replace them with underscores "_" instead for readability.

In [None]:
# Define an array of special characters to replace
chars_to_replace = ['ï', '¿', '½', 'ý']

# Replace the characters with underscores
for char in chars_to_replace:
    youtube_df['Youtuber'] = youtube_df['Youtuber'].str.replace(char, '_', regex=False)

youtube_df

### The "Subscribers" variable

- **`subscribers`**: Number of subscribers to the channel<br>

We will explore the **`subscribers`** column and check if it needs any cleaning or preprocessing.

We want to check if there are any observations with null **`subscribers`** values.

In [None]:
# Display the counts
null_subscriber_count = youtube_df['subscribers'].eq(0).sum()
print(f'Number of null values in "subscribers": {null_subscriber_count}')

Since there are none, we now want to convert the data type from **`int64`** to **`float64`** for scaling. <br>
We will scale the values to millions, so we will need to convert the data type to one that can recognize decimals. <br>
We are scaling the values for easier interpretation and visualization.

In [None]:
# Convert to float64 data type
youtube_df['subscribers'] = youtube_df['subscribers'].astype('float64')
# Convert the "subscribers" column to values in millions
youtube_df['subscribers'] = youtube_df['subscribers'] / 1000000
# Rename the "subscribers" column to indicate values are in millions
youtube_df.rename(columns={'subscribers': 'subscribers (millions)'}, inplace=True)
youtube_df

We also renamed the column from **`"subscribers"`** to **`"subscribers (millions)"`**. 

### The "Video Views" and "Uploads" variables
- **`video views`**: Overall count views to the channel
- **`uploads`**: Total number of videos uploaded on the channel<br>

We will explore the **`video views`** and **`uploads`** columns and check if it needs any cleaning or preprocessing.

We want to check if there are any observations that have null or zero values in these columns.

In [None]:
# Check for null values in "video views"
nullviews_df = youtube_df[youtube_df['video views'] == 0]
nullviews_df

In [None]:
# Check for null values in "uploads"
nulluploads_df = youtube_df[youtube_df['uploads'] == 0]
nulluploads_df.head(10)

In [None]:
# Display the counts
null_views_count = youtube_df['video views'].eq(0).sum()
null_uploads_count = youtube_df['uploads'].eq(0).sum()
print(f'Number of null values in "video views": {null_views_count}')
print(f'Number of null values in "uploads": {null_uploads_count}')

We want to scale the **`video views`** into millions for easier interpretation and visualization.

In [None]:
# Convert the "video views" column to values in millions
youtube_df['video views'] = youtube_df['video views'] / 1000000
# Rename the "subscribers" column to indicate values are in millions
youtube_df.rename(columns={'video views': 'video views (millions)'}, inplace=True)

In [None]:
youtube_df['uploads'] = youtube_df['uploads'].astype('float64')

We want to see if there are any correlations between the **`video views (millions)`** and **`uploads`** with their respective indexes.

In [None]:
# Create a figure with two subplots side by side
fig, axs = plt.subplots(1, 2, figsize=(12, 4))

# Plot 'video views' (in millions) in the first subplot
axs[0].scatter(youtube_df.index, youtube_df['video views (millions)'])
axs[0].set_title('Correlation between Video Views (Millions) and Index')
axs[0].set_xlabel('Index')
axs[0].set_ylabel('Video Views (Millions)')

# Plot 'upload' in the second subplot
axs[1].scatter(youtube_df.index, youtube_df['uploads'], color='red')
axs[1].set_title('Correlation between Upload and Index')
axs[1].set_xlabel('Index')
axs[1].set_ylabel('Upload')

# Adjust the layout for better spacing
plt.tight_layout()

# Show the combined figure
plt.show()

Given the absence of significant correlations between **`video views`**, **`uploads`**, and their respective indexes, the decision has been made to remove the observations in question rather than replacing them with measures of central tendencies.

In [None]:
# Drop the rows where video views == 0
youtube_df = youtube_df[youtube_df["video views (millions)"] != 0]

# Drop the rows where uploads == 0
youtube_df = youtube_df[youtube_df["uploads"] != 0]
youtube_df

After removing the observations, we will be resetting the indexes ensure there are no gaps and the data remains continuous.

In [None]:
# Reset the index after dropping rows
youtube_df.reset_index(drop=True, inplace=True)
youtube_df

### The "Category" variable
- **`category`**: Category or niche of the channel<br>

We will explore the **`category`** column and check if it needs any cleaning or preprocessing.


We want to check if there are any observations that has null values for the **`category`** column.

In [None]:
# Check for null values in "category"
nullcategory_df = youtube_df[youtube_df["category"].isnull()]
nullcategory_df.head(10)

In [None]:
# Display the counts
null_category_count = youtube_df['category'].isnull().sum()
print(f'Number of null values in "category": {null_category_count}')

We will now check for possible values we can use to replace the null values within the **`category`** column based on the **`channel_type`** column.<br>
We are replacing the null values instead of removing them due to the presence of the **`channel_type`** variable which is similar in nature.

In [None]:
# Displaying the 'category' and 'channel_type' values for cleaning
column1 = 'category'
column2 = 'channel_type'

categoryandchannel_df = youtube_df.dropna(subset=[column1, column2])

unique_values_column1 = categoryandchannel_df[column1].unique().tolist()
unique_values_column2 = categoryandchannel_df[column2].unique().tolist()

df_unique_values_column1 = pd.DataFrame(unique_values_column1, columns=[column1])
df_unique_values_column2 = pd.DataFrame(unique_values_column2, columns=[column2])

# Concatenate DataFrames horizontally
result_df = pd.concat([df_unique_values_column1, df_unique_values_column2], axis=1)
result_df


After going through the unique values, we have prepared the following values to be used as replacement based on the **`channel_type`**.
| channel_type | category |
| ----------- | ----------- |
| Games      | Gaming       |
| People   | People & Blogs        |
| Entertainment      | Entertainment       |
| Sports   | Sports        |
| Film      | Film & Animation       |
| Howto   | Howto & Style        |
| Education      | Education       |
| Tech   | Science & Technology        |
| Music   | Music    |

For the remaining rows with neither **`category`** nor **`channel_type`** values, they will be removed.

In [None]:
# Replace category null values with closest value in channel_type

# Use the .loc method to set the new value based on the condition (where 'category' is null)
youtube_df.loc[(youtube_df['channel_type'] == 'Games') & (youtube_df['category'].isnull()), 'category'] = 'Gaming'
youtube_df.loc[(youtube_df['channel_type'] == 'People') & (youtube_df['category'].isnull()), 'category'] = 'People & Blogs'
youtube_df.loc[(youtube_df['channel_type'] == 'Entertainment') & (youtube_df['category'].isnull()), 'category'] = 'Entertainment'
youtube_df.loc[(youtube_df['channel_type'] == 'Sports') & (youtube_df['category'].isnull()), 'category'] = 'Sports'
youtube_df.loc[(youtube_df['channel_type'] == 'Film') & (youtube_df['category'].isnull()), 'category'] = 'Film & Animation'
youtube_df.loc[(youtube_df['channel_type'] == 'Howto') & (youtube_df['category'].isnull()), 'category'] = 'Howto & Style'
youtube_df.loc[(youtube_df['channel_type'] == 'Education') & (youtube_df['category'].isnull()), 'category'] = 'Education'
youtube_df.loc[(youtube_df['channel_type'] == 'Tech') & (youtube_df['category'].isnull()), 'category'] = 'Science & Technology'
youtube_df.loc[(youtube_df['channel_type'] == 'Music') & (youtube_df['category'].isnull()), 'category'] = 'Music'

# Drop remaining rows with null values for 'category' column
youtube_df.dropna(subset=['category'], inplace=True)

# Display the modified DataFrame
youtube_df

We will also drop the **`channel_type`** column since we will no longer be using it.

In [None]:
# Drop 'channel_type' column
youtube_df.drop(columns='channel_type', inplace=True)
youtube_df

After removing the observations, we will be resetting the indexes ensure there are no gaps and the data remains continuous.


In [None]:
# Reset the index after dropping rows
youtube_df.reset_index(drop=True, inplace=True)
youtube_df

### The "Country" variable
- **`Country`**: Country where the YouTube channel originates<br>

We will explore the **`Country`** column and check if it needs any cleaning or preprocessing.

We want to check if there are any observations that has null values for the **`Country`** column.

In [None]:
# Check for null values in Country
null_country_df = youtube_df[youtube_df["Country"].isnull()]
null_country_df.head(10)

In [None]:
# Check for null values in Country
null_country_count = youtube_df["Country"].isnull().sum()
print(f'Number of null values in "Country": {null_country_count}')

There are 94 observations without a country, approximately 10% of the dataset has null **`Country`** values.<br>
Next, we will look for possible misrepresentations of the data.

In [None]:
# Get the unique values in the "Country" column
unique_country = youtube_df['Country'].unique()
print(unique_country)
print(f"Number of countries: {unique_country.__len__()}")


Since there are no misrepresentations, we will check the count for each unique value and determine if its possible to use the mode as replacement for the missing values.

In [None]:
# Get the count of each unique value in the "Country" column
country_counts = youtube_df['Country'].value_counts()

# Display the count of each unique value
print("Count of each unique value in 'Country':")
print(country_counts)

The mode value **`United States`** has 306 counts, approximately 32.27% of Youtubers in the dataset are from the **`United States`**.<br>
We will not replace missing values with the mode value since there is a large number of unique values in the dataset and the mode already represents a significant portion of the data. <br>
If replaced, it can lead to a lack of diversity in the dataset, and it may not accurately reflect the underlying data distribution.

    For visualization, if replaced we can see the following distribution of the values below.

In [None]:
# Replace missing values with the mode
test_df = youtube_df.copy()

test_df['Country'].fillna(test_df['Country'].mode()[0], inplace=True)

# Create a countplot to show the distribution of 'country' values
plt.figure(figsize=(10, 5))
sns.countplot(data=test_df, x='Country')
plt.xlabel('Country')
plt.ylabel('Count')
plt.title('Distribution of Country Values')
plt.xticks(rotation=90)  # Rotate x-axis labels for better readability

plt.show()

We will be removing the observations with null **`Country`** values.

In [None]:
# Drop observations with null values
youtube_df.dropna(subset=['Country'], inplace=True)
youtube_df

After removing the observations, we will be resetting the indexes ensure there are no gaps and the data remains continuous.



In [None]:
# Reset the index after dropping rows
youtube_df.reset_index(drop=True, inplace=True)
youtube_df

### The "Lowest Monthly Earnings" variable
- **`lowest_monthly_earnings`**: Lowest estimated monthly earnings from the channel \$<br>

We will explore the **`lowest_monthly_earnings`** column and check if it needs any cleaning or preprocessing.

We want to check if there are any values for **`lowest_monthly_earnings`** that are zero (0) or null.

In [None]:
# Check for null values in lowest_monthly_earnings
null_lowestmonthly_df = youtube_df[youtube_df["lowest_monthly_earnings"].eq(0)]
null_lowestmonthly_df.head(10)

In [None]:
# Display the count
null_lowestmonthly_count = youtube_df["lowest_monthly_earnings"].eq(0).sum()
print(f'Number of null values in "lowest_monthly_earnings": {null_lowestmonthly_count}')

We have 34 observations wherein they have zero (0) or null values for **`lowest_monthly_earnings`**.<br><br>
Since there is the possibility of the earnings coming from sponsorships, affiliate marketing, etc., some of the earnings may not be accurated particularly the null values.
We will be removing these zero (0) or null values to maintain the overall quality and integrity of our dataset. Through removal, we can avoid distorting the underlying patterns in the data and also avoid the complications of imputations. We would still be able to perform a meaningful analysis despite reducing the dataset.

In [None]:
# Drop all observations with zero lowest_monthly_earnings
youtube_df.drop(youtube_df[youtube_df["lowest_monthly_earnings"].eq(0)].index, inplace=True)
youtube_df

After removing the observations, we will be resetting the indexes ensure there are no gaps and the data remains continuous.


In [None]:
# Reset the index after dropping rows
youtube_df.reset_index(drop=True, inplace=True)
youtube_df

We will not be performing any preprocessing in terms of removing outliers as it is possible for channel to have no "earnings" or very high "earnings". 

### The "Highest Monthly Earnings" variable
- **`highest_monthly_earnings`**: Highest estimated monthly earnings from the channel \$<br>

We will explore the **`highest_monthly_earnings`** column and check if it needs any cleaning or preprocessing.

We want to check if there are any values for **`highest_monthly_earnings`** that are zero (0) or null.

In [None]:
# Check for null values in highest_monthly_earnings
null_highestmonthly_df = youtube_df[youtube_df["highest_monthly_earnings"].eq(0)]
null_highestmonthly_df.head(10)

In [None]:
# Display the count
null_highestmonthly_count = youtube_df["highest_monthly_earnings"].eq(0).sum()
print(f'Number of null values in "highest_monthly_earnings": {null_highestmonthly_count}')

We have 0 observations wherein they have zero (0) or null values for **`highest_monthly_earnings`**.

There is no need to drop any observations.

### The "Lowest Yearly Earnings" variable
- **`lowest_yearly_earnings`**: Lowest estimated yearly earnings from the channel \$<br>

We will explore the **`lowest_yearly_earnings`** column and check if it needs any cleaning or preprocessing.

We want to check if there are any values for **`lowest_yearly_earnings`** that are zero (0) or null.

In [None]:
# Check for null values in lowest_yearly_earnings
null_lowestyearly_df = youtube_df[youtube_df["lowest_yearly_earnings"].eq(0)]
null_lowestyearly_df.head(10)

In [None]:
# Display the count
null_lowestyearly_count = youtube_df["lowest_yearly_earnings"].eq(0).sum()
print(f'Number of null values in "lowest_yearly_earnings": {null_lowestyearly_count}')

We have 0 observations wherein they have zero (0) or null values for **`lowest_yearly_earnings`**.

There is no need to drop any observations.

### The "Highest Yearly Earnings" variable
- **`highest_yearly_earnings`**: Highest estimated monthly earnings from the channel \$<br>

We will explore the **`highest_yearly_earnings`** column and check if it needs any cleaning or preprocessing.

We want to check if there are any values for **`highest_yearly_earnings`** that are zero (0) or null.

In [None]:
# Check for null values in highest_yearly_earnings
null_highestyearly_df = youtube_df[youtube_df["highest_yearly_earnings"].eq(0)]
null_highestyearly_df.head(10)

In [None]:
# Display the count
null_highestyearly_count = youtube_df["highest_yearly_earnings"].eq(0).sum()
print(f'Number of null values in "highest_yearly_earnings": {null_highestyearly_count}')

We have 0 observations wherein they have zero (0) or null values for **`highest_yearly_earnings`**. There is no need for cleaning.

### The "Created Year" variable
- **`created_year`**: Year when the YouTube channel was created<br>

We will explore the **`created_year`** column and check if it needs any cleaning or preprocessing.


We want to check if there are any values for **`created_year`** that are zero (0) or null.

In [None]:
# Check for null values in created_year
null_createdyear_df = youtube_df[youtube_df["created_year"].eq(0)]
null_createdyear_df.head(10)

In [None]:
# Display the count
null_createdyear_count = youtube_df["created_year"].eq(0).sum()
print(f'Number of null values in "created_year": {null_createdyear_count}')

We also want to check if there are any values less than 2005 since Youtube was made in this year.

In [None]:
# Check if there are any observations of Youtube channels created before 2005
subset = youtube_df[youtube_df['created_year'] < 2005]
subset


After manual checking, the channel does exist and the year shown is accurate but it may have been an error or a default value (0) in Posix time assigned in the database. 

As we are dealing with years, we will remove the observation instead since incorrect or error values for a year can significantly affect the accuracy and interpretation of the data.

In [None]:
# Get the index of the subset
indices_to_drop = subset.index

# Drop the observations from the original DataFrame
youtube_df.drop(indices_to_drop, inplace=True)

### The "Population" variable
- **`Population`**: Total population of the country<br>

We will explore the  **`Population`** column and check if it needs any cleaning or preprocessing.


We want to check if there are any values for  **`Population`** that are zero (0) or null.

In [None]:
# Check for null values in Population
null_population_df = youtube_df[youtube_df["Population"].eq(0)]
null_population_df.head(10)

In [None]:
# Display the count
null_population_count = youtube_df["Population"].eq(0).sum()
print(f'Number of null values in "Population": {null_population_count}')

In [None]:
# Check for NaN values in the 'Population' column
nan_subscribers = youtube_df['Population'].isna().sum()
print("Number of NaN values in 'Population' column:", nan_subscribers)

We have 1 observations wherein they have a NaN value for **`Population`**. We will drop this observation.



In [None]:
# Drop rows with NaN values in 'Population' column
youtube_df.dropna(subset=['Population'], inplace=True)

We want to scale the values to millions to make it easier to interpret and visualize. 

In [None]:
# Convert the "Population" column to values in millions
youtube_df['Population'] = youtube_df['Population'] / 1000000
# # Rename the "Population" column to indicate values are in millions
youtube_df = youtube_df.rename(columns={'Population': 'Population (millions)'})
# Round the values to 3 decimal places
youtube_df['Population (millions)'] = youtube_df['Population (millions)'].round(3)

youtube_df

The **`Population`** column is renamed to **`Population (millions)`** to represent the scaling.

### Check Data Types of the Values in each Column

We want to check if the values in each column match their respective data types **`float64`**, **`int64`**, and **`object`**.

In [None]:
# Define the expected data types
expected_data_types = {
    'Youtuber': object,
    'subscribers (millions)': float,
    'video views (millions)': float,
    'uploads': float,
    'Country': object,
    'lowest_monthly_earnings': float,
    'highest_monthly_earnings': float,
    'lowest_yearly_earnings': float,
    'highest_yearly_earnings': float,
    'created_year': float,
    'Population (millions)': float
}

# Check if values in each column match the expected data type using a custom function
def check_data_type(column_name, expected_dtype):
    for value in youtube_df[column_name]:
        if not isinstance(value, expected_dtype):
            return False
    return True

all_columns_match = True

for column_name, expected_dtype in expected_data_types.items():
    if not check_data_type(column_name, expected_dtype):
        all_columns_match = False
        print(f"Column '{column_name}' does not match the expected data type {expected_dtype}.")

if all_columns_match:
    print("All columns match the expected data types.")
else:
    print("Not all columns match the expected data types.")

### Post Data Cleaning



#### Dataset Information:
- **`Rows/Observations`**: 820
- **`Variables/Columns`**: 12
- **`Data Types`**: float64(9), object(3)

The variables that will be used are as follows:
- **`Youtuber`**: Name of the YouTube channel
- **`subscribers (millions)`**: Number of subscribers to the channel
- **`video views (millions)`**: Overall count views to the channel
- **`category`**: Category or niche of the channel
- **`uploads`**: Total number of videos uploaded on the channel
- **`Country`**: Country where the YouTube channel originates
- **`lowest_monthly_earnings`**: Lowest estimated monthly earnings from the channel \$
- **`highest_monthly_earnings`**: Highest estimated monthly earnings from the channel \$
- **`lowest_yearly_earnings`**: Lowest estimated yearly earnings from the channel \$
- **`highest_yearly_earnings`**: Highest estimated yearly earnings from the channel $
- **`created_year`**: Year when the YouTube channel was created
- **`Population (millions)`**: Total population of the country

In [None]:
youtube_df.info()

# Exploratory Data Analysis
    This section of the Jupyter Notebook contains the overall process of our Exploratory Data Analysis of our dataset, together with the questions that we are curious to comprehend and answer through exploratory data analysis.

### 1. What are the lowest and highest yearly earning categories?

**Description**:
- This analysis aims to identify and explore the YouTube content categories or niches that yield the lowest and highest yearly earnings.

**Variables**:
- `category`
- `lowest_yearly_earnings`
- `highest_yearly_earnings`

**Numerical Summary**

Get the lowest and highest yearly earnings of Youtube Channels per category.
Calculate the average yearly earnings for reference as well.

In [None]:
# calculate the average yearly earnings
youtube_df['average_yearly_earnings'] = (youtube_df['lowest_yearly_earnings'] + youtube_df['highest_yearly_earnings']) / 2

average_category_means = youtube_df.groupby('category')['average_yearly_earnings'].mean()

# Sort the categories by mean earnings in descending order
average_category_means = average_category_means.sort_values(ascending=False)

# Display the sorted categories and their mean earnings
average_category_means

**Visualization**

Create a horizontal bar chart to visualize the lowest and highest yearly earnings categories, showing how they compare to one another.

In [None]:
# make category as string
youtube_df['category'] = youtube_df['category'].astype(str)

# create horizontal bar chart for lowest yearly earnings sorted 
plt.figure(figsize=(10, 6))
sns.barplot(data=youtube_df.sort_values(by='lowest_yearly_earnings', ascending=False),
            x='lowest_yearly_earnings', y='category')
plt.xlabel('Lowest Yearly Earnings')
plt.ylabel('Category')
plt.title('Lowest Yearly Earnings by Category')
plt.show()

# create horizontal bar chart for highest yearly earnings sorted 
plt.figure(figsize=(10, 6))
sns.barplot(data=youtube_df.sort_values(by='highest_yearly_earnings', ascending=False),
            x='highest_yearly_earnings', y='category')
plt.xlabel('Highest Yearly Earnings')
plt.ylabel('Category')
plt.title('Highest Yearly Earnings by Category')
plt.show()

# create horizontal bar chart for average yearly earnings sorted 
plt.figure(figsize=(10, 6))
sns.barplot(data=youtube_df.sort_values(by='average_yearly_earnings', ascending=False),
            x='average_yearly_earnings', y='category')
plt.xlabel('Average Yearly Earnings')
plt.ylabel('Category')
plt.title('Average Yearly Earnings by Category')
plt.show()

**Conclusion**

The 'Show' category stands out as the highest-earning category, boasting an impressive average yearly earning of 14,035,000. Following closely, the 'Pets and Animals' category commands a solid average earning of 10,183,300. In contrast, the 'Travel & Events' and 'Howto & Style' categories exhibit comparatively lower average yearly earnings, with figures of 796,500 and 968,481, respectively. These findings shed light on the financial dynamics of different content categories on YouTube, with 'Show' and 'Pets and Animals' leading the way in terms of earnings.


In [None]:
# mean of shows category 
youtube_df[youtube_df['category'] == 'Shows']['average_yearly_earnings'].describe()

In [None]:
# mean of travels & events category
youtube_df[youtube_df['category'] == 'Travel & Events']['average_yearly_earnings'].describe()

In [None]:
# mean of pets & animals category
youtube_df[youtube_df['category'] == 'Pets & Animals']['average_yearly_earnings'].describe()

In [None]:
# mean of howto & style category
youtube_df[youtube_df['category'] == 'Howto & Style']['average_yearly_earnings'].describe()

### 2. Is there a correlation between video views and subscriber counts on YouTube channels?

**Description**
- This analysis investigates the relationship between the number of video views and the number of subscribers on YouTube channels to determine if there's a correlation indicating viewer engagement.

**Variables**
- `video views`
- `subscribers`

**Numerical Summary**

Calculate the Pearson correlation coefficient to quantify the strength and direction of the correlation between video views and subscribers.


In [None]:
# calculate pearson correlation betwen video views and subscribers
youtube_df['video views (millions)'].corr(youtube_df['subscribers (millions)'])


**Visualization**

Create a scatterplot to visualize the relationship between video views and subscribers, indicating the nature of the correlation.

In [None]:
# scatterplot by using
plt.figure(figsize=(16, 8))
sns.scatterplot(x='video views (millions)', y='subscribers (millions)', data=youtube_df)
plt.xlabel('Video Views')
plt.ylabel('Subscribers')
plt.title('Video Views vs. Subscribers')
plt.show()

**Conclusion**

The Pearson correlation value produced by the relationship between video views and subscriber count was 0.7509576173780214, and since the r value greater than 0.5, there is a strong relationship and a positive correlation between the video views and the subscriber count of a Youtube Channel.

### 3. Is there a correlation between video views and average yearly earnings on YouTube channels?

**Description**
- This analysis explores whether there's a correlation between the number of video views and the monthly earnings of YouTube channels, offering insights into the monetization potential of views.

**Variables**
- `video views`
- `lowest_yearly_earnings`
- `highest_yearly_earnings`

**Numerical Summary**

Calculate the correlation coefficient to determine the strength and direction of the correlation between video views and average yearly earnings

In [None]:
# calculate pearson correlation betwen video views and average yearly earnings
youtube_df['video views (millions)'].corr(youtube_df['average_yearly_earnings'])

**Visualization**

Create a scatterplot to visualize the relationship between video views and average yearly earnings, highlighting the nature of the correlation.


In [None]:
# visualize the data
plt.figure(figsize=(16, 8))
sns.scatterplot(x='video views (millions)', y='average_yearly_earnings', data=youtube_df)
plt.xlabel('Video Views')
plt.ylabel('Average Yearly Earnings')
plt.title('Video Views vs. Average Yearly Earnings')
plt.show()

**Conclusion**

The correlation coefficient only produced a value of 0.58, which means that it only produced a medium positive correlation.

### 4. Are there discernible patterns related to a YouTube channel's category and its subscribers?


**Description**:
- We aim to explore if there are discernible patterns related to a YouTube channel's category and its subscribers. We will perform numerical summary and visualization to analyze the data.

**Variables**:
- `subscribers` : The number of subscribers for YouTube channels.
- `category` : The category of the YouTube channels.


**Numerical Summary**

We calculated summary statistics for the `subscribers` column using youtube_df['subscribers'].describe(). The subscribers column represents the `average number of subscribers` for each YouTube channel within their respective categories.

In [None]:
subscribers_summary = youtube_df['subscribers (millions)'].describe()

# Display the summary statistics
subscribers_summary

**Visualization**

We created a horizontal bar chart to display the average subscribers for each category, using the average_subscribers_by_category variable. The x-axis represents the average subscribers (in millions), and the y-axis displays the categories. The chart is color-coded for better visualization.

In [None]:
# Bar Chart - Average Subscribers by Category
average_subscribers_by_category = youtube_df.groupby('category')['subscribers (millions)'].mean().sort_values(ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x=average_subscribers_by_category.values, y=average_subscribers_by_category.index, palette="viridis", orient="h")
plt.title('Average Subscribers by Category')
plt.xlabel('Average Subscribers (in millions)')
plt.ylabel('Category')
plt.show()

**Conclusion**

From our analysis, we have found discernible patterns related to a YouTube channel's category and its subscribers. The analysis revealed the following:

- Shows had the highest average number of subscribers, with an average of over 40 million subscribers.

- Trailers came in second place in terms of average subscribers.

- Movies occupied the third position in terms of average subscribers.

- Autos & Vehicles had the third-to-last position with lower average subscribers.

- Howto & Style ranked second to last in terms of average subscribers.

- Travel & Events was the category with the lowest average number of subscribers.

These patterns suggest that the type of content and category significantly influences the average number of subscribers on YouTube channels. Channels in the "Shows" category, on average, tend to have the most subscribers, while those in the "Travel & Events" category have the fewest. This information can be valuable for content creators and marketers looking to understand subscriber trends on YouTube.

### 5. Are there discernible patterns related to a YouTube channel's category and its average yearly earnings?

**Description**:
- We aim to investigate if there are discernible patterns related to a YouTube channel's category and its average yearly earnings. We will conduct numerical summary and visualization to analyze the data.


**Variables**:
- `category` : The category of the YouTube channels.
- `lowest_yearly_earnings` : The lowest estimated yearly earnings of the YouTube channels.
- `highest_yearly_earnings` : The highest estimated yearly earnings of the YouTube channels.


**Numerical Summary**

We will calculate summary statistics for the 'average_yearly_earnings' column, which is the sum of the lowest and highest yearly earnings. This column represents the average yearly earnings for each YouTube channel within their respective categories.

In [None]:
# youtube_df['average_yearly_earnings'] = youtube_df['lowest_yearly_earnings'] + youtube_df['highest_yearly_earnings']

youtube_df['average_yearly_earnings'].describe()

**Visualization**

We will create a horizontal bar chart to visualize the average yearly earnings for each category, using the average_yearly_earnings_by_category variable. The x-axis will represent the average yearly earnings (in dollars), and the y-axis will display the categories. The chart will be color-coded for better visualization.

In [None]:
# Bar Chart - Average Yearly Earnings by Category
average_yearly_earnings_by_category = youtube_df.groupby('category')['average_yearly_earnings'].mean().sort_values(ascending=False)
plt.figure(figsize=(10, 6))
sns.barplot(x=average_yearly_earnings_by_category.values, y=average_yearly_earnings_by_category.index, palette="viridis", orient="h")
plt.title('Average Yearly Earnings by Category')
plt.xlabel('Average Yearly Earnings (in dollars)')
plt.ylabel('Category')
plt.show()

**Conclusion**

Conclusion:
From our analysis, we have found discernible patterns related to a YouTube channel's category and its average yearly earnings, calculated as the sum of the lowest and highest yearly earnings. The analysis reveals the following:

- Shows had the highest average yearly earnings, with an average of 1.4 million dollars.

- Pets & Animals came in second place in terms of average yearly earnings.

- Autos & Vehicles occupied the third position in terms of average yearly earnings.

- Science & Technology had the third-to-last position with lower average yearly earnings.

- Howto & Style ranked second to last in terms of average yearly earnings.

- Travel & Events was the category with the lowest average yearly earnings.

These patterns suggest that the category of content significantly influences the average yearly earnings of YouTube channels. Channels in the "Shows" category, on average, tend to have the highest earnings, while those in the "Travel & Evetns" category have the lowest earnings. This information can be valuable for content creators and marketers looking to understand earning trends on YouTube.

### 6. What is the distribution of video views for YouTube channels within each content category?

**Description**:
- We aim to explore the distribution of video views for YouTube channels within each content category.


**Variables**:
- `category` : The category of the YouTube channels.
- `video views` : The number of video views for each YouTube channel.


**Numerical Summary**

Calculate the summary statistics for the 'video_views' column within each category. This will help us understand the distribution of video views, including measures like the median, quartiles, and potential outliers for each category.

In [None]:
# calculate mean of video views within each category 
views_summary = youtube_df.groupby('category')['video views (millions)'].describe()
# views_summary = youtube_df.groupby('category')['video views'].describe()

views_summary

**Visualization**

Create a box plot and bar chart to visually represent the distribution of video views for each category. Box plots are effective for showing the central tendency and spread of data within categories.

In [None]:
# Horizontal Box Plot - Distribution of Video Views by Category
plt.figure(figsize=(12, 6))
sns.boxplot(x='video views (millions)', y='category', data=youtube_df)
# sns.boxplot(x='video views', y='category', data=youtube_df)
plt.title('Distribution of Video Views by Category')
plt.xlabel('Number of Video Views')
plt.ylabel('Category')
plt.show()

# Horizontal Bar Chart - Distribution of Video Views by Category
plt.figure(figsize=(12, 6))
sns.barplot(x='video views (millions)', y='category', data=youtube_df)
# sns.barplot(x='video views', y='category', data=youtube_df)

plt.title('Distribution of Video Views by Category')
plt.xlabel('Number of Video Views')
plt.ylabel('Category')
plt.show()

**Conclusion**

Based from our EDA,  in terms of the distribution number of video views per category, the results revealed the following:
1. The Shows youtube category has the highest number of video views, out of all other youtube categories.
2. The following youtube categories: Trailers, Movies, Travels & Events only have an instance of one YouTube channel, while Nonprofits & Activism, Pets & Animals, and Autos & Vehicles  only have an instance of two YouTube channels. Despite the low number of YouTube Channels in these particular categories, this explains the abnormally small distribution of data within the box plot and the bar chart. Yet still yield a high number of video views.

### 7. What are the most prevalent YouTube channel categories in the dataset?

Description:
- How many channels are in each category?

Variables:
- `category`
- `Youtube Channels`

**Numerical Summary**

Count the frequency of YouTube channels within each unique category.

In [None]:
category_counts = youtube_df['category'].value_counts()

# Display the count of each unique value
print("Count of each unique value in 'category':")
print(category_counts)

**Visualization**

Create a bar chart to visualize the distribution of YouTube channels across unique categories.

In [None]:
# create a bar chart to visualize the distribution of YouTube channels across categories
plt.figure(figsize=(12, 6))  # Adjust the figure size
ax = sns.barplot(x=category_counts.values, y=category_counts.index, palette="viridis", orient="h")
ax.set(xlabel='Number of Channels', ylabel='Category')
plt.title('Distribution of YouTube Channels Across Categories')

# Add data labels to the right of each bar for readability
for p in ax.patches:
    width = p.get_width()
    ax.annotate(f'{int(width)}', xy=(width, p.get_y() + p.get_height() / 2), ha='left', va='center', fontsize=9, color='black')

plt.tight_layout()  # Ensure labels and titles fit in the plot area
plt.show()


**Conclusion**

Notably, the Entertainment category stands out as the most prominent, encompassing approximately 20% of the channels, totaling around 212 channels. Following closely is the Music category, with 175 channels falling under this classification. The next three top categories include People & Blogs, Gaming, and Comedy. It's interesting to note that despite the Pets and Animals category being among the highest earners, it is represented by only 2 channels in our dataset, underscoring the niche nature of this content within the YouTube landscape.

### 8. What is the distribution of YouTube channels by their country of origin in the dataset? Retrieve the top 15 countries.

Description:
- aims to identify the top 15 countries with most amount of channels.

Variables:
- `Country`
- `Youtube Channels`

**Numerical Summary**

Get the frequency of YouTube channels from each country.

Retrieve the top 15 countries with most frequency of YouTube channels.

country_counts = youtube_df['Country'].value_counts()

In [None]:
top_15_countries = country_counts[:15]

top_15_countries

**Visualization**

Create a bar chart to visualize the distribution of YouTube channels by country – showing only the top 15.

In [None]:
plt.figure(figsize=(12, 6))  # Adjust the figure size
ax = sns.barplot(x=top_15_countries.values, y=top_15_countries.index, palette="viridis", orient="h")
ax.set(xlabel='Number of Channels', ylabel='Country')
plt.title('Distribution of YouTube Channels Across Countries')

# Add data labels to the right of each bar for readability
for p in ax.patches:
    width = p.get_width()
    ax.annotate(f'{int(width)}', xy=(width, p.get_y() + p.get_height() / 2), ha='left', va='center', fontsize=9, color='black')

plt.tight_layout()  # Ensure labels and titles fit in the plot area
plt.show()

**Conclusion**

The distribution of channels in our exploratory data analysis reveals a clear dominance by the United States and India, with the United States leading with 306 channels and India following closely with approximately 167 channels. Beyond these two leaders, the distribution becomes less pronounced, with countries like Brazil, the United Kingdom, Mexico, and others falling within a range of approximately 65 channels, showing a less significant presence in this context.

# Research Question
    This section of the Jupyter Notebook contains this study's Research Question that we wish to answer based from our results and findings from the Exploratory Data Ananlysis.

    After careful consideration of our observations and analysis, we want to answer the following research question.
### What are the key factors that makes a YouTube Channel Successfull?

#### In connection with the EDA's...
Based on our exploratory data analysis, it is evident that determining the success of a YouTube channel is a wide ranging task; it is not solely reliant on one variable. Our findings reveal that a channel can excel in certain metrics while falling short in others. A noteworthy case in point is the earnings of successful YouTube channels, where some channels with high view and subscriber counts may generate minimal or no revenue due to certain reasons, and conversely, some less prominent channels might generate substantial earnings. with that said, there are some metrics that do have some sort of correlation with each other. The positive correlation between subscriber count and view count for a YouTube channel means that having a higher subscriber count means having a higher view count. 

Our data analysis suggests that there are potential factors that have a positive correlation with the metrics that determine the success of a YouTube channel. These factors include a channel’s category of content and country origin. However, further study is needed to specifically find the correlation of a potential factor to each given metric of success.


#### Description:

- Analyzing the observations from our exploratory data analysis, we have observed that there are existing key factors that explains the high values in some the YouTube Metrics
- A YouTube Channel’s success can be derived from these existing factors which explains why a given YouTube Channel’s Analytics considers metrics such as impressions, click-through, likes, shares, view count, subscriber, and the list goes on. Youtube Channel Analytics is a bird’s eye view to monitor its channel vitals, like real-time views, total views, watch time, subscribers, etc. It is perfect for getting a performance overview over a period at a quick glance.
- In consideration of YouTube's Algorithm, it also looks at the demography of viewer's video relevance and category and recommends it to similar audiences. 
- In addition, considering a success metric that will determine “success” for a youtube video, this would include subscribers, video views, earnings, rankings(overall and country). We test out other parameters in the dataset that we can consider “factors”, this could be country, population count of the country, category and category rank, education attainment, urban population, employment rate.

### Significance of the Study

    In determining the factors that enhances a YouTube Channel’s success, it carries substantial importance on various fronts. It provides practical guidance for content creators, empowering them to make data-driven choices about their content and strategies. This research has broader industry implications as it aids businesses in optimizing their marketing endeavors on the YouTube platform. Moreover, unraveling these success factors contributes to the academic understanding of digital media, informs evidence-based decision-making, and can potentially influence policies and regulations pertaining to online content. This study could extend its impact to a wider spectrum, offering insights into the dynamics of content creation and user engagement in the digital realm, making it a multidimensional and highly pertinent field of inquiry.



# Phase 2

    As we delve into the second phase of our case study on Global YouTube Statistics, we will be focusing on three interconnected sections: data modeling, statistical inference, and insights and conclusions. 

This phase of our study is dedicated to unraveling the intricate web of data that underlies the vast YouTube ecosystem, drawing insightful conclusions from the analysis, and shedding light on the broader implications for content creators, consumers, and the platform itself. By meticulously examining the data and applying rigorous statistical techniques, we aim to uncover the patterns, trends, and behaviors that define the global YouTube landscape, ultimately providing valuable insights into the dynamic and ever-expanding world of online video content.

## Dependent and Independent Variable
    In this notebook, our selected dataset and based from our research question, yields us the following variables...

### Dependent Variable 
- **`subscribers (millions)`**: Number of subscribers to the channel
    - **`rank`**: Position of the YouTube channel based on the number of subscribers


- **`video views (millions)`**: Overall count views to the channel
    - **`video_views_rank`**: Ranking of the channel based on total video views


- **`earnings`**: Total earnings of the youtuber
    - **`Monthly Earnings`**:
        - **`lowest_monthly_earnings`**: Lowest estimated monthly earnings from the channel in Dollars
        - **`highest_monthly_earnings`**: Highest estimated monthly earnings from the channel in Dollars
    - **`Yearly Earnings`**:
        - **`lowest_yearly_earnings`**: Lowest estimated yearly earnings from the channel in Dollars
        - **`highest_yearly_earnings`**: Highest estimated yearly earnings from the channel in Dollars 

### Independent Variables
- **`Youtuber`**: Name of the YouTube channel
- **`category`**: Category or niche of the channel
- **`uploads`**: Total number of videos uploaded on the channel
- **`Country`**: Country where the YouTube channel originates
- **`video_views_for_the_last_30_days`**: Total video views in the last 30 days
- **`Population (millions)`**: Total population of the country
- **`Unemployment rate`**: Unemployment rate in the country %
- **`Urban_population`**: Percentage of the population living in urban areas %
- **`created_year`**: Year when the YouTube channel was created
- **`created_month`**: Month when the YouTube channel was created



## Data Modelling

In the data modeling section of our study, we focus on constructing a visual models to capture the intricacies of the dataset. 

### Objective
Our primary objective here is to gather and organize a diverse set of data points, including viewership statistics, engagement metrics, demographics, and content categories. Through data cleaning, transformation, and structuring, we create a structured dataset that forms the foundation for our subsequent statistical analyses. This structured data helps unveil hidden patterns and relationships within the vast dataset. By methodically representing the characteristics of YouTube content, creators, and viewers, we pave the way for deeper insights into the platform's dynamics, leading to a more comprehensive understanding of the factors driving the global YouTube phenomenon.



### Scaling the Data

In this project, the user performed data preprocessing to prepare the data for analysis. Specifically, outlier detection and removal were performed on several columns including:
 - **`subscribers (millions)`**
 - **`uploads`**
 - **`Population (millions)`**
 - **`created_year`**
 - **`video views (millions)`**
 - **`average_yearly_earnings`**
 
A z-score threshold of 3 standard deviations from the mean was employed during outlier detection and removal in the data preprocessing phase. This specific threshold was chosen for several reasons:
 - A threshold of 3 standard deviations helps ensure that only data points significantly deviating from the mean are flagged as outliers.
 - It allows for the removal of extreme values while retaining a sufficient amount of data for a meaningful analysis.
 - By removing outliers using a z-score threshold of 3, the subsequent normalization process is less likely to be skewed by these extreme values, leading to a more accurate representation of the data distribution.

 This was done to ensure the accuracy of the data and to address any potential skewness or bias in the data. This was a necessary process since we plan on using normalization on the data. Normalization tends to be susceptible to outliers hence their removal.



In [None]:
# First, we will get the average monthly and yearly earnings that will be used for the following hypotheses

# Add the Average Earnings columns to the list of columns to scale
youtube_df['average_yearly_earnings'] = (youtube_df['lowest_yearly_earnings'] + youtube_df['highest_yearly_earnings']) / 2

column_names = ['subscribers (millions)', 'uploads', 'Population (millions)', 'created_year', 'video views (millions)', 'average_yearly_earnings']
z_score_threshold = 3
filtered_df = youtube_df.copy()

for column_name in column_names:
    column_data = filtered_df[column_name]
    filtered_df = filtered_df[(np.abs(stats.zscore(column_data)) < z_score_threshold)]

youtube_df = filtered_df

youtube_df

Additionally, normalization was performed on the said data to address any potential issues with scale and distribution. This was performed on several columns including:
 - **`subscribers (millions)`**
 - **`uploads`**
 - **`Population (millions)`**
 - **`created_year`**
 - **`video views (millions)`**
 - **`average_yearly_earnings`**

This crucial step was taken to standardize the scale and distribution of the data, ensuring a fair comparison across variables. Normalization is particularly important when employing t-tests to assess hypotheses.

Normalization aids in creating a level playing field for variables with different units and ranges, allowing the subsequent t-tests to provide meaningful results. By transforming the data to a common scale, the impact of varying magnitudes among the features is minimized, preventing any single variable from disproportionately influencing the outcomes of the statistical tests.

The t-tests, in turn, rely on the assumption that the data is approximately normally distributed. Normalization contributes to meeting this assumption, enhancing the reliability of the t-test results. The calculated scores and p-values from the t-tests are crucial in evaluating whether to accept or reject the null and alternative hypotheses, providing valuable insights into the significance of observed differences or relationships within the data.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Specify the columns you want to normalize
column_names = ['subscribers (millions)', 'uploads', 'Population (millions)', 'created_year', 'video views (millions)', 'average_yearly_earnings']

# Create a new DataFrame with only the selected columns
selected_columns_df = youtube_df[column_names]

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Perform normalization on the selected columns
normalized_data = scaler.fit_transform(selected_columns_df)

# Create a new DataFrame with the normalized data
normalized_df = pd.DataFrame(normalized_data, columns=column_names)

### Histogram Models 

In [None]:
# Plotting the histogram
plt.hist(youtube_df['subscribers (millions)'], bins=15, edgecolor='black')

# Adding labels and title
plt.xlabel('Subscriber Count (millions)')
plt.ylabel('Frequency')
plt.title('Histogram of Subscriber Count')

# Displaying the histogram
plt.show()


In [None]:
# Plotting the histogram
plt.hist(youtube_df['video views (millions)'], bins=15, edgecolor='black')

# Adding labels and title
plt.xlabel('Video Views (millions)')
plt.ylabel('Frequency')
plt.title('Histogram of Video Views')

# Displaying the histogram
plt.show()



In [None]:
# Plotting the histogram
plt.hist(youtube_df['average_yearly_earnings'], bins=15, edgecolor='black')

# Adding labels and title
plt.xlabel('Average Yearly Earnings (millions)')
plt.ylabel('Frequency')
plt.title('Histogram of Average Yearly Earnings')

# Displaying the histogram
plt.show()


## Statistical Inference

In the statistical inference section of our study, we delve into the heart of our data analysis, seeking to draw meaningful conclusions and insights from the structured YouTube dataset. Through a rigorous application of statistical techniques and methodologies, we aim to extract valuable information about the platform's performance, trends, and user behaviors. Statistical inference allows us to identify patterns, correlations, and dependencies within the data, offering answers to critical questions about user engagement, content preferences, and the impact of various factors on YouTube's success.

By conducting hypothesis tests, regression analyses, and other statistical procedures, we can make informed assessments of the relationships between different variables. Moreover, we explore the significance of various factors, such as video length, publication time, and audience demographics, in influencing viewership and engagement. This section of our study enables us to provide evidence-based insights into the factors contributing to the dynamics of YouTube's global ecosystem, helping content creators, businesses, and researchers better understand how to navigate and succeed in the ever-evolving world of online video content.

## Hypothesis 1

    In this section of the notebook, the group aims to conduct a statistical analysis to investigate potential associations or dependencies between certain variables within the dataset with a Youtuber’s Subscriber Count. The goal is to assess whether specific factors exhibit a statistically significant relationship with the ranking.

### Hypothesis 1.1:  Is Subscriber Count associated/dependent with the channel’s country of origin?
    In determining whether a Youtuber's Subscriber Count is associated with the channel's country of origin, the group utilizes a chi-square test with a significance level, the investigation assesses if there is a dependency between these variables. By exploring the relationship, the study examines how geographic elements, such as cultural preferences and regional trends, may influence the Subscriber Count. The focus is on detecting meaningful associations or dependencies between the two variables.

**`Null Hypothesis:`** There is no significant association or dependence between the Subscriber Count and the channel's country of origin.

**`Alternative Hypothesis:`** There is a significant association or dependence between the rank and the channel's country of origin.


In [None]:
# Create a table to show the Youtube channels with the highest number of subscribers and their country of origin.
hp1_1_table = pd.crosstab(youtube_df['subscribers (millions)'], youtube_df['Country'])
hp1_1_table


In [None]:
chi2, p, degrees_of_freedom, expected = chi2_contingency(hp1_1_table)

print("Chi-Square Statistic:", chi2)
print("P-value:", p)
print("Degrees of Freedom:", degrees_of_freedom)
print("Expected Frequencies Table: \n", expected)

#### Findings and Results for Hypothesis 1.1

   Based on the results and findings of the chi-square test investigating the association between a Youtuber's Subscriber Count and the channel's country of origin, the Chi-Square Statistic yielded a value of **`[Insert Value]`**, indicating the strength of association between the variables. The associated p-value, calculated as **`[Insert Value]`**, represents the probability of obtaining the observed results under the assumption of no association. Having the degrees of freedom calculated at **`[Insert Value]`**, the Chi-Square Statistic suggests **`[Insert interpretation]`**, signifying the **`[Nature/Direction]`** of the association. 
   
   In summary, these findings indicate that **`[insert concise summary of the main results and their implications for Hypothesis 1.1]]`** and we therefore **`[reject or do not reject]`** the the null hypothesis.



### Hypothesis 1.2:  Is rank associated/dependent with the channel’s upload count?
    In determining whether there exists a statistically significant relationship between the rank of YouTube channels and their upload count. Upload count serves as a proxy for the frequency and consistency of content production, which could influence viewer engagement and, consequently, a channel's rank The group's focus is to investigate whether the frequency of content uploads has a discernible association or dependency on a channel's ranking . This exploration is essential for understanding the potential association between a channel's upload activity and its perceived popularity or success. 

**`Null Hypothesis:`** There is no association or dependency between the rank of Youtube Channels and the channel's upload count

**`Alternative Hypothesis:`** There is a significant association or dependency between the rank of channels and their country of origin.


In [None]:
data = {'subscribers': normalized_df['subscribers (millions)'],
        'uploads': normalized_df['uploads']}
df = pd.DataFrame(data)

t_stat, p_value = stats.ttest_ind(df['subscribers'], df['uploads'])

print("t-statistic =", t_stat)
print("p-value =", p_value)

#### Findings and Results for Hypothesis 1.2

   Based on the results and findings of the T-test statistic investigating the association between a Youtuber's Subscriber Count and the channel's upload count, the t-test statistic yielded a value of **`13.6543`**. The associated p-value, calculated as **`4.155`**, represents the probability of obtaining the observed results under the assumption of no association. The test statistic suggests **`[Insert interpretation]`**, signifying the **`[Nature/Direction]`** of the association. 
   
   In summary, these findings indicate that **`[insert concise summary of the main results and their implications for Hypothesis 1.2]]`** and we therefore **`do not reject`** the the null hypothesis.


*placeholder results, i saved it here just in case -lui*
mali daw toh? idk... nagbago daw eh.. -lui again

Chi-Square Statistic: 211302.00000000003
P-value: 0.34532206754248224
Degrees of Freedom: 211044

We accept the null hypothesis.

### Hypothesis 1.3:  Is Subscriber Count associated/dependent with the channel’s category?
    In investigating the existence of an association or dependence between a Youtuber's Subscriber Count and the category of their channel. THe group's aim is to discern whether the content genre, represented by the channel's category, significantly influences the number of subscribers it attracts. This exploration is crucial for understanding the connection between a channel's content type and its perceived popularity, as measured by the Subscriber Count. It suggests that the channel's category plays a pivotal role in shaping its Subscriber Count. Diverse content genres attract varied audiences, and by examining the association between Subscriber Count and channel category, the analysis seeks to unveil whether specific categories tend to attract higher subscriber numbers. This understanding provides valuable insights for content creators aiming to optimize their channel's category for enhanced subscriber engagement and overall success on the platform.

**`Null Hypothesis:`** There is no association or dependence between the Subscriber Count and the channel's category.

**`Alternative Hypothesis:`** There is a significant association or dependence between the Subscriber Count and the channel's category.


In [None]:
hp1_3_table = pd.crosstab(youtube_df['subscribers (millions)'], youtube_df['category'])
hp1_3_table

In [None]:

chi2, p, degrees_of_freedom, expected = chi2_contingency(hp1_3_table)

print("Chi-Square Statistic:", chi2) # too high
print("P-value:", p) # too high
print("Degrees of Freedom:", degrees_of_freedom) # too high
print("Expected Frequencies Table: \n", expected)

#### Findings and Results for Hypothesis 1.3

   Based on the results and findings of the chi-square test investigating the association between a Youtuber's Subscriber Count and the Youtube channel's content category, the Chi-Square Statistic yielded a value of **`[Insert Value]`**, indicating the strength of association between the variables. The associated p-value, calculated as **`[Insert Value]`**, represents the probability of obtaining the observed results under the assumption of no association. Having the degrees of freedom calculated at **`[Insert Value]`**, the Chi-Square Statistic suggests **`[Insert interpretation]`**, signifying the **`[Nature/Direction]`** of the association. 
   
   In summary, these findings indicate that **`[insert concise summary of the main results and their implications for Hypothesis 1.3]]`** and we therefore **`[reject or do not reject]`** the the null hypothesis.



### Hypothesis 1.4:  Is Subscriber Count associated or dependent with the population of the country of origin?
    In exploring the existence of association or dependency between a Youtuber's Subscriber Count and the population of their country of origin. The obbective is mainly to investigate whether the demographic size of a country has a statistically significant impact on the number of subscribers a Youtuber attracts, aiming to uncover any association or dependency between Subscriber Count and the population of the country of origin. Population of a Youtuber's country of origin plays a substantial role in shaping their Subscriber Count. A larger population might suggest a broader potential audience, while a smaller population could imply a more niche viewership. By scrutinizing the association between Subscriber Count and the population of the country of origin, the analysis seeks to reveal whether Youtubers from countries with larger populations tend to have higher subscriber numbers. Understanding this potential association provides valuable insights into the broader demographic factors that may contribute to a Youtuber's popularity.

**`Null Hypothesis:`** There is no association or dependence between the Subscriber Count and the population of the country of origin.

**`Alternative Hypothesis:`** There is a significant association or dependence between the Subscriber Count and the population of the country of origin.



In [None]:
data = {'subscribers': normalized_df['subscribers (millions)'],
        'population': normalized_df['Population (millions)']}
df = pd.DataFrame(data)

t_stat, p_value = stats.ttest_ind(df['subscribers'], df['population'])

print("t-statistic =", t_stat)
print("p-value =", p_value)


#### Findings and Results for Hypothesis 1.4

   Based on the results and findings of the T-test statistic investigating the association between a Youtuber's Subscriber Count and the population's country of origin, the t-test statistic yielded a value of **`-12.5363`**. The associated p-value, calculated as **`2.2874`**, represents the probability of obtaining the observed results under the assumption of no association. The test statistic suggests **`[Insert interpretation]`**, signifying the **`[Nature/Direction]`** of the association. 
   
   In summary, these findings indicate that **`[insert concise summary of the main results and their implications for Hypothesis 1.2]]`** and we therefore **`do not reject`** the the null hypothesis.



### Hypothesis 1.5:  Is Subscriber Count associated/dependent with the channel’s creation date?
    In exploring the potential connection between a Youtuber's Subscriber Count and the channel's creation date. The main goal is to determine if the age of a channel, represented by its creation date, has a significant association the number of subscribers it attracts. The hypothesis posits that the creation date of a YouTube channel carries meaningful implications for its Subscriber Count, reflecting factors such as platform tenure, visibility, reputation, and subscriber acquisition trends over time. Seeking to uncover whether older channels generally boast higher subscriber numbers or if newer channels can also achieve notable subscriber engagement.

**`Null Hypothesis:`** There is no association or dependence between the Subscriber Count and the channel's creation date.

**`Alternative Hypothesis:`** There is a significant association or dependence between the Subscriber Count and the channel's creation date.


In [None]:
data = {'subscribers': normalized_df['subscribers (millions)'],
        'created_year': normalized_df['created_year']}
df = pd.DataFrame(data)

t_stat, p_value = stats.ttest_ind(df['subscribers'], df['created_year'])

print("t-statistic =", t_stat)
print("p-value =", p_value)

#### Findings and Results for Hypothesis 1.5

   Based on the results and findings of the T-test statistic investigating the association between a Youtuber's Subscriber Count and the channel's creation date, the t-test statistic yielded a value of **`-30.4090`**. The associated p-value, calculated as **`9.4169`**, represents the probability of obtaining the observed results under the assumption of no association. The test statistic suggests **`[Insert interpretation]`**, signifying the **`[Nature/Direction]`** of the association. 
   
   In summary, these findings indicate that **`[insert concise summary of the main results and their implications for Hypothesis 1.2]]`** and we therefore **`do not reject`** the the null hypothesis.

## Hypothesis 2

    In this section, we intend to perform a statistical analysis to assess potential associations or dependencies between various variables within the dataset with the Youtuber's total number of video views. The objective is to determine whether specific factors demonstrate a statistically significant relationship with the total count of video views for a Youtuber.


### Hypothesis 2.1:  Is a Youtube Channel’s video views associated or dependent with the channel’s country of origin?
    In innvestigating the potential associations or dependencies between a Youtuber's total number of video views and the country of origin of their channel, our  objective is to determine whether the geographic location of the channel has a statistically significant relationship with the overall count of video views. By examining the association between video views and the country of origin, the analysis seeks to identify whether specific countries are more likely to yield higher video views for a Youtuber. This exploration provides valuable insights into the geographical factors that shape a channel's video viewership.

**`Null Hypothesis:`** There is no association or dependence between the 'Video Views' and the Country origin of the YouTube channel.

**`Alternative Hypothesis:`** There is a significant association or dependence between the 'Video Views' and the Country origin of the YouTube channel.


In [None]:
hp2_1_table = pd.crosstab(youtube_df['video views (millions)'], youtube_df['Country'])
chi2, p, degrees_of_freedom, expected = chi2_contingency(hp2_1_table)

print("Chi-Square Statistic:", chi2)
print("P-value:", p)
print("Degrees of Freedom:", degrees_of_freedom)
print("Expected Frequencies Table: \n", expected)

#### Findings and Results for Hypothesis 2.1
Based on the results and findings of the chi-square test investigating the association between a Youtuber's total number of video views and the country of origin of the channel, the Chi-Square Statistic yielded a value of **`[Insert Value]`**, indicating the strength of association between the variables. The associated p-value, calculated as **`[Insert Value]`**, represents the probability of obtaining the observed results under the assumption of no association. With the degrees of freedom calculated at **`[Insert Value]`**, the Chi-Square Statistic suggests a statistically significant association, signifying that the country of origin has a **`notable or no notable?`** notable impact on the total count of video views.

In summary, these findings indicate that **`[insert concise summary of the main results and their implications for Hypothesis 2,1]]`** and we therefore **`[reject or do not reject]`** the the null hypothesis.

##### Example Summarry
In summary, these findings indicate that certain countries are more likely to generate higher video views for a Youtuber, highlighting the influence of geographic factors on video viewership. Therefore, we reject the null hypothesis, affirming that there is a significant association between a Youtuber's total number of video views and the country of origin of their channel. This suggests that the geographic location plays a meaningful role in shaping a Youtuber's overall video viewership.




### Hypothesis 2.2: Is a channel’s video views associated/dependent with the channel’s upload count?
    In exploring the potential relationship between a Youtuber's total number of video views and the upload count of their channel, our objective was is to investigate whether the frequency of content uploads, as indicated by the upload count, has a statistically significant association with the overall count of video views. the upload count of a YouTube channel significantly influences its total number of video views, with the frequency and consistency of content uploads playing a crucial role in impacting viewer engagement and contributing to the overall video viewership. By scrutinizing the association between video views and the upload count, we seek to determine whether channels with a higher upload count tend to attract more video views, providing valuable insights into the impact of content production frequency on a channel's overall video viewership.

**`Null Hypothesis:`** There is no association between the 'upload count' and the ‘video views’ of the YouTube channel.

**`Alternative Hypothesis:`** There is a significant association between the upload count of a channel and its ranking in video views.


In [None]:
data = {'video views': normalized_df['video views (millions)'],
        'uploads': normalized_df['uploads']}
df = pd.DataFrame(data)

t_stat, p_value = stats.ttest_ind(df['uploads'], df['video views'])

print("t-statistic =", t_stat)
print("p-value =", p_value)

#### Findings and Results for Hypothesis 2.2

   Based on the results and findings of the T-test statistic investigating the association between a Youtuber's video views and the YouTube Channel's upload count, the t-test statistic yielded a value of **`-26.5936`**. The associated p-value, calculated as **`4.2711`**, represents the probability of obtaining the observed results under the assumption of no association. The test statistic suggests **`[Insert interpretation]`**, signifying the **`[Nature/Direction]`** of the association. 
   
   In summary, these findings indicate that **`[insert concise summary of the main results and their implications for Hypothesis 1.2]]`** and we therefore **`do not reject`** the the null hypothesis.

### Hypothesis 2.3:  Is a channel’s video views associated/dependent with the population of the country of origin?
    In investigating the potential relationship between a Youtuber's total number of video views and the population of their country of origin. Our objective is to examine whether the demographic size of a country has a statistically significant association with the total count of video views. This exploration aims to uncover whether there is a meaningful dependency between a channel's video viewership and the population of its country of origin. A population of a Youtuber's country of origin may play a significant role in influencing their total number of video views. The demographic size of a country can impact the potential audience base and viewer engagement. By scrutinizing the association between video views and the population of the country of origin, we reveal whether Youtubers from countries with larger populations tend to have higher video views. Understanding this potential association provides valuable insights into the broader demographic factors that may contribute to a Youtuber's overall video viewership.

**`Null Hypothesis:`** There is no association between video views and the population of the country of origin.

**`Alternative Hypothesis:`** There is a significant association between video views and the population of the country of origin.


In [None]:
data = {'video views': normalized_df['video views (millions)'],
        'population': normalized_df['Population (millions)']}
df = pd.DataFrame(data)

t_stat, p_value = stats.ttest_ind(df['video views'], df['population'])

print("t-statistic =", t_stat)
print("p-value =", p_value)

#### Findings and Results for Hypothesis 2.3

   Based on the results and findings of the T-test statistic investigating the association between a Youtuber's video views and the population YouTube Channel's country of origin, the t-test statistic yielded a value of **`-3.141`**. The associated p-value, calculated as **`0.0017`**, represents the probability of obtaining the observed results under the assumption of no association. The test statistic suggests **`[Insert interpretation]`**, signifying the **`[Nature/Direction]`** of the association. 
   
   In summary, these findings indicate that **`[insert concise summary of the main results and their implications for Hypothesis 1.2]]`** and we therefore **`do not reject`** the the null hypothesis.

### Hypothesis 2.4:  Is a channel’s video views associated/dependent with the channel’s creation date?

    In exploring the potential relationship between a Youtuber's total number of video views and the creation date of their channel. We investigate whether the age of a channel, indicated by its creation date, has a statistically significant association with the overall count of video views. A YouTube channel's creation date significantly influences its total number of video views, with the age of the channel serving as an indicator of its establishment within the platform. This aspect may impact visibility, reputation, and viewer engagement over time. In uncovering whether older channels tend to have higher video views or if newer channels can also achieve significant viewership, understanding this potential association provides insights into the impact of a channel's longevity on its overall video viewership.

**`Null Hypothesis:`** There is no association between video views and the creation date of the YouTube channel.

**`Alternative Hypothesis:`** There is a significant association between video views and the creation date of the YouTube channel.


In [None]:
data = {'video views': normalized_df['video views (millions)'],
        'created_year': normalized_df['created_year']}
df = pd.DataFrame(data)

t_stat, p_value = stats.ttest_ind(df['video views'], df['created_year'])

print("t-statistic =", t_stat)
print("p-value =", p_value)

#### Findings and Results for Hypothesis 2.4

   Based on the results and findings of the T-test statistic investigating the association between a Youtuber's video views and the YouTube Channel's creation date, the t-test statistic yielded a value of **`-17.0253`**. The associated p-value, calculated as **`1.3671`**, represents the probability of obtaining the observed results under the assumption of no association. The test statistic suggests **`[Insert interpretation]`**, signifying the **`[Nature/Direction]`** of the association. 
   
   In summary, these findings indicate that **`[insert concise summary of the main results and their implications for Hypothesis 1.2]]`** and we therefore **`do not reject`** the the null hypothesis.

### Hypothesis 2.5:  Is a channel’s video views associated/dependent with the channel’s average yearly earnings?
Description: In this analysis phase, the team investigates the potential relationship between a Youtuber's total number of video views and the channel's average yearly earnings. The primary goal is to determine whether the financial success of a channel, represented by its average yearly earnings, has a statistically significant association with the overall count of video views. The hypothesis suggests that a Youtuber's video views may be significantly linked to the average yearly earnings of their channel, with financial success serving as an indicator of popularity, content quality, and audience engagement. By examining the association between video views and average yearly earnings, the analysis seeks to unveil whether channels with higher video views tend to generate greater financial returns. This exploration provides valuable insights into the financial dynamics of a channel's success and its correlation with video viewership.

**`Null Hypothesis:`** There is no association or dependence between a channel's video views and its average yearly earnings.

**`Alternative Hypothesis:`** There is a significant association or dependence between a channel's video views and its average yearly earnings.


In [None]:
data = {'video views': normalized_df['video views (millions)'],
        'average_yearly_earnings': normalized_df['average_yearly_earnings']}
df = pd.DataFrame(data)

t_stat, p_value = stats.ttest_ind(df['video views'], df['average_yearly_earnings'])

print("t-statistic =", t_stat)
print("p-value =", p_value)

## Hypothesis 3

### Hypothesis 3.1: Is  a channel’s average yearly earning associated with their category?
Description:(placeholder only)

**`Null Hypothesis:`** There is no association between a channel's average yearly earning and their category.


**`Alternative Hypothesis:`** There is a significant association between a channel's average yearly earning and their category.


In [None]:

hp3_1_table = pd.crosstab(youtube_df['average_yearly_earnings'], youtube_df['category'])
chi2, p, degrees_of_freedom, expected = chi2_contingency(hp3_1_table)

print("Chi-Square Statistic:", chi2)
print("P-value:", p)
print("Degrees of Freedom:", degrees_of_freedom)
print("Expected Frequencies Table: \n", expected)

### Hypothesis 3.2: Is a channel’s average yearly earning associated with the country of origin’s population?
Description:(placeholder only)

**`Null Hypothesis:`** There is no association between a channel's average yearly earning and the population of the country of origin.

**`Alternative Hypothesis:`** There is a significant association between a channel's average yearly earning and the population of the country of origin.


In [None]:
data = {'average_yearly_earnings': normalized_df['average_yearly_earnings'],
        'population': normalized_df['Population (millions)']}
df = pd.DataFrame(data)

t_stat, p_value = stats.ttest_ind(df['average_yearly_earnings'], df['population'])

print("t-statistic =", t_stat)
print("p-value =", p_value)

### Hypothesis 3.3: Is a channel’s average yearly earning associated with the channel's country of origin?
Description:(placeholder only)

**`Null Hypothesis:`** There is no association between a channel's average yearly earning and the channel's country of origin.

**`Alternative Hypothesis:`** There is a significant association between a channel's average yearly earning and the channel's country of origin.


In [None]:
hp3_3_table = pd.crosstab(youtube_df['average_yearly_earnings'], youtube_df['Country'])
chi2, p, degrees_of_freedom, expected = chi2_contingency(hp3_3_table)

print("Chi-Square Statistic:", chi2)
print("P-value:", p)
print("Degrees of Freedom:", degrees_of_freedom)
print("Expected Frequencies Table: \n", expected)

### Hypothesis 3.4: Is a channel’s average yearly earning associated with the channel's upload count?
Description:(placeholder only)

**`Null Hypothesis:`** There is no association between a channel's average yearly earning and the channel's upload count.

**`Alternative Hypothesis:`** There is a significant association between a channel's average yearly earning and the channel's upload count.



In [None]:
data = {'average_yearly_earnings': normalized_df['average_yearly_earnings'],
        'uploads': normalized_df['uploads']}
df = pd.DataFrame(data)

t_stat, p_value = stats.ttest_ind(df['average_yearly_earnings'], df['uploads'])

print("t-statistic =", t_stat)
print("p-value =", p_value)

### Hypothesis 3.5: Is a channel’s average yearly earning associated with the channel's creation date?
Description:(placeholder only)

**`Null Hypothesis:`** There is no association between a channel's average yearly earning and the channel's creation date.

**`Alternative Hypothesis:`** There is a significant association between a channel's average yearly earning and the channel's creation date.


In [None]:
data = {'average_yearly_earnings': normalized_df['average_yearly_earnings'],
        'created_year': normalized_df['created_year']}
df = pd.DataFrame(data)

t_stat, p_value = stats.ttest_ind(df['average_yearly_earnings'], df['created_year'])

print("t-statistic =", t_stat)
print("p-value =", p_value)

## Insights and Conclusion

Clear

    - End of Document -