# **Youtube earnings prediction project**
---

## **Problem Statement**

The success of content creators on YouTube is often associated with various factors such as the number of views, engagement metrics, and the audience's demographics. This project seeks to develop a predictive model that estimates a YouTuber's earnings based on key performance indicators and other relevant factors. The goal is to provide content creators and stakeholders with a tool that offers insights into potential earnings, aiding in strategic decision-making and content optimization.

---


## **Main Objective**
Develop a robust and accurate linear regression model to predict YouTube earnings for content creators, leveraging key performance indicators and relevant factors, in order to empower content creators with actionable insights for optimizing their content strategy and maximizing revenue on the YouTube platform.

---

### Specific Objectives
1. **Exploratory Data Analysis:** Perform EDA on the data to better understand the dataset. Clean and preprocess the dataset to handle missing values, outliers, and any other data inconsistencies.
2. **Feature Selection:** Identify and select the most significant independent variables that have a significant impact on earnings. 
3. **Model Development:** Apply linear regression modeling techniques to establish a relationship between the chosen independent variables and YouTube earnings.
4. **Assess Model Perfomance:** Estimate and Interprate the co-efficient of the significant predictor variables
5. **Interpretation of Results:** Interpret the coefficients of the regression model to understand the relative importance of each independent variable in predicting YouTube earnings. Provide insights into the factors that most strongly influence earnings.

---


## **Details on the Data set**

 This meticulously curated dataset unveils the statistics of the most subscribed YouTube channels. A collection of YouTube giants, this dataset offers a perfect avenue to analyze and gain valuable insights from the luminaries of the platform. With comprehensive details on top creators' subscriber counts, video views, upload frequency, country of origin, earnings, and more.



### **Description of columns:**
---
**rank:** Position of the YouTube channel based on the number of subscribers

**Youtuber:** Name of the YouTube channel

**subscribers:** Number of subscribers to the channel

**video views:** Total views across all videos on the channel

**category:** Category or niche of the channel

**Title:** Title of the YouTube channel

**uploads:** Total number of videos uploaded on the channel

**Country:** Country where the YouTube channel originates

**Abbreviation:** Abbreviation of the country

**channel_type:** Type of the YouTube channel (e.g., individual, brand)

**video_views_rank:** Ranking of the channel based on total video views

**country_rank:** Ranking of the channel based on the number of subscribers within its country

**channel_type_rank:** Ranking of the channel based on its type (individual or brand)

**video_views_for_the_last_30_days:** Total video views in the last 30 days

**lowest_monthly_earnings:** Lowest estimated monthly earnings from the channel

**highest_monthly_earnings:** Highest estimated monthly earnings from the channel

**lowest_yearly_earnings:** Lowest estimated yearly earnings from the channel

**highest_yearly_earnings:** Highest estimated yearly earnings from the channel

**subscribers_for_last_30_days:** Number of new subscribers gained in the last 30 days

**created_year:** Year when the YouTube channel was created

**created_month:** Month when the YouTube channel was created

**created_date:** Exact date of the YouTube channel's creation

**Gross tertiary education enrollment (%):** Percentage of the population enrolled in tertiary education in the country

**Population:** Total population of the country

**Unemployment rate:** Unemployment rate in the country

**Urban_population:** Percentage of the population living in urban areas

**Latitude:** Latitude coordinate of the country's location

**Longitude:** Longitude coordinate of the country's location

In [3]:
# Adding necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [4]:
df = pd.read_csv('Global YouTube Statistics (1).csv', encoding='latin-1', index_col=0)

In [5]:
df.head()

Unnamed: 0_level_0,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,video_views_rank,...,subscribers_for_last_30_days,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,T-Series,245000000,228000000000.0,Music,T-Series,20082,India,IN,Music,1.0,...,2000000.0,2006.0,Mar,13.0,28.1,1366418000.0,5.36,471031528.0,20.593684,78.96288
2,YouTube Movies,170000000,0.0,Film & Animation,youtubemovies,1,United States,US,Games,4055159.0,...,,2006.0,Mar,5.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
3,MrBeast,166000000,28368840000.0,Entertainment,MrBeast,741,United States,US,Entertainment,48.0,...,8000000.0,2012.0,Feb,20.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
4,Cocomelon - Nursery Rhymes,162000000,164000000000.0,Education,Cocomelon - Nursery Rhymes,966,United States,US,Education,2.0,...,1000000.0,2006.0,Sep,1.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
5,SET India,159000000,148000000000.0,Shows,SET India,116536,India,IN,Entertainment,3.0,...,1000000.0,2006.0,Sep,20.0,28.1,1366418000.0,5.36,471031528.0,20.593684,78.96288


In [6]:
df.tail()

Unnamed: 0_level_0,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,video_views_rank,...,subscribers_for_last_30_days,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
991,Natan por Aï¿,12300000,9029610000.0,Sports,Natan por Aï¿,1200,Brazil,BR,Entertainment,525.0,...,700000.0,2017.0,Feb,12.0,51.3,212559400.0,12.08,183241641.0,-14.235004,-51.92528
992,Free Fire India Official,12300000,1674410000.0,People & Blogs,Free Fire India Official,1500,India,IN,Games,6141.0,...,300000.0,2018.0,Sep,14.0,28.1,1366418000.0,5.36,471031528.0,20.593684,78.96288
993,Panda,12300000,2214684000.0,,HybridPanda,2452,United Kingdom,GB,Games,129005.0,...,1000.0,2006.0,Sep,11.0,60.0,66834400.0,3.85,55908316.0,55.378051,-3.435973
994,RobTopGames,12300000,374123500.0,Gaming,RobTopGames,39,Sweden,SE,Games,35112.0,...,100000.0,2012.0,May,9.0,67.0,10285450.0,6.48,9021165.0,60.128161,18.643501
995,Make Joke Of,12300000,2129774000.0,Comedy,Make Joke Of,62,India,IN,Comedy,4568.0,...,100000.0,2017.0,Aug,1.0,28.1,1366418000.0,5.36,471031528.0,20.593684,78.96288


In [7]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
subscribers,995.0,22982410.0,17526110.0,12300000.0,14500000.0,17700000.0,24600000.0,245000000.0
video views,995.0,11039540000.0,14110840000.0,0.0,4288145000.0,7760820000.0,13554700000.0,228000000000.0
uploads,995.0,9187.126,34151.35,0.0,194.5,729.0,2667.5,301308.0
video_views_rank,994.0,554248.9,1362782.0,1.0,323.0,915.5,3584.5,4057944.0
country_rank,879.0,386.0535,1232.245,1.0,11.0,51.0,123.0,7741.0
channel_type_rank,962.0,745.7193,1944.387,1.0,27.0,65.5,139.75,7741.0
video_views_for_the_last_30_days,939.0,175610300.0,416378200.0,1.0,20137500.0,64085000.0,168826500.0,6589000000.0
lowest_monthly_earnings,995.0,36886.15,71858.72,0.0,2700.0,13300.0,37900.0,850900.0
highest_monthly_earnings,995.0,589807.8,1148622.0,0.0,43500.0,212700.0,606800.0,13600000.0
lowest_yearly_earnings,995.0,442257.4,861216.1,0.0,32650.0,159500.0,455100.0,10200000.0


In [8]:
df.shape

(995, 27)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 995 entries, 1 to 995
Data columns (total 27 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   Youtuber                                 995 non-null    object 
 1   subscribers                              995 non-null    int64  
 2   video views                              995 non-null    float64
 3   category                                 949 non-null    object 
 4   Title                                    995 non-null    object 
 5   uploads                                  995 non-null    int64  
 6   Country                                  873 non-null    object 
 7   Abbreviation                             873 non-null    object 
 8   channel_type                             965 non-null    object 
 9   video_views_rank                         994 non-null    float64
 10  country_rank                             879 non-null  