### Predictive analysis of YouTube trending videos using Machine Learning 

#### 1. Introduction

Youtube is the largest media platform for sharing the video nowsaday. YouTube offers interactive video features for public and content creators such as Views, which denotes the total number of viewership gathered by the particular video till date. Generally, the number of views determines the popularity of videos and it takes a certain amount of time for a video to become popular. 

#### 2. Problem statement

This purpose of the project is to assist if you are:

a) A youtuber who wants to make trending videos

b) An advertiser who wants to know the best video to put advertisements on before they become trending video


There will be some studies of the followings:

1. To perform analysis on correlation between features to determine how interactive video features helps a video trend on YouTube? is having a large number of views required for a video to trend? how important is the Correlation between features? 

2. To compare analysis of ML classifiers for predicting YouTube trending video’s lifecycle using Random Forest,Support Vector Machine, Decision Tree, Logistic Regression and Gaussian Naïve Bayes classifier to determine which classifier is best suited for forecasting.

#### 3. Data description

The dataset is obtained from Kaggle [Source](https://www.kaggle.com/rsrishav/youtube-trending-video-dataset?select=BR_youtube_trending_data.csv).

Data is included for the CA,GB and US regions ( Canada,Great Britain and USA respectively), with up to 200 listed trending videos per day. Data includes the video title, channel title, publish time, tags, views, likes and dislikes, description, and comment count.

File type: csv

* video_id: Uniquely identifies each video
* published_at: Date and Time of video published
* categoryId: Id of category the video belongs to
* trending_date: Date and time when the video got to Trending
* view_count: Number of views (cumulative)
* likes: Number of Likes(cumulative)
* dislikes: Number of dislikes(cumulative)
* comment_count: Number of comments(cumulative)
* country: Country in which the video was trending
* description: Description of video by the creator
* tags: Tags of the video by the creator
* title: Title of the video
* channelTitle: Channel Title of the video
* thumbnail_link:link for thumbnails
* comments_disabled: boolean value that defines if viewer can comment
* ratings_disabled: boolean value that defines if viewer can rate through likes and dislikes
* channelId: uniquely defines the channel the video is coming from

File type: json

* id: Id of category the video belongs to
* name: Respective category names of category ids

#### Importing libraries and data

In [32]:
# import libraries

# Basic libraries
import numpy as np
import pandas as pd

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%config InlineBackend.figure_format = 'retina'
%matplotlib inline


#NLTK libraries
import re
import string
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer


#Ignore warnings
import warnings
warnings.filterwarnings('ignore')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rimay\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [33]:
# read the files

US_df = pd.read_csv('./Data/US_youtube_trending_data.csv')
US_df.head(1)

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,description
0,3C66w5Z0ixs,I ASKED HER TO BE MY GIRLFRIEND...,2020-08-11T19:20:14Z,UCvtRTOMP2TqYqu51xNrqAzg,Brawadis,22,2020-08-12T00:00:00Z,brawadis|prank|basketball|skits|ghost|funny vi...,1514614,156908,5855,35313,https://i.ytimg.com/vi/3C66w5Z0ixs/default.jpg,False,False,SUBSCRIBE to BRAWADIS ▶ http://bit.ly/Subscrib...


In [34]:
# read the files

UK_df = pd.read_csv('./Data/GB_youtube_trending_data.csv')
UK_df.head(1)

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,description
0,J78aPJ3VyNs,I left youtube for a month and THIS is what ha...,2020-08-11T16:34:06Z,UCYzPXprvl5Y-Sf0g4vX-m6g,jacksepticeye,24,2020-08-12T00:00:00Z,jacksepticeye|funny|funny meme|memes|jacksepti...,2038853,353790,2628,40228,https://i.ytimg.com/vi/J78aPJ3VyNs/default.jpg,False,False,I left youtube for a month and this is what ha...


In [35]:
# read the files

CA_df = pd.read_csv('./Data/CA_youtube_trending_data.csv')
CA_df.head(1)

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,description
0,KX06ksuS6Xo,Diljit Dosanjh: CLASH (Official) Music Video |...,2020-08-11T07:30:02Z,UCZRdNleCgW-BGUJf-bbjzQg,Diljit Dosanjh,10,2020-08-12T00:00:00Z,clash diljit dosanjh|diljit dosanjh|diljit dos...,9140911,296541,6180,30059,https://i.ytimg.com/vi/KX06ksuS6Xo/default.jpg,False,False,CLASH official music video performed by DILJIT...


In [36]:
# looking at the shape of the data

print('Shape of UK File: '+ str(UK_df.shape))
print('Shape of CA File: '+ str(CA_df.shape))
print('Shape of US File: '+ str(US_df.shape))


Shape of UK File: (111995, 16)
Shape of CA File: (111944, 16)
Shape of US File: (111991, 16)


In [37]:
import json #import data using python json module
with open('./Data/CA_category_id.json','r') as f:
    category_ca= json.loads(f.read())

In [38]:
import json #import data using python json module
with open('./Data/US_category_id.json','r') as f:
    category_us= json.loads(f.read())

In [39]:
import json #import data using python json module
with open('./Data/GB_category_id.json','r') as f:
    category_uk= json.loads(f.read())

In [40]:
# Read the json file into dataframe and normalise the data

us_cat = pd.json_normalize(category_us,record_path='items')
ca_cat = pd.json_normalize(category_ca,record_path='items')
uk_cat = pd.json_normalize(category_uk,record_path='items')

In [41]:
# convert the 'id' to int type

us_cat['id']= us_cat['id'].astype(int)
ca_cat['id']= ca_cat['id'].astype(int)
uk_cat['id']= uk_cat['id'].astype(int)


In [42]:
# merging the data and category data 

US_df= US_df.merge(us_cat,how ='left',left_on= 'categoryId',\
                                 right_on='id').rename(columns= {'snippet.title':'category_name'})
CA_df= CA_df.merge(ca_cat,how ='left',left_on= 'categoryId',\
                                 right_on='id').rename(columns= {'snippet.title':'category_name'})
UK_df= UK_df.merge(uk_cat,how ='left',left_on= 'categoryId',\
                                 right_on='id').rename(columns= {'snippet.title':'category_name'})

In [43]:
# shape of data after merging

print('Shape of UK File: '+ str(UK_df.shape))
print('Shape of CA File: '+ str(CA_df.shape))
print('Shape of US File: '+ str(US_df.shape))

Shape of UK File: (111995, 22)
Shape of CA File: (111944, 22)
Shape of US File: (111991, 22)


In [44]:
# adding "country" to specify the location

US_df['country']= 'usa'
CA_df['country']= 'canada'
UK_df['country']= 'united kingdom'

In [45]:
US_df.head(1)

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,...,comments_disabled,ratings_disabled,description,kind,etag,id,category_name,snippet.assignable,snippet.channelId,country
0,3C66w5Z0ixs,I ASKED HER TO BE MY GIRLFRIEND...,2020-08-11T19:20:14Z,UCvtRTOMP2TqYqu51xNrqAzg,Brawadis,22,2020-08-12T00:00:00Z,brawadis|prank|basketball|skits|ghost|funny vi...,1514614,156908,...,False,False,SUBSCRIBE to BRAWADIS ▶ http://bit.ly/Subscrib...,youtube#videoCategory,QMEBz6mxVdklVaq8JwesPEw_4nI,22,People & Blogs,True,UCBR8-60-B28hp2BmDPdntcQ,usa


In [46]:
# concate the 3 files

df_list= [US_df,CA_df,UK_df]
df= pd.concat(df_list).reset_index(drop=True)
df.shape

(335930, 23)

#### 4. Data preprocessing

In [47]:
# Drop unused column
df.drop(columns=['thumbnail_link','kind','etag','id','snippet.assignable','snippet.channelId','channelId'], axis='columns',inplace=True)

In [48]:
df.shape

(335930, 16)

In [49]:
df.dtypes

video_id             object
title                object
publishedAt          object
channelTitle         object
categoryId            int64
trending_date        object
tags                 object
view_count            int64
likes                 int64
dislikes              int64
comment_count         int64
comments_disabled      bool
ratings_disabled       bool
description          object
category_name        object
country              object
dtype: object

In [50]:
# check for null value

df.isna().sum()

video_id                0
title                   0
publishedAt             0
channelTitle            0
categoryId              0
trending_date           0
tags                    0
view_count              0
likes                   0
dislikes                0
comment_count           0
comments_disabled       0
ratings_disabled        0
description          8259
category_name         173
country                 0
dtype: int64

In [51]:
# Dealing with null value in `description`, fill up the description with title

df.description = np.where(df.description.isnull(), df.title, df.description)


In [52]:
# see the list of category ID
df['categoryId'].value_counts()

24    71463
20    56930
10    48106
17    44646
22    31302
23    19447
28    12132
1     11180
25    10756
26    10699
27     8473
2      6763
19     2021
15     1751
29      261
Name: categoryId, dtype: int64

In [53]:
df['category_name'].value_counts()

Entertainment            71463
Gaming                   56930
Music                    48106
Sports                   44646
People & Blogs           31302
Comedy                   19447
Science & Technology     12132
Film & Animation         11180
News & Politics          10756
Howto & Style            10699
Education                 8473
Autos & Vehicles          6763
Travel & Events           2021
Pets & Animals            1751
Nonprofits & Activism       88
Name: category_name, dtype: int64

In [54]:
# Dealing with the null calue in category_name with `Nonprofits & Activism`

df.category_name.fillna('Nonprofits & Activism').isna().any()
df.category_name.fillna('Nonprofits & Activism',inplace=True)

In [55]:
df.groupby(['categoryId', 'category_name']).size()

categoryId  category_name        
1           Film & Animation         11180
2           Autos & Vehicles          6763
10          Music                    48106
15          Pets & Animals            1751
17          Sports                   44646
19          Travel & Events           2021
20          Gaming                   56930
22          People & Blogs           31302
23          Comedy                   19447
24          Entertainment            71463
25          News & Politics          10756
26          Howto & Style            10699
27          Education                 8473
28          Science & Technology     12132
29          Nonprofits & Activism      261
dtype: int64

In [56]:
# Convert Unix Timestamp to Datetime
df['trending_date'] = pd.to_datetime(df['trending_date']).dt.date


# Show date-range of data
print("Youtube start date:", df['trending_date'].min())
print("Youtube end date:", df['trending_date'].max())


Youtube start date: 2020-08-12
Youtube end date: 2022-02-18


In [57]:
# Convert Unix Timestamp to Datetime
df['publishedAt'] = pd.to_datetime(df['publishedAt']).dt.date


# Show date-range of data
print("Youtube start date:", df['publishedAt'].min())
print("Youtube end date:", df['publishedAt'].max())

Youtube start date: 2020-07-27
Youtube end date: 2022-02-17


In [58]:
# drop duplicated `video_id`

df.drop_duplicates(subset=['video_id'], inplace=True)

In [59]:
df.shape

(36355, 16)

In [60]:
# clean text
import re
from nltk.corpus import stopwords
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))
def clean_text(text):
    """
        text: a string
        
        return: modified string
    """
    text = str(text).lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    text = re.sub(" \d+", " ", text) # remove digits
    text = re.sub(" #\d+", " ", text) # remove digits starting with # symbol
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # delete stopwors from text
    return text
df['clean_text'] = df['title'].apply(clean_text)
df.head()

Unnamed: 0,video_id,title,publishedAt,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,comments_disabled,ratings_disabled,description,category_name,country,clean_text
0,3C66w5Z0ixs,I ASKED HER TO BE MY GIRLFRIEND...,2020-08-11,Brawadis,22,2020-08-12,brawadis|prank|basketball|skits|ghost|funny vi...,1514614,156908,5855,35313,False,False,SUBSCRIBE to BRAWADIS ▶ http://bit.ly/Subscrib...,People & Blogs,usa,asked girlfriend
1,M9Pmf9AB4Mo,Apex Legends | Stories from the Outlands – “Th...,2020-08-11,Apex Legends,20,2020-08-12,Apex Legends|Apex Legends characters|new Apex ...,2381688,146739,2794,16549,False,False,"While running her own modding shop, Ramya Pare...",Gaming,usa,apex legends stories outlands endorsement
2,J78aPJ3VyNs,I left youtube for a month and THIS is what ha...,2020-08-11,jacksepticeye,24,2020-08-12,jacksepticeye|funny|funny meme|memes|jacksepti...,2038853,353787,2628,40221,False,False,I left youtube for a month and this is what ha...,Entertainment,usa,left youtube month happened
3,kXLn3HkpjaA,XXL 2020 Freshman Class Revealed - Official An...,2020-08-11,XXL,10,2020-08-12,xxl freshman|xxl freshmen|2020 xxl freshman|20...,496771,23251,1856,7647,False,False,Subscribe to XXL → http://bit.ly/subscribe-xxl...,Music,usa,xxl freshman class revealed official announcement
4,VIUo6yapDbc,Ultimate DIY Home Movie Theater for The LaBran...,2020-08-11,Mr. Kate,26,2020-08-12,The LaBrant Family|DIY|Interior Design|Makeove...,1123889,45802,964,2196,False,False,Transforming The LaBrant Family's empty white ...,Howto & Style,usa,ultimate diy home movie theater labrant family


In [61]:
# Convert the data to csv

df.to_csv('./Data/df_clean.csv', index=False)