PREDICTING VIDEO CATEGORY USING NAIVE BAYE'S MODEL:

Bayes’ Theorem provides a way that we can calculate the probability of a piece of data belonging to a given class, given our prior knowledge. Naive Bayes is a classification algorithm for multiclass classification problems. It is called Naive Bayes because the calculations of the probabilities for each class are simplified to make their calculations tractable.

Introduction

This notebook is a Data Science project aimed at predicting the category of a video fetched from YouTube. The Naive Baye's model is trained on data constructed by scraping YouTube video meta data. The program scrapes YouTube videos for content and processes this data to use as train/test data for a Naive Baye's model. The model functions to predict a category for a given video. The model is trained on data from Youtube, but can be tested on data from other sites like Amazon Prime & Netflix, or any other video with a title and description.

Background

We will be using BeautifulSoup/Selenium as an HTML/XML parser to fetch our data and organize it into a Pandas DataFrame. We will parse common endpoints for each of these websites and we will attempt to categorize the videos based on genre. Once the scraping and data collecting is completed, we used the data to train a Naive Baye's Model 

In [100]:
from bs4 import BeautifulSoup
import pandas as pd 
import requests
import json
import re

In [101]:
class Soup:
    def __init__(self, name, url):
        self.name = name
        self.url = url
        self.df = pd.DataFrame(columns = ['video_id', 'title', 'category_id', 'description'])

In [102]:
YouTube = Soup("YouTube", "https://www.youtube.com")
NfxMovies = Soup("Netflix Movies", "https://www.netflix.com/ca/browse/genre/34399")
PrimeMovies = Soup("Amazon Prime Movies", "https://www.primevideo.com/storefront/movie/ref=atv_tc_m")

Getting Test Data from YouTube:

Since YouTube is javascript rendered, we need a library like Selenium to crawl through the site and find video titles, links, and descriptions.Selenium is a portable framework for testing web applications, but in this case, we are using the framework to crawl through YouTube.com

In [103]:
import time
from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

BASE_PATH = "https://www.youtube.com"
CHROMEDRIVER_PATH = '/Users/alanjudi/Downloads/chromedriver'

#selenium chrome driver
options = Options()
#options.add_argument('--headless')
#options.add_argument('--disable-gpu') 

In [104]:
#Chrome Drive Setup
driver = webdriver.Chrome(executable_path=CHROMEDRIVER_PATH, options=options)
wait = WebDriverWait(driver, 10)

#Get YouTube Url
driver.get(YouTube.url)
time.sleep(5)
driver.execute_script("return window.scrollBy(0,2000);")

#array with all video ids
ids = []

#scroll 20 times so we can get more links
for i in range(30):
    driver.execute_script("return window.scrollBy(0,2000);")
    time.sleep(3)

#get all a tags with watch inside the href attribute   
soup = BeautifulSoup(driver.page_source,"lxml")
for item in soup.find_all('a', href=True):
        
        if ('watch' in item['href']) and (item['href'].strip("/watch?v=") not in ids):
            ids.append(item['href'].strip("/watch?v="))    
        
#quit the driver        
driver.quit()

In [105]:
#Here, we use the YouTubeAPI to get data on the video ids that we collected
KEY = 'AIzaSyDaIwPWq3A1kVxRriYlFqsrfosnnDpPwCo'

def get_info(video_id):
    LINK = 'https://www.googleapis.com/youtube/v3/videos?part=id%2C+snippet&id={vid}&key={API_KEY}'.format(API_KEY=KEY, vid = video_id)
    r = requests.get(LINK)
    return r.json()

for i in range(len(ids)):
    _id = ids[i]
    print(_id)
    data = get_info(_id)
    
    
    #video information
    items = data['items']
    
    #skip if empty
    if not items:
        continue
        
    #attach to dataframe    
    title = items[0]['snippet']['title']
    description = items[0]['snippet']['description']
    category_id = items[0]['snippet']['categoryId']
    YouTube.df.loc[i] = [_id, title, category_id, description]

mp3V9tuPf_0
lJlEQim-yMo
bsXSnrSPypU
8Qi39aL3RSE
r6VrteAxb4U
1RzOeVtroY
Uxt9dBOyOY
uS3fNEuUHCQ
QXpxUeL0lz4
iN94uR0aQ0
CfxvtGFG_kk
mDGvPDgk3rU
5_dKkSW5ZDQ
qfNTxW1Jhc0
K5GP0v5p9Wk
Z0TjyjqU39E
1TO48Cnl66
CNgOqNNWLQ4
9do1Uw4w-IU
o0etimvtD74
glOYwS1G-vA
lW80YxvdtT
mjeltClqfF0
ySgylW6tU3A
ZmDBbnmKpqQ
FXzE9eP1U_E
C7Tl7Zn-KRE
QPKXw8XEQiA
n8BVdIoRXi4
rnFRcjzhTzE
pkf1t2trKyg
sN5Oj8ALHV0
D_K45Ltfb24
08mTLN3CG7A
_77HS_NS4Co
jXZAbnn1kTU
OiJsbXXq-AY
6GVgncA9oi
PvaO7A09_HY
QsUfsZzxi9
jkOCa2Xsl
sTd4O8bfVT
5qap5aO4i9A
j-cnex3Bfq8
lStp1m9aF0k
rXUc7rfCG
NaY91YjVbEM
JBRf3nEqfZ8
iS5jqXWECbI
M1jBjq4-bt0
_4kHxtiuML0
PHVv8g7H5sI
Oyx3xkdi4u
j0ViQI6y74
Q_WrhIgNHS8
R14a3rLrZ1o
DmL12NRE4hQ
baoWE8LlK8
Z1CX41MiEi
nK_CklnpOOM
N8lCEJo1y5g
S-gxNYXogKU
8R_7KCtShkU
ArR-ctuKraE
T2YGXxYx-
QfEI7YPJMXY
1LmeQ3Vci-
FqLEwAvaHI
65qVdgZT2to
Z9WUrOX9bKY
kXZwQcyxg2o
_9TShlMkQnc&list=RDCLAK5uy_lrj9qy29eJKUUvkLFw56PiEHq07rDHwkU&start_radio=1
7qH4qyi1-Ys&list=RDCLAK5uy_lrj9qy29eJKUUvkLFw56PiEHq07rDHwkU&start_radio=1
oqMN3y8k9So
8hGGVW

In [106]:
#YouTube Categories to be used for classification
CategoriesJSON = pd.read_json("category_id.JSON")
CategoryDict = [{'id': item['id'], 'title': item['snippet']['title']} for item in CategoriesJSON['items']]
CategoriesDF = pd.DataFrame(CategoryDict)
Categories = CategoriesDF.rename(index=str, columns={"id": "category_id", "title": "category"})
Categories.head(len(Categories))

Unnamed: 0,category_id,category
0,1,Film & Animation
1,2,Autos & Vehicles
2,10,Music
3,15,Pets & Animals
4,17,Sports
5,18,Short Movies
6,19,Travel & Events
7,20,Gaming
8,21,Videoblogging
9,22,People & Blogs


CREATING THE TRAINING MODEL USING NAIVE BAYES:

MultinomialNB implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice). 

In [114]:
vids = pd.read_csv("vids.csv", names=['video_id', 'title','category_id', 'description'])
frames = [YouTube.df, vids]
joined_data = pd.concat(frames)

joined_data.head(len(joined_data))

Unnamed: 0,video_id,title,category_id,description
0,mp3V9tuPf_0,2020 YouTube Streamy Awards,1,Join hosts Trixie and Katya for the 2020 YouTu...
1,lJlEQim-yMo,Relaxing Christmas Jazz Music 10 Hours,10,"he Best Compilation of Relaxing, Soothing Trad..."
2,bsXSnrSPypU,Shin Lim Performs His KISS Transfer Trick And ...,24,America’s Got Talent: The Champions | Season 1...
3,8Qi39aL3RSE,Shark Demands Hugs Whenever She Sees Her Diver...,15,Blondie the lemon shark loves hugs\n\nVideo by...
4,r6VrteAxb4U,Dog perfectly imitates owner on crutches,15,This dog totally mocks his owner's walk... or ...
...,...,...,...,...
41471,BZt0qjTWNhw,The Cat Who Caught the Laser,15,The Cat Who Caught the Laser - Aaron's Animals
41472,1h7KV2sjUWY,True Facts : Ant Mutualism,22,
41473,D6Oy4LfoqsU,I GAVE SAFIYA NYGAARD A PERFECT HAIR MAKEOVER ...,24,I had so much fun transforming Safiyas hair in...
41474,oV0zkMe1K8s,How Black Panther Should Have Ended,1,How Black Panther Should Have EndedWatch More ...


In [115]:
import numpy as np
import collections
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

vector = CountVectorizer()
counts = vector.fit_transform(joined_data['description'].values.astype(str))
print(counts.shape)

(41729, 71348)


In [116]:
#NAIVE BAYES MODEL
Model = MultinomialNB()
targets = joined_data['category_id'].values.astype(str)
Model.fit(counts,targets)
print(targets.shape)

(41729,)


In [117]:
#check the accuracy using a 90/10 train/test split 
X= counts
y= targets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .1)

NBtest = MultinomialNB().fit(X_train, y_train)
nb_predictions = NBtest.predict(X_test)
accuracy_nb = NBtest.score(X_test, y_test)
print('This Naive Bayes Model has an accuracy of', accuracy_nb)

The Naive Bayes Algorithm scored an accuracy of 0.9197220225257609


Here we test on brand new data never seen before by the algorithm, from that we can tell how accurate the model is. The results will be printed as an array in the same order of the descriptions, for example, an input such as the following:

input = ["description_1", "description_2"]

will output the categories as an array in that same order

output = ['category_1', 'category_2']

In [118]:
Descriptions = [
    "REMASTERED IN HD!Music video by Ludacris performing Act A Fool. (C) 2003 The Island Def Jam Music Group#Ludacris #ActAFool #Remastered",
    "Watch the G.O.A.T. Bernie Mac perform LIVE from Anaheim Walter Latham Comedy invites you to enjoy your favorite comedy videos of all time.  Please like and share these videos with friends We can also be found at: www.facebook.com/walterlathamproducer www.walterlathamcomedy.com The best multi-ethnic comedy ever produced in one place. Enjoy! We continue to strive to be,"
         ]
_counts = vector.transform(Descriptions)
Predict = Model.predict(_counts)

In [133]:
#print out results
for i in range(len(Predict)):
    print('Description: {D}'.format(D=Descriptions[i]))
    print('\n')
    category = Categories.loc[Categories['category_id'] == Predict[i]].values[0][1]
    print('CATEGORY => {C}'.format(C=category))
    print('\n')

Description: REMASTERED IN HD!Music video by Ludacris performing Act A Fool. (C) 2003 The Island Def Jam Music Group#Ludacris #ActAFool #Remastered


CATEGORY => Music


Description: Watch the G.O.A.T. Bernie Mac perform LIVE from Anaheim Walter Latham Comedy invites you to enjoy your favorite comedy videos of all time.  Please like and share these videos with friends We can also be found at: www.facebook.com/walterlathamproducer www.walterlathamcomedy.com The best multi-ethnic comedy ever produced in one place. Enjoy! We continue to strive to be,


CATEGORY => Entertainment


