Linear regression model that predicts the amount of views a YouTube video will get

In the realm of AI/ML engineering, as valuable it is to be able to curate your own dataset, it is an equally important skill to be able to find (usually publically available, but essentially any that aren't created by you) datasets that provide high-quality, clean data that is large enough in amounts to ensure your model has the data it needs to be able to churn out highly-accurate outputs.

For this YouTube regression model, we are going to use a publically available dataset from a popular source, Kaggle.

In [None]:
# first we need to install the kaggle package
%pip install kaggle
%pip install pandas

Collecting kaggle
  Downloading kaggle-1.7.4.5-py3-none-any.whl (181 kB)
Collecting protobuf
  Downloading protobuf-6.31.1-cp310-abi3-win_amd64.whl (435 kB)
Collecting bleach
  Downloading bleach-6.2.0-py3-none-any.whl (163 kB)
Collecting webencodings
  Downloading webencodings-0.5.1-py2.py3-none-any.whl (11 kB)
Collecting python-slugify
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Collecting text-unidecode
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Installing collected packages: webencodings, text-unidecode, python-slugify, protobuf, bleach, kaggle
Successfully installed bleach-6.2.0 kaggle-1.7.4.5 protobuf-6.31.1 python-slugify-8.0.4 text-unidecode-1.3 webencodings-0.5.1
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Users\atin5\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


In [5]:
# then we need to set up the Kaggle API credentials
import kaggle
import os
os.environ["KAGGLE_USERNAME"] = "your_username"
os.environ["KAGGLE_KEY"] = "your_key"

In [6]:
os.environ["KAGGLE_USERNAME"] = "atinkumarsingh"
os.environ["KAGGLE_KEY"] = "e28a6fc4eb1b5f96e7a1cd388152f7cc"

In [None]:
from kaggle.api.kaggle_api_extended import KaggleApi
# then we use the Kaggle API to download the dataset
api = KaggleApi()
api.authenticate()

dataset_name = 'rsrishav/youtube-trending-video-dataset'
# Our dataset is 'rsrishav/youtube-trending-video-dataset'
# For some datasets, Kaggle creates a zip file of the dataset, and uploads it to Google Cloud Storage.
# The problem is, Kaggle only does this for SOME datasets, not all datasets.
# The dataset we are using does not have a zip file like this.
# If it DID have a zip file like that, we could use the following code to download it
# api.dataset_download_files('rsrishav/youtube-trending-video-dataset', path='data/', unzip=True)

# Since it does not, our only other option is to download directly from the Kaggle dataset page, or by setting up the Kaggle API, as we did above, and the Kaggle CLI.
# From a brief look at it, setting up the Kaggle CLI seems to be a time-consuming process on a notebook, so I might add that later.
# For now, we will download the dataset directly from the Kaggle dataset page.
# The dataset is available at https://www.kaggle.com/datasets/rsrishav/youtube-trending-video-dataset
# I have downloaded the dataset manually and saved it to /data/GB_youtube_trending.data.csv


If any of the above does not work for you, don't worry. This repository comes with all datasets downloaded with it, so you can skip the Kaggle stage in case its taking too much your time and move on to the rest of this notebook.

In [36]:
# Now we're going to read the dataset (using pandas)
import pandas as pd
df = pd.read_csv('data/GB_youtube_trending_data.csv')

In [None]:
# Here's a quick look at the first 10 rows of our dataset
df.head(10)

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,description
0,J78aPJ3VyNs,I left youtube for a month and THIS is what ha...,2020-08-11T16:34:06Z,UCYzPXprvl5Y-Sf0g4vX-m6g,jacksepticeye,24,2020-08-12T00:00:00Z,jacksepticeye|funny|funny meme|memes|jacksepti...,2038853,353790,2628,40228,https://i.ytimg.com/vi/J78aPJ3VyNs/default.jpg,False,False,I left youtube for a month and this is what ha...
1,9nidKH8cM38,TAXI CAB SLAYER KILLS 'TO KNOW HOW IT FEELS',2020-08-11T20:00:45Z,UCFMbX7frWZfuWdjAML0babA,Eleanor Neale,27,2020-08-12T00:00:00Z,eleanor|neale|eleanor neale|eleanor neale true...,236830,16423,209,1642,https://i.ytimg.com/vi/9nidKH8cM38/default.jpg,False,False,The first 1000 people to click the link will g...
2,M9Pmf9AB4Mo,Apex Legends | Stories from the Outlands – “Th...,2020-08-11T17:00:10Z,UC0ZV6M2THA81QT9hrVWJG3A,Apex Legends,20,2020-08-12T00:00:00Z,Apex Legends|Apex Legends characters|new Apex ...,2381688,146739,2794,16549,https://i.ytimg.com/vi/M9Pmf9AB4Mo/default.jpg,False,False,"While running her own modding shop, Ramya Pare..."
3,kgUV1MaD_M8,Nines - Clout (Official Video),2020-08-10T18:30:28Z,UCvDkzrj8ZPlBqRd6fIxdhTw,Nines,24,2020-08-12T00:00:00Z,Nines|Trapper of the year|Crop Circle|Nines Tr...,613785,37567,669,2101,https://i.ytimg.com/vi/kgUV1MaD_M8/default.jpg,False,False,Nines - Clout (Official Video)Listen to Clout ...
4,49Z6Mv4_WCA,i don't know what im doing anymore,2020-08-11T20:24:34Z,UCtinbF-Q-fVthA0qrFQTgXQ,CaseyNeistat,22,2020-08-12T00:00:00Z,[None],940036,87113,1860,7052,https://i.ytimg.com/vi/49Z6Mv4_WCA/default.jpg,False,False,ssend love to my sponsor; for a super Limited ...
5,ua4QMFQATco,CGP Grey was WRONG,2020-08-11T17:15:11Z,UC2C_jShtL725hvbm1arSV9w,CGP Grey,27,2020-08-12T00:00:00Z,cgpgrey|education|hello internet,1050143,89192,855,6455,https://i.ytimg.com/vi/ua4QMFQATco/default.jpg,False,False,‣ What Was TEKOI: https://www.youtube.com/watc...
6,x-KbnJ9fvJc,Kya Baat Aa : Karan Aujla (Official Video) Tan...,2020-08-11T09:00:11Z,UCm9SZAl03Rev9sFwloCdz1g,Rehaan Records,10,2020-08-12T00:00:00Z,[None],11308046,655449,33242,405146,https://i.ytimg.com/vi/x-KbnJ9fvJc/default.jpg,False,False,Singer/Lyrics: Karan Aujla Feat Tania Music/ D...
7,3C66w5Z0ixs,I ASKED HER TO BE MY GIRLFRIEND...,2020-08-11T19:20:14Z,UCvtRTOMP2TqYqu51xNrqAzg,Brawadis,22,2020-08-12T00:00:00Z,brawadis|prank|basketball|skits|ghost|funny vi...,1514614,156910,5856,35313,https://i.ytimg.com/vi/3C66w5Z0ixs/default.jpg,False,False,SUBSCRIBE to BRAWADIS ▶ http://bit.ly/Subscrib...
8,ZNfeMbO_AHo,Popek ft. Dr Alban - It's My Life (prod. Clay...,2020-08-12T10:00:09Z,UC8Mh9UmrIaQPEcybdWvQsOg,KrólAlbaniiTV,24,2020-08-12T00:00:00Z,[None],277506,27420,617,1268,https://i.ytimg.com/vi/ZNfeMbO_AHo/default.jpg,False,False,Nowa wersja kultowego utworu z lat 90’.Posłuch...
9,VIUo6yapDbc,Ultimate DIY Home Movie Theater for The LaBran...,2020-08-11T15:10:05Z,UCDVPcEbVLQgLZX0Rt6jo34A,Mr. Kate,26,2020-08-12T00:00:00Z,The LaBrant Family|DIY|Interior Design|Makeove...,1123889,45803,964,2198,https://i.ytimg.com/vi/VIUo6yapDbc/default.jpg,False,False,Transforming The LaBrant Family's empty white ...


In [None]:
# And some stats
df.describe()

Unnamed: 0,categoryId,view_count,likes,dislikes,comment_count
count,268791.0,268791.0,268791.0,268791.0,268791.0
mean,19.04627,2190558.0,108097.7,975.519404,7982.66
std,6.380657,6651493.0,346531.9,7288.658511,59532.16
min,1.0,0.0,0.0,0.0,0.0
25%,17.0,343502.0,11095.0,0.0,765.0
50%,20.0,774505.0,32137.0,0.0,1978.0
75%,24.0,1844036.0,87201.0,275.0,5071.0
max,29.0,1406330000.0,15246510.0,865075.0,5987770.0


If you remember, what we are going to do is predict the view count. So we're first going to clean the data, before setting up the model architecture and training our model.

In [41]:
# Now we're going to clean our dataset to prepare for our model.
# First, we will drop any rows with missing values in the columns we care about.
df = df.dropna(subset=["view_count", "likes", "dislikes", "comment_count"])

# We're also going to create some new features that might be useful for our model.
# Engagement features
df["like_ratio"] = df["likes"] / (df["view_count"] + 1)
df["comment_ratio"] = df["comment_count"] / (df["view_count"] + 1)

# Time features
df["published_hour"] = pd.to_datetime(df["publishedAt"]).dt.hour
df["trending_day"] = pd.to_datetime(df["trending_date"]).dt.dayofweek

In [None]:
# Now we're going to finalize a list of features we want to use for our model.
features = [
    "view_count",
    "likes",
    "dislikes",
    "comment_count",
    "like_ratio",
    "comment_ratio",
    "published_hour",
    "trending_day"
]
y = df["view_count"]  # Our target variable is view_count
X = df[features]  # Our features

### Model Architecture
- I need to deliberate over whether I want to use scikit-learn or tensorflow to implement this linear regression model, and then will declare my decision here and justify it. Then I will proceed to set the model up and actually train it and use it.