In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import gensim
from gensim.models import Word2Vec
from sklearn.feature_extraction.text import TfidfVectorizer
from tqdm import tqdm

## Project:
## YouTube Views: Predict the success of a video before it even goes live.


Let's start by importing the dataset generated via the scrapping notebook.

In [2]:
dfvideos = pd.read_csv("Videos_DF1.csv")


dfvideos.head()

Unnamed: 0.1,Unnamed: 0,title,description,keywords,channel_links,video,date,length,views
0,0,I got the Fortnite Only Up WORLD RECORD! (Spee...,⬆️ PLAY MY ONLY UP MAP NOW!! ► 5264-1761-9807❤...,"video, sharing, camera phone, video phone, fre...",https://www.youtube.com/@TGplays,https://www.youtube.com/watch?v=4HlBgHmknY4,2023-07-12T19:36:52-07:00,PT19M55S,1.2M
1,1,Ron DeSantis: It is important to stand for a c...,2024 GOP presidential candidate Gov. Ron DeSan...,"DeSantis, Ron DeSantis, DeSantis abortion, abo...",https://www.youtube.com/@FoxNews,https://www.youtube.com/watch?v=pWpOn6C0YAk,2024-01-09T16:14:58-08:00,PT5M23S,26K
2,2,"Game Theory: Viewers' Choice, Cyborgs, Fatalit...",Your voices have been heard! As thanks for sup...,"Chrono Trigger, Mario, Super Mario, Illusion o...",https://www.youtube.com/@GameTheory,https://www.youtube.com/watch?v=z4QwsHsu3uw,2011-07-06T09:05:29-07:00,PT8M27S,958K
3,3,C-R-O-W-N-E-D - Kirby's Return to Dream Land +...,MY LINKS:●Main channel: https://www.youtube.co...,"music, extended, ost",https://www.youtube.com/@AacroXtensions,https://www.youtube.com/watch?v=iPx1YkOVGKE,2023-04-26T17:51:46-07:00,PT30M1S,1.2K
4,4,Tostarena: Night - Super Mario Odyssey Music E...,MY LINKS:●Main channel: https://www.youtube.co...,"music, extended, ost",https://www.youtube.com/@AacroXtensions,https://www.youtube.com/watch?v=xYY8KI_00tY,2023-06-28T22:01:18-07:00,PT30M2S,3.1K


Then, let's prepare the tokenization via Word2Vec. Word2Vec is useful because it looks for the meaning of the words within a context. Since we have many different features, its use is appropriate.

In [3]:
# Tokenize the text in each column
title_tokens = [str(title).split() for title in dfvideos['title']]
description_tokens = [str(description).split() for description in dfvideos['description']]
keywords_tokens = [str(keywords).split() for keywords in dfvideos['keywords']]

# Train Word2Vec model
word2vec_model = Word2Vec(title_tokens + description_tokens + keywords_tokens, vector_size=100, window=5, min_count=1, workers=4)


In [4]:
def vectorize_with_word2vec(tokens, model):
    vectors = []
    for token_list in tokens:
        vector = sum([model.wv[word] for word in token_list if word in model.wv])
        vectors.append(vector)
    return vectors

Now, applying the trained model:

In [5]:
# Apply Word2Vec to each column and add vectors to DataFrame
dfvideos['title_vectors'] = vectorize_with_word2vec(title_tokens, word2vec_model)
dfvideos['description_vectors'] = vectorize_with_word2vec(description_tokens, word2vec_model)
dfvideos['keywords_vectors'] = vectorize_with_word2vec(keywords_tokens, word2vec_model)


In [6]:
# Dropping variables that are no longer useful, to save up memory

del title_tokens, description_tokens, keywords_tokens, word2vec_model

Now, to make the final model even more robust, we're gonna tokenize the features with TF-IDF. This technique captures the importance of the words in a given document. It works by measuring how many times a word appears in a given text, and then dividing that by the number of texts that have the word in it. It does not capture semantic relationships, and that's why also used word2vec.

In [7]:
from scipy.sparse import csr_matrix
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD


dfvideos['combined_text'] = dfvideos['title'] + ' ' + dfvideos['description'] + ' ' + dfvideos['keywords']

dfvideos['combined_text'].fillna('', inplace=True)


tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(dfvideos['combined_text'])

tfidf_csr = csr_matrix(tfidf_matrix)

n_components = 5000

svd = TruncatedSVD(n_components=n_components)
tfidf_reduced = svd.fit_transform(tfidf_matrix)

del tfidf_matrix, tfidf_vectorizer

# Add the reduced TF-IDF vectors to your DataFrame
dfvideos['tfidf_vector'] = tfidf_reduced.tolist()


del tfidf_csr

Here we have code for using GloVe and FastText embedding. It was used for experimentation, but won't be used in the final model, so I'm not gonna go into the details. If you just want to reproduce the results, don't run the two cells below.

In [44]:
import pandas as pd
import fasttext
import spacy
import numpy as np

# Load the pre-trained spaCy model with word embeddings
nlp = spacy.load("en_core_web_md")


dfvideos = dfvideos.dropna()

# Function to get GloVe embeddings for a given text
def get_glove_embedding(text):
    doc = nlp(text)
    return doc.vector

tqdm.pandas(desc="Adding GloVe embeddings")
dfvideos['title_glove'] = dfvideos['title'].progress_apply(get_glove_embedding)
dfvideos['description_glove'] = dfvideos['description'].progress_apply(get_glove_embedding)
dfvideos['keywords_glove'] = dfvideos['keywords'].progress_apply(get_glove_embedding)


Adding GloVe embeddings: 100%|██████████| 13334/13334 [00:54<00:00, 243.52it/s]
Adding GloVe embeddings: 100%|██████████| 13334/13334 [01:16<00:00, 173.69it/s]
Adding GloVe embeddings: 100%|██████████| 13334/13334 [01:28<00:00, 149.95it/s]


In [45]:

# Load the pre-trained FastText model
ft_model = fasttext.load_model('cc.en.300.bin')  # Download the FastText model from https://fasttext.cc/docs/en/crawl-vectors.html

# Assuming dfvideos is your DataFrame with "title", "description", and "keywords" columns
# Replace these column names with the actual names in your dataset

# Function to get FastText embeddings for a given text
def get_fasttext_embedding(text):
    try:
        return ft_model.get_sentence_vector(text)
    except Exception as e:
        print(f"Error in FastText embedding: {e}")
        return np.zeros(300)  # Return zeros if there's an error


# Add columns for FastText and GloVe embeddings with progress bar
tqdm.pandas(desc="Adding FastText embeddings")
dfvideos['title_fasttext'] = dfvideos['title'].progress_apply(get_fasttext_embedding)
dfvideos['description_fasttext'] = dfvideos['description'].progress_apply(get_fasttext_embedding)
dfvideos['keywords_fasttext'] = dfvideos['keywords'].progress_apply(get_fasttext_embedding)




Adding FastText embeddings: 100%|██████████| 13334/13334 [00:00<00:00, 32811.10it/s]
Adding FastText embeddings: 100%|██████████| 13334/13334 [00:00<00:00, 14479.90it/s]
Adding FastText embeddings:  77%|███████▋  | 10321/13334 [00:00<00:00, 13965.79it/s]

Error in FastText embedding: predict processes one line at a time (remove '\n')


Adding FastText embeddings: 100%|██████████| 13334/13334 [00:00<00:00, 14044.41it/s]


In [21]:
dfvideos

Unnamed: 0.1,Unnamed: 0,title,description,keywords,channel_links,video,date,length,views,title_vectors,description_vectors,keywords_vectors,combined_text,tfidf_vector,title_glove,description_glove,keywords_glove,title_fasttext,description_fasttext,keywords_fasttext
0,0,I got the Fortnite Only Up WORLD RECORD! (Spee...,⬆️ PLAY MY ONLY UP MAP NOW!! ► 5264-1761-9807❤...,"video, sharing, camera phone, video phone, fre...",https://www.youtube.com/@TGplays,https://www.youtube.com/watch?v=4HlBgHmknY4,2023-07-12T19:36:52-07:00,PT19M55S,1.2M,"[-2.3758144, 0.41034794, 3.293144, 3.3258832, ...","[-15.166822, 1.345074, 11.923557, 0.2294234, 0...","[15.253917, 14.500347, 0.69548845, -8.14942, 1...",I got the Fortnite Only Up WORLD RECORD! (Spee...,"[0.3607159380254042, 0.10303499587196051, -0.0...","[-0.3377533, -2.060342, 1.7744125, -0.66898245...","[-0.2325979, -2.5292861, -0.6437167, -1.292271...","[-0.9185046, -1.9906908, -1.1067432, -0.106053...","[0.0049519073, -0.009549503, -0.00046868168, 0...","[-0.0035246222, -0.028688457, -0.0018514928, 0...","[0.047917824, 0.04999085, 0.01913951, 0.133339..."
1,1,Ron DeSantis: It is important to stand for a c...,2024 GOP presidential candidate Gov. Ron DeSan...,"DeSantis, Ron DeSantis, DeSantis abortion, abo...",https://www.youtube.com/@FoxNews,https://www.youtube.com/watch?v=pWpOn6C0YAk,2024-01-09T16:14:58-08:00,PT5M23S,26K,"[-8.914736, 7.5490937, 7.3160815, -4.9427176, ...","[-18.933815, 19.705341, 24.190193, -3.2342658,...","[2.062563, 30.105453, 30.074707, -10.259567, 2...",Ron DeSantis: It is important to stand for a c...,"[0.021789777898420563, 0.054521154698280436, -...","[-1.5506387, 2.4491537, -1.7139022, 0.6070992,...","[-0.11473874, 0.595177, -3.0886812, 1.0771065,...","[0.13954824, -0.03859296, -1.859019, 0.7005172...","[0.009222708, -0.04204326, -0.004863213, 0.026...","[-0.024331383, 0.025316384, -0.0034863118, 0.0...","[-0.017978324, 0.033592585, -0.005582819, 0.07..."
2,2,"Game Theory: Viewers' Choice, Cyborgs, Fatalit...",Your voices have been heard! As thanks for sup...,"Chrono Trigger, Mario, Super Mario, Illusion o...",https://www.youtube.com/@GameTheory,https://www.youtube.com/watch?v=z4QwsHsu3uw,2011-07-06T09:05:29-07:00,PT8M27S,958K,"[-1.1888382, 5.4095435, 0.91301775, 3.1505556,...","[-6.701402, 17.643934, 11.497351, 1.8601605, 6...","[1.9697646, 11.668364, 3.6080647, 0.93656063, ...","Game Theory: Viewers' Choice, Cyborgs, Fatalit...","[0.028186108025626364, 0.04834267842641934, -0...","[-1.3669965, -0.86046076, 1.9244978, 0.7163654...","[-0.37628472, -0.34862334, -3.1958642, -0.6927...","[-1.8910546, -1.4549284, 0.19222069, 1.020602,...","[-0.019393837, -0.0006271716, 0.016845139, 0.0...","[-0.013621625, 0.04849749, 0.01893261, 0.05062...","[-0.028666733, -0.008186309, 0.005269124, 0.05..."
3,3,C-R-O-W-N-E-D - Kirby's Return to Dream Land +...,MY LINKS:●Main channel: https://www.youtube.co...,"music, extended, ost",https://www.youtube.com/@AacroXtensions,https://www.youtube.com/watch?v=iPx1YkOVGKE,2023-04-26T17:51:46-07:00,PT30M1S,1.2K,"[3.8679698, 1.9457426, -0.07090827, 1.4073573,...","[-6.5522676, 12.250958, 23.587572, -11.756236,...","[-1.9883513, 2.3534052, 6.4368353, -2.772706, ...",C-R-O-W-N-E-D - Kirby's Return to Dream Land +...,"[0.09172592824018282, 0.4808903284503428, -0.1...","[-2.6555786, 3.6305854, 2.4743733, 2.0091543, ...","[-0.03246466, -0.8104347, 0.030219333, -0.3588...","[-2.900834, -0.04849205, -0.78084403, -1.53059...","[-0.010565607, 0.02092209, -0.010245121, -0.01...","[0.007729591, 0.003838814, 0.0016227973, 0.104...","[-0.035380017, 0.005370192, -0.059529617, 0.02..."
4,4,Tostarena: Night - Super Mario Odyssey Music E...,MY LINKS:●Main channel: https://www.youtube.co...,"music, extended, ost",https://www.youtube.com/@AacroXtensions,https://www.youtube.com/watch?v=xYY8KI_00tY,2023-06-28T22:01:18-07:00,PT30M2S,3.1K,"[9.088351, 3.2380605, -1.4039601, 5.007807, 2....","[-6.5522676, 12.250958, 23.587572, -11.756236,...","[-1.9883513, 2.3534052, 6.4368353, -2.772706, ...",Tostarena: Night - Super Mario Odyssey Music E...,"[0.09505771803969609, 0.5183311351157022, -0.1...","[-0.59108883, 0.028681166, 2.929658, 1.8105221...","[-0.03246466, -0.8104347, 0.030219333, -0.3588...","[-2.900834, -0.04849205, -0.78084403, -1.53059...","[-0.020172965, 0.031549398, 0.02322162, -0.045...","[0.007729591, 0.003838814, 0.0016227973, 0.104...","[-0.035380017, 0.005370192, -0.059529617, 0.02..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17007,17007,15 Most Dangerous Creatures in Australia,Having been separated from other landmasses ar...,"Australia, Creatures, animals",https://www.youtube.com/@topfives,https://www.youtube.com/watch?v=yrF2zEZzONU,2023-04-02T12:30:03-07:00,PT21M45S,199K,"[0.03892468, 1.3765994, 1.0549036, 3.222323, -...","[-9.911185, 19.195124, 15.653146, -2.4605153, ...","[0.048628185, 0.777444, 0.5377083, -0.24989927...",15 Most Dangerous Creatures in Australia Havin...,"[0.025590237885365658, 0.04114059419848127, -0...","[-1.9602083, -3.55472, -0.3945527, 0.38665667,...","[-1.6225536, 1.2916812, -2.1171246, -0.4127153...","[-2.6629, -2.93736, -1.747624, -0.79012, 3.362...","[-0.0034223883, -0.011237001, -0.008882027, -0...","[-0.009651948, 0.011737424, -0.017385239, 0.05...","[0.004893913, -0.000835652, -0.016947158, 0.10..."
17008,17008,I FINALLY Brought Him Back to Fortnite!,Today I brought back the BOYS to Fortnite Seas...,"fortnite, fortnite battle royale, competitive,...",https://www.youtube.com/@SypherPK,https://www.youtube.com/watch?v=sa0kUgZhhBc,2021-07-31T11:00:24-07:00,PT11M37S,1.2M,"[-4.1728954, 0.8249187, 2.1631951, -0.65770596...","[-21.973852, 11.141667, 17.342066, 2.7240803, ...","[-36.89027, -25.899267, 68.23906, -39.533424, ...",I FINALLY Brought Him Back to Fortnite! Today ...,"[0.8500308116822854, -0.2449982376307785, 0.03...","[0.120314956, 0.98569244, -2.5165224, -3.07832...","[-0.9745164, 1.4345919, -3.2211037, -1.1439137...","[-0.80179644, -1.6077292, -0.25197995, -0.1904...","[0.018937973, 0.016860327, 0.020391785, 0.0167...","[0.003019045, 0.005188585, 0.001199611, 0.0392...","[-0.031987187, -0.0067877197, 0.009353142, 0.0..."
17009,17009,Samsung Galaxy Fold Impressions!,"Samsung Galaxy Fold. Forget the crease, foldin...","Galaxy Fold, Samsung Galaxy Fold, Fold, foldab...",https://www.youtube.com/@mkbhd,https://www.youtube.com/watch?v=0Z8J3axc0oY,2019-04-15T15:15:25-07:00,PT9M1S,5.2M,"[6.6263123, 2.2944014, -1.09031, 0.09204132, 1...","[6.6162806, 11.265955, 7.0389843, -0.7767883, ...","[65.597206, 29.52255, -8.481855, -9.327614, 17...",Samsung Galaxy Fold Impressions! Samsung Galax...,"[0.017138348225193032, 0.036942335382177185, -...","[1.0832908, 1.6207802, -1.544462, 0.8206361, 2...","[-0.3563003, 0.20096889, -0.48135647, 0.467675...","[-1.1736034, 0.90537554, -1.775319, 1.9402494,...","[0.018514559, -0.021889158, 0.029857337, 0.024...","[-0.00952022, 0.012519878, 0.012090147, 0.0516...","[0.010021943, -0.0138607435, 0.007902909, 0.05..."
17010,17010,Facebook Metaverse In Disarray & NFT Sales Are...,PATREON: http://www.patreon.com/yongyeaTWITTER...,"video, sharing, camera phone, video phone, fre...",https://www.youtube.com/@YongYea,https://www.youtube.com/watch?v=PjbFU4L9VgA,2022-07-01T06:00:16-07:00,PT24M28S,366K,"[1.4562414, 2.7552037, 0.8469313, 4.8084846, 1...","[1.6659677, 2.1038742, 5.0325594, -2.6629167, ...","[15.253917, 14.500347, 0.69548845, -8.14942, 1...",Facebook Metaverse In Disarray & NFT Sales Are...,"[0.1206278414048883, 0.32267957055697544, -0.0...","[-0.9380464, -1.1273698, 1.3055894, 0.07360577...","[0.88729, -1.7203048, 6.1775503, 0.94269496, 1...","[-0.9185046, -1.9906908, -1.1067432, -0.106053...","[0.011839648, 0.0065996223, -0.004159878, 0.00...","[-0.031069312, -0.01780277, -0.016309816, 0.08...","[0.047917824, 0.04999085, 0.01913951, 0.133339..."


Now, we're gonna transform the views and duration columns in a way that is useful to us: Numerical values. In the dataset, duration is coded in an unfamiliar format that uses letters, and the views are counted by the thousands, millions or billions.

In [8]:
def convert_views(view_count):
    if 'K' in view_count:
        return float(view_count.replace('K', '').replace(',', '')) * 1000
    elif 'M' in view_count:
        return float(view_count.replace('M', '').replace(',', '')) * 1_000_000
    elif 'B' in view_count:
        return float(view_count.replace('B', '').replace(',', '')) * 1_000_000_000
    else:
        try:
            return float(view_count.replace(',', ''))
        except:
            return np.nan


def convert_video_length(length):
    minutes, seconds = 0, 0

    try:
        # Extract minutes
        if 'M' in length:
            minutes = int(length.split('M')[0][2:])

        # Extract seconds
        if 'S' in length:
            seconds = int(length.split('S')[0][-2:])

        # Calculate total seconds
        total_seconds = minutes * 60 + seconds
        return total_seconds
    
    except:
        return np.nan

In [9]:
dfvideos['views'] = dfvideos['views'].apply(convert_views)
dfvideos['length'] = dfvideos['length'].apply(convert_video_length)
dfvideos

Unnamed: 0.1,Unnamed: 0,title,description,keywords,channel_links,video,date,length,views,title_vectors,description_vectors,keywords_vectors,combined_text,tfidf_vector
0,0,I got the Fortnite Only Up WORLD RECORD! (Spee...,⬆️ PLAY MY ONLY UP MAP NOW!! ► 5264-1761-9807❤...,"video, sharing, camera phone, video phone, fre...",https://www.youtube.com/@TGplays,https://www.youtube.com/watch?v=4HlBgHmknY4,2023-07-12T19:36:52-07:00,1195.0,1200000.0,"[-1.6795774, 1.4490157, 6.178439, 4.588057, -4...","[-13.2073765, 2.4838564, 13.264118, 6.486306, ...","[8.519925, 16.488422, 2.7007475, -11.303694, 7...",I got the Fortnite Only Up WORLD RECORD! (Spee...,"[0.36071593802540525, 0.10303499587195948, -0...."
1,1,Ron DeSantis: It is important to stand for a c...,2024 GOP presidential candidate Gov. Ron DeSan...,"DeSantis, Ron DeSantis, DeSantis abortion, abo...",https://www.youtube.com/@FoxNews,https://www.youtube.com/watch?v=pWpOn6C0YAk,2024-01-09T16:14:58-08:00,323.0,26000.0,"[-8.695058, 1.9177513, 9.501862, -6.849505, -0...","[-23.187578, 13.495493, 26.062983, -8.764625, ...","[-14.120467, 35.019222, 31.265934, -13.492206,...",Ron DeSantis: It is important to stand for a c...,"[0.02178977789842047, 0.05452115469828015, -0...."
2,2,"Game Theory: Viewers' Choice, Cyborgs, Fatalit...",Your voices have been heard! As thanks for sup...,"Chrono Trigger, Mario, Super Mario, Illusion o...",https://www.youtube.com/@GameTheory,https://www.youtube.com/watch?v=z4QwsHsu3uw,2011-07-06T09:05:29-07:00,507.0,958000.0,"[-2.5269418, 6.9454813, 2.7381387, 4.648952, 1...","[-11.6576605, 15.131871, 20.671322, -5.504745,...","[-0.123255976, 10.845466, 6.976281, 0.43738675...","Game Theory: Viewers' Choice, Cyborgs, Fatalit...","[0.0281861080256261, 0.048342678426418935, -0...."
3,3,C-R-O-W-N-E-D - Kirby's Return to Dream Land +...,MY LINKS:●Main channel: https://www.youtube.co...,"music, extended, ost",https://www.youtube.com/@AacroXtensions,https://www.youtube.com/watch?v=iPx1YkOVGKE,2023-04-26T17:51:46-07:00,,1200.0,"[2.8032727, 1.880723, -0.8495823, 4.3075924, -...","[-7.4030027, 7.9482284, 23.360329, -10.031336,...","[-0.84046376, 0.9997828, 5.7130127, -2.5322502...",C-R-O-W-N-E-D - Kirby's Return to Dream Land +...,"[0.09172592824018294, 0.4808903284503385, -0.1..."
4,4,Tostarena: Night - Super Mario Odyssey Music E...,MY LINKS:●Main channel: https://www.youtube.co...,"music, extended, ost",https://www.youtube.com/@AacroXtensions,https://www.youtube.com/watch?v=xYY8KI_00tY,2023-06-28T22:01:18-07:00,,3100.0,"[7.0866675, 3.2768304, -3.1805847, 6.8738713, ...","[-7.4030027, 7.9482284, 23.360329, -10.031336,...","[-0.84046376, 0.9997828, 5.7130127, -2.5322502...",Tostarena: Night - Super Mario Odyssey Music E...,"[0.09505771803969633, 0.5183311351156975, -0.1..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17007,17007,15 Most Dangerous Creatures in Australia,Having been separated from other landmasses ar...,"Australia, Creatures, animals",https://www.youtube.com/@topfives,https://www.youtube.com/watch?v=yrF2zEZzONU,2023-04-02T12:30:03-07:00,1305.0,199000.0,"[-0.37331486, 1.02241, 2.1009853, 2.4208696, -...","[-14.026737, 10.74867, 24.69941, -12.685656, 1...","[-0.07851909, 0.64927053, 0.62222964, -0.31127...",15 Most Dangerous Creatures in Australia Havin...,"[0.02559023788536576, 0.04114059419848119, -0...."
17008,17008,I FINALLY Brought Him Back to Fortnite!,Today I brought back the BOYS to Fortnite Seas...,"fortnite, fortnite battle royale, competitive,...",https://www.youtube.com/@SypherPK,https://www.youtube.com/watch?v=sa0kUgZhhBc,2021-07-31T11:00:24-07:00,697.0,1200000.0,"[-3.0663505, 1.149898, 4.157325, 1.02996, -1.5...","[-20.11136, 5.561459, 25.665005, -0.48888242, ...","[-3.42567, -46.744816, 58.472916, 15.307017, -...",I FINALLY Brought Him Back to Fortnite! Today ...,"[0.8500308116822851, -0.24499823763078035, 0.0..."
17009,17009,Samsung Galaxy Fold Impressions!,"Samsung Galaxy Fold. Forget the crease, foldin...","Galaxy Fold, Samsung Galaxy Fold, Fold, foldab...",https://www.youtube.com/@mkbhd,https://www.youtube.com/watch?v=0Z8J3axc0oY,2019-04-15T15:15:25-07:00,,5200000.0,"[4.68511, 3.6033359, -1.0770307, -0.9587405, -...","[1.778667, 12.182019, 11.117801, -4.256032, -2...","[43.34031, 41.584156, -4.1926985, -23.19965, 8...",Samsung Galaxy Fold Impressions! Samsung Galax...,"[0.017138348225193063, 0.03694233538217726, -0..."
17010,17010,Facebook Metaverse In Disarray & NFT Sales Are...,PATREON: http://www.patreon.com/yongyeaTWITTER...,"video, sharing, camera phone, video phone, fre...",https://www.youtube.com/@YongYea,https://www.youtube.com/watch?v=PjbFU4L9VgA,2022-07-01T06:00:16-07:00,1468.0,366000.0,"[1.0633997, 5.568253, 2.2949955, 7.4735384, 1....","[1.6599969, 1.531536, 3.6590078, -1.4927975, 2...","[8.519925, 16.488422, 2.7007475, -11.303694, 7...",Facebook Metaverse In Disarray & NFT Sales Are...,"[0.12062784140488901, 0.32267957055697427, -0...."


In [1]:
# Dropping null columns

dfvideos.dropna(inplace= True)

NameError: name 'dfvideos' is not defined

Now, since we only have the channel links, we're gonna trim them down to only include the youtube handle of the user. This will be used later.

In [11]:
dfvideos['channel_links'] = dfvideos['channel_links'].apply(lambda x: x[24:])

In [12]:

dfvideos['date'] = pd.to_datetime(dfvideos['date'], utc=True)

dfvideos['year'] = dfvideos['date'].dt.year
dfvideos['month'] = dfvideos['date'].dt.month
dfvideos['day'] = dfvideos['date'].dt.day
dfvideos['hour'] = dfvideos['date'].dt.hour
dfvideos['minute'] = dfvideos['date'].dt.minute


dfvideos['day_of_week'] = dfvideos['date'].dt.dayofweek

dfvideos


Unnamed: 0.1,Unnamed: 0,title,description,keywords,channel_links,video,date,length,views,title_vectors,description_vectors,keywords_vectors,combined_text,tfidf_vector,year,month,day,hour,minute,day_of_week
0,0,I got the Fortnite Only Up WORLD RECORD! (Spee...,⬆️ PLAY MY ONLY UP MAP NOW!! ► 5264-1761-9807❤...,"video, sharing, camera phone, video phone, fre...",@TGplays,https://www.youtube.com/watch?v=4HlBgHmknY4,2023-07-13 02:36:52+00:00,1195.0,1200000.0,"[-1.6795774, 1.4490157, 6.178439, 4.588057, -4...","[-13.2073765, 2.4838564, 13.264118, 6.486306, ...","[8.519925, 16.488422, 2.7007475, -11.303694, 7...",I got the Fortnite Only Up WORLD RECORD! (Spee...,"[0.36071593802540525, 0.10303499587195948, -0....",2023,7,13,2,36,3
1,1,Ron DeSantis: It is important to stand for a c...,2024 GOP presidential candidate Gov. Ron DeSan...,"DeSantis, Ron DeSantis, DeSantis abortion, abo...",@FoxNews,https://www.youtube.com/watch?v=pWpOn6C0YAk,2024-01-10 00:14:58+00:00,323.0,26000.0,"[-8.695058, 1.9177513, 9.501862, -6.849505, -0...","[-23.187578, 13.495493, 26.062983, -8.764625, ...","[-14.120467, 35.019222, 31.265934, -13.492206,...",Ron DeSantis: It is important to stand for a c...,"[0.02178977789842047, 0.05452115469828015, -0....",2024,1,10,0,14,2
2,2,"Game Theory: Viewers' Choice, Cyborgs, Fatalit...",Your voices have been heard! As thanks for sup...,"Chrono Trigger, Mario, Super Mario, Illusion o...",@GameTheory,https://www.youtube.com/watch?v=z4QwsHsu3uw,2011-07-06 16:05:29+00:00,507.0,958000.0,"[-2.5269418, 6.9454813, 2.7381387, 4.648952, 1...","[-11.6576605, 15.131871, 20.671322, -5.504745,...","[-0.123255976, 10.845466, 6.976281, 0.43738675...","Game Theory: Viewers' Choice, Cyborgs, Fatalit...","[0.0281861080256261, 0.048342678426418935, -0....",2011,7,6,16,5,2
5,5,How Super Mario Kart was Created,"In the world of the local multiplayer game, on...","gaming, video games, Nintendo, Super Mario, Ma...",@ThomasGameDocs,https://www.youtube.com/watch?v=Qu27yfItsSw,2018-11-30 11:34:07+00:00,576.0,332000.0,"[3.3885865, 2.6789274, -1.7763621, 2.6008675, ...","[-11.602807, 22.244604, 24.447556, 1.8333958, ...","[4.108917, 14.299869, -0.7578187, 0.4368142, 5...",How Super Mario Kart was Created In the world ...,"[0.06658263443937545, 0.14046179739737427, -0....",2018,11,30,11,34,4
6,6,Film Theory: The Scary Monsters Living Under Y...,Subscribe to never miss a Theory! ► http://bit...,"the descent, monsters, the descent 2, the desc...",@FilmTheory,https://www.youtube.com/watch?v=aIKNqE7mpv8,2020-06-25 21:08:58+00:00,836.0,3300000.0,"[-0.2099945, 6.942519, 2.2158422, 4.463892, 1....","[-16.648933, 12.353871, 23.819704, -14.266366,...","[-19.513763, 35.493492, 56.287556, -17.102392,...",Film Theory: The Scary Monsters Living Under Y...,"[0.019829686486974172, 0.03656216083497131, -0...",2020,6,25,21,8,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17006,17006,Huge Fake Bear Conspiracy,This is the greatest sun bear of All TimeMerch...,"video, sharing, camera phone, video phone, fre...",@penguinz0,https://www.youtube.com/watch?v=_Mo-UJl5yoQ,2023-08-02 00:00:13+00:00,550.0,2300000.0,"[0.034584038, 0.94406486, 0.70802, -0.07303694...","[0.26178515, 7.8868046, 7.810991, -6.1481094, ...","[8.519925, 16.488422, 2.7007475, -11.303694, 7...",Huge Fake Bear Conspiracy This is the greatest...,"[0.11718101741847348, 0.25154776013797636, -0....",2023,8,2,0,0,2
17007,17007,15 Most Dangerous Creatures in Australia,Having been separated from other landmasses ar...,"Australia, Creatures, animals",@topfives,https://www.youtube.com/watch?v=yrF2zEZzONU,2023-04-02 19:30:03+00:00,1305.0,199000.0,"[-0.37331486, 1.02241, 2.1009853, 2.4208696, -...","[-14.026737, 10.74867, 24.69941, -12.685656, 1...","[-0.07851909, 0.64927053, 0.62222964, -0.31127...",15 Most Dangerous Creatures in Australia Havin...,"[0.02559023788536576, 0.04114059419848119, -0....",2023,4,2,19,30,6
17008,17008,I FINALLY Brought Him Back to Fortnite!,Today I brought back the BOYS to Fortnite Seas...,"fortnite, fortnite battle royale, competitive,...",@SypherPK,https://www.youtube.com/watch?v=sa0kUgZhhBc,2021-07-31 18:00:24+00:00,697.0,1200000.0,"[-3.0663505, 1.149898, 4.157325, 1.02996, -1.5...","[-20.11136, 5.561459, 25.665005, -0.48888242, ...","[-3.42567, -46.744816, 58.472916, 15.307017, -...",I FINALLY Brought Him Back to Fortnite! Today ...,"[0.8500308116822851, -0.24499823763078035, 0.0...",2021,7,31,18,0,5
17010,17010,Facebook Metaverse In Disarray & NFT Sales Are...,PATREON: http://www.patreon.com/yongyeaTWITTER...,"video, sharing, camera phone, video phone, fre...",@YongYea,https://www.youtube.com/watch?v=PjbFU4L9VgA,2022-07-01 13:00:16+00:00,1468.0,366000.0,"[1.0633997, 5.568253, 2.2949955, 7.4735384, 1....","[1.6599969, 1.531536, 3.6590078, -1.4927975, 2...","[8.519925, 16.488422, 2.7007475, -11.303694, 7...",Facebook Metaverse In Disarray & NFT Sales Are...,"[0.12062784140488901, 0.32267957055697427, -0....",2022,7,1,13,0,4


Now, using MultiLabelBinarizer, I transform the channel names into matrixes that function as classes for our regression.

In [13]:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

channel_matrix = mlb.fit_transform(dfvideos['channel_links'].str.split())

dfvideos['channel_vector'] = channel_matrix.tolist()

dfvideos['channel_vector']


0        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...
2        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...
5        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
6        [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
                               ...                        
17006    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
17007    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
17008    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
17010    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
17011    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...
Name: channel_vector, Length: 13334, dtype: object

Now, preparing the features that will be used. Simply selecting the numerical features for X and the target for y.

In [47]:
features = dfvideos.columns[9:]

X = dfvideos.dropna()[features].drop(columns='combined_text')
y = dfvideos.dropna()["views"]

Since the columns represent entire vectors, we have to "unpack" them as features for x, where each column of the vectors become columns in the dataframe instead.

In [51]:
vector_columns = ['title_vectors', 'description_vectors', 'keywords_vectors', 'channel_vector', 'tfidf_vector',]
                  # 'title_fasttext', 'description_fasttext', 'keywords_fasttext', 'title_glove', 'description_glove', 'keywords_glove']

for col in vector_columns:
    print(col)
    rows = pd.DataFrame(list(X[col].values))
    rows.columns = rows.columns.astype(str) + "_" + col
    X = pd.concat([X, rows], axis= 1)

X.drop(columns= vector_columns, inplace= True)

X


title_vectors
description_vectors
keywords_vectors
channel_vector
tfidf_vector
title_fasttext
description_fasttext
keywords_fasttext
title_glove
description_glove
keywords_glove


Unnamed: 0,year,month,day,hour,minute,day_of_week,0_title_vectors,1_title_vectors,2_title_vectors,3_title_vectors,...,290_keywords_glove,291_keywords_glove,292_keywords_glove,293_keywords_glove,294_keywords_glove,295_keywords_glove,296_keywords_glove,297_keywords_glove,298_keywords_glove,299_keywords_glove
0,2023,7,13,2,36,3,-1.679577,1.449016,6.178439,4.588057,...,0.521308,-1.739232,0.159932,-3.166825,-1.465603,1.512945,1.033550,-1.237042,-1.486300,-1.367467
1,2024,1,10,0,14,2,-8.695058,1.917751,9.501862,-6.849505,...,0.047307,-2.141437,-0.086640,-2.433806,-1.290793,1.055457,3.134836,1.865578,-1.743173,0.334735
2,2011,7,6,16,5,2,-2.526942,6.945481,2.738139,4.648952,...,0.333862,-2.064317,-0.252769,-0.895644,-1.429959,1.200893,1.252971,-0.641076,-0.626000,0.021667
3,2018,11,30,11,34,4,3.388587,2.678927,-1.776362,2.600868,...,1.025264,-2.678861,-1.235668,0.195949,-1.372418,0.725798,1.802477,-1.541324,-0.723837,-0.762619
4,2020,6,25,21,8,3,-0.209994,6.942519,2.215842,4.463892,...,1.015643,-1.812851,0.218364,-1.053538,-1.302648,1.039712,1.695771,-1.265315,-1.991667,-0.028532
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13329,2023,8,2,0,0,2,0.034584,0.944065,0.708020,-0.073037,...,0.521308,-1.739232,0.159932,-3.166825,-1.465603,1.512945,1.033550,-1.237042,-1.486300,-1.367467
13330,2023,4,2,19,30,6,-0.373315,1.022410,2.100985,2.420870,...,1.766632,-2.121922,-0.758560,-1.891864,-1.586520,0.870102,1.052715,-0.605248,-3.021980,0.742946
13331,2021,7,31,18,0,5,-3.066350,1.149898,4.157325,1.029960,...,1.107978,-1.567314,-0.457945,-0.217118,-1.346896,0.844148,1.622319,-1.506048,-1.912868,0.153523
13332,2022,7,1,13,0,4,1.063400,5.568253,2.294996,7.473538,...,0.521308,-1.739232,0.159932,-3.166825,-1.465603,1.512945,1.033550,-1.237042,-1.486300,-1.367467


Preparing a scaler. This turned out not to be useful after experimentation, but it can be in certain situations. 

In [52]:
from sklearn.preprocessing import StandardScaler

# Scale your features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Now, testing a Ridge model. We're gonna test multiple models to check which performs better.

In [54]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = Ridge(alpha=2.5)


model.fit(X_train, y_train)


y_pred = model.predict(X_test)


mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {np.sqrt(mse)}')
print('Coefficients:', model.coef_)
print('Intercept:', model.intercept_)


Mean Squared Error: 2462209.035853819
Coefficients: [ 58950.73175672  -2512.02291562   2213.19864087 ...  57160.4377328
 149396.87816722 177370.73193752]
Intercept: -117511131.28762428


Since we're dealing with very big numbers, the mean squared error (Which is exactly what it sounds like) is misleading. A better way to measure the performance of the model is using R². R² Measures how much of the variation in the data can be captured in the model.

In [55]:
from sklearn.metrics import r2_score
r_squared = r2_score(y_test, y_pred)
r_squared

0.6109203429096104

Not quite great yet. We can do better! Let's try RandomForest.

In [56]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)


rf_model = RandomForestRegressor(n_estimators=200, criterion="poisson", random_state=42, verbose= 2, n_jobs= 10)

rf_model.fit(X_train, y_train)

[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.


building tree 1 of 200
building tree 2 of 200
building tree 3 of 200
building tree 4 of 200
building tree 5 of 200
building tree 6 of 200
building tree 7 of 200
building tree 8 of 200
building tree 9 of 200
building tree 10 of 200
building tree 11 of 200
building tree 12 of 200
building tree 13 of 200
building tree 14 of 200
building tree 15 of 200
building tree 16 of 200
building tree 17 of 200
building tree 18 of 200
building tree 19 of 200
building tree 20 of 200
building tree 21 of 200
building tree 22 of 200
building tree 23 of 200
building tree 24 of 200
building tree 25 of 200
building tree 26 of 200
building tree 27 of 200
building tree 28 of 200
building tree 29 of 200
building tree 30 of 200
building tree 31 of 200


[Parallel(n_jobs=10)]: Done  21 tasks      | elapsed:  6.0min


building tree 32 of 200
building tree 33 of 200
building tree 34 of 200
building tree 35 of 200
building tree 36 of 200
building tree 37 of 200
building tree 38 of 200
building tree 39 of 200
building tree 40 of 200
building tree 41 of 200
building tree 42 of 200
building tree 43 of 200
building tree 44 of 200
building tree 45 of 200
building tree 46 of 200
building tree 47 of 200
building tree 48 of 200
building tree 49 of 200
building tree 50 of 200
building tree 51 of 200
building tree 52 of 200
building tree 53 of 200
building tree 54 of 200
building tree 55 of 200
building tree 56 of 200
building tree 57 of 200
building tree 58 of 200
building tree 59 of 200
building tree 60 of 200
building tree 61 of 200
building tree 62 of 200
building tree 63 of 200
building tree 64 of 200
building tree 65 of 200
building tree 66 of 200
building tree 67 of 200
building tree 68 of 200
building tree 69 of 200
building tree 70 of 200
building tree 71 of 200
building tree 72 of 200
building tree 73

[Parallel(n_jobs=10)]: Done 142 tasks      | elapsed: 32.5min


building tree 153 of 200
building tree 154 of 200
building tree 155 of 200
building tree 156 of 200
building tree 157 of 200
building tree 158 of 200
building tree 159 of 200
building tree 160 of 200
building tree 161 of 200
building tree 162 of 200
building tree 163 of 200
building tree 164 of 200
building tree 165 of 200
building tree 166 of 200
building tree 167 of 200
building tree 168 of 200
building tree 169 of 200
building tree 170 of 200
building tree 171 of 200
building tree 172 of 200
building tree 173 of 200
building tree 174 of 200
building tree 175 of 200
building tree 176 of 200
building tree 177 of 200
building tree 178 of 200
building tree 179 of 200
building tree 180 of 200
building tree 181 of 200
building tree 182 of 200
building tree 183 of 200
building tree 184 of 200
building tree 185 of 200
building tree 186 of 200
building tree 187 of 200
building tree 188 of 200
building tree 189 of 200
building tree 190 of 200
building tree 191 of 200
building tree 192 of 200


[Parallel(n_jobs=10)]: Done 200 out of 200 | elapsed: 45.1min finished


In [58]:
r_squared = r2_score(y_test, y_pred)
r_squared

0.6547290190616355

Much better! In experimentation, I was able to reach a 0.67 value, unfortunately, I lost the code. However, the fact that I got so far makes me think that with even more data, we might be able to reach r² > 0.7! Upon further scrapping, I will be re-testing the algorithms here and update the code accordingly.



The Random forest regressor wielded the best result. But the following models were also investigated. Notice that the xgboost model came extremely close to the RandomForest performance. This may indicate that ensemble models are more effective for this task.

In [79]:
from sklearn.ensemble import HistGradientBoostingRegressor

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

gb_model = HistGradientBoostingRegressor(max_iter= 50, verbose= 2, loss="poisson")

gb_model.fit(X_train, y_train)

Binning 0.412 GB of training data: 6.216 s
Binning 0.046 GB of validation data: 0.067 s
Fitting gradient boosted rounds:
[1/50] 1 tree, 31 leaves, max depth = 15, train loss: -29885168.74738, val loss: -31240085.31271, in 0.715s
[2/50] 1 tree, 31 leaves, max depth = 15, train loss: -30140340.18978, val loss: -31447707.27735, in 0.786s
[3/50] 1 tree, 31 leaves, max depth = 17, train loss: -30303804.58635, val loss: -31582107.81894, in 0.762s
[4/50] 1 tree, 31 leaves, max depth = 15, train loss: -30442953.61509, val loss: -31697732.73749, in 0.870s
[5/50] 1 tree, 31 leaves, max depth = 11, train loss: -30546666.51338, val loss: -31789035.31585, in 0.841s
[6/50] 1 tree, 31 leaves, max depth = 13, train loss: -30633516.72575, val loss: -31859475.35777, in 0.772s
[7/50] 1 tree, 31 leaves, max depth = 11, train loss: -30707187.65129, val loss: -31918742.78210, in 0.710s
[8/50] 1 tree, 31 leaves, max depth = 11, train loss: -30771768.79730, val loss: -31966155.88231, in 0.742s
[9/50] 1 tree, 

In [81]:
r_squared = r2_score(y_test, y_pred)
r_squared

0.5998088006616161

In [27]:

import xgboost as xgb


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create an XGBoost regression model
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.1,
                          max_depth = 5, alpha = 10, n_estimators = 100, verbosity = 2, random_state=42)

# Fit the model to the training data
xg_reg.fit(X_train, y_train)

# Make predictions on the test set
y_pred = xg_reg.predict(X_test)

# Evaluate the model
r2 = r2_score(y_test, y_pred)
print(f"R² Score (XGBoost): {r2}")

R² Score (XGBoost): 0.6509721624862198


In [28]:
from sklearn.metrics import r2_score
r_squared = r2_score(y_test, y_pred)
r_squared

0.6509721624862198

In [52]:
import lightgbm as lgb


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a LightGBM regression model
lgb_reg = lgb.LGBMRegressor(objective='regression', num_leaves=31, learning_rate=0.05, n_estimators=100)

# Fit the model to the training data
lgb_reg.fit(X_train, y_train)

# Make predictions on the test set
y_pred = lgb_reg.predict(X_test)

# Evaluate the model
r2 = r2_score(y_test, y_pred)
print(f"R² Score (LightGBM): {r2}")

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.867059 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1351600
[LightGBM] [Info] Number of data points in the train set: 10667, number of used features: 5350
[LightGBM] [Info] Start training from score 2174805.015937
R² Score (LightGBM): 0.6191037017547409


## Conclusion:

In this project, we used youtube scrapped data to try and predict the number of views a video will receive considering the information present in the dataset. Since the information is mostly textual, we had to use a number of natural language processing techniques to get it to a usable format. 

In the end, we found that:

- The best performing regression model was the RandomForestRegressor.
- Ensemble models seem to work better in this dataset, and they seem necessary to obtain a satisfactory performance.
- The maximum r² reproduced was ~6.5.
- More data may be useful to increase the r² value.

In the end, this project still has room to grow. But it could already be a useful tool for youtube channels. Once a higher r² value is obtained, I will also prepare an analysis of the dataset, hopefully producing useful information for those interested in knowing how the youtube algorithm works.