Imports

Importing different modules to use.

We import truediv and regex. Regex is used in the string cleaner to transform our strings.
We import numpy and pandas to use the data set.

Imports are commented to display what they do.

The last try-catch loop is to check if GPU is accessible, and if it is it will be used. This is because GPU will be a lot faster than CPU. If no GPU is available, a smaller data-set will be used.

In [9]:
#Import regex
from operator import truediv
import re

#pandas and numpy
import numpy as np
import pandas as pd

#setting pandas options
pd.set_option('display.max_colwidth', 200)

#storing and loading models
import pickle

#to set types for functions
from typing import Tuple

#plots
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

#gpu debug
import torch

#setting device to use GPU for NLP backend if you have GPU
device = "cuda" if torch.cuda.is_available() else "cpu"

#SBERT
from sentence_transformers import SentenceTransformer

#UMAP - Used to reduce dimensionality from 700+ dimensional arrays to a 2d array
from umap import UMAP

#HDBSCAN
from hdbscan import HDBSCAN

#topic finding
from sklearn.feature_extraction.text import TfidfVectorizer

#loading model from pickle if possible, to avoid downloading it again
try:
    model = pickle.load(open(f'model-{device}.pkl','rb'))
    model_load = True
except:
    model = SentenceTransformer('all-mpnet-base-v2', device=device)
    pickle.dump(model,open(f'model-{device}.pkl', 'wb'))

    model_load=False

print(f"""
GPUS detected:          {torch.cuda.device_count()}
Using GPU:              {torch.cuda.is_available()}
Device:                 {device}
Got model from pickle:  {model_load}""")


GPUS detected:          1
Using GPU:              True
Device:                 cuda
Got model from pickle:  True


Functions:
A function to find the most relevant word is defined. It returns a list of the most relevant words. Inputs are described in the comment "section" of the function. It then runs through every word of the input list and sorts it based on importance. After this feature names are assigned and the importance and feature names are (with importance and number of words) appended to the list of most relevant words.

In [10]:
def tfidf_most_relevant_word(input: list, num_words=5) -> list:
    """"
    Function that finds the most relevant words per cluster id.

    Args:
        input (list): A list of title strings aggregated by cluster id.
        num_words(int, optional): How many words you want. Defaults to 5

    returns: list: Returns a list of most relevant words, with length of unique cluster Ids.
    """
    most_relevant_words = []

    for corpus in input:
        vectorizer = TfidfVectorizer(stop_words='english')
        X = vectorizer.fit_transform(corpus)

        importance = np.argsort(np.asarray(X.sum(axis=0)).ravel())[::-1]
        tfidf_feature_names = np.array(vectorizer.get_feature_names_out()) #get_feature names
        most_relevant_words.append(tfidf_feature_names[importance[:num_words]])
    return most_relevant_words

Cleaning
Defining a function to clean up the input strings. It starts by turning the strings lower case, and then using regex to remove punctuation and other non-alphanumeric characters. It then returns the "cleaned" input string

In [11]:
#Function to clean up strings. 
def string_cleaner(input: str) -> str:
    #starts by turning it to lowercase
    input = input.lower()

    #removing punctuation and other non-alphanumeric characters with regex
    input = re.sub(r'[^\w\s]', '',input)

    return input

Topic Modeling:
Defining our own functions to find the most relevant words using Tfidf and vecotricing them. Most relevant words are returned. In the second function it maps topics to cluster id's. Comments are included to explain the functions.

In [12]:
#Function that finds the most relevant words per cluster id.
def tfidf_most_relevant_word(input: list, num_words=5) -> list:
    
    most_relevant_words = []

    for corpus in input:
        vectorizer = TfidfVectorizer(stop_words='english')
        X = vectorizer.fit_transform(corpus)

        importance = np.argsort(np.asarray(X.sum(axis=0)).ravel())[::-1]
        tfidf_feature_names = np.array(vectorizer.get_feature_names_out()) #get feature names
        most_relevant_words.append(tfidf_feature_names[importance[:num_words]])
    return most_relevant_words

#Function that maps topics to cluster ids. Takes the dataframe as input and returns a dictionary with cluster ids as keys and topics as values
def topic_by_clusterId(result: pd.DataFrame) -> dict:

    #print(result.isna().sum())

    df_group = result[["titles","cluster_label"]].groupby("cluster_label").agg(list).reset_index()

    df_group["topics"] = tfidf_most_relevant_word(df_group["titles"])

    return dict(zip(df_group.cluster_label, df_group.topics))


Plotting Functions:
Functions in the two following code blocks are used to plot the Dataframe and make scatter traces. Comments are included to explain the functions.

In [13]:
# When you actually cast the type here, then it works with how pandas cast types and you don't have to worry about copying series
def result_df_maker(embeddings: np.ndarray, cluster_labels: np.ndarray, titles: np.ndarray) -> pd.DataFrame:

    result = pd.DataFrame(embeddings, columns=['x','y'])

    result["titles"] = titles
    
    result["cluster_label"] = cluster_labels

    topic_dict = topic_by_clusterId(result)

    result["topics"] = result["cluster_label"].apply(lambda x: topic_dict[x])

    result["topics"] = result["topics"].apply(lambda x: " ".join(x))

    return result

#Function to split the dataframe into two dataframes, one for clustered and one for outliers
def result_splitter(result: pd.DataFrame) -> Tuple[np.ndarray, np.ndarray]:
    clustered = result.loc[result.cluster_label != -1, :]

    outliers = result.loc[result.cluster_label == -1, :]

    return clustered, outliers

#Function to make scatter traces of the clustered and outliers
def result_tracer(clustered: pd.DataFrame, outliers: pd.DataFrame) -> Tuple[go.Scattergl, go.Scattergl]:

    trace_cluster = go.Scattergl(
        x=clustered.x,
        y=clustered.y,
        mode="markers",
        name="Clustered",

        #styling markers
        marker=dict(
            size=2,
            color=clustered.cluster_label,
            colorscale="Rainbow"
        ),
        #setting hover text to the titles of the videos
        hovertemplate = "<b>Topics:</b> %{customdata[0]} <br><b>Cluster Id:</b> %{customdata[1]}<extra></extra>",
        customdata=np.column_stack([clustered.topics, clustered.cluster_label])
    )

    trace_outlier = go.Scattergl(
        x=outliers.x,
        y=outliers.y,
        mode="markers",
        name="Outliers",

        marker=dict(
            size=1,
            color="grey"
        ),

        hovertemplate="Outlier<extra></extra>"
    )
    return trace_cluster, trace_outlier

#Function to make a scatter trace of the clustered and outliers.
def result_tracer_wrapper(uembs: np.ndarray, cluster_labels: np.ndarray, titles: np.ndarray) -> Tuple[go.Scattergl, go.Scattergl]:
    result = result_df_maker(uembs, cluster_labels, titles)
    clustered, outliers = result_splitter(result)
    trace_cluster, trace_outlier = result_tracer(clustered, outliers)
    return trace_cluster, trace_outlier

In [14]:
#Function to make a figure with subplots of the clusteredf and outliers
def subplotter( trace_nested_list: list, titles: list, base_size=1000) -> go.Figure:
    row_count = len(trace_nested_list)
    col_count = len(trace_nested_list [0])

    fig = make_subplots(
        rows=row_count,
        cols=col_count,
        subplot_titles=(titles),
        vertical_spacing = 0.02,
        horizontal_spacing= 0.02
    )

    for i, row in enumerate(trace_nested_list):
        for j, col in enumerate(row):
            #adding both outliers and clustered
            for trace in col:
                fig.add_trace(trace, row=i+1, col=1)
    fig.update_xaxes(visible = False)
    fig.update_yaxes(visible = False)

    fig.update_layout(width = base_size*col_count, height=base_size*row_count, plot_bgcolor='rgba(250,250,250,1)')

    return fig

Saving / Showing Plots:
Function to show and save the figure created. It creates an .html and a .png file with the filename as input. It also takes in the boolean variable "Show" and if it is true, as is default, it will show the plotly graph which it saved

In [15]:
#Function to show and save a figure
def fig_show_save(fig: go.Figure, filename: str, show=True):
    fig.write_html(f"../figures/{filename}.html")
    fig.write_image(f"../figures/{filename}.png")

    if show:
        fig.show()

Data Part
The code block below uses the imported pandas to read a comma separated value file and define the data-set called "df_whole".
It then assigns a variable called "df" to be a copy of the column title from the data set.
If the device uses CPU, then it will not use the whole dataset, but if GPU is used then the lines could be commented out. This is to ensure the data can be processed even if the computer is not top of the line.
In the last line, the first three rows of the data set is printed to console

In [16]:
#Got data from Lab repo, who again got it from kaggle: https://www.kaggle.com/datasets/datasnaek/youtube-new?resource=download
df_whole = pd.read_csv("../DataFiles/USvideos.csv")

df = df_whole[["title"]].copy()

#if your computer does not have GPU support, you can use a sample of the dataset instead to make it run in a reasonable time
#if you want to use the full dataset even without GPU in case you have a strong GPU, then you can just comment out the next line

if device == "cpu": df=df.sample(frac=0.05)

print(df_whole.shape)

df_whole.head(3)

(40949, 16)


Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/shantellmartin\nCANDICE - https://www.lovebilly.com\n\nfilmed this video in 4k on this -- http://amzn.to/2sTDnRZ\nwith this lens -- http://amzn.to/2rUJ...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with John Oliver (HBO),LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week tonight donald trump""|""john oliver trump""|""donald trump""",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John Oliver discusses what we've learned so far and enlists our catheter cowboy to teach Donald Trump what he hasn't.\n\nConnect with Last Week Tonight on..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Lele Pons",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""|""racist""|""superman""|""love""|""rudy mancuso poo bear black white official music video""|""iphone x by pineapple""|""lelepons""|""hannahstocking""|""rudymancuso""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► https://www.youtube.com/channel/UC5jkXpfnBhlDjqh0ir5FsIQ?sub_confirmation=1\n\nTHANKS FOR WATCHING! LIKE & SUBSCRIBE FOR MORE VIDEOS!\n-------------------...


In the line written in the code block below, we create a list of the data set only containing titles, and give the first 20 titles from the data set

In [17]:
list(df["title"])[0:20]

['WE WANT TO TALK ABOUT OUR MARRIAGE',
 'The Trump Presidency: Last Week Tonight with John Oliver (HBO)',
 'Racist Superman | Rudy Mancuso, King Bach & Lele Pons',
 'Nickelback Lyrics: Real or Fake?',
 'I Dare You: GOING BALD!?',
 '2 Weeks with iPhone X',
 'Roy Moore & Jeff Sessions Cold Open - SNL',
 '5 Ice Cream Gadgets put to the Test',
 'The Greatest Showman | Official Trailer 2 [HD] | 20th Century FOX',
 'Why the rise of the robots won’t mean the end of work',
 "Dion Lewis' 103-Yd Kick Return TD vs. Denver! | Can't-Miss Play | NFL Wk 10 Highlights",
 "(SPOILERS) 'Shiva Saves the Day' Talked About Scene Ep. 804 | The Walking Dead",
 'Marshmello - Blocks (Official Music Video)',
 'Which Countries Are About To Collapse?',
 'SHOPPING FOR NEW FISH!!!',
 'The New SpotMini',
 'One Change That Would Make Pacific Rim a Classic',
 "How does your body know you're full? - Hilary Coller",
 'HomeMade Electric Airplane',
 'Founding An Inbreeding-Free Space Colony']

Cleaning
In A.I data-cleaning is an important part of creating the inputs. This is because unlike conventional programming, A.I is created and evolves through input and training what kind of output we want by applying A.I algorithms. For this to work, we need to have clean input data and in the below code block we are doing just this. We are again using the data-set we defined from pandas called df, but are creating a column called "title_clean" this is created by taking the column "title" and applying our earlier defined function "string_cleaner" to all entities in the column. We end the code block by printing the columns of df to compare the values of both "title" and "title_clean".

In [18]:
df["title_clean"] = df["title"].apply(string_cleaner)

df.head(5)

Unnamed: 0,title,title_clean
0,WE WANT TO TALK ABOUT OUR MARRIAGE,we want to talk about our marriage
1,The Trump Presidency: Last Week Tonight with John Oliver (HBO),the trump presidency last week tonight with john oliver hbo
2,"Racist Superman | Rudy Mancuso, King Bach & Lele Pons",racist superman rudy mancuso king bach lele pons
3,Nickelback Lyrics: Real or Fake?,nickelback lyrics real or fake
4,I Dare You: GOING BALD!?,i dare you going bald


Machine Learning part:
In the three following codeblocks we are first printing the data-set's column "title_clean" that we defined in the preceding code blocks. After looking at this and seeing that we have 40949 cleaned titles we are using encoding to have our machine group our data for us.

To do this we are using embeddings, which is done by applying sentence transformer (imported in the first code block of the note book) to our data sets column "title_clean". For this to work we first need to transform our df["title_clean"] to a numpy array. The senctence transformers encode function is an AI algorithm, which then gives us our embeddings. After we this line we then print our the shape of our embeddings which is 40949 elements and each of these elements have 768 length per element. After we have done this we use the pandas dataframe again and create a columns called "embs" (for embeddings) which is our embeddings casted to a list. We then print out the first three entities of this list.

In [19]:
df["title_clean"]

0                                                       we want to talk about our marriage
1                              the trump presidency last week tonight with john oliver hbo
2                                       racist superman  rudy mancuso king bach  lele pons
3                                                           nickelback lyrics real or fake
4                                                                    i dare you going bald
                                               ...                                        
40944                                                         the cat who caught the laser
40945                                                            true facts  ant mutualism
40946    i gave safiya nygaard a perfect hair makeover based on her features bts bradmondo
40947                                                  how black panther should have ended
40948                        official call of duty black ops 4  multiplayer reveal trailer

In [20]:
embs = model.encode(df["title_clean"].to_numpy())
print(f"The shape of our embeddings: {embs.shape}")

The shape of our embeddings: (40949, 768)


In [21]:
df["embs"] = list(embs)

df.head(3)

Unnamed: 0,title,title_clean,embs
0,WE WANT TO TALK ABOUT OUR MARRIAGE,we want to talk about our marriage,"[0.026689166, 0.008800405, -0.03485281, -0.009949934, -0.03847832, -0.031960357, 0.006905073, -0.007085408, 0.07219982, -0.00948024, -0.05225678, -0.1279048, 0.062239345, 0.0018435045, -0.06286907..."
1,The Trump Presidency: Last Week Tonight with John Oliver (HBO),the trump presidency last week tonight with john oliver hbo,"[-0.071231276, 0.06831777, 0.014243533, -0.040540695, 0.0148761235, -0.003980817, -0.029666796, 0.024369014, 0.003718506, -0.023094337, 0.08075214, -0.0003890591, -0.05722484, 0.045889538, 0.02455..."
2,"Racist Superman | Rudy Mancuso, King Bach & Lele Pons",racist superman rudy mancuso king bach lele pons,"[0.01062062, 0.051775046, -0.005729636, 0.035479583, 0.023077734, 0.024468908, 0.007058366, 0.011016046, -0.023958901, -0.0019661316, -0.0098380195, 0.04014703, 0.027782077, -0.057235688, 0.033899..."


Dimensionality Reduction:
As stated in the markdown for the codeblock below, we have our embeddings with a list of our strings embedded in different numbers. We can think of these numbers as coordinates in a 768-dimensional space, and these coordinates are where they stand in relation to eachother. This is difficult to gather any value from if we do not reduce it to a "human readable" format. To do this we have to reduce it to a two dimensional space.
This is done in the below code block by using assigning a variable called umap which uses "UMAP" taking the neighbors and minimum distance between them to reduce it to a two dimensonal space. We then use our assigned umap and the function "fit_transform" on our embeddings and print out the shape of our embeddings. After this is done we assign a variable called fig to a scatterplot (using plotly) of our x and y coordinates (2-dimentional) of our uembs variable - the dimentsonally reduced variable of our embeddings. After this is done we plot our figure with layout and traces and then save the figure as files called "umap-scatter"

In [22]:
umap = UMAP(n_neighbors=20, min_dist=0.1)

uembs = umap.fit_transform(embs)

print(uembs.shape)

fig = px.scatter(x=uembs[:,0], y=uembs[:,1])


fig.update_layout(width=800, height=800)
fig.update_traces(marker=dict(size=2))

#plotting to show how the embeddings are when just dimensionality reduction is used
fig_show_save(fig, "umap-scatter")

(40949, 2)


Clustering 2D data:
After reducing our embeddings to two-dimensions we then need to cluster the data. We assign a new variable called clusters_2d and use HDBSCAN which is Hierarchical density to clusters with noise and allows us to create cluster based on our input data. We then set the minimum cluster size to 10, and use the cluster_selection_method "leaf" which selects from the leaves of the tree and allows us to have fin grained and homogenous clusters. Using HBDSCAN we again use the function "fit" which performs HDBSCAN from a feature or array, which in this case is our 2-dimensional embeddings "uembs".
HDBSCAN is in short a way to figure out which points should be clusters, and we say "take minimum 10 into a cluster" with the "leaf" method on our 2-demnsional embeddings.

We then print out the numbers of clusters created by HDBSCAN and number of outliers.

In [23]:
clusters_2d = HDBSCAN(min_cluster_size=10, cluster_selection_method="leaf").fit(uembs)

print(f"""
    2D
    Number of clusters: {len(set(clusters_2d.labels_)) - 1}
    Number of rows as outliers: {clusters_2d.labels_.tolist().count(-1)}
""")


    2D
    Number of clusters: 1977
    Number of rows as outliers: 3378



Results
We then use Python Sets to store multiple items in a single variable, and send out the output of this on the clusters_2d.labels_

In [24]:
set(clusters_2d.labels_)

{0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,


Plotting the results:
In the code-block below we are tracing the clusters in 2d and tracing the outliers in 2d using our defined result_tracer_wrapper function. We are sending in the 2d embeddings called "uembs" the cluster labels in 2.d and the clean titles as a numpy array. After this we are assigning column 1 as a 2d array with traced clusters in 2d and traced outliers in 2d and then assigning row1 as an array containing column 1. After doing this we assign it to a list, and use our pre-defined subplotter function to send in the trace_list and use topics by HDBSCAN Cluster as a heading, we call our show_save function in the last line and save the figure as files called topics-by-hdbscan-clusters

In [25]:
trace_cluster_2d, trace_outlier_2d = result_tracer_wrapper(uembs, clusters_2d.labels_, df["title_clean"].to_numpy())


col1 = [trace_cluster_2d, trace_outlier_2d]


row1 = [col1]


trace_list = [row1]

fig = subplotter(trace_list, ["Topics by HDBSCAN Cluster", ])

fig_show_save(fig, "topics-by-hdbscan-clusters")

Showing topics per cluster:
In the below code-block we use our function "result_df_maker" on our 2-d embeddings and cluster_2d.labels_ and our dataset of clean titles as a numpy arrays to create a new dataset. We then use pandas on the result_2d to create a table visualizing the cluster labels, topics and videos count in each topic. We then sort the values by video count and output the first 20 occurences in the table we created.

In [26]:
result_2d = result_df_maker(uembs, clusters_2d.labels_, df["title_clean"].to_numpy())

result_2d[["cluster_label", "topics"]].groupby(["cluster_label", "topics"])["topics"].count().reset_index(name="vidoes_count").sort_values(by="vidoes_count", ascending=False).head(20)

Unnamed: 0,cluster_label,topics,vidoes_count
0,-1,official trailer 2018 2017 video,3378
1774,1773,netflix hd trailer official outsider,62
1249,1248,google animoji apple taxi driver,60
1850,1849,date ellen love stories girl,59
1507,1506,messenger tedashii day year ed,59
1582,1581,game week nfl highlights vs,58
1843,1842,tesla 2018 ride bobsledder olympic,53
1945,1944,national anthem fergies reaction trump,52
1679,1678,kristen ep momsplaining bell riverdale,52
1281,1280,school useless degree trebuchet trap,52
