# Topic Modelling YouTube Comments

In this notebook, I am going to perform topic modelling on YouTube Comments I've extracted previously. There are many Topic Modelling articles available that work through each stage in enough detail. Still I believe there's room for showing beginners how to get the best performance out of the topic modelling code.

When performing topic modelling you need to identify:
1. The Document
2. The Paragraphs within the document
3. The approach 
4. The assumptions

In this example, I will be using Cyberpunk 2077 YouTube trailer comment sections as my documents.

1. The document - Whole comment section
2. The paragraphs - Each individual comment
3. The approach - LDA (and maybe NNMF after)
4. The assumptions - Similar words = Similar topics/ Re-occuring groups of words = similar topic

I will be following the steps outlined in [this article](https://stackabuse.com/python-for-nlp-topic-modeling/) as I felt everything was thoroughly explained step by step.


In [67]:
import pandas as pd
import numpy as np

dataset = pd.read_csv("/content/4yQ--nrwy5w_comments.csv")
dataset.columns = ["Comment", "Comment ID", "Reply Count", "Like Count", "Publish Date"]
display(youtube_dataset.head())
print(youtube_dataset.shape)

Unnamed: 0,Comment,Comment ID,Reply Count,Like Count,Publish Date
0,I like how theres quests in game with song nam...,UgzBfnjpNP8sTyVuEh54AaABAg,0,0,2021-01-02T13:18:19Z
1,This is one of the best game trailers ever mad...,UgxmWkk3_vhYmnxOYzl4AaABAg,0,0,2021-01-02T12:37:41Z
2,No matter what people say about this game and ...,UgxwIEXrUcnwg7s3JOl4AaABAg,0,0,2021-01-02T11:54:51Z
3,It gives me chills every time I watch this. Ik...,UgwAaP00la2JB6vrCOZ4AaABAg,1,1,2021-01-02T11:13:04Z
4,POLSKA GRA I NIE MA POLAKOW XDDDXDDD,UgxIKIqITUIheHzS-cZ4AaABAg,0,0,2021-01-02T11:12:49Z


(24087, 5)


# Looking at the text data

It's always important to look at the raw text data to check for things that might not get picked up by the machine which is definitely the case with Social Media.

The use of slang, typos and poor grammar can hinder the effectiveness of NLP models significantly, so we will to do some significant cleaning before we do any technical analysis.

In [68]:
print(dataset['Comment'][351])
print(" ")
print(dataset['Comment'][1200])
print(" ")
print(dataset['Comment'][45])
print(" ")
print(dataset['Comment'][1666])

Absolutely beuatiful... can't wait to get my hands on this game.
 
Is it me or it doesnt look sooo beautiful as they have been showing us?
 
I say we start a petition to see to it that no cdpr developer receives their bonus.
 
This is the only game that makes me think about getting a new gen console


You can clean your text data without creating a function but I believe it is more thorough to put as much of the preprocessing in functions so you can tailor it to the different text data you'll be working with.

In this case we are cleaning by:
* Lowercasing the text
* Removing numbers
* Stripping out white text
* Removing punctuation
* (Optional) - Removing topic specific stopwords. If the document is about Cyberpunk 2077 then CDPR will most likely appear consistently, whether this is useful in your topic modelling is up to you.



In [69]:
def clean_text(text):
  text = text.lower().replace("'","").replace('[^\w\s]', ' ').replace(" \d+", " ").strip()
  return text


sample = dataset['Comment'][351]
clean_text(sample)

'absolutely beuatiful... cant wait to get my hands on this game.'

In [71]:
dataset['Clean Comment'] = dataset['Comment'].apply(clean_text)
dataset.head()

Unnamed: 0,Comment,Comment ID,Reply Count,Like Count,Publish Date,Clean Comment
0,so that was a fucking lie,Ugz2OiOZa8EZbPRPfHt4AaABAg,0,0,2021-01-01T05:51:31Z,so that was a fucking lie
1,1:13 bro why are you going so slow... why you ...,UgylPC2tsFYHrGFd-UN4AaABAg,0,0,2020-12-31T03:03:45Z,1:13 bro why are you going so slow... why you ...
2,LLLLLIIIIEEEESSSSSS!!!!!!!!,UgzwwVVxRnmPkw8HEvl4AaABAg,0,0,2020-12-29T12:27:39Z,llllliiiieeeessssss!!!!!!!!
3,Nice relaxing scam,UgwFVxwEn2tpceqFm_94AaABAg,0,0,2020-12-29T09:59:42Z,nice relaxing scam
4,Just coming back here to see how much we were ...,UgyJLTdjbqX464mJ8xZ4AaABAg,0,1,2020-12-29T06:28:09Z,just coming back here to see how much we were ...


# Advanced Optimisation of Text Data 
## Skip section until you have ran LDA model first time round.

There is so much more you can do if the text data is very dirty (typos, grammar, memes) to make the data optimised for topic modelling including:

* Word or Character limit
* Removal Topic specific stopwords
* Language translation
* Lemming
* Stemming





In [23]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
#stop_words_list = stopwords.words('english') + ['though', 'game', 'games']

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [29]:
# initiate stopwords from nltk

stop_words = stopwords.words('english')
print(len(stop_words))
# add additional missing terms

stop_words.extend(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l','m','n','o','p','q','r','s','t', 'u', 'v', 'w', 'x', 'y', 'z', "about", "across", "after", "all", "also", "an", "and", "another", "added",
"any", "are", "as", "at", "basically", "be", "because", 'become', "been", "before", "being", "between","both", "but", "by","came","can","come","could","did","do","does","each","else","every","either","especially", "for","from","get","given","gets",
'give','gives',"got","goes","had","has","have","he","her","here","him","himself","his","how","if","in","into","is","it","its","just","lands","like","make","making", "made", "many","may","me","might","more","most","much","must","my","never","provide", 
"provides", "perhaps","no","now","of","on","only","or","other", "our","out","over","re","said","same","see","should","since","so","some","still","such","seeing", "see", "take","than","that","the","their","them","then","there",
"these","they","this","those","through","to","too","under","up","use","using","used", "underway", "very","want","was","way","we","well","were","what","when","where","which","while","whilst","who","will","with","would","you","your", 
'etc', 'via', 'eg', 'game', 'games' 'like']) 

# remove stopwords
print(len(stop_words))
#df[1] = df[1].apply(lambda x: [item for item in x if item not in stop_words])

#display(df.head(10))

179
352


In [31]:
def superclean(text):
  #tokens = text.split(" ")
  text = text.lower().replace("'","").replace('[^\w\s]', ' ').replace(" \d+", " ").strip()
  tokens = nltk.word_tokenize(text)
  stop_tokens = [item for item in tokens if item not in stop_words]
  new_text = ' '.join(stop_tokens)
  return new_text
print(youtube_dataset['Comment'][45])
print("------")
print(superclean(youtube_dataset['Comment'][45]))

It is a shame to have such errors in a game that has only been released in 8 years, and it is disrespectful to people who expect the game
And as someone who does not like FPS games, I am sure that if there were camera angles such as GTA, he would appeal to more masses
------
shame errors released 8 years , disrespectful people expect someone fps games , sure camera angles gta , appeal masses


In [45]:
dataset['Clean Comment'] = dataset['Comment'].apply(superclean)
dataset.head()
dataset["char_count"] = dataset['Clean Comment'].apply(len)
df = dataset.drop(dataset[dataset['char_count'] < 50].index)
#df = df.drop(df[df.score < 50].index)
#df = df.drop(df[(df.score < 50) & (df.score > 20)].index)
print(df.shape)
print(dataset.shape)
display(df.head())

(1348, 3)
(1928, 3)


Unnamed: 0,Comment,Clean Comment,char_count
0,What this is telling me is I should get one of...,telling one new consoles play pc . dont chuggi...,96
1,"Honestly, I'd be fine if it was delayed until ...","honestly , id fine delayed new consoles releas...",285
2,Feel like they buried the lede where the sourc...,feel buried lede source mentioned cdpr worked ...,63
3,To anyone mildly worried that this means that ...,anyone mildly worried means wont release ps4/x...,145
4,"I'm at that point where I'm like, do I play on...","im point im , play ps4 , wait play ps5 . even ...",151


# Create your Document Term Matrix

You need a Document Term Matrix of some sort to do the topic modelling calculations. 

I've seen a lot of topic modelling approaches that skip customising their Vectoriser. We won't because never forget, garbage in, garbage out.

Before converting our words into numeric values, we will:#

* Only include those words that appear in less than 80% of the document (max_df=0.8)
* Only include those words that ppear in at least 2 documents
* Remove english stopwords (even if you've removed stopwords before, do it again as different models you've used previously might have missed out some words.

In [72]:
from sklearn.feature_extraction.text import CountVectorizer
#Only include those words that appear in less than 80% of the document (max_df=0.8)
#Only include those words that ppear in at least 2 documents
count_vect = CountVectorizer(max_df=0.8, min_df=2, stop_words='english')
#.values Only the values in the DataFrame will be returned, the axes labels will be removed.
#The astype(‘U’) is telling numpy to convert the data to Unicode (essentially a string in python 3).
doc_term_matrix = count_vect.fit_transform(dataset['Clean Comment'].values.astype('U'))

In [61]:
doc_term_matrix
#Each of 16236 documents(comments) is represented as 9651 dimensional vector, which means that our vocabulary has 9651 words.

<1348x2719 sparse matrix of type '<class 'numpy.int64'>'
	with 23833 stored elements in Compressed Sparse Row format>

# Run your LDA Model

This is quite a basic implementation of LDA without the [extensive list of parameters](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) you can edit. So I added a more advanced version underneath as a comparison.



In [73]:
from sklearn.decomposition import LatentDirichletAllocation
#n_components = num. of topics
#random_state = Is just random
LDA = LatentDirichletAllocation(n_components=10, random_state=42)
LDA.fit(doc_term_matrix)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=10, n_jobs=None,
                          perp_tol=0.1, random_state=42, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

In [77]:
#Thanks fellow coder - 
#https://stackoverflow.com/questions/61373994/how-to-specify-random-state-in-lda-model-for-topic-modelling
LDA_Advanced = LatentDirichletAllocation(n_components=10,        
                                  max_iter=10,               
                                  learning_method='online',   
                                  random_state=100,          
                                  batch_size=128,            
                                  evaluate_every = -1,       
                                  n_jobs = -1 )

LDA_Advanced.fit(doc_term_matrix)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='online', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=10, n_jobs=-1,
                          perp_tol=0.1, random_state=100, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

In [50]:
import random

for i in range(10):
  #randomly fetch 10 word ids using the get_feature_names() method
    random_id = random.randint(0,len(count_vect.get_feature_names()))
    print(count_vect.get_feature_names()[random_id])

mountain
contradicts
north
microsoft
profile
properly
companies
enormous
janky
2500k


# Print out the Top 10 words for Topic 1

In [51]:
#Use components_ to get the first topic at index 0
first_topic = LDA.components_[0]
#Out of 14546 probabilities for each word for topic 1, sort the indexes according to probability values - argsort().
#Return last 10 indexes [-10]
top_topic_words = first_topic.argsort()[-10:]

for i in top_topic_words:
    print(count_vect.get_feature_names()[i])

consoles
play
run
ps5
ps4
lot
think
people
pc
games


# First Attempt at Top Topics And Analysis

* Topic 0: December 10th delay
* Topic 1: undistinguishable - Foreign language and stopwords
* Topic 2: Cars look great
* Topic 3: Positive reception, in awe
* Topic 4: Can't wait to play on next gen or PC
* Topic 5: Technical Capabilities of running 4k and 1080p on consoles
* Topic 6: Undistinguishable - Very mixed
* Topic 7: Looks good
* Topic 8: Undistinghuishable - Very mixed
* Topic 9: The look of night city

There are so clear topics but there is also a lot of overlap of positive sentiment from "wow+omg" to "good+great+love". We can maybe reduce the overlap by having less topics.

Non-english words is a poor oversight by me and should be factored into the clean_text function.

There's scope for adding new stopwords such as "like, game, im, whats, video". Even https is an interesting removal as it refers to external links to other websites.

In [75]:
for i,topic in enumerate(LDA.components_):
    print(f'Top 10 words for topic #{i}:')
    print([count_vect.get_feature_names()[i] for i in topic.argsort()[-10:]])
    print('\n')

Top 10 words for topic #0:
['10', 'just', '10th', 'wait', 'year', 'release', 'delay', 'december', 'gen', 'game']


Top 10 words for topic #1:
['look', 'на', '3rd', 'что', 'не', 'https', 'im', 'game', 'just', 'person']


Top 10 words for topic #2:
['video', 'look', 'fucking', 'love', 'car', 'really', 'im', 'just', 'game', 'like']


Top 10 words for topic #3:
['got', 'just', 'good', 'wow', 'awesome', 'shit', 'dont', 'like', 'looks', 'game']


Top 10 words for topic #4:
['gen', 'pc', 'im', 'ps5', 'play', 'xbox', 'wait', 'game', 'series', 'looks']


Top 10 words for topic #5:
['gameplay', 'dont', 'run', 'consoles', '1080p', 'video', '4k', 'ps4', 'xbox', 'game']


Top 10 words for topic #6:
['15', 'like', 'know', '60', 'whats', 'hollie', 'xbox', 'song', '30', 'fps']


Top 10 words for topic #7:
['really', 'amazing', 'bad', 'people', 'gen', 'game', 'look', 'good', 'like', 'looks']


Top 10 words for topic #8:
['god', 'gta', '50', 'look', 'game', 'reeves', 'looks', 'music', 'like', 'keanu']



The topics with the advanced LDA model are completely different.

* Topic 0: 1080p or 4k on playstation
* Topic 1: Current gen will run this at 30fps
* Topic 2: Console versions
* Topic 3: Keanu Reeves is awesome
* Topic 4: Game should be delayed something about graphics
* Topic 5: Looks great on PS5
* Topic 6: Undistinguishable -  Mixture of topics
* Topic 7: The gameplay looks good
* Topic 8: Weird pop ins with Combat and Car mechanics
* Topic 9: Ray tracing isn't going to make a difference.

The LDA Advanced might have just performed better with a more unique set of words but some of the issues are still present.


In [78]:
for i,topic in enumerate(LDA_Advanced.components_):
    print(f'Top 10 words for topic #{i}:')
    print([count_vect.get_feature_names()[i] for i in topic.argsort()[-10:]])
    print('\n')

Top 10 words for topic #0:
['upload', 'screen', 'que', 'yes', 'boring', 'playstation', 'excited', '1080p', 'video', '4k']


Top 10 words for topic #1:
['running', '30fps', 'current', 'character', 'series', 'year', 'fucking', 'consoles', '10', 'gen']


Top 10 words for topic #2:
['version', 'just', 'consoles', 'new', 'console', 'delay', 'ps4', 'pc', 'hope', 'xbox']


Top 10 words for topic #3:
['johnny', 'reeves', 'что', 'downgrade', 'на', 'не', '50', 'awesome', 'song', 'keanu']


Top 10 words for topic #4:
['delayed', '10th', 'graphics', 'lol', 'december', 'cyberpunk', '2077', 'game', 'looks', 'like']


Top 10 words for topic #5:
['version', 'just', 'great', 'fps', 'play', 'ps5', 'release', 'series', 'wait', 'game']


Top 10 words for topic #6:
['god', 'wtf', 'looks', 'time', 'shit', 'game', 'waiting', 'like', '60fps', 'nice']


Top 10 words for topic #7:
['just', 'city', 'dont', 'people', 'gameplay', 'im', 'good', 'like', 'looks', 'game']


Top 10 words for topic #8:
['driving', 'pop'

# Add Topic Column to dataframe

This is a nice way to look how aligned the topic allocation is to each comment. Short 1-2 worded comments are obviously going to be completely incorrect but I have seen a lot of very accurate Topic allocations when compared to what I assumed the topics to be about.

So that's another thing I need to do for the second run. Remove short comments which are shorter than one first so maybe 10 words or 50 characters?

I think at this point I need to reveal what the video is about.... 


[Cyberpunk 2077 — Night City Wire Special: Xbox One X and Xbox Series X footage](https://www.youtube.com/watch?v=4yQ--nrwy5w&ab_channel=Cyberpunk2077)

In [33]:
topic_values = LDA_Advanced.transform(doc_term_matrix)
topic_values.shape
dataset['Topic'] = topic_values.argmax(axis=1)
dataset[16000:16010]

Unnamed: 0,Comment,Comment ID,Reply Count,Like Count,Publish Date,Clean Comment,Topic
16000,They look like angels,Ugy5voH_HW9F4E_mfmR4AaABAg,0,0,2020-11-17T17:13:07Z,they look like angels,4
16001,I have reached levels of hype for this game th...,UgzDew78mKSh8ERMAC14AaABAg,2,106,2020-11-17T17:13:06Z,i have reached levels of hype for this game th...,7
16002,The people in the chat were going ballistic be...,UgxMalIws8N4G5CS4fp4AaABAg,26,518,2020-11-17T21:57:03Z,the people in the chat were going ballistic be...,7
16003,Looks unplayable on current gen consoles.,UgwF_ZaOev1Izbw6zFV4AaABAg,0,0,2020-11-17T17:13:04Z,looks unplayable on current gen consoles.,1
16004,"Hollie Bennett is really cute, she has an Inst...",UgzrEKsm76OVQikx6XB4AaABAg,0,0,2020-11-17T17:13:03Z,"hollie bennett is really cute, she has an inst...",7
16005,Who else can't wait for their pre-order?,Ugy2BckZ4Q5OGoXFG6d4AaABAg,1,1,2020-11-17T17:13:00Z,who else cant wait for their pre-order?,7
16006,12 Tflops = 30fps console plebs lmao,Ugzb0jdME2n0WTFBGq94AaABAg,0,0,2020-11-17T17:13:00Z,12 tflops = 30fps console plebs lmao,1
16007,"Wow, this looks awesome!",UgwFJPYFwmfojerNcW94AaABAg,0,0,2020-11-17T17:12:59Z,"wow, this looks awesome!",7
16008,Xbox? What is this?,Ugw3T34PJjXpz1ptevN4AaABAg,0,0,2020-11-17T17:12:59Z,xbox? what is this?,2
16009,I just hope it works on my ps4,UgypkOZR-r3oRIQHv454AaABAg,0,0,2020-11-17T17:12:53Z,i just hope it works on my ps4,2
