# Topic Modeling on a Ted Talk Recorded Session Transcript 

![Topic Modeling Bill Gates](https://talkstar-photos.s3.amazonaws.com/uploads/ba9f8c13-f0a9-4698-a435-5767e478d715/BillGates_2022-1350x675.jpg)

After converting the speech-to-text recorded session of the interview with Bill Gates using the Google Cloud API service, now it is time to preprocess the text and implement various topic modeling algorithms, namely, **Latent Dirichlet Allocation (LDA)**, **Latent Semantic Analysis (LSA)**, and **BERTopic**. 
I also implemented those topic modeling techniques on a large and common dataset (**20 Newsgroups**) in order to serve as a base benchmark, that will help in providing more insights when reaching a conclusion regarding the performance of the ted_talk corpus of documents dataset. 

So, without further ado, Let’s begin. 


### Importing the Relevant Libraries 

In [1]:
# Importing general libraries 
import re
import numpy as np
import pandas as pd
from pprint import pprint

# Importing the Gensim library
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# I will use this library for implementing the truncated singular value decomposition for the LSA model
from gensim.models import LsiModel 


# Importing nltk and downloading stopwords 
import nltk
nltk.download('stopwords')

# Importing spacy for lemmatization
import spacy

# Importing the BERTopic model
from bertopic import BERTopic
# Importing the sentence-transformers package for the purpose of document embeddings
from sentence_transformers import SentenceTransformer
from sentence_transformers import *
# Importing UMAP for dimensionality reduction in the BERTopic model
import umap
# Importing HDBSCAN to perform its clustering
import hdbscan

# Importing various dimensionality reduction and clustering techniques 
from sklearn.cluster import MiniBatchKMeans
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE


# Importing LexRank, an unsupervised approach to text summarization based on graph-based centrality scoring of sentences
from lexrank import *
# Importing the torch package  
import torch


# Importing plotting tools
import pyLDAvis
import pyLDAvis.gensim_models  
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline

# Enabling logging for gensim
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

# Importing warnings 
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Yoni\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Preparing Stopwords

In [2]:
# Importing NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

### Importing the Ted Talk Corpus of Documents 

In [3]:
# Importing Dataset
df = pd.read_json('tedtalk_corpus.json')
df.head(10)

Unnamed: 0,Sentences,Document#
0,"[""it said that's a lie I merely asked you on T...",1
1,"""Bill Gates cousin self an imperfect Messenger...",2
2,"""who is the author of the book when I come bac...",3
3,"""you know it's pretty demanding it's not a 50%...",4
4,"""manufacturing including steel and cement peop...",5
5,"""yeah so the green premium berries from emissi...",6
6,"""X-Men where we haven't really gotten started ...",7
7,"""25% across all categories will that conversat...",8
8,"""and we have to do everything we can to accele...",9
9,"""where is now breakthrough energy Ventures as ...",10


In [6]:
df['Sentences'].shape

(59,)

There are only 59 documents in my corpus from the recoded session with Bill Gates. This is a very small dataset, but to test my hypostasis of how well different types of topic models cope with this type of small dataset I will have to implement this corpus and reach the relevant conclusions based on the results. 

So, let's continue

### Removing Emails and Newline Characters

In [9]:
# Converting to list
data = df.Sentences.values.tolist()

# Removing new line characters
data = [re.sub('\s+', ' ', sent) for sent in data]



# Removing distracting single marks
data = [re.sub("\'", "", sent) for sent in data]
data = [re.sub(",", "", sent) for sent in data]
data = [re.sub("\[", "", sent) for sent in data]
data = [re.sub("\]", "", sent) for sent in data]
data = [re.sub("\"", "", sent) for sent in data]


print(data)

['it said thats a lie I merely asked you on Today Show the philanthropist and Microsoft co-founder Bill Gates in conversation with Ted Global curator changes for the world to avoid climate disaster he talks or something called the green premium lays out Innovations we need to invest in and shares why younger Generations are the key to getting to net zero emissions and also have his love for burgers is changing the conversation is from March 2021 and part of countdown Ted Global initiative to xcelerate solutions to The Climate Crisis get involved at countdown head.com', 'Bill Gates cousin self an imperfect Messenger on climate because of his high carbon footprint and the lifestyle however he is just made a major contribution to our thinking about confronting climate change a book A book about decarbonizing our economy and Society its an optimistic can do kind of book with a strong focus on technological solutions he discusses the things we have such as wind and solar power the things we

### Tokenizing Words and Clean-Up Text

In [10]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words)

[['it', 'said', 'thats', 'lie', 'merely', 'asked', 'you', 'on', 'today', 'show', 'the', 'philanthropist', 'and', 'microsoft', 'co', 'founder', 'bill', 'gates', 'in', 'conversation', 'with', 'ted', 'global', 'curator', 'changes', 'for', 'the', 'world', 'to', 'avoid', 'climate', 'disaster', 'he', 'talks', 'or', 'something', 'called', 'the', 'green', 'premium', 'lays', 'out', 'innovations', 'we', 'need', 'to', 'invest', 'in', 'and', 'shares', 'why', 'younger', 'generations', 'are', 'the', 'key', 'to', 'getting', 'to', 'net', 'zero', 'emissions', 'and', 'also', 'have', 'his', 'love', 'for', 'burgers', 'is', 'changing', 'the', 'conversation', 'is', 'from', 'march', 'and', 'part', 'of', 'countdown', 'ted', 'global', 'initiative', 'to', 'xcelerate', 'solutions', 'to', 'the', 'climate', 'crisis', 'get', 'involved', 'at', 'countdown', 'head', 'com'], ['bill', 'gates', 'cousin', 'self', 'an', 'imperfect', 'messenger', 'on', 'climate', 'because', 'of', 'his', 'high', 'carbon', 'footprint', 'and',

### Creating Bigram and Trigram Models

In [11]:
# Building the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])

['it', 'said', 'thats', 'lie', 'merely', 'asked', 'you', 'on', 'today', 'show', 'the', 'philanthropist', 'and', 'microsoft', 'co', 'founder', 'bill', 'gates', 'in', 'conversation', 'with', 'ted', 'global', 'curator', 'changes', 'for', 'the', 'world', 'to', 'avoid', 'climate', 'disaster', 'he', 'talks', 'or', 'something', 'called', 'the', 'green', 'premium', 'lays', 'out', 'innovations', 'we', 'need', 'to', 'invest', 'in', 'and', 'shares', 'why', 'younger', 'generations', 'are', 'the', 'key', 'to', 'getting', 'to', 'net', 'zero', 'emissions', 'and', 'also', 'have', 'his', 'love', 'for', 'burgers', 'is', 'changing', 'the', 'conversation', 'is', 'from', 'march', 'and', 'part', 'of', 'countdown', 'ted', 'global', 'initiative', 'to', 'xcelerate', 'solutions', 'to', 'the', 'climate', 'crisis', 'get', 'involved', 'at', 'countdown', 'head', 'com']


### Removing Stopwords and Making Bigrams and Lemmatization

In [12]:
# Defining functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [13]:
# Removing Stop Words
data_words_nostops = remove_stopwords(data_words)

# Forming Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initializing spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

# Preforming lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])

[['say', 's', 'lie', 'merely', 'ask', 'today', 'conversation', 'te', 'global', 'curator', 'change', 'world', 'avoid', 'climate', 'disaster', 'talk', 'call', 'green', 'premium', 'lay', 'innovation', 'need', 'invest', 'share', 'young', 'generation', 'key', 'get', 'net', 'emission', 'also', 'love', 'burger', 'change', 'conversation', 'part', 'countdown', 'global', 'solution', 'climate', 'crisis', 'get', 'involve', 'countdown', 'head', 'com']]


### Creating the Dictionary and Corpus needed for Topic Modeling

In [14]:
# Creating Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Creating Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# Viewing the Term Document Frequency
print(corpus[:1])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2), (6, 2), (7, 1), (8, 2), (9, 2), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 2), (16, 2), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1)]]


In [15]:
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

[[('also', 1),
  ('ask', 1),
  ('avoid', 1),
  ('burger', 1),
  ('call', 1),
  ('change', 2),
  ('climate', 2),
  ('com', 1),
  ('conversation', 2),
  ('countdown', 2),
  ('crisis', 1),
  ('curator', 1),
  ('disaster', 1),
  ('emission', 1),
  ('generation', 1),
  ('get', 2),
  ('global', 2),
  ('green', 1),
  ('head', 1),
  ('innovation', 1),
  ('invest', 1),
  ('involve', 1),
  ('key', 1),
  ('lay', 1),
  ('lie', 1),
  ('love', 1),
  ('merely', 1),
  ('need', 1),
  ('net', 1),
  ('part', 1),
  ('premium', 1),
  ('s', 1),
  ('say', 1),
  ('share', 1),
  ('solution', 1),
  ('talk', 1),
  ('te', 1),
  ('today', 1),
  ('world', 1),
  ('young', 1)]]

### Building the LDA Topic Model

**Latent Dirichlet Allocation (LDA)** is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar. In this, observations (e.g., words) are collected into documents, and each word's presence is attributable to one of the document's topics. Each document will contain a small number of topics.
LDA is one of the most popular topic modeling methods.

In [18]:
# Building the LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=5, 
                                           random_state=100,
                                           update_every=1, # Determines how often the model parameters should be updated
                                           chunksize=10, # The number of documents to be used in each training chunk
                                           passes=10, # Total number of training passes
                                           alpha='auto',
                                           per_word_topics=True)

### Viewing the Topics in The LDA Model

In [19]:
# Printing the Keyword in the 5 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.043*"know" + 0.029*"s" + 0.026*"people" + 0.023*"lot" + 0.016*"get" + '
  '0.014*"use" + 0.013*"hard" + 0.012*"problem" + 0.012*"come" + '
  '0.012*"important"'),
 (1,
  '0.031*"thing" + 0.030*"make" + 0.029*"go" + 0.024*"offset" + 0.023*"look" + '
  '0.017*"want" + 0.016*"know" + 0.015*"actually" + 0.014*"well" + '
  '0.013*"see"'),
 (2,
  '0.027*"year" + 0.024*"say" + 0.017*"s" + 0.016*"get" + 0.015*"part" + '
  '0.014*"world" + 0.012*"young" + 0.011*"time" + 0.011*"innovation" + '
  '0.011*"generation"'),
 (3,
  '0.026*"green" + 0.025*"know" + 0.021*"get" + 0.016*"go" + 0.015*"re" + '
  '0.015*"emission" + 0.014*"today" + 0.013*"really" + 0.012*"price" + '
  '0.011*"say"'),
 (4,
  '0.024*"book" + 0.021*"think" + 0.019*"climate" + 0.018*"change" + '
  '0.018*"talk" + 0.017*"individual" + 0.017*"fund" + 0.015*"future" + '
  '0.014*"year" + 0.013*"term"')]


### Computing Model Perplexity and Coherence Score

In [25]:
# Computing Perplexity
print('\nPerplexity Score: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Computing Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity Score:  -6.895566757446175

Coherence Score:  nan


  m_lr_i = np.log(numerator / denominator)
  return cv1.T.dot(cv2)[0, 0] / (_magnitude(cv1) * _magnitude(cv2))


Since I am dealing here with a small corpus of documents, when computing the coherence score, I received that one of the top words of the topics of the trained model has a word frequency count of ‘0’ in the test corpus. For that reason, the coherence model throws this warning and returns an output of a ‘nan’ value.

 I will need to take this into consideration when evaluating the models in the written paper.  


### Visualizing the Topics-Keywords

In [21]:
# Visualizing the topics using pyLDAvis package's interactive chart
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)
vis

# Latent Semantic Analysis (LSA) Topic Modeling
Latent Semantic Analysis (LSA) also known as Latent Semantic Index (LSI) is a natural language processing method that analyzes relationships between a set of documents and the terms contained within. It uses singular value decomposition, a mathematical technique, to scan unstructured data to find hidden relationships between terms and concepts.

All the preprocessing work done on the ted_talk dataset is still valid here. So, I can continue straight to the LSA model.




  







Again, I can obtain the coherence score with the Gensim module. Let’s see how the coherence score is for the LSA model for a total of 5 topics (The same number of topics as I initially chose for the LDA model. For comparison purposes).  
**Note** - LsiModel does not function with the log_preplexity for the calculation of the perplexity score the same as LDA does. So, I will drop the perplexity score and focus my attention only to the coherence score.

In [30]:
lsi = LsiModel(corpus, num_topics=5, id2word=id2word, chunksize=10)

# Computing Coherence Score
coherence_model_lsi = CoherenceModel(model=lsi, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lsi = coherence_model_lsi.get_coherence()
print('\nCoherence Score: ', coherence_lsi)


Coherence Score:  nan


Again, I received the same result of coherence score of ‘nan’. The reason is the same as written above.  

### Performing SVD

In [32]:
# performing SVD on the bag of words with the LsiModel to extract 5 topics
lsi = LsiModel(corpus, num_topics=5, id2word=id2word)

In [33]:
# finding the 10 words with the srongest association to the derived topics
for topic_num, words in lsi.print_topics(num_words=10):
    print('Words in {}: {}.'.format(topic_num, words))

Words in 0: 0.413*"know" + 0.317*"get" + 0.302*"s" + 0.248*"go" + 0.216*"green" + 0.186*"make" + 0.160*"people" + 0.154*"thing" + 0.146*"year" + 0.142*"say".
Words in 1: -0.585*"know" + 0.279*"s" + -0.193*"people" + 0.173*"get" + 0.167*"green" + 0.155*"hydrogen" + 0.152*"energy" + 0.144*"make" + 0.121*"book" + -0.107*"hard".
Words in 2: -0.540*"offset" + -0.320*"look" + -0.226*"thing" + 0.194*"s" + -0.184*"company" + -0.165*"really" + -0.139*"actually" + -0.138*"way" + -0.136*"pay" + -0.133*"carbon".
Words in 3: 0.373*"green" + -0.287*"year" + -0.269*"get" + 0.220*"product" + 0.207*"thing" + 0.153*"premium" + -0.153*"say" + 0.146*"make" + -0.120*"way" + -0.112*"term".
Words in 4: -0.315*"s" + 0.287*"go" + 0.279*"get" + 0.218*"make" + -0.183*"offset" + 0.179*"thing" + -0.158*"premium" + -0.130*"buy" + -0.129*"green" + -0.124*"cost".


## 

# BERTopic

**BERTopic** is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.






After preprocessing the dataset early in my project and reaching to a final lemmatized dataset (named: **data_lemmatized**) containing the final product words that I have been working with. Now, I want to add those words to my dataframe as a set of rows and their corresponding words. After doing that, I would like to convert does words back to sentences for the purpose of using the sentence transformer model from BERTopic. 

In [34]:
# Adding new column to the dataframe (named: text_cleaned) 
# containing the different lemmatized words in each corresponding row. 
df['text_cleaned'] = data_lemmatized

In [35]:
# Function to make it back into a sentence 
def make_sentences(data,name):
    data[name]=data[name].apply(lambda x:' '.join([i+' ' for i in x]))
    # Removing double spaces if created
    data[name]=data[name].apply(lambda x:re.sub(r'\s+', ' ', x, flags=re.I))

In [36]:
# Converting all the texts back to sentences
make_sentences(df, 'text_cleaned')

In [37]:
df.head()

Unnamed: 0,Sentences,Document#,text_cleaned
0,"[""it said that's a lie I merely asked you on T...",1,say s lie merely ask today conversation te glo...
1,"""Bill Gates cousin self an imperfect Messenger...",2,cousin self imperfect messenger climate high c...
2,"""who is the author of the book when I come bac...",3,author book come back town thank start start t...
3,"""you know it's pretty demanding it's not a 50%...",4,know pretty demanding reduction way scale pock...
4,"""manufacturing including steel and cement peop...",5,manufacturing include steel cement people leas...


### Importing a Pre-Trained Model from SentenceTransformer

In [38]:
# Getting a model
model=SentenceTransformer('all-MiniLM-L12-v2')

### Encodinng The Preprocessed Text Data

In [39]:
embeddings = model.encode(df['text_cleaned'])

### Getting Topics Using BERTopic and SentenceTransformer Embeddings

In [51]:
model2 = BERTopic()
topics, probabilities = model2.fit_transform(df['text_cleaned'],embeddings)

In [52]:
# viewing how frequent certain topics are
model2.get_topic_freq().head()

Unnamed: 0,Topic,Count
0,-1,34
1,0,14
2,1,11


The topic name -1 refers to all documents that did not have any topics assigned.
Not all documents are forced towards a certain cluster. If no cluster could be found, then it is simply an outlier.

After generating topics and their probabilities, I can access the frequent topics that were generated.

In [58]:
model2.get_topic(0)

[('know', 0.07907707609430518),
 ('green', 0.06842145047879848),
 ('get', 0.06393522627030282),
 ('thing', 0.05675945699409091),
 ('product', 0.052837841742575185),
 ('go', 0.047782039786862365),
 ('re', 0.043394995848731285),
 ('make', 0.04217159336112181),
 ('start', 0.0418673957099745),
 ('even', 0.04038281079701216)]

In [59]:
model2.get_topic(1)

[('know', 0.10208146889594683),
 ('get', 0.07646596104042899),
 ('people', 0.07066159902726571),
 ('go', 0.05541516985202422),
 ('young', 0.05533023632355649),
 ('year', 0.054959021465651105),
 ('think', 0.053443387531400724),
 ('climate', 0.053443387531400724),
 ('time', 0.04768860217680837),
 ('thank', 0.04587810564124359)]

In [60]:
model2.get_topic(2)

False

In [61]:
model2.get_topics()

{-1: [('get', 0.04883814581387286),
  ('go', 0.046453559435876285),
  ('know', 0.04602250476205853),
  ('green', 0.044134917606143895),
  ('make', 0.042951437904072846),
  ('year', 0.0401164001431861),
  ('say', 0.03888080752001887),
  ('book', 0.03867278079549097),
  ('emission', 0.03314809782470655),
  ('carbon', 0.03173864782209457)],
 0: [('know', 0.07907707609430518),
  ('green', 0.06842145047879848),
  ('get', 0.06393522627030282),
  ('thing', 0.05675945699409091),
  ('product', 0.052837841742575185),
  ('go', 0.047782039786862365),
  ('re', 0.043394995848731285),
  ('make', 0.04217159336112181),
  ('start', 0.0418673957099745),
  ('even', 0.04038281079701216)],
 1: [('know', 0.10208146889594683),
  ('get', 0.07646596104042899),
  ('people', 0.07066159902726571),
  ('go', 0.05541516985202422),
  ('young', 0.05533023632355649),
  ('year', 0.054959021465651105),
  ('think', 0.053443387531400724),
  ('climate', 0.053443387531400724),
  ('time', 0.04768860217680837),
  ('thank', 0.04

I can view all the topics discussed in all of the documents.

There is some favorable advantage for the LDA model over the other two models with regards to the interpretation and insights regarding the theme discussed in the recorded session. Explaining the results and reaching the relevant conclusions will take part in the paper that will be submitted along side this code implementation. 

 
 
 ### THANK YOU
 
      
        
          
          





















![Thank You NLP](https://miro.medium.com/max/960/0*xLRsbQ02J7sQpNNy)


