# Project ML -2/2   Topic model on comments
--Dan Hua Li

Perform latent semantic analysis (LSA), non-negative matrix factorization (NMF), and latent Dirichlet allocation (LDA) using scikit-learn and Gensim on the ***Clean_comments dataset***. Set the number of topics = 10,3,5.  ( Try the top 10/20 words for each topic). Specific steps include:

- a.	scikit-learn for LSA, NMF, and LDA.
- b.	scikit-learn grid search for the best number of topics k for LDA, based on the perplexity score, with the grid of k = [2,3,4,5,6,7,8,10,12,14,16,18,20] (this search may take about 10-20 minutes).
- c.	Gensim for LSI, NMF, and LDA.
- d.	Gensim tqdm search for the best number of topics k for LDA, based on the c_v coherence score, with k ranging from 2 to 20 (consecutively).


#### Import data 'clean_comments'(clean it in excel)
clean Data: far more symbols need to be clean from Excel, especially:
- https, youtube.com, github, .com (impact frequent words)
- symbols from Emoticons (make insert data into python difficultly)
- different language..
- special symbols.


In [1]:
# ignore warnings
import warnings
warnings.filterwarnings('ignore')
#pip install --upgrade gensim

In [2]:
#pip install --upgrade gensim
import pandas as pd
# pd.options.display.max_rows = 100
import numpy as np
from time import time

textdf = pd.read_csv('Clean_comments-UTF8.csv',encoding='utf-8', sep=',')
textdf.info()
textdf.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10217 entries, 0 to 10216
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Reviews  10217 non-null  object
dtypes: object(1)
memory usage: 79.9+ KB


Unnamed: 0,Reviews
0,Thanks fucken.\n\nI decided to go into Tech in...
1,Hello ken jee!!! I'm doing a graduation on Com...
2,Thanks fuck for
3,Great video!!! I started learning Python 8 mon...
4,Been watching hours ofucknow that it is not an...


In [3]:
import nltk
import string
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer

def lemma_tokenizer(corpus): # a method to lemmatize corpus
    corpus = ''.join([ch for ch in corpus if ch not in string.punctuation]) # remove punctuation
    tokens = nltk.word_tokenize(corpus)
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(token) for token in tokens]

nltk_stopwords = nltk.corpus.stopwords.words('english') # use nltk's English stopwords list
tf = CountVectorizer(tokenizer=lemma_tokenizer, stop_words=nltk_stopwords) # default lowercase
tf_sparse = tf.fit_transform(textdf.Reviews)
tf_dictionary = tf.get_feature_names_out()
print(tf_dictionary)
tf_sparse

['0' '000' '00000001000' ... '…………' '›' '›\x8d']


<10217x11886 sparse matrix of type '<class 'numpy.int64'>'
	with 112108 stored elements in Compressed Sparse Row format>

In [4]:
nltk_stopwords.append('ha') # add 'ha' to stopword list for removal
nltk_stopwords.append('hey')
nltk_stopwords.append('wa')
nltk_stopwords.append('hi')
nltk_stopwords.append('Woo')
tf = CountVectorizer(tokenizer=lemma_tokenizer, stop_words=nltk_stopwords) # default lowercase
tf_sparse = tf.fit_transform(textdf.Reviews)
tf_dictionary = tf.get_feature_names_out()
print(tf_dictionary)
tf_sparse

['0' '000' '00000001000' ... '…………' '›' '›\x8d']


<10217x11884 sparse matrix of type '<class 'numpy.int64'>'
	with 111211 stored elements in Compressed Sparse Row format>

In [5]:
tf_dense = tf_sparse.toarray() # convert sparse to dense matrix
pd.DataFrame(tf_dense, columns=tf_dictionary)

Unnamed: 0,0,000,00000001000,0015459375000001213,005,0050,00s,01,0100,0120,...,…,…watching,…¡,…¸,…û´,……,………,…………,›,›
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10212,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10213,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10214,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10215,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(tokenizer=lemma_tokenizer, stop_words=nltk_stopwords) # default lowercase
tfidf_sparse = tfidf.fit_transform(textdf.Reviews)
tfidf_dictionary = tfidf.get_feature_names_out()
tfidf_sparse

<10217x11884 sparse matrix of type '<class 'numpy.float64'>'
	with 111211 stored elements in Compressed Sparse Row format>

In [7]:
from sklearn.decomposition import TruncatedSVD
lsa = TruncatedSVD(n_components=10)
lsa

In [8]:
lsa_tf_topics = lsa.fit_transform(tf_sparse)
lsa_tf_topics.shape

(10217, 10)

In [9]:
lsa.components_.shape

(10, 11884)

In [10]:
# print top terms for each topic
def print_top_terms(model, vocabulary, n_top_terms):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([vocabulary[i]
                             for i in topic.argsort()[:-n_top_terms - 1:-1]])
        print(message)
    print()

print('LSA topics based on term-document matrix:')
print_top_terms(lsa, tf_dictionary, 10)

LSA topics based on term-document matrix:
Topic #0: data science ken video fuck im learning great would project
Topic #1: data science scientist machine degree job computer analyst master field
Topic #2: fuck im fucke ofuck hi get project learning much know
Topic #3: video fuck im really learning like great get would ofuck
Topic #4: im project really get learning would one like good fucke
Topic #5: project thanks data scientist one great learning would question model
Topic #6: science learning thanks project machine course computer master learn deep
Topic #7: great learning would thank machine really one ken like content
Topic #8: learning video thank much get like learn machine time fucke
Topic #9: thanks learning great machine get lot really scientist fucke content



In [11]:
# sklearn for non-negative matrix factorization (NMF)
from sklearn.decomposition import NMF
nmf = NMF(n_components=10, random_state=1, alpha_W=.1, l1_ratio=.5) # alpha_W and l1 related to regularization
nmf

In [24]:
nmf.fit_transform(tf_sparse)
# print('NMF topics based on term-document matrix:')
# print_top_terms(nmf, tf_dictionary, 10)

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [22]:
nmf.fit_transform(tfidf_sparse)
# print('NMF topics based on tfidf matrix:')
# print_top_terms(nmf, tfidf_dictionary, 10)

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [23]:
# sklearn for latent Dirichlet allocation (LDA)
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(
    n_components=10, random_state=1, learning_method='online', learning_offset=50.)
# lda

In [15]:
lda.fit_transform(tf_sparse)
print('LDA topics based on term-document matrix:')
print_top_terms(lda, tf_dictionary, 10)

LDA topics based on term-document matrix:
Topic #0: project use using like problem cool code deep idea model
Topic #1: topic bro recommendation list got 0 day life habit ¸
Topic #2: data science im ken fuck learning learn get course would
Topic #3: resume excited ‚ wait cant email necessary already sent 25
Topic #4: error getting tweet help line please first info element session
Topic #5: video ken great thanks fuck thank really much love content
Topic #6:  subscribed applying congrats „ 100k glad bring pas develop
Topic #7: nice youre lol ” seems check people na bit gon
Topic #8: comment company tell review like u 2 everyone ifuck see
Topic #9: link page discord name api episode column conversation drop indian



In [16]:
lda.fit_transform(tfidf_sparse)
print('LDA topics based on tfidf matrix:')
print_top_terms(lda, tfidf_dictionary, 10)

LDA topics based on tfidf matrix:
Topic #0: keyboard shirt eagerly john mouse kernel portfucken  snow de
Topic #1: pizza pc papaya monitor funny commenting dope hot pick hardware
Topic #2: honest discount annual lighting 365 titan super keen fuckenjee neural
Topic #3: helpfucks datacamp x excited period ì haircut soo mic voice
Topic #4: ¸ brother error thanx csv import wall sa object 115
Topic #5: ken video data thanks fuck great science thank really im
Topic #6:  10k plant biomedical keyword silver pycharm greate vomit woo
Topic #7: thumbnail lmao gem absolute wink beautiful exact epic stormbreaker musk
Topic #8: subscribed congrats road extra lol “ 100k sub remotely fast
Topic #9: name green stufucken discord tea brilliant link secret congratulation algo



In [17]:
# sklearn grid search for the best number of topics for LDA based on perplexity score
from sklearn.model_selection import GridSearchCV
param_grid = {'n_components': [3,5,8,10,15,18,20]}
lda = LatentDirichletAllocation(random_state=1, learning_method='online', learning_offset=50.)
lda_grid = GridSearchCV(lda, param_grid)
lda_grid.fit(tf_sparse)

In [18]:
print("Best model's params: ", lda_grid.best_params_)
print("Best log likelihood score: ", lda_grid.best_score_)
print("Model perplexity: ", lda_grid.best_estimator_.perplexity(tf_sparse))
cvresult = lda_grid.cv_results_
for mean_test_score, params in zip(cvresult['mean_test_score'], cvresult['params']):
    print(mean_test_score, params)
# the best number of topics is  based on sklearn grid search, which does not looks good 

Best model's params:  {'n_components': 3}
Best log likelihood score:  -203295.05947136274
Model perplexity:  1889.0721663769823
-203295.05947136274 {'n_components': 3}
-213036.8929452252 {'n_components': 5}
-220867.81692879763 {'n_components': 8}
-226860.10152153028 {'n_components': 10}
-236008.06206140685 {'n_components': 15}
-242457.09837598112 {'n_components': 18}
-247687.1072977081 {'n_components': 20}


In [19]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(
    n_components=3, random_state=1, learning_method='online', learning_offset=50.)
lda

In [21]:
lda.fit_transform(tfidf_sparse)
print('LDA topics based on tfidf matrix:')
print_top_terms(lda, tfidf_dictionary, 20) # 3 is terrible, i use component =20,

LDA topics based on tfidf matrix:
Topic #0: bro error name tweet helpfucks discord email line panda module call sound send file server keyboard brother pizza import sent
Topic #1: ken video data thanks fuck great science thank really im project much good hi like would learning hey fucke love
Topic #2: congrats subscribed 100k sub beast „ lol clock commenting congratulation notion gem super green template absolute lighting wall wink 50k



Wow, I am not sure this is what I need.
### Continue search for best topic
#### Gensim

In [27]:
pd.options.display.max_rows = 30
# data processing for gensim
t0 = time()
stopwds = nltk.corpus.stopwords.words('english')
stopwds.append('ha')  # add 'ha' to stopword list for removal
stopwds.append('wa')
wtk = nltk.tokenize.RegexpTokenizer(r'\w+')
wnl = nltk.stem.wordnet.WordNetLemmatizer()

def tokenize_docs(docs):
    tokened_docs = []
    for doc in docs:
        doc = doc.lower()
        doc_tokens = [token.strip() for token in wtk.tokenize(doc)]
        doc_tokens = [wnl.lemmatize(token) for token in doc_tokens if not token.isnumeric()]
        doc_tokens = [token for token in doc_tokens if len(token) > 1]
        doc_tokens = [token for token in doc_tokens if token not in stopwds]
        doc_tokens = list(filter(None, doc_tokens))
        if doc_tokens:
            tokened_docs.append(doc_tokens)
    return tokened_docs

tokened_docs = tokenize_docs(textdf.Reviews)
print("Computing time: %0.3f seconds." % (time() - t0))
print('\nTokenized documents:\n', tokened_docs[:20])  # shows a little

Computing time: 2.166 seconds.

Tokenized documents:
 [['thanks', 'fucken', 'decided', 'go', 'tech', 'learning', 'web', 'development', 'learnt', 'bit', 'ofucknowledge', 'front', 'end', 'web', 'development', 'helpful', 'web', 'scraping'], ['hello', 'ken', 'jee', 'graduation', 'computer', 'science', 'really', 'keen', 'learn', 'data', 'science', 'fuck', 'honest', 'excitement', 'learning', 'data', 'science', 'get', 'top', 'notch', 'watching', 'video'], ['thanks', 'fuck'], ['great', 'video', 'started', 'learning', 'python', 'month', 'ago', 'quickly', 'became', 'interested', 'background', 'fucked', 'fuck', 'investment', 'said', 'im', 'considering', 'boot', 'camp', 'even', 'master', 'would', 'love', 'get', 'opinion', 'thanks'], ['watching', 'hour', 'ofucknow', 'easy', 'fucke', 'pinnacle', 'give'], ['hey', 'ken', 'almost', 'fucket', 'research', 'marketing', 'going', 'fuck', 'sure', 'compatibility', 'osx', 'fuck', 'good', 'idea', 'better', 'stick', 'zbooks', 'similar', 'window', 'laptop'], ['ba

In [31]:
from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary(tokened_docs)
bow_tf = [dictionary.doc2bow(word) for word in tokened_docs]
# print(dictionary)
# print('\n', list(dictionary.items())) 

#### Gensim for LSI

In [34]:
# gensim for latent semantic indexing (LSI)
from gensim.models.lsimodel import LsiModel
NumberOfTopics = 5
lsi = LsiModel(corpus=bow_tf, num_topics=NumberOfTopics, id2word=dictionary)
print(lsi)

LsiModel<num_terms=10232, num_topics=5, decay=1.0, chunksize=20000>


In [35]:
# print term-topic matrix (top 10 terms in each topic), the U matrix in X = U*S*V
for i, topic in lsi.print_topics(num_words=15):
    print('\nTopic', i)
    print(topic)


Topic 0
0.674*"data" + 0.413*"science" + 0.229*"ken" + 0.205*"video" + 0.197*"fuck" + 0.112*"learning" + 0.102*"great" + 0.097*"would" + 0.093*"project" + 0.091*"scientist" + 0.090*"thanks" + 0.082*"get" + 0.079*"really" + 0.079*"job" + 0.078*"hi"

Topic 1
0.445*"video" + -0.438*"data" + 0.418*"ken" + 0.334*"fuck" + -0.302*"science" + 0.224*"great" + 0.153*"thanks" + 0.123*"really" + 0.106*"project" + 0.100*"thank" + 0.095*"hi" + 0.079*"fucke" + 0.075*"hey" + 0.072*"like" + 0.067*"much"

Topic 2
0.660*"video" + -0.643*"fuck" + 0.245*"great" + -0.081*"fucke" + 0.070*"science" + -0.067*"ofuck" + 0.067*"data" + -0.061*"hi" + -0.059*"project" + -0.057*"learning" + -0.055*"get" + -0.051*"much" + -0.049*"ken" + -0.042*"learn" + -0.040*"fucking"

Topic 3
-0.773*"ken" + 0.357*"fuck" + 0.317*"video" + -0.170*"hi" + -0.136*"hey" + -0.113*"thanks" + -0.092*"science" + -0.087*"jee" + 0.082*"learning" + 0.080*"really" + 0.073*"like" + 0.073*"great" + 0.059*"would" + 0.058*"ofuck" + 0.056*"project"

In [37]:
# seperate the terms with positive weight from those with negative weight
for i in range(NumberOfTopics):
    print('Topic', i)
    print('-'*50)
    g1 = []
    g2 = []
    for term, wt in lsi.show_topic(i, topn=10):
        if wt >= 0: g1.append((term, round(wt, 3)))
        else:       g2.append((term, round(wt, 3)))
    print('Gropu +:', g1)
    print('-'*50)
    print('Group -:', g2)
    print('='*50)

Topic 0
--------------------------------------------------
Gropu +: [('data', 0.674), ('science', 0.413), ('ken', 0.229), ('video', 0.205), ('fuck', 0.197), ('learning', 0.112), ('great', 0.102), ('would', 0.097), ('project', 0.093), ('scientist', 0.091)]
--------------------------------------------------
Group -: []
Topic 1
--------------------------------------------------
Gropu +: [('video', 0.445), ('ken', 0.418), ('fuck', 0.334), ('great', 0.224), ('thanks', 0.153), ('really', 0.123), ('project', 0.106), ('thank', 0.1)]
--------------------------------------------------
Group -: [('data', -0.438), ('science', -0.302)]
Topic 2
--------------------------------------------------
Gropu +: [('video', 0.66), ('great', 0.245), ('science', 0.07), ('data', 0.067)]
--------------------------------------------------
Group -: [('fuck', -0.643), ('fucke', -0.081), ('ofuck', -0.067), ('hi', -0.061), ('project', -0.059), ('learning', -0.057)]
Topic 3
---------------------------------------------

In [38]:
# print document-topic matrix, the V matrix in X = U*S*V
from gensim.matutils import corpus2dense
term_topic = lsi.projection.u  # left singular vectors
singular_values = lsi.projection.s  # singular values
topic_document = (corpus2dense(lsi[bow_tf], len(singular_values)).T / singular_values).T 
                             # lsi[bow_tf] is right singular vectors
term_topic.shape, singular_values.shape, topic_document.shape

((10232, 5), (5,), (5, 10168))

In [39]:
document_topics = pd.DataFrame(np.round(topic_document.T, 4),
                               columns=['Topic '+str(i) for i in range(NumberOfTopics)])
document_topics.head()

Unnamed: 0,Topic 0,Topic 1,Topic 2,Topic 3,Topic 4
0,0.0032,0.004,-0.003,0.0014,0.0127
1,0.0363,-0.0053,0.0032,-0.0043,-0.0138
2,0.0028,0.0077,-0.0141,0.0056,-0.0117
3,0.0118,0.0218,-0.001,0.0217,0.0078
4,0.0013,0.0019,-0.0018,0.0019,0.0035


#### Gensim (LDA)

In [40]:
# gensim for latent Dirichlet allocation (LDA)
from gensim.models.ldamodel import LdaModel
lda_model = LdaModel(corpus=bow_tf, num_topics=NumberOfTopics, id2word=dictionary, random_state=1)
print(lda_model)

LdaModel<num_terms=10232, num_topics=5, decay=0.5, chunksize=2000>


In [48]:
# print document-topic matrix, the theta parameter in LDA model
topic_document_theta = corpus2dense(lda_model.get_document_topics(bow_tf), NumberOfTopics)
document_topic_theta_df = pd.DataFrame(np.round(topic_document_theta.T, 4),
                                       columns=['Topic '+str(i) for i in range(NumberOfTopics)])
# document_topic_theta_df.head()

In [42]:
# get the topics with the highest coherence score based on top 10 terms
topics_coherences = lda_model.top_topics(bow_tf, topn=10)  # default coherence='u_mass'

# compute the average score for the top topics
avg_coherence_score = np.mean([item[1] for item in topics_coherences])
print('Average coherence score:', avg_coherence_score)

Average coherence score: -2.365187803749801


In [44]:
topics_with_wts = [item[0] for item in topics_coherences]

In [47]:
topics_df = pd.DataFrame([[(term, round(wt, 3)) for (wt, term) in topic] for topic in topics_with_wts],
                           columns=['Term '+str(i) for i in range(0, 10)],
                           index=['Topic '+str(t) for t in range(0, lda_model.num_topics)]).T

pd.set_option('display.max_colwidth', None)
topics_df = pd.DataFrame([', '.join([term for wt, term in topic]) for topic in topics_with_wts],
                         columns=['Terms per Topic'],
                         index=['Topic '+str(t) for t in range(0, 5)])
topics_df

Unnamed: 0,Terms per Topic
Topic 0,"thanks, fuck, project, like, ken, thank, fucking, video, fucke, much"
Topic 1,"data, science, learning, course, get, ken, master, scientist, hey, would"
Topic 2,"video, ken, great, thanks, fuck, hi, data, fucke, hey, good"
Topic 3,"thank, tweet, ken, error, one, got, fuck, get, help, please"
Topic 4,"video, thank, much, time, ken, fuck, great, channel, work, code"


In [49]:
# compute four different coherence scores based on all terms
from gensim.models import CoherenceModel
cv_coherence_model_lda = CoherenceModel(model=lda_model,
                                        corpus=bow_tf, 
                                        texts=tokened_docs, 
                                        dictionary=dictionary, 
                                        coherence='c_v')
print('Average coherence score (c_v):', cv_coherence_model_lda.get_coherence())

umass_coherence_model_lda = CoherenceModel(model=lda_model,
                                           corpus=bow_tf,
                                           texts=tokened_docs,
                                           dictionary=dictionary,
                                           coherence='u_mass')
print('Average coherence score (u_mass):', umass_coherence_model_lda.get_coherence())

uci_coherence_model_lda = CoherenceModel(model=lda_model,
                                         corpus=bow_tf,
                                         texts=tokened_docs,
                                         dictionary=dictionary,
                                         coherence='c_uci')
print('Average coherence score (c_uci):', uci_coherence_model_lda.get_coherence())

npmi_coherence_model_lda = CoherenceModel(model=lda_model,
                                          corpus=bow_tf,
                                          texts=tokened_docs,
                                          dictionary=dictionary,
                                          coherence='c_npmi')
print('Average coherence score (c_npmi):', npmi_coherence_model_lda.get_coherence())

perplexity = lda_model.log_perplexity(bow_tf)
print('Model perplexity:', perplexity)

Average coherence score (c_v): 0.45430698132842046
Average coherence score (u_mass): -2.9950420939880367
Average coherence score (c_uci): -0.10625597414531388
Average coherence score (c_npmi): 0.019166505933121046
Model perplexity: -7.763722252439627


In [50]:
# gensim tqdm search for the best number of topics for LDA based on coherence score
from tqdm import tqdm
def topic_model_coherence_generator(corpus, texts, dictionary,
                                    start_topic_count, end_topic_count, step):
    models = []
    coherence_scores = []
    for number_of_topics in tqdm(range(start_topic_count, end_topic_count+1, step)):
        lda_model = LdaModel(corpus=bow_tf,
                             num_topics=number_of_topics,
                             id2word=dictionary,
                             random_state=1)
        coherence_model_lda = CoherenceModel(model=lda_model,
                                             corpus=bow_tf, 
                                             texts=tokened_docs, 
                                             dictionary=dictionary, 
                                             coherence='c_v')
        coherence_score = coherence_model_lda.get_coherence()
        coherence_scores.append(coherence_score)
        models.append(lda_model)
    return models, coherence_scores

lda_models, coherence_scores = topic_model_coherence_generator(corpus=bow_tf,
                                                               texts=tokened_docs,
                                                               dictionary=dictionary,
                                                               start_topic_count=2,
                                                               end_topic_count=20,
                                                               step=1)

100%|██████████████████████████████████████████████████████████████████████████| 19/19 [05:34<00:00, 17.60s/it]


In [51]:
# display the search results
coherence_df = pd.DataFrame({'Number of Topics': range(2, 21, 1), #(start_topic_ct, end_topic_ct+1, step)
                             'Coherence Score': np.round(coherence_scores, 4)})
coherence_df.sort_values(by=['Coherence Score'], ascending=False)

Unnamed: 0,Number of Topics,Coherence Score
2,4,0.4794
0,2,0.4607
1,3,0.4558
3,5,0.4543
5,7,0.4249
6,8,0.416
4,6,0.4099
13,15,0.4037
8,10,0.3947
15,17,0.3942


4 topics should be the best. But it seems not suitable for these dataset which are the comments for the same subject of data science.

In [None]:
-end