# First example
1. Defining the corpus

In [1]:
corpus = ["cryptography can be used for preventing data leakage in computer security",
"supervised learning and unsupervised learning are the two main groups of methods in machine learning",
"while in supervised learning we have access to the target variable in unsupervised learning we do not have such a variable",
"there are some methods in security for reducing the risk of information leakage like authentication and cryptography",
"topic modeling in an unsupervised machine learning model and therefore we do not have target variables"
]

2. Preprocessing the corpus

As the next step, we want to preprocess the corpus. One crucial aspect of preprocessing is removing the stop words. As mentioned in [this Wikipedia entry](https://en.wikipedia.org/wiki/Stop_word), stop words are filtered out before or after processing natural language data (text). Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all-natural language processing tools. In our example, it seems that the following is a good initial candidate for the stop words list.

In [2]:
stop_words = ["can","be","for","two","the","for","we","in","not","do",\
              "are","to","an","there","some","have","a","and","of","like","while","therefore","such"]

Defining a function to remove the stop words and make the docs lower case:

In [3]:
def clean_doc(doc):
    return " ".join([word for word in doc.lower().split() if word not in stop_words])

In [4]:
clean_doc(corpus[0])

'cryptography used preventing data leakage computer security'

In [5]:
corpus_clean = [clean_doc(doc) for doc in corpus]

In [6]:
corpus_clean

['cryptography used preventing data leakage computer security',
 'supervised learning unsupervised learning main groups methods machine learning',
 'supervised learning access target variable unsupervised learning variable',
 'methods security reducing risk information leakage authentication cryptography',
 'topic modeling unsupervised machine learning model target variables']

3. Tokenization

For the next phase, we need to split the docs in the corpus to a list of words

In [7]:
corpus_clean = [doc.split() for doc in corpus_clean]

In [8]:
corpus_clean

[['cryptography',
  'used',
  'preventing',
  'data',
  'leakage',
  'computer',
  'security'],
 ['supervised',
  'learning',
  'unsupervised',
  'learning',
  'main',
  'groups',
  'methods',
  'machine',
  'learning'],
 ['supervised',
  'learning',
  'access',
  'target',
  'variable',
  'unsupervised',
  'learning',
  'variable'],
 ['methods',
  'security',
  'reducing',
  'risk',
  'information',
  'leakage',
  'authentication',
  'cryptography'],
 ['topic',
  'modeling',
  'unsupervised',
  'machine',
  'learning',
  'model',
  'target',
  'variables']]

4. Creating the dictionary

Now, we want to create a dictionary which more having than the words, has an id  assigned to each word. For this, gensim can help us as the following:

In [9]:
from gensim import corpora
dictionary = corpora.Dictionary(corpus_clean)

In [10]:
for id, word in dictionary.iteritems():
    print(id, word)

0 computer
1 cryptography
2 data
3 leakage
4 preventing
5 security
6 used
7 groups
8 learning
9 machine
10 main
11 methods
12 supervised
13 unsupervised
14 access
15 target
16 variable
17 authentication
18 information
19 reducing
20 risk
21 model
22 modeling
23 topic
24 variables


4. Creating M1

Now we want to create M1. In other words, we want a model to tell us for each document and each word what is the frequency of the word in the document. We call these models [bag of words (bow)](https://en.wikipedia.org/wiki/Bag-of-words_model). If we call the *doc2bow* method of the dictionary on a document, it will give us the document's bow.

In [11]:
dictionary.doc2bow(corpus_clean[2]) 

[(8, 2), (12, 1), (13, 1), (14, 1), (15, 1), (16, 2)]

From the above output, we can conclude that word with the id of 8 (learning) has been repeated two times in document 3 (corpus_clean\[2\]). So now we can make M1:

In [12]:
M1 = [dictionary.doc2bow(doc) for doc in corpus_clean]

In [13]:
M1

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)],
 [(7, 1), (8, 3), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1)],
 [(8, 2), (12, 1), (13, 1), (14, 1), (15, 1), (16, 2)],
 [(1, 1), (3, 1), (5, 1), (11, 1), (17, 1), (18, 1), (19, 1), (20, 1)],
 [(8, 1), (9, 1), (13, 1), (15, 1), (21, 1), (22, 1), (23, 1), (24, 1)]]

5. Creating and training th LDA model

In [14]:
import gensim
Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(M1, num_topics=2, id2word = dictionary, passes=3,random_state =0)

The first three arguments of Lda are clear. Regarding the other arguments, note that Lda may not be able to find optimal topics initially. Therefore, we can give it the possibility to go over the corpus for more than one pass. Usually, more passes increase the model's quality, however after some point, it converges, and therefore more passes do not give better topic models. Regarding the last parameter, regard that the Lda implementation of gensim has a certain degree of randomness. Therefore, there is no guarantee that you get the same topics each time you run the model. By fixing the random_state to some fixed number, we will be sure that the results will be the same after each run. I fixed this so that you also get the same results of this notebook, and we can discuss the results. 

Now, we have the model and M2 and M3 has been generated. Let us see M2:

In [15]:
ldamodel.print_topics()

[(0,
  '0.098*"learning" + 0.067*"cryptography" + 0.067*"security" + 0.067*"leakage" + 0.066*"methods" + 0.042*"unsupervised" + 0.041*"computer" + 0.041*"data" + 0.041*"supervised" + 0.041*"used"'),
 (1,
  '0.120*"learning" + 0.087*"unsupervised" + 0.083*"target" + 0.077*"variable" + 0.055*"machine" + 0.053*"supervised" + 0.052*"topic" + 0.052*"variables" + 0.052*"modeling" + 0.052*"model"')]

The output shows that for topic 0 (the first topic), cryptography has the weight of 0.067, and for topic 1(the second topic), learning has the weight of 0.12. You can also use the show_topic method of the model to focus on each topic. You can pass two parameters to it. The first one is the topic number, and the second one is the number of top words of the topic which you want to get the weights of:

In [16]:
ldamodel.show_topic(topicid=1,topn=5)

[('learning', 0.12041809),
 ('unsupervised', 0.08701295),
 ('target', 0.0833895),
 ('variable', 0.076944105),
 ('machine', 0.055268224)]

6. Evaluating the model

To evaluate the model, the first step is to check the generated topics and see whether they make sense. In the above example, it seems that the model did a good job in finding the security and machine learning topics. Note that interpretation of the name of the topic is on us. It seems that we can call the first topic computer security and the second one machine learning

Another step is to test the model on some unseen documents to see whether it would be successful in finding each topic's weight in the document. For this, let us try it on the first paragraph of [this page](https://en.wikipedia.org/wiki/Computer_security)

In [17]:
doc = """Computer security, cybersecurity or information technology security (IT security) is the protection of computer 
systems and networks from information disclosure, theft of or damage to their hardware, software, or electronic data,
as well as from the disruption or misdirection of the services they provide"""

In [18]:
doc_clean = clean_doc(doc).split()

In [19]:
doc_bow = dictionary.doc2bow(doc_clean)

In [20]:
ldamodel[doc_bow]

[(0, 0.91121024), (1, 0.08878979)]

It is shown that the model has assigned the weight of 0.91 for the first topic, which is the computer security topic which confirms that the model is doing a good job on unseen data also.

Another step that can help in evaluating the model is to visualize it. To do this, the optimal solution is to use the pyLDAvis package as follows:

In [21]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(ldamodel, M1, dictionary)
vis

This figure enables you to see the topics in a two-dimensional space. Circles in this figure represent the topics. If you hover on the words, you can see their weights.

The other step for evaluating a topic model is computing a quantitative metric for comparing the different topic models that we can make on a specific corpus.  We will discuss one of these metrics, which is the coherence score, in the following example.  

# Second example
For this example, I will use paragraphs of [this page](https://www.sigsac.org/ccs/CCS2020/proceedings.html) as  the documents of a corpus. This is the proceedings of the CCS 2020 conference. We can directly scrap the page or save its text in a file and use that file. We will follow the second approach for simplicity here. I have saved the text in the _ccs2020.corpus_ file. Let us read the corpus first.

In [21]:
corpus = [doc for doc in open('ccs2020.corpus', encoding='utf-8') if len(doc) > 80]
len(corpus)

291

We have limited the corpus to the paragraphs  which have the length of at least 80 so that the authors' names are not included in the corpus

In [22]:
corpus[0]

'Tor exit blocking, in which websites disallow clients arriving from Tor, is a growing and potentially existential threat to the anonymity network. This paper introduces HebTor, a new and robust architecture for exit bridges---short-lived proxies that serve as alternative egress points for Tor. A key insight of HebTor is that exit bridges can operate as Tor onion services, allowing any device that can create outbound TCP connections to serve as an exit bridge, regardless of the presence of NATs and/or firewalls. HebTor employs a micropayment system that compensates exit bridge operators for their services, and a privacy-preserving reputation scheme that prevents freeloading. We show that HebTor effectively thwarts server-side blocking of Tor, and we describe the security, privacy, and legal implications of our design.\n'

Now we follow the same steps as the previous example

In [23]:
corpus_clean = [clean_doc(doc).split() for doc in corpus]
dictionary = corpora.Dictionary(corpus_clean)
M1 = [dictionary.doc2bow(doc) for doc in corpus_clean]
Lda = gensim.models.ldamodel.LdaModel
lda_model = Lda(M1, num_topics=5, id2word = dictionary, passes=5,random_state =0)
topics = lda_model.print_topics(num_topics=5, num_words=10)
for topic in topics:
    print(topic)

(0, '0.015*"that" + 0.008*"on" + 0.008*"this" + 0.007*"is" + 0.007*"our" + 0.005*"by" + 0.005*"as" + 0.005*"it" + 0.004*"from" + 0.004*"attack"')
(1, '0.012*"that" + 0.011*"is" + 0.009*"on" + 0.006*"this" + 0.006*"security" + 0.006*"our" + 0.005*"with" + 0.005*"data" + 0.004*"as" + 0.004*"or"')
(2, '0.013*"that" + 0.012*"is" + 0.010*"with" + 0.009*"on" + 0.009*"as" + 0.008*"by" + 0.008*"this" + 0.007*"our" + 0.006*"security" + 0.005*"which"')
(3, '0.015*"that" + 0.009*"on" + 0.009*"this" + 0.009*"is" + 0.008*"our" + 0.008*"by" + 0.008*"security" + 0.006*"with" + 0.005*"new" + 0.004*"protocol"')
(4, '0.007*"that" + 0.005*"with" + 0.005*"our" + 0.004*"is" + 0.004*"on" + 0.004*"as" + 0.003*"by" + 0.003*"from" + 0.003*"show" + 0.003*"at"')


As you see in the above example, most of the topics extracted do not make sense. We can quantitatively see this also by computing the coherence score of the model. To do that, we should make a coherence model as it follows:

In [24]:
from gensim.models import CoherenceModel
coherence_model_lda = CoherenceModel(model=lda_model, texts=corpus_clean, dictionary=dictionary)
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

Coherence Score:  0.2176289704922158


Let us see whether we can improve this score. As you can guess one problem is with the preprocessing step. One initial step is to extend the stopwords. We can for example use the stopwords of the english language as our stop words list. To do that one method is to use the stopwords of the nltk pacakge.

In [25]:
import nltk
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [26]:
len(stop_words)

179

So let us now repeat the example with these stop words.

In [27]:
corpus_clean = [clean_doc(doc).split() for doc in corpus]
dictionary = corpora.Dictionary(corpus_clean)
M1 = [dictionary.doc2bow(doc) for doc in corpus_clean]
Lda = gensim.models.ldamodel.LdaModel
lda_model = Lda(M1, num_topics=5, id2word = dictionary, passes=5,random_state =0)
topics = lda_model.print_topics(num_topics=5, num_words=10)
for topic in topics:
    print(topic)

(0, '0.006*"security" + 0.005*"data" + 0.005*"system" + 0.004*"new" + 0.003*"code" + 0.003*"model" + 0.003*"propose" + 0.003*"attacks" + 0.003*"existing" + 0.003*"show"')
(1, '0.007*"security" + 0.004*"data" + 0.004*"attack" + 0.003*"key" + 0.003*"attacks" + 0.003*"code" + 0.003*"privacy" + 0.003*"using" + 0.002*"learning" + 0.002*"number"')
(2, '0.007*"protocol" + 0.006*"security" + 0.004*"secure" + 0.004*"using" + 0.003*"show" + 0.003*"protocols" + 0.003*"new" + 0.003*"attacks" + 0.003*"also" + 0.003*"present"')
(3, '0.004*"privacy" + 0.004*"new" + 0.004*"security" + 0.003*"analysis" + 0.003*"secure" + 0.003*"set" + 0.003*"data" + 0.003*"first" + 0.003*"workshop" + 0.003*"censorship"')
(4, '0.007*"security" + 0.005*"attacks" + 0.004*"attack" + 0.004*"new" + 0.003*"analysis" + 0.003*"show" + 0.003*"approach" + 0.003*"devices" + 0.003*"however," + 0.003*"using"')


The results are better now. Let us see whether the coherence score has also improved : 

In [28]:
coherence_model_lda = CoherenceModel(model=lda_model, texts=corpus_clean, dictionary=dictionary)
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

Coherence Score:  0.2959806780538898


We can see an improvement of around 8 percent in the score. Let us see whether we can improve it more. one thing that we can do is to extend the list of stopwords as it follows:

In [29]:
stop_words = set(stop_words)
stop_words.update(set(["attack","new","security","first","however","ha",\
                       "protocols","privacy","paper","also","new","eg",\
                       "secure","system","approach","key","using","zk",\
                       "present","user","show","attack","attacks","workshop"\
                       "paper","et","propose","two","per","paper,",\
                       "data","study","al.","wang","zhang","however,"
                      ]))

In [30]:
len(stop_words)

212

For example, the intuition behind adding "paper" to the stop words is that we know that it is repeated in most of the corpus documents as it is a conference proceeding. "Security" has been added as it is a security conference, and many of the documents have it; therefore, it can not help in identifying the topics. More than extending the stopwords, let us also focus on another aspect of the preprocessing and see whether we can improve that part. We can add [lemmatization](https://en.wikipedia.org/wiki/Lemmatisation) to preprocessing, for which I am going to provide you a function to use it out of the box. For the lemmatization, we will use the _lemmatize_sentence_ function defined below. It is not necessary to understand the details of it at this stage, and you can use it out of the box. It is borrowed from [this page](https://gaurav5430.medium.com/using-nltk-for-lemmatizing-sentences-c1bfff963258) with a little bit of modification.

In [31]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            # lemmatized_sentence.append(word)
            pass# This part is modified so that we will just have ADJ VERB NOUN ADVERB remained 
        else:        
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)

In [32]:
print(lemmatize_sentence("I am loving it")) #I be love it

be love


Now we can modify the clean_doc function as it follows:

In [33]:
def clean_doc(doc):
    lemmatized_doc = lemmatize_sentence(doc).lower().split()
    stop_free_lemmatized_doc = " ".join([word for word in lemmatized_doc if (word not in stop_words and len(word) > 3)])
    return stop_free_lemmatized_doc

now let us see how much does it affect our model

In [34]:
corpus_clean = [clean_doc(doc).split() for doc in corpus]
dictionary = corpora.Dictionary(corpus_clean)
M1 = [dictionary.doc2bow(doc) for doc in corpus_clean]
Lda = gensim.models.ldamodel.LdaModel
lda_model = Lda(M1, num_topics=5, id2word = dictionary, passes=5,random_state =0)
topics = lda_model.print_topics(num_topics=5, num_words=10)
for topic in topics:
    print(topic)
coherence_model_lda = CoherenceModel(model=lda_model, texts=corpus_clean, dictionary=dictionary)
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

(0, '0.005*"input" + 0.005*"client" + 0.005*"proof" + 0.005*"browser" + 0.005*"message" + 0.004*"protocol" + 0.004*"model" + 0.004*"code" + 0.004*"adversarial" + 0.004*"allow"')
(1, '0.008*"patch" + 0.006*"model" + 0.006*"protocol" + 0.005*"code" + 0.005*"skill" + 0.004*"group" + 0.004*"demonstrate" + 0.004*"provide" + 0.004*"scheme" + 0.004*"vulnerability"')
(2, '0.006*"mechanism" + 0.005*"analysis" + 0.005*"method" + 0.005*"domain" + 0.005*"vulnerability" + 0.004*"e.g." + 0.004*"software" + 0.004*"kernel" + 0.004*"provide" + 0.004*"include"')
(3, '0.008*"model" + 0.008*"device" + 0.005*"contract" + 0.004*"distribution" + 0.004*"detection" + 0.003*"different" + 0.003*"large" + 0.003*"input" + 0.003*"metric" + 0.003*"network"')
(4, '0.017*"protocol" + 0.005*"application" + 0.005*"analysis" + 0.004*"result" + 0.004*"code" + 0.004*"make" + 0.004*"technique" + 0.004*"work" + 0.004*"provide" + 0.004*"proof"')
Coherence Score:  0.30974244506568543


The topics make more sense, and the coherence score has improved. One other aspect that can improve the model is to add the bigrams and even trigrams to the model. To get a notion of how this can help to enhance the model consider two documents. One with five occurrences of the "computer science" combination and the other one with the five repetitions of computer and science that happens separately in the text. Our current model can not make a difference between these two documents. So, let us make the bigrams and trigrams and add them to the model. For this, gensim phrases can help us. 

In [35]:
bigram = gensim.models.Phrases(corpus_clean)

let us see how this bigraming can affect our corpus. 

In [36]:
set(bigram[corpus_clean[8]])-set(corpus_clean[8])

{'deep_neural'}

In [37]:
set(corpus_clean[8])-set(bigram[corpus_clean[8]])

{'deep', 'neural'}

As you can see by using this bigram model, frequent bigrams have been detected. The same way you can make trigrams also.

In [38]:
trigram = gensim.models.Phrases(bigram[corpus_clean])

Now let us apply it to our model and see how does it affect it 

In [39]:
corpus_clean = [trigram[bigram[doc]] for doc in corpus_clean]
dictionary = corpora.Dictionary(corpus_clean)
M1 = [dictionary.doc2bow(doc) for doc in corpus_clean]
Lda = gensim.models.ldamodel.LdaModel
lda_model = Lda(M1, num_topics=5, id2word = dictionary, passes=5,random_state =0)
topics = lda_model.print_topics(num_topics=5, num_words=10)
for topic in topics:
    print(topic)
coherence_model_lda = CoherenceModel(model=lda_model, texts=corpus_clean, dictionary=dictionary)
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

(0, '0.007*"vulnerability" + 0.007*"protocol" + 0.004*"device" + 0.004*"exploit" + 0.004*"model" + 0.004*"technique" + 0.004*"base" + 0.004*"target" + 0.003*"proof" + 0.003*"make"')
(1, '0.007*"model" + 0.005*"provide" + 0.004*"code" + 0.004*"large" + 0.004*"method" + 0.004*"framework" + 0.004*"analysis" + 0.003*"exist" + 0.003*"pets" + 0.003*"domain"')
(2, '0.005*"analysis" + 0.005*"protocol" + 0.005*"code" + 0.005*"application" + 0.004*"scheme" + 0.004*"malicious" + 0.004*"introduce" + 0.004*"result" + 0.004*"host" + 0.003*"technique"')
(3, '0.015*"protocol" + 0.006*"model" + 0.005*"network" + 0.004*"provide" + 0.004*"client" + 0.004*"device" + 0.004*"input" + 0.004*"technique" + 0.004*"analysis" + 0.003*"achieve"')
(4, '0.007*"patch" + 0.006*"mechanism" + 0.006*"application" + 0.005*"protocol" + 0.005*"model" + 0.004*"test" + 0.004*"call" + 0.003*"cloud" + 0.003*"include" + 0.003*"base"')
Coherence Score:  0.31012864342450797


As it has been discussed in our meeting, gensim Lda can be used in multicore mode. It can save a lot of time for us while processing big datasets. The only change you need to do is instantiate the lda model from gensim.models.ldamulticore.LdaMulticore and pass the number of workers, which is the number of cpu cores which you want to use, to it:

In [40]:
corpus_clean = [trigram[bigram[doc]] for doc in corpus_clean]
dictionary = corpora.Dictionary(corpus_clean)
M1 = [dictionary.doc2bow(doc) for doc in corpus_clean]
Lda = gensim.models.ldamulticore.LdaMulticore
lda_model = Lda(M1, num_topics=10, id2word = dictionary, passes=10,random_state =0,workers=4)
topics = lda_model.print_topics(num_topics=5, num_words=10)
for topic in topics:
    print(topic)
coherence_model_lda = CoherenceModel(model=lda_model, texts=corpus_clean, dictionary=dictionary)
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

(8, '0.007*"algorithm" + 0.007*"skill" + 0.006*"vulnerability" + 0.005*"platform" + 0.005*"work" + 0.004*"model" + 0.004*"kernel" + 0.004*"verify" + 0.004*"database" + 0.004*"implement"')
(3, '0.014*"protocol" + 0.006*"device" + 0.006*"setting" + 0.005*"technique" + 0.005*"design" + 0.005*"provide" + 0.005*"network" + 0.005*"generate" + 0.004*"email" + 0.004*"honeypot"')
(5, '0.006*"software" + 0.005*"scheme" + 0.005*"provide" + 0.005*"base" + 0.005*"message" + 0.004*"result" + 0.004*"proof" + 0.004*"device" + 0.004*"vulnerability" + 0.004*"analysis"')
(9, '0.008*"provide" + 0.008*"method" + 0.007*"model" + 0.007*"pets" + 0.006*"network" + 0.005*"webauthn" + 0.005*"service" + 0.005*"private" + 0.005*"analysis" + 0.005*"authenticator"')
(6, '0.008*"distribution" + 0.007*"find" + 0.007*"test" + 0.005*"search" + 0.005*"utility_metric" + 0.005*"captcha" + 0.005*"developer" + 0.005*"efficient" + 0.005*"increase" + 0.005*"error"')
Coherence Score:  0.3207110052344472


Due to the differences in the internal  implementations of the two modes of Lda, we can expect that the results will not be the same. To continue our discussion, now let us focus on a bigger dataset which is the [20 newsgroups text dataset](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html). First let us load it:

# Third example

In [41]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_data = fetch_20newsgroups(subset='train')
corpus = newsgroups_data.data

In [42]:
corpus[:2]

["From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",
 "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 

First, we test our latest model on it:

In [43]:
stop_words = set(stopwords.words('english'))
corpus_clean = [clean_doc(doc).split() for doc in corpus]
corpus_clean = [trigram[bigram[doc]] for doc in corpus_clean]
dictionary = corpora.Dictionary(corpus_clean)
M1 = [dictionary.doc2bow(doc) for doc in corpus_clean]
Lda = gensim.models.ldamulticore.LdaMulticore
lda_model = Lda(M1, num_topics=8, id2word = dictionary, passes=15,random_state =0,workers=4)
topics = lda_model.print_topics(num_topics=8, num_words=10)
for topic in topics:
    print(topic)
coherence_model_lda = CoherenceModel(model=lda_model, texts=corpus_clean, dictionary=dictionary)
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

(0, '0.011*"subject" + 0.010*"lines" + 0.010*"organization" + 0.008*"window" + 0.006*"write" + 0.004*"nntp-posting-host" + 0.004*"file" + 0.004*"university" + 0.004*"problem" + 0.004*"server"')
(1, '0.011*"subject" + 0.011*"lines" + 0.011*"organization" + 0.008*"game" + 0.007*"team" + 0.007*"write" + 0.006*"university" + 0.006*"year" + 0.006*"article" + 0.006*"nntp-posting-host"')
(2, '0.011*"people" + 0.007*"write" + 0.007*"think" + 0.006*"make" + 0.006*"know" + 0.005*"subject" + 0.005*"article" + 0.005*"right" + 0.005*"organization" + 0.005*"lines"')
(3, '0.009*"chip" + 0.009*"drive" + 0.008*"encryption" + 0.007*"clipper" + 0.005*"scsi" + 0.004*"government" + 0.004*"system" + 0.004*"subject" + 0.004*"organization" + 0.004*"lines"')
(4, '0.013*"subject" + 0.013*"lines" + 0.012*"organization" + 0.008*"write" + 0.006*"article" + 0.006*"university" + 0.006*"nntp-posting-host" + 0.005*"good" + 0.005*"know" + 0.004*"work"')
(5, '0.005*"turkish" + 0.004*"armenian" + 0.004*"stephanopoulos" +

The result does not seem promising(E.g., "lines" and "subject" words which do not seem to be topic keywords, are repeated in multiple topics). The topics also do not make sense.  We guess that the problem is with the stop words list, and we should extend it. However, for this big data dataset, it is not easy to find all the stop words. In these cases, the [filter_extremes](https://radimrehurek.com/gensim/corpora/dictionary.html) method of the dictionary object can help us. This method, by getting the no_below and no_above parameters, can remove some of the entries from the dictionary as per the following:
<br>filter_extremes removes all tokens in the dictionary that are:
-  Less frequent than no_below documents (absolute number, e.g., 5) or
-  More frequent than no_above documents (fraction of the total corpus size, e.g., 0.3).

In [44]:
stop_words = set(stopwords.words('english'))
corpus_clean = [clean_doc(doc).split() for doc in corpus]
corpus_clean = [trigram[bigram[doc]] for doc in corpus_clean]
dictionary = corpora.Dictionary(corpus_clean)
dictionary.filter_extremes(no_below=20, no_above=0.1) 
M1 = [dictionary.doc2bow(doc) for doc in corpus_clean]
Lda = gensim.models.ldamulticore.LdaMulticore
lda_model = Lda(M1, num_topics=8, id2word = dictionary, passes=15,random_state =0,workers=4)
topics = lda_model.print_topics(num_topics=8, num_words=10)
for topic in topics:
    print(topic)
coherence_model_lda = CoherenceModel(model=lda_model, texts=corpus_clean, dictionary=dictionary)
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

(0, '0.014*"jesus" + 0.007*"bible" + 0.007*"church" + 0.006*"christian" + 0.006*"christ" + 0.005*"life" + 0.005*"word" + 0.005*"love" + 0.004*"hell" + 0.004*"christians"')
(1, '0.010*"israel" + 0.008*"president" + 0.007*"israeli" + 0.005*"government" + 0.004*"fire" + 0.004*"kill" + 0.004*"report" + 0.004*"child" + 0.004*"today" + 0.004*"talk"')
(2, '0.006*"bike" + 0.004*"weapon" + 0.004*"government" + 0.004*"little" + 0.003*"power" + 0.003*"crime" + 0.003*"insurance" + 0.003*"keep" + 0.003*"firearm" + 0.003*"money"')
(3, '0.013*"game" + 0.012*"team" + 0.009*"space" + 0.008*"play" + 0.007*"player" + 0.005*"hockey" + 0.005*"season" + 0.004*"league" + 0.003*"launch" + 0.003*"division"')
(4, '0.009*"armenian" + 0.008*"armenians" + 0.007*"turkish" + 0.007*"keith" + 0.004*"drive" + 0.004*"armenia" + 0.004*"engine" + 0.004*"drug" + 0.004*"homosexual" + 0.004*"light"')
(5, '0.011*"file" + 0.008*"drive" + 0.007*"program" + 0.007*"card" + 0.006*"window" + 0.006*"windows" + 0.006*"version" + 0.00

In [49]:
vis = pyLDAvis.gensim.prepare(lda_model, M1, dictionary)
vis

  and should_run_async(code)


Now the results make more sense, and the coherence score has improved a little bit more. Now you should be able to use this code as an initial template for your topic models. In case that you face any problems, you can extend this template. As an example, you can change the number of topics and passes, limit the dictionary, extend the preprocessing module (e.g., removing the emails and web addresses, [stemming the text](https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/), ...), use [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), ...

Good Luck!