# **Latent Dirichlet Allocation(LDA) for 20 News Group**

**Topic:** Data Scientist Intern at RBC<br>
**Name:** Xu(Shawn) Zhang<br>
**Date:** May 11, 2020

In [0]:
#@title Import { display-mode: "form" }

from sklearn.datasets import fetch_20newsgroups
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np

import random
import nltk

random.seed(2019)
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### **Load the Dataset**

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.<br>

**sklearn.datasets.fetch_20newsgroups**, returns a list of the raw texts that can be fed to text feature extractors with custom parameters so as to extract feature vectors. 

In [0]:
# Load the dataset
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')

### **Preprocess the Dataset**

**1. Stemming**

stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root formâ€”generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.<br>

The three major stemming algorithms in use today are Porter, Snowball(Porter2), and Lancaster (Paice-Husk). Where Porter is the most commonly used stemmer and one of the most gentle stemmers; Snowball is regarded as an improvement over porter and it has slightly faster computation time than porter, with a fairly large community around it. Lancaster is the fastest algorithm here, and will reduce your working set of words hugely, so it is a very aggressive stemming algorithm, sometimes to a fault.

**2. lemmatization**

In many languages, words appear in several inflected forms. For example, in English, the verb 'to walk' may appear as 'walk', 'walked', 'walks' or 'walking'. The base form, 'walk', that one might look up in a dictionary, is called the lemma for the word. In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning.<br>

Wordnet is an large, freely and publicly available lexical database for the English language aiming to establish structured semantic relationships between words. It offers lemmatization capabilities as well and is one of the earliest and most commonly used lemmatizers.

**3. Tokenization**

Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.

In gensim, we can perform tokenization using **gensim.utils.simple_preprocess(doc, deacc=False, min_len=2, max_len=15)**:
- **doc:** the document to be tokenized
- **deacc:** de-accents
- **min_len:** minimum length of token
- **max_len:** maximum length of token

In [0]:
# Word stemming and lemmatization
def lemmatize_stemming(text):
    stemmer = SnowballStemmer("english")
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Text preprocessing
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))

    return result


# Get preprocessed dataset
processed_docs = []

for doc in newsgroups_train.data:
    processed_docs.append(preprocess(doc))

### **Create Bag of Words(BOW) Representation of the Dataset**

**Class gensim.corpora.Dictionary(documents=None, prune_at=2000000)**

Dictionary encapsulates the mapping between normalized words and their integer ids.

**Methods:**<br>
**1. Dictionary.filter_extremes(no_below=10, no_above=0.7, keep_n=100000)**
- **no_below=10** means word appears less than 10 documents
- **no_above=0.7** means word appears more than 70% documents
- **keep_n=100000** means keep only the first 100000 most frequent tokens

**2. Dictionary.doc2bow(document, allow_update=False, return_missing=False)**
- **document:** document to be converted into BOW
- **allow_update:** if allow_update is set, update dictionary and document frequencies in the process.
- **return_missing:** if true, also return words not in dictionary


In [0]:
# Create word dictionary / word2id
dictionary = gensim.corpora.Dictionary(processed_docs)

# Filter out low frequency and high frequency words
dictionary.filter_extremes(no_below=10, no_above=0.7, keep_n=100000)

# Create Bag of Words(BOW) representation of dataset
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

# Preview BOW for our sample preprocessed document
document_num = 1
bow_doc_x = bow_corpus[document_num]

for i in range(10):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_x[i][0],
                                                     dictionary[bow_doc_x[i][0]],
                                                     bow_doc_x[i][1]))

Word 11 ("host") appears 1 time.
Word 20 ("nntp") appears 1 time.
Word 22 ("post") appears 1 time.
Word 29 ("thank") appears 1 time.
Word 31 ("univers") appears 1 time.
Word 34 ("acceler") appears 1 time.
Word 35 ("adapt") appears 1 time.
Word 36 ("answer") appears 1 time.
Word 37 ("articl") appears 1 time.
Word 38 ("attain") appears 1 time.


### **Build and Train LDA Model**

We can build LDA model using **gensim.models.LdaMulticore** API, it using all CPU cores to parallelize and speed up model training.
- **corpus:** the input corpus for LDA model
- **id2word:** mapping from word IDs to words. 
- **num_topics:** the number of topics, to compare LDA and ETM, set to the same as num_topics in ETM
- **chunksize:** number of documents to load into memory at a time and process E step of EM, we can save some memory using the smaller chunksize, but will be doing multiple loading/processing steps prior to moving onto the M step, in practice we can set 100 to consider speed and memory.
- **passes:** number of passes through the corpus during training, in practice set to 10 to get fairly good results.
- **iterations:** number of iterations of EM algorithm, default value is 50, to compare with ETM(which set epochs=1000, where epochs stands for the number times that the learning algorithm will work through the entire training dataset), we set iterations=1000. But notics that different model may has it's own optimal epochs.

Below are some of examples to show the effect of chunksize and passes: <br>
chunksize = 100k, corpus = 1M docs, passes =1 : 10 updates total<br>
chunksize = 100k, corpus = 1M docs, passes =2 : 20 updates total<br>

In [0]:
# Build and train LDA model for 20 topics
lda_model_20 = gensim.models.LdaMulticore(corpus=bow_corpus,
                                          id2word=dictionary,
                                          num_topics=20, 
                                          random_state=100,
                                          chunksize=100,
                                          iterations=50,
                                          passes=10)

In [0]:
# Build and train LDA model for 50 topics
lda_model_50 = gensim.models.LdaMulticore(corpus=bow_corpus,
                                          id2word=dictionary,
                                          num_topics=50, 
                                          random_state=100,
                                          chunksize=100,
                                          iterations=50,
                                          passes=10)

In [0]:
# Build and train LDA model for 100 topics
lda_model_100 = gensim.models.LdaMulticore(corpus=bow_corpus,
                                           id2word=dictionary,
                                           num_topics=100, 
                                           random_state=100,
                                           chunksize=100,
                                           iterations=50,
                                           passes=10)

  diff = np.log(self.expElogbeta)


In [0]:
# Build and train LDA model for 300 topics
lda_model_300 = gensim.models.LdaMulticore(corpus=bow_corpus,
                                           id2word=dictionary,
                                           num_topics=300, 
                                           random_state=100,
                                           chunksize=100,
                                           iterations=50,
                                           passes=10)

  diff = np.log(self.expElogbeta)


### **Visualize Results**

#### **1. For 20 topics**

For each topic, we will explore the words occuring in that topic and its relative weight.

In [0]:
# Visualize the results
topics_20 = []
for idx, topic in lda_model_20.print_topics(num_topics=10, num_words=10):
    topics_20.append(topic)
    print("Topic: {} \nWords: {}".format(idx, topic))
    print("\n")

Topic: 16 
Words: 0.024*"sale" + 0.020*"washington" + 0.013*"price" + 0.012*"sell" + 0.010*"offer" + 0.010*"leagu" + 0.009*"year" + 0.009*"host" + 0.009*"nntp" + 0.009*"univers"


Topic: 1 
Words: 0.030*"christian" + 0.016*"believ" + 0.016*"jesus" + 0.016*"exist" + 0.015*"faith" + 0.013*"write" + 0.012*"bibl" + 0.012*"evid" + 0.011*"religion" + 0.010*"atheist"


Topic: 11 
Words: 0.020*"state" + 0.014*"nation" + 0.013*"presid" + 0.011*"american" + 0.010*"year" + 0.009*"public" + 0.009*"announc" + 0.009*"clinton" + 0.008*"group" + 0.008*"issu"


Topic: 19 
Words: 0.024*"write" + 0.019*"articl" + 0.013*"bike" + 0.013*"uiuc" + 0.011*"drive" + 0.010*"engin" + 0.010*"colorado" + 0.009*"car" + 0.009*"like" + 0.008*"umich"


Topic: 18 
Words: 0.045*"space" + 0.027*"nasa" + 0.018*"research" + 0.016*"scienc" + 0.015*"orbit" + 0.013*"sphere" + 0.013*"earth" + 0.011*"launch" + 0.011*"center" + 0.009*"moon"


Topic: 0 
Words: 0.020*"peopl" + 0.013*"like" + 0.013*"think" + 0.012*"reason" + 0.010*"k

#### **2. For 50 topics**

For each topic, we will explore the words occuring in that topic and its relative weight.

In [0]:
# Visualize the results
topics_50 = []
for idx, topic in lda_model_50.print_topics(num_topics=10, num_words=10):
    topics_50.append(topic)
    print("Topic: {} \nWords: {}".format(idx, topic))
    print("\n")

Topic: 29 
Words: 0.105*"netcom" + 0.033*"write" + 0.032*"homosexu" + 0.028*"articl" + 0.026*"cool" + 0.024*"water" + 0.020*"guest" + 0.019*"heat" + 0.019*"communic" + 0.019*"cycl"


Topic: 40 
Words: 0.029*"go" + 0.027*"think" + 0.026*"say" + 0.025*"know" + 0.022*"peopl" + 0.022*"come" + 0.019*"time" + 0.018*"like" + 0.016*"tell" + 0.012*"happen"


Topic: 49 
Words: 0.022*"weapon" + 0.021*"gun" + 0.020*"access" + 0.018*"right" + 0.017*"crime" + 0.016*"state" + 0.015*"firearm" + 0.015*"arm" + 0.014*"columbia" + 0.013*"amend"


Topic: 22 
Words: 0.066*"technolog" + 0.064*"caltech" + 0.061*"institut" + 0.055*"keith" + 0.048*"california" + 0.037*"pasadena" + 0.032*"keyboard" + 0.026*"motto" + 0.025*"post" + 0.024*"host"


Topic: 4 
Words: 0.036*"like" + 0.030*"problem" + 0.029*"good" + 0.025*"time" + 0.022*"thing" + 0.018*"help" + 0.017*"know" + 0.016*"work" + 0.014*"need" + 0.013*"think"


Topic: 35 
Words: 0.051*"michael" + 0.049*"lebanes" + 0.036*"articl" + 0.033*"write" + 0.027*"henri

#### **2. For 100 topics**

For each topic, we will explore the words occuring in that topic and its relative weight.

In [0]:
# Visualize the results
topics_100 = []
for idx, topic in lda_model_100.print_topics(num_topics=10, num_words=10):
    topics_100.append(topic)
    print("Topic: {} \nWords: {}".format(idx, topic))
    print("\n")

Topic: 57 
Words: 0.122*"colorado" + 0.059*"buffalo" + 0.055*"vote" + 0.042*"state" + 0.039*"cursor" + 0.036*"thier" + 0.033*"boulder" + 0.032*"prize" + 0.031*"revers" + 0.028*"remind"


Topic: 25 
Words: 0.049*"sourc" + 0.041*"entri" + 0.029*"section" + 0.028*"code" + 0.028*"page" + 0.026*"number" + 0.025*"contain" + 0.025*"follow" + 0.022*"includ" + 0.020*"inform"


Topic: 41 
Words: 0.058*"time" + 0.037*"advic" + 0.036*"week" + 0.036*"hour" + 0.032*"father" + 0.029*"long" + 0.028*"doctor" + 0.027*"wait" + 0.026*"acn" + 0.025*"island"


Topic: 93 
Words: 0.210*"space" + 0.139*"nasa" + 0.067*"orbit" + 0.051*"launch" + 0.039*"mission" + 0.034*"shuttl" + 0.032*"satellit" + 0.032*"station" + 0.030*"flight" + 0.023*"astro"


Topic: 60 
Words: 0.101*"bank" + 0.098*"pitt" + 0.079*"univ" + 0.059*"gordon" + 0.055*"soon" + 0.039*"widget" + 0.037*"surveil" + 0.037*"pittsburgh" + 0.036*"articl" + 0.033*"repli"


Topic: 42 
Words: 0.118*"group" + 0.058*"umich" + 0.052*"newsgroup" + 0.040*"engin" 

#### **3. For 300 topics**


For each topic, we will explore the words occuring in that topic and its relative weight.

In [0]:
# Visualize the results
topics_300 = []
for idx, topic in lda_model_300.print_topics(num_topics=10, num_words=10):
    topics_300.append(topic)
    print("Topic: {} \nWords: {}".format(idx, topic))
    print("\n")

Topic: 241 
Words: 0.226*"ignor" + 0.134*"chanc" + 0.078*"upenn" + 0.065*"visibl" + 0.062*"maintain" + 0.050*"quiet" + 0.047*"solv" + 0.046*"threaten" + 0.046*"blame" + 0.040*"threat"


Topic: 175 
Words: 0.000*"jolla" + 0.000*"pythagorean" + 0.000*"ggrrrrrr" + 0.000*"marcus" + 0.000*"mcguir" + 0.000*"piss" + 0.000*"jay" + 0.000*"unto" + 0.000*"padr" + 0.000*"danc"


Topic: 120 
Words: 0.311*"say" + 0.047*"know" + 0.046*"fact" + 0.044*"come" + 0.037*"convinc" + 0.034*"refer" + 0.027*"like" + 0.026*"tell" + 0.026*"reveal" + 0.025*"scriptur"


Topic: 283 
Words: 0.171*"attack" + 0.132*"soldier" + 0.073*"civilian" + 0.061*"brad" + 0.060*"kill" + 0.058*"blood" + 0.047*"effort" + 0.044*"territori" + 0.044*"border" + 0.040*"northern"


Topic: 254 
Words: 0.174*"jam" + 0.125*"texa" + 0.113*"austin" + 0.103*"purdu" + 0.095*"utexa" + 0.080*"univers" + 0.078*"lanc" + 0.050*"roman" + 0.035*"cult" + 0.031*"timothi"


Topic: 214 
Words: 0.381*"opinion" + 0.118*"express" + 0.092*"dead" + 0.069*"disc

### **Model Evaluation**

**1. Calculate Topic Coherence(TC)**

Gensim provides Topic Coherence(TC) API to calculate TC, the details of different TC calculation methods can be found here:https://palmetto.demos.dice-research.org/<br>

The TC measurement mentioned in the ETM paper: https://arxiv.org/pdf/1907.04907.pdf is NPMI, we use this measurement to calculate TC of LDA is shown below:

In [0]:
from gensim.models import CoherenceModel

# Get Topic Coherence(TC) for 20 topics
coherence_model_lda_20 = CoherenceModel(model=lda_model_20, texts=processed_docs, dictionary=dictionary, coherence='c_npmi')
coherence_lda_20 = coherence_model_lda_20.get_coherence()

print("The Topic Coherence(TC) of LDA model for 20 topics calculated using API is:", coherence_lda_20)

The Topic Coherence(TC) of LDA model for 20 topics calculated using API is: 0.00185423528902863


In [0]:
from gensim.models import CoherenceModel

# Get Topic Coherence(TC) for 50 topics
coherence_model_lda_50 = CoherenceModel(model=lda_model_50, texts=processed_docs, dictionary=dictionary, coherence='c_npmi')
coherence_lda_50 = coherence_model_lda_50.get_coherence()

print("The Topic Coherence(TC) of LDA model for 50 topics calculated using API is:", coherence_lda_50)

The Topic Coherence(TC) of LDA model for 50 topics calculated using API is: -0.06824679916267129


In [0]:
from gensim.models import CoherenceModel

# Get Topic Coherence(TC) for 100 topics
coherence_model_lda_100 = CoherenceModel(model=lda_model_100, texts=processed_docs, dictionary=dictionary, coherence='c_npmi')
coherence_lda_100 = coherence_model_lda_100.get_coherence()

print("The Topic Coherence(TC) of LDA model for 100 topics calculated using API is:", coherence_lda_100)

The Topic Coherence(TC) of LDA model for 100 topics calculated using API is: -0.11313425676892294


In [0]:
from gensim.models import CoherenceModel

# Get Topic Coherence(TC) for 100 topics
coherence_model_lda_300 = CoherenceModel(model=lda_model_300, texts=processed_docs, dictionary=dictionary, coherence='c_npmi')
coherence_lda_300 = coherence_model_lda_300.get_coherence()

print("The Topic Coherence(TC) of LDA model for 300 topics calculated using API is:", coherence_lda_300)

The Topic Coherence(TC) of LDA model for 300 topics calculated using API is: -0.2118525558847258


To compare with ETM paper, I implememnt TC from scratch as the same way as the code provided by ETM paper, and the implementation details and corresponding result is shown below:

In [0]:
import string

# Filter out punctuations
def contains_punctuation(w):
    return any(char in string.punctuation for char in w)

# Filter out numeric values
def contains_numeric(w):
    return any(char.isdigit() for char in w)

# Document frequency
def D_w(word, corpus):
  D_w = 0
  for i in range(len(corpus)):
    if word in corpus[i]:
      D_w += 1
  return D_w

# Get Topic Coherence(TC)
def topic_coherence(corpus, topics):
  
  # Select top-10 most likely words in each topic
  words = [None] * len(topics)
  for i in range(len(topics)):
    topic = topics[i]
    topic = [w.lower() for w in topic if not contains_punctuation(w)] 
    topic = [w for w in topic if not contains_numeric(w)]
    topic = "".join(topic)
    topic = topic.split(sep=' ')
    topic = [w for w in topic if w != '']
    words[i] = topic[0:11]
  
  # Calculate topic coherence
  D = len(corpus)
  TC = []
  for k in range(len(words)):
    TC_k = 0
    counter = 0
    word_k = words[k]
    
    for i in range(10):
      w_i = word_k[i]
      tmp = 0

      for j in range(i+1, 10):
        w_j = word_k[j]
        D_wi = D_w(w_i, corpus)
        D_wj = D_w(w_j, corpus)
        # Joint document frequency
        D_wi_wj = 0
        for i in range(len(corpus)):
          if (w_i in corpus[i]) and (w_j in corpus[i]):
            D_wi_wj += 1

        if D_wi_wj == 0:
          f_wi_wj = -1
        else:
          f_wi_wj = -1 + (np.log(D_wi)+np.log(D_wj)-2.0*np.log(D)) / (np.log(D_wi_wj)-np.log(D))
        tmp += f_wi_wj
        counter += 1
      
      TC_k += tmp
    
    TC.append(TC_k)
    TC = np.mean(TC) / counter

    return TC

In [0]:
# Topic Coherence(TC) for 20 topics
topic_coherence_20 = topic_coherence(corpus=processed_docs, topics=topics_20)
print("The Topic Coherence(TC) for 20 topics of LDA model calculated from scratch is:", topic_coherence_20)

The Topic Coherence(TC) for 20 topics of LDA model calculated from scratch is: -0.0018815240063567594


In [0]:
# Topic Coherence(TC) for 50 topics
topic_coherence_50 = topic_coherence(corpus=processed_docs, topics=topics_50)
print("The Topic Coherence(TC) for 50 topics of LDA model calculated from scratch is:", topic_coherence_50)

The Topic Coherence(TC) for 50 topics of LDA model calculated from scratch is: 0.147


In [0]:
# Topic Coherence(TC) for 100 topics
topic_coherence_100 = topic_coherence(corpus=processed_docs, topics=topics_100)
print("The Topic Coherence(TC) for 100 topics of LDA model calculated from scratch is:", topic_coherence_100)

The Topic Coherence(TC) for 100 topics of LDA model calculated from scratch is: 0.142


In [0]:
# Topic Coherence(TC) for 50 topics
topic_coherence_300 = topic_coherence(corpus=processed_docs, topics=topics_300)
print("The Topic Coherence(TC) for 300 topics of LDA model calculated from scratch is:", topic_coherence_300)

The Topic Coherence(TC) for 300 topics of LDA model calculated from scratch is: 0.108


**2. Calculate Topic Diversity(TD)**

This part I implemented the same calculation method of Topic Diversity(TD) with ETM, from TD perspect, LDA get better results compare to ETM on topic diversity.

In [0]:
import string

# Filter out punctuations
def contains_punctuation(w):
    return any(char in string.punctuation for char in w)

# Filter out numeric values
def contains_numeric(w):
    return any(char.isdigit() for char in w)

In [0]:
"""
TD for 20 topics
"""
# Get all words in generated 20 topics 
topic_words_20 = []
for i in range(len(topics_20)):
  topic = topics_20[i]
  topic = [w.lower() for w in topic if not contains_punctuation(w)] 
  topic = [w for w in topic if not contains_numeric(w)]
  topic = "".join(topic)
  topic = topic.split(sep=' ')
  topic = [w for w in topic if w != '']
  topic_words_20.extend(topic)

# Get all unique words for 50 topics 
unique_words_20 = []
for w in topic_words_20:
  if w not in unique_words_20:
    unique_words_20.append(w)

# Calculate Topic Diversity(TD)
TD_20 = len(unique_words_20) / len(topic_words_20)

print("The Topic Diversity(TD) for 20 topics of LDA model is: ", TD_20)

The Topic Diversity(TD) for 20 topics of LDA model is:  0.87


In [0]:
"""
TD for 50 topics
"""
# Get all words in generated 50 topics 
topic_words_50 = []
for i in range(len(topics_50)):
  topic = topics_50[i]
  topic = [w.lower() for w in topic if not contains_punctuation(w)] 
  topic = [w for w in topic if not contains_numeric(w)]
  topic = "".join(topic)
  topic = topic.split(sep=' ')
  topic = [w for w in topic if w != '']
  topic_words_50.extend(topic)

# Get all unique words for 50 topics 
unique_words_50 = []
for w in topic_words_50:
  if w not in unique_words_50:
    unique_words_50.append(w)

# Calculate Topic Diversity(TD)
TD_50 = len(unique_words_50) / len(topic_words_50)

print("The Topic Diversity(TD) for 50 topics of LDA model is: ", TD_50)

The Topic Diversity(TD) for 50 topics of LDA model is:  0.947


In [0]:
"""
TD for 100 topics
"""
# Get all words in generated 50 topics 
topic_words_100 = []
for i in range(len(topics_100)):
  topic = topics_100[i]
  topic = [w.lower() for w in topic if not contains_punctuation(w)] 
  topic = [w for w in topic if not contains_numeric(w)]
  topic = "".join(topic)
  topic = topic.split(sep=' ')
  topic = [w for w in topic if w != '']
  topic_words_100.extend(topic)

# Get all unique words
unique_words_100 = []
for w in topic_words_100:
  if w not in unique_words_100:
    unique_words_100.append(w)

# Calculate Topic Diversity(TD)
TD_100 = len(unique_words_100) / len(topic_words_100)

print("The Topic Diversity(TD) for 100 topics of LDA model is: ", TD_100)

The Topic Diversity(TD) for 100 topics of LDA model is:  0.991


In [0]:
"""
TD for 300 topics
"""
# Get all words in generated 50 topics 
topic_words_300 = []
for i in range(len(topics_300)):
  topic = topics_300[i]
  topic = [w.lower() for w in topic if not contains_punctuation(w)] 
  topic = [w for w in topic if not contains_numeric(w)]
  topic = "".join(topic)
  topic = topic.split(sep=' ')
  topic = [w for w in topic if w != '']
  topic_words_300.extend(topic)

# Get all unique words
unique_words_300 = []
for w in topic_words_300:
  if w not in unique_words_300:
    unique_words_300.append(w)

# Calculate Topic Diversity(TD)
TD_300 = len(unique_words_300) / len(topic_words_300)

print("The Topic Diversity(TD) for 300 topics of LDA model is: ", TD_300)

The Topic Diversity(TD) for 300 topics of LDA model is:  0.996


### **Query On New Dataset**

Query the model using test set.

In [0]:
# Word stemming and lemmatization
def lemmatize_stemming(text):
    stemmer = SnowballStemmer("english")
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Text preprocessing
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))

    return result

# Get preprocessed dataset
test_docs = []
for doc in newsgroups_test.data[0:100]:
    test_docs.append(preprocess(doc))

# Create word dictionary / word2id
dictionary = gensim.corpora.Dictionary(test_docs)

# Filter out low frequency and high frequency words
dictionary.filter_extremes(no_below=10, no_above=0.7, keep_n=100000)

# Create Bag of Words(BOW) representation of dataset
bow_test = [dictionary.doc2bow(doc) for doc in test_docs]

# Query on test docs
vector = lda_model_20[bow_test[0]]
print("Generated vectors are: \n", vector)

Generated vectors are: 
 [(0, 0.07990225), (3, 0.13209473), (7, 0.3069345), (14, 0.17250848), (16, 0.26168504)]
