# Introduction

**What is gensim?**

Gensim is billed as a Natural Language Processing package that does ‘Topic Modeling for Humans’. But its practically much more than that.

If you are unfamiliar with topic modeling, it is a technique to extract the underlying topics from large volumes of text. Gensim provides algorithms like LDA and LSI (which we will see later in this post) and the necessary sophistication to build high-quality topic models.

You may argue that topic models and word embedding are available in other packages like scikit, R etc. But the width and scope of facilities to build and evaluate topic models are unparalleled in gensim, plus many more convenient facilities for text processing.

It is a great package for processing texts, working with word vector models (such as Word2Vec, FastText etc) and for building topic models.

Also, another significant advantage with gensim is: it lets you handle large text files without having to load the entire file in memory.

This post intends to give a practical overview of the nearly all major features, explained in a simple and easy to understand way.
* By the end of this tutorial, you would know:
* What are the core concepts in gensim?
* What is dictionary and corpus, why they matter and where to use them?
* How to create and work with dictionary and corpus?
* How to load and work with text data from multiple text files in memory efficient way
* Create topic models with LDA and interpret the outputs
* Create TFIDF model, bigrams, trigrams, Word2Vec model, Doc2Vec model
* Compute similarity metrics

And much more.

# What is a Dictionary and Corpus?

In order to work on text documents, Gensim requires the words (aka tokens) be converted to unique ids. In order to achieve that, Gensim lets you create a Dictionary object that maps each word to a unique id.

So, how to create a `Dictionary`? By converting your text/sentences to a [list of words] and pass it to the corpora.Dictionary() object.

We will see how to actually do this in the next section.

But why is the dictionary object needed and where can it be used?

The dictionary object is typically used to create a ‘bag of words’ Corpus. It is this Dictionary and the bag-of-words (Corpus) that are used as inputs to topic modeling and other models that Gensim specializes in.

Alright, what sort of text inputs can gensim handle? The input text typically comes in 3 different forms:
* As sentences stored in python’s native list object
* As one single text file, small or large.
* In multiple text files.

Now, when your text input is large, you need to be able to create the dictionary object without having to load the entire text file.

The good news is Gensim lets you read the text and update the dictionary, one line at a time, without loading the entire text file into system memory. Let’s see how to do that in the next 2 sections.

But, before we get in, let’s understand some NLP jargon.

A ‘token’ typically means a ‘word’. A ‘document’ can typically refer to a ‘sentence’ or ‘paragraph’ and a ‘corpus’ is typically a ‘collection of documents as a bag of words’. That is, for each document, a corpus contains each word’s id and its frequency count in that document. As a result, information of the order of words is lost.

# How to create a Dictionary from a list of sentences?

In gensim, the dictionary contains a map of all words (tokens) to its unique id.

You can create a dictionary from a paragraph of sentences, from a text file that contains multiple lines of text and from multiple such text files contained in a directory. For the second and third cases, we will do it without loading the entire file into memory so that the dictionary gets updated as you read the text line by line.

Let’s start with the ‘List of sentences’ input.

When you have multiple sentences, you need to convert each sentence to a list of words. List comprehensions is a common way to do this.

In [1]:
import gensim
from gensim import corpora
from pprint import pprint

# How to create a dictionary from a list of sentences?
documents = ["The Saudis are preparing a report that will acknowledge that", 
             "Saudi journalist Jamal Khashoggi's death was the result of an", 
             "interrogation that went wrong, one that was intended to lead", 
             "to his abduction from Turkey, according to two sources."]

documents_2 = ["One source says the report will likely conclude that", 
                "the operation was carried out without clearance and", 
                "transparency and that those involved will be held", 
                "responsible. One of the sources acknowledged that the", 
                "report is still being prepared and cautioned that", 
                "things could change."]

# Tokenize(split) the sentences into words
texts = [[text for text in doc.split()] for doc in documents]

# Create dictionary
dictionary = corpora.Dictionary(texts)

# Get information about the dictionary
print(dictionary)

Dictionary(33 unique tokens: ['Saudis', 'The', 'a', 'acknowledge', 'are']...)


As it says the dictionary has 34 unique tokens (or words). Let’s see the unique ids for each of these tokens.

In [2]:
# Show the word to id map
print(dictionary.token2id)

{'Saudis': 0, 'The': 1, 'a': 2, 'acknowledge': 3, 'are': 4, 'preparing': 5, 'report': 6, 'that': 7, 'will': 8, 'Jamal': 9, "Khashoggi's": 10, 'Saudi': 11, 'an': 12, 'death': 13, 'journalist': 14, 'of': 15, 'result': 16, 'the': 17, 'was': 18, 'intended': 19, 'interrogation': 20, 'lead': 21, 'one': 22, 'to': 23, 'went': 24, 'wrong,': 25, 'Turkey,': 26, 'abduction': 27, 'according': 28, 'from': 29, 'his': 30, 'sources.': 31, 'two': 32}


We have successfully created a Dictionary object. Gensim will use this dictionary to create a bag-of-words corpus where the words in the documents are replaced with its respective id provided by this dictionary.

If you get new documents in the future, it is also possible to **update an existing dictionary** to include the new words.

In [3]:
documents_2 = ["The intersection graph of paths in trees",
               "Graph minors IV Widths of trees and well quasi ordering",
               "Graph minors A survey"]

texts_2 = [[text for text in doc.split()] for doc in documents_2]

dictionary.add_documents(texts_2)

In [5]:
# If you check now, the dictionary should have been updated with the new words (tokens).
print(dictionary)
print()
print(dictionary.token2id)

Dictionary(48 unique tokens: ['Saudis', 'The', 'a', 'acknowledge', 'are']...)

{'Saudis': 0, 'The': 1, 'a': 2, 'acknowledge': 3, 'are': 4, 'preparing': 5, 'report': 6, 'that': 7, 'will': 8, 'Jamal': 9, "Khashoggi's": 10, 'Saudi': 11, 'an': 12, 'death': 13, 'journalist': 14, 'of': 15, 'result': 16, 'the': 17, 'was': 18, 'intended': 19, 'interrogation': 20, 'lead': 21, 'one': 22, 'to': 23, 'went': 24, 'wrong,': 25, 'Turkey,': 26, 'abduction': 27, 'according': 28, 'from': 29, 'his': 30, 'sources.': 31, 'two': 32, 'graph': 33, 'in': 34, 'intersection': 35, 'paths': 36, 'trees': 37, 'Graph': 38, 'IV': 39, 'Widths': 40, 'and': 41, 'minors': 42, 'ordering': 43, 'quasi': 44, 'well': 45, 'A': 46, 'survey': 47}


## How to create a Dictionary from one or more text files?

You can also create a dictionary from a text file or from a directory of text files.

The below example reads a file line-by-line and uses gensim’s simple_preprocess to process one line of the file at a time.

The advantage here is it let’s you read an entire text file without loading the file in memory all at once.

Let’s use a sample.txt file to demonstrate this.

In [10]:
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

# Create gensim dictionary form a single tet file
dictionary = corpora.Dictionary(simple_preprocess(line, deacc=True) for line in open(r'C:\Users\saurabhkumar9\1. NLP Practicum Materials\simple.txt', encoding='utf-8'))

# Token to Id map
dictionary.token2id

{'army': 0,
 'china': 1,
 'chinese': 2,
 'force': 3,
 'liberation': 4,
 'of': 5,
 'people': 6,
 'recently': 7,
 'recruited': 8,
 'rocket': 9,
 'tank': 10,
 'technicians': 11,
 'the': 12,
 'think': 13,
 'companies': 14,
 'daily': 15,
 'from': 16,
 'on': 17,
 'pla': 18,
 'private': 19,
 'reported': 20,
 'saturday': 21,
 'and': 22,
 'appointment': 23,
 'at': 24,
 'ceremony': 25,
 'experts': 26,
 'founding': 27,
 'hao': 28,
 'letters': 29,
 'other': 30,
 'received': 31,
 'science': 32,
 'technology': 33,
 'zhang': 34,
 'according': 35,
 'by': 36,
 'defense': 37,
 'national': 38,
 'panel': 39,
 'published': 40,
 'report': 41,
 'to': 42,
 'as': 43,
 'fellow': 44,
 'his': 45,
 'honored': 46,
 'will': 47,
 'conduct': 48,
 'design': 49,
 'fields': 50,
 'into': 51,
 'like': 52,
 'members': 53,
 'overall': 54,
 'research': 55,
 'serve': 56,
 'which': 57,
 'five': 58,
 'for': 59,
 'launching': 60,
 'missile': 61,
 'missiles': 62,
 'network': 63,
 'system': 64,
 'years': 65,
 'counterparts': 66,
 '

## **Now, how to read one-line-at-a-time from multiple files?**

Assuming you have all the text files in the same directory, you need to define a class with an __iter__ method. The __iter__() method should iterate through all the files in a given directory and yield the processed list of word tokens.

Let’s define one such class by the name ReadTxtFiles, which takes in the path to directory containing the text files I am using this directory of [sports food docs](https://github.com/selva86/datasets/tree/master/lsa_sports_food_docs
) as input.

In [None]:
class ReadTxtFiles(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname), encoding='latin'):
                yield simple_preprocess(line)

path_to_text_directory = "lsa_sports_food_docs"

dictionary = corpora.Dictionary(ReadTxtFiles(path_to_text_directory))

# Token to Id map
dictionary.token2id

# How to create a bag of words corpus in gensim?

Now you know how to create a dictionary from a list and from text file.

The next important object you need to familiarize with in order to work in gensim is the Corpus (a Bag of Words). That is, it is a corpus object that contains the word id and its frequency in each document. You can think of it as gensim’s equivalent of a Document-Term matrix.

Once you have the updated dictionary, all you need to do to create a bag of words corpus is to pass the tokenized list of words to the Dictionary.doc2bow()

Let’s create s Corpus for a simple list (my_docs) containing 2 sentences.

In [11]:
# List with 2 sentences
my_docs = ["Who let the dogs out?",
           "Who? Who? Who? Who?"]

# Tokenize the docs
tokenized_list = [simple_preprocess(doc) for doc in my_docs]

# Create the Corpus
mydict = corpora.Dictionary()
mycorpus = [mydict.doc2bow(doc, allow_update=True) for doc in tokenized_list]
pprint(mycorpus)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(4, 4)]]


How to interpret the above corpus?

The (0, 1) in line 1 means, the word with id=0 appears once in the 1st document.

Likewise, the (4, 4) in the second list item means the word with id 4 appears 4 times in the second document. And so on.

Well, this is not human readable. To convert the id’s to words, you will need the dictionary to do the conversion.

Let’s see how to get the original texts back.

In [13]:
word_counts = [[(mydict[id], count) for id, count in line] for line in mycorpus]
pprint(word_counts)

[[('dogs', 1), ('let', 1), ('out', 1), ('the', 1), ('who', 1)], [('who', 4)]]


Notice, the order of the words gets lost. Just the word and it’s frequency information is retained.

# How to create a bag of words corpus from a text file?

Reading words from a python list is quite straightforward because the entire text was in-memory already.

However, you may have a large file that you don’t want to load the entire file in memory.

You can import such files one line at a time by defining a class and the __iter__ function that iteratively reads the file one line at a time and yields a corpus object. But how to create the corpus object?

The __iter__() from BoWCorpus reads a line from the file, process it to a list of words using simple_preprocess() and pass that to the dictionary.doc2bow(). Can you related how this is similar and different from the ReadTxtFiles class we created earlier?

Also, notice that I am using the **smart_open()** from smart_open package because, it lets you open and read large files line-by-line from a variety of sources such as S3, HDFS, WebHDFS, HTTP, or local and compressed files. That’s pretty awesome by the way!

However, if you had used open() for a file in your system, it will work perfectly file as well.

In [None]:
from gensim.utils import simple_preprocess
from smart_open import smart_open
import nltk
nltk.download('stopwords')  # run once
from nltk.corpus import stopwords
stop_words = stopwords.words('english')


class BoWCorpus(object):
    def __init__(self, path, dictionary):
        self.filepath = path
        self.dictionary = dictionary
def __iter__(self):
        global mydict  # OPTIONAL, only if updating the source dictionary.
        for line in smart_open(self.filepath, encoding='latin'):
            # tokenize
            tokenized_list = simple_preprocess(line, deacc=True)

            # create bag of words
            bow = self.dictionary.doc2bow(tokenized_list, allow_update=True)

            # update the source dictionary (OPTIONAL)
            mydict.merge_with(self.dictionary)

            # lazy return the BoW
            yield bow


# Create the Dictionary
mydict = corpora.Dictionary()

# Create the Corpus
bow_corpus = BoWCorpus('sample.txt', dictionary=mydict)  # memory friendly

# Print the token_id and count for each line.
for line in bow_corpus:
    print(line)

# How to save a gensim dictionary and corpus to disk and load them back?

In [None]:
# Save the Dict and Corpus
mydict.save('mydict.dict')  # save dict to disk
corpora.MmCorpus.serialize('bow_corpus.mm', bow_corpus)  # save corpus to disk

In [None]:
# Load them back
loaded_dict = corpora.Dictionary.load('mydict.dict')

corpus = corpora.MmCorpus('bow_corpus.mm')
for line in corpus:
    print(line)

# How to create the TFIDF matrix (corpus) in gensim?

The Term Frequency – Inverse Document Frequency(TF-IDF) is also a bag-of-words model but unlike the regular corpus, TFIDF down weights tokens (words) that appears frequently across documents.

**How is TFIDF computed?**

Tf-Idf is computed by multiplying a local component like term frequency (TF) with a global component, that is, inverse document frequency (IDF) and optionally normalizing the result to unit length.

As a result of this, the words that occur frequently across documents will get downweighted.

There are multiple variations of formulas for TF and IDF existing. Gensim uses the SMART Information retrieval system that can be used to implement these variations. You can specify what formula to use specifying the smartirs parameter in the TfidfModel. See help(models.TfidfModel) for more details.

So, how to get the TFIDF weights?

By training the corpus with models.TfidfModel(). Then, apply the corpus within the square brackets of the trained tfidf model. See example below.

In [15]:
from gensim import models
import numpy as np

documents = ["This is the first line",
             "This is the second sentence",
             "This third document"]

# Create the Dictionary and Corpus
mydict = corpora.Dictionary([simple_preprocess(line) for line in documents])
corpus = [mydict.doc2bow(simple_preprocess(line)) for line in documents]

# Show the Word Weights in Corpus
for doc in corpus:
    print([[mydict[id], freq] for id, freq in doc])

# Create the TF-IDF model
tfidf = models.TfidfModel(corpus, smartirs='ntc')

# Show the TF-IDF weights
for doc in tfidf[corpus]:
    print([[mydict[id], np.around(freq, decimals=2)] for id, freq in doc])

[['first', 1], ['is', 1], ['line', 1], ['the', 1], ['this', 1]]
[['is', 1], ['the', 1], ['this', 1], ['second', 1], ['sentence', 1]]
[['this', 1], ['document', 1], ['third', 1]]
[['first', 0.63], ['is', 0.31], ['line', 0.63], ['the', 0.31], ['this', 0.13]]
[['is', 0.31], ['the', 0.31], ['this', 0.13], ['second', 0.63], ['sentence', 0.63]]
[['this', 0.15], ['document', 0.7], ['third', 0.7]]


Notice the difference in weights of the words between the original corpus and the tfidf weighted corpus.

The words ‘is’ and ‘the’ occur in two documents and were weighted down. The word ‘this’ appearing in all three documents was removed altogether. In simple terms, words that occur more frequently across the documents get smaller weights.

# How to use gensim downloader API to load datasets?

Gensim provides an inbuilt API to download popular text datasets and word embedding models.

A comprehensive list of available datasets and models is maintained [here](https://raw.githubusercontent.com/RaRe-Technologies/gensim-data/master/list.json).

Using the API to download the dataset is as simple as calling the api.load() method with the right data or model name.

The below example shows how to download the ‘glove-wiki-gigaword-50’ model.

In [17]:
import gensim.downloader as api

# Get information about the model or dataset
api.info('glove-wiki-gigaword-50')

{'num_records': 400000,
 'file_size': 69182535,
 'base_dataset': 'Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)',
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-50/__init__.py',
 'license': 'http://opendatacommons.org/licenses/pddl/',
 'parameters': {'dimension': 50},
 'description': 'Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/).',
 'preprocessing': 'Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-50.txt`.',
 'read_more': ['https://nlp.stanford.edu/projects/glove/',
  'https://nlp.stanford.edu/pubs/glove.pdf'],
 'checksum': 'c289bc5d7f2f02c6dc9f2f9b67641813',
 'file_name': 'glove-wiki-gigaword-50.gz',
 'parts': 1}

In [None]:
w2v_model = api.load("glove-wiki-gigaword-50")
w2v_model.most_similar('blue')

In Simple Words they are documents already scraped by google and can be directly converted to word to vec

# How to create bigrams and trigrams using Phraser models?

Now you know how to download datasets and pre-trained models with gensim.

Let’s download the text8 dataset, which is nothing but the “First 100,000,000 bytes of plain text from Wikipedia”. Then, from this, we will generate bigrams and trigrams.

But **what are bigrams and trigrams? and why do they matter?

In paragraphs, certain words always tend to occur in pairs (bigram) or in groups of threes (trigram). Because the two words combined together form the actual entity. For example: The word ‘French’ refers the language or region and the word ‘revolution’ can refer to the planetary revolution. But combining them, ‘French Revolution’, refers to something completely different.

It’s quite important to form bigrams and trigrams from sentences, especially when working with bag-of-words models.

So how to create the bigrams?

It’s quite easy and efficient with gensim’s Phrases model. The created Phrases model allows indexing, so, just pass the original text (list) to the built Phrases model to form the bigrams. An example is shown below:

In [19]:
dataset = api.load("text8")
dataset = [wd for wd in dataset]

dct = corpora.Dictionary(dataset)
corpus = [dct.doc2bow(line) for line in dataset]

# Build the bigram models
bigram = gensim.models.phrases.Phrases(dataset, min_count=3, threshold=10)

# Construct bigram
print(bigram[dataset[0]])

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working_class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans_culottes', 'of', 'the', 'french_revolution', 'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorative_way', 'to', 'describe', 'any', 'act', 'that', 'used', 'violent', 'means', 'to', 'destroy', 'the', 'organization', 'of', 'society', 'it', 'has', 'also', 'been', 'taken_up', 'as', 'a', 'positive', 'label', 'by', 'self', 'defined', 'anarchists', 'the', 'word', 'anarchism', 'is', 'derived_from', 'the', 'greek', 'without', 'archons', 'ruler', 'chief', 'king', 'anarchism', 'as', 'a', 'political_philosophy', 'is', 'the', 'belief_that', 'rulers', 'are', 'unnecessary', 'and', 'should_be', 'abolished', 'although', 'there_are', 'differing_interpretations', 'of', 'what', 'this', 'means', 'anarchism', 'also', 'refers_to', 'related', 'social_movements', 'that', 'advocate',

**Can you guess how to create a trigram?**

Well, Simply rinse and repeat the same procedure to the output of the bigram model. Once you’ve generated the bigrams, you can pass the output to train a new Phrases model. Then, apply the bigrammed corpus on the trained trigram model. 

In [20]:
# Build the trigram models
trigram = gensim.models.phrases.Phrases(bigram[dataset], threshold=10)

# Construct trigram
print(trigram[bigram[dataset[0]]])

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working_class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans_culottes', 'of', 'the', 'french_revolution', 'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorative_way', 'to_describe', 'any', 'act', 'that', 'used', 'violent', 'means', 'to', 'destroy', 'the', 'organization', 'of', 'society', 'it', 'has', 'also_been', 'taken_up', 'as', 'a', 'positive', 'label', 'by', 'self', 'defined', 'anarchists', 'the', 'word', 'anarchism', 'is', 'derived_from', 'the', 'greek', 'without', 'archons', 'ruler', 'chief', 'king', 'anarchism', 'as', 'a', 'political_philosophy', 'is', 'the', 'belief_that', 'rulers', 'are', 'unnecessary', 'and', 'should_be', 'abolished', 'although', 'there_are', 'differing_interpretations', 'of', 'what', 'this', 'means', 'anarchism', 'also', 'refers_to', 'related', 'social_movements', 'that', 'advocate', 'the'

# How to create Topic Models with LDA?

The objective of topic models is to extract the underlying topics from a given collection of text documents. Each document in the text is considered as a combination of topics and each topic is considered as a combination of related words.

Topic modeling can be done by algorithms like **Latent Dirichlet Allocation (LDA) and Latent Semantic Indexing (LSI)**.

In both cases you need to provide the number of topics as input. The topic model, in turn, will provide the topic keywords for each topic and the percentage contribution of topics in each document.

The quality of topics is highly dependent on the quality of text processing and the number of topics you provide to the algorithm. The earlier post on how to build best topic models explains the procedure in more detail. However, I recommend understanding the basic steps involved and the interpretation in the example below.

Step 0: Load the necessary packages and import the stopwords.

In [21]:
# Step 0: Import packages and stopwords
from gensim.models import LdaModel, LdaMulticore
import gensim.downloader as api
from gensim.utils import simple_preprocess, lemmatize
from nltk.corpus import stopwords
import re
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(level=logging.INFO)
stop_words = stopwords.words('english')
stop_words = stop_words + ['com', 'edu', 'subject', 'lines', 'organization', 'would', 'article', 'could']

Step 1: Import the dataset. I am going to use the text8 dataset that can be downloaded using gensim’s downloader API.

In [27]:
# Step 1: Import the dataset and get the text and real topic of each news article
dataset = api.load("text8")
data = [d for d in dataset]

Step 2: Prepare the downloaded data by removing stopwords and lemmatize it. For Lemmatization, gensim requires the pattern package. So, be sure to do pip install pattern in your terminal or prompt before running this. I have setup lemmatization such that only Nouns (NN), Adjectives (JJ) and Pronouns (RB) are retained. Because I prefer only such words to go as topic keywords. This is a personal choice.

In [None]:
# Step 2: Prepare Data (Remove stopwords and lemmatize)
data_processed = []

for i, doc in enumerate(data[:50]):
    doc_out = []
    for wd in doc:
        if wd not in stop_words:  # remove stopwords
            lemmatized_word = lemmatize(wd, allowed_tags=re.compile('(NN|JJ|RB)'))  # lemmatize
            if lemmatized_word:
                doc_out = doc_out + [lemmatized_word[0].split(b'/')[0].decode('utf-8')]
        else:
            continue
    data_processed.append(doc_out)

# Print a small sample    
print(data_processed[0][:5])

The data_processed is now processed as a list of list of words. You can now use this to create the Dictionary and Corpus, which will then be used as inputs to the LDA model.

In [None]:
# Step 3: Create the Inputs of LDA model: Dictionary and Corpus
dct = corpora.Dictionary(data_processed)
corpus = [dct.doc2bow(line) for line in data_processed]

We have the Dictionary and Corpus created. Let’s build a LDA topic model with 7 topics, using LdaMulticore(). 7 topics is an arbitrary choice for now.

In [None]:
# Step 4: Train the LDA model
lda_model = LdaMulticore(corpus=corpus,
                         id2word=dct,
                         random_state=100,
                         num_topics=7,
                         passes=10,
                         chunksize=1000,
                         batch=False,
                         alpha='asymmetric',
                         decay=0.5,
                         offset=64,
                         eta=None,
                         eval_every=0,
                         iterations=100,
                         gamma_threshold=0.001,
                         per_word_topics=True)

# save the model
lda_model.save('lda_model.model')

lda_model.print_topics(-1)

The lda_model.print_topics shows what words contributed to which of the 7 topics, along with the weightage of the word’s contribution to that topic.

You can see the words like ‘also’, ‘many’ coming across different topics. So I would add such words to the stop_words list to remove them and further tune to topic model for optimal number of topics.

LdaMulticore() supports parallel processing. Alternately you could also try and see what topics the LdaModel() gives.

# How to interpret the LDA Topic Model’s output?

The lda_model object supports indexing. That is, if you pass a document (list of words) to the lda_model, it provides 3 things:

* The topic(s) that document belongs to along with percentage.
* The topic(s) each word in that document belongs to.
* The topic(s) each word in that document belongs to AND the phi values.

So, **what is phi value?**

**Phi value** is the probability of the word belonging to that particular topic. And the sum of phi values for a given word adds up to the number of times that word occurred in that document.

For example, in below output for the 0th document, the word with id=0 belongs to topic number 6 and the phi value is 3.999. That means, the word with id=0 appeared 4 times in the 0th document.

In [None]:
# Reference: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/topic_methods.ipynb
for c in lda_model[corpus[5:8]]:
    print("Document Topics      : ", c[0])      # [(Topics, Perc Contrib)]
    print("Word id, Topics      : ", c[1][:3])  # [(Word id, [Topics])]
    print("Phi Values (word id) : ", c[2][:2])  # [(Word id, [(Topic, Phi Value)])]
    print("Word, Topics         : ", [(dct[wd], topic) for wd, topic in c[1][:2]])   # [(Word, [Topics])]
    print("Phi Values (word)    : ", [(dct[wd], topic) for wd, topic in c[2][:2]])  # [(Word, [(Topic, Phi Value)])]
    print("------------------------------------------------------\n")

# How to create a LSI topic model using gensim?

The syntax for using an LSI model is similar to how we built the LDA model, except that we will use the LsiModel().

In [33]:
from gensim.models import LsiModel

# Build the LSI Model
lsi_model = LsiModel(corpus=corpus, id2word=dct, num_topics=7, decay=0.5)

2020-04-19 20:54:55,461 : INFO : using serial LSI version on this node
2020-04-19 20:54:55,484 : INFO : updating model with new documents
2020-04-19 20:54:55,877 : INFO : preparing a new chunk of documents
2020-04-19 20:55:50,619 : INFO : using 100 extra samples and 2 power iterations
2020-04-19 20:55:50,629 : INFO : 1st phase: constructing (253854, 107) action matrix
2020-04-19 20:55:51,688 : INFO : orthonormalizing (253854, 107) action matrix
2020-04-19 20:56:05,022 : INFO : 2nd phase: running dense svd on (107, 1701) matrix
2020-04-19 20:56:05,344 : INFO : computing the final decomposition
2020-04-19 20:56:05,348 : INFO : keeping 7 factors (discarding 2.323% of energy spectrum)
2020-04-19 20:56:05,452 : INFO : processed documents up to #1701
2020-04-19 20:56:11,311 : INFO : topic #0(39552.536): 0.655*"the" + 0.365*"of" + 0.255*"and" + 0.254*"one" + 0.228*"in" + 0.199*"a" + 0.194*"to" + 0.163*"zero" + 0.154*"nine" + 0.118*"two"
2020-04-19 20:56:11,322 : INFO : topic #1(9756.470): -0.

In [34]:
# View Topics
pprint(lsi_model.print_topics(-1))

2020-04-19 20:56:11,482 : INFO : topic #0(39552.536): 0.655*"the" + 0.365*"of" + 0.255*"and" + 0.254*"one" + 0.228*"in" + 0.199*"a" + 0.194*"to" + 0.163*"zero" + 0.154*"nine" + 0.118*"two"
2020-04-19 20:56:11,500 : INFO : topic #1(9756.470): -0.615*"one" + -0.459*"nine" + 0.272*"the" + -0.231*"zero" + -0.202*"eight" + -0.177*"two" + -0.162*"seven" + -0.155*"six" + -0.149*"five" + -0.144*"four"
2020-04-19 20:56:11,519 : INFO : topic #2(3430.528): 0.748*"zero" + -0.293*"one" + 0.272*"two" + -0.202*"of" + -0.169*"his" + -0.159*"nine" + -0.145*"he" + 0.132*"the" + -0.123*"a" + -0.094*"that"
2020-04-19 20:56:11,534 : INFO : topic #3(2975.647): -0.477*"is" + -0.400*"a" + 0.393*"the" + 0.233*"was" + -0.222*"are" + -0.192*"or" + 0.145*"his" + -0.145*"be" + -0.126*"zero" + -0.118*"that"
2020-04-19 20:56:11,551 : INFO : topic #4(2672.783): -0.756*"of" + 0.300*"a" + 0.199*"to" + -0.196*"university" + 0.162*"nine" + 0.149*"was" + 0.138*"the" + 0.118*"s" + 0.115*"his" + 0.113*"he"
2020-04-19 20:56:

[(0,
  '0.655*"the" + 0.365*"of" + 0.255*"and" + 0.254*"one" + 0.228*"in" + '
  '0.199*"a" + 0.194*"to" + 0.163*"zero" + 0.154*"nine" + 0.118*"two"'),
 (1,
  '-0.615*"one" + -0.459*"nine" + 0.272*"the" + -0.231*"zero" + -0.202*"eight" '
  '+ -0.177*"two" + -0.162*"seven" + -0.155*"six" + -0.149*"five" + '
  '-0.144*"four"'),
 (2,
  '0.748*"zero" + -0.293*"one" + 0.272*"two" + -0.202*"of" + -0.169*"his" + '
  '-0.159*"nine" + -0.145*"he" + 0.132*"the" + -0.123*"a" + -0.094*"that"'),
 (3,
  '-0.477*"is" + -0.400*"a" + 0.393*"the" + 0.233*"was" + -0.222*"are" + '
  '-0.192*"or" + 0.145*"his" + -0.145*"be" + -0.126*"zero" + -0.118*"that"'),
 (4,
  '-0.756*"of" + 0.300*"a" + 0.199*"to" + -0.196*"university" + 0.162*"nine" + '
  '0.149*"was" + 0.138*"the" + 0.118*"s" + 0.115*"his" + 0.113*"he"'),
 (5,
  '-0.487*"the" + 0.339*"his" + 0.317*"and" + 0.298*"he" + 0.247*"zero" + '
  '0.225*"in" + 0.219*"s" + 0.217*"was" + 0.215*"of" + -0.200*"is"'),
 (6,
  '0.667*"nine" + -0.243*"two" + 0.209*"an

# How to train Word2Vec model using gensim?

A **word embedding model** is a model that can provide numerical vectors for a given word. Using the Gensim’s downloader API, you can download pre-built word embedding models like **word2vec, fasttext, GloVe and ConceptNet**. These are built on large corpuses of commonly occurring text data such as wikipedia, google news etc.

However, if you are working in a specialized niche such as technical documents, you may not able to get word embeddings for all the words. So, in such cases its desirable to train your own model.

Gensim’s Word2Vec implementation let’s you train your own word embedding model for a given corpus.

In [35]:
from gensim.models.word2vec import Word2Vec
from multiprocessing import cpu_count
import gensim.downloader as api

# Download dataset
dataset = api.load("text8")
data = [d for d in dataset]

# Split the data into 2 parts. Part 2 will be used later to update the model
data_part1 = data[:1000]
data_part2 = data[1000:]

# Train Word2Vec model. Defaults result vector size = 100
model = Word2Vec(data_part1, min_count = 0, workers=cpu_count())

# Get the word vector for given word
model['topic']

model.most_similar('topic')

2020-04-19 20:59:22,451 : INFO : collecting all words and their counts
2020-04-19 20:59:22,463 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-19 20:59:25,598 : INFO : collected 189074 word types from a corpus of 10000000 raw words and 1000 sentences
2020-04-19 20:59:25,602 : INFO : Loading a fresh vocabulary
2020-04-19 20:59:34,828 : INFO : effective_min_count=0 retains 189074 unique words (100% of original 189074, drops 0)
2020-04-19 20:59:34,830 : INFO : effective_min_count=0 leaves 10000000 word corpus (100% of original 10000000, drops 0)
2020-04-19 20:59:35,836 : INFO : deleting the raw counts dictionary of 189074 items
2020-04-19 20:59:35,847 : INFO : sample=0.001 downsamples 38 most-common words
2020-04-19 20:59:35,848 : INFO : downsampling leaves estimated 7563517 word corpus (75.6% of prior 10000000)
2020-04-19 20:59:36,651 : INFO : estimated required memory for 189074 words and 100 dimensions: 245796200 bytes
2020-04-19 20:59:36,653 : INFO :

2020-04-19 21:00:22,781 : INFO : EPOCH 4 - PROGRESS: at 81.10% examples, 757391 words/s, in_qsize 14, out_qsize 1
2020-04-19 21:00:23,799 : INFO : EPOCH 4 - PROGRESS: at 91.00% examples, 755491 words/s, in_qsize 14, out_qsize 1
2020-04-19 21:00:24,667 : INFO : worker thread finished; awaiting finish of 7 more threads
2020-04-19 21:00:24,685 : INFO : worker thread finished; awaiting finish of 6 more threads
2020-04-19 21:00:24,687 : INFO : worker thread finished; awaiting finish of 5 more threads
2020-04-19 21:00:24,690 : INFO : worker thread finished; awaiting finish of 4 more threads
2020-04-19 21:00:24,710 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-04-19 21:00:24,718 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-19 21:00:24,720 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-19 21:00:24,737 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-19 21:00:24,740 : INFO : EPOCH - 4 :

[('interpretation', 0.7374930381774902),
 ('discussion', 0.7164048552513123),
 ('focuses', 0.7051739692687988),
 ('consensus', 0.7041907906532288),
 ('discourse', 0.7038161754608154),
 ('merits', 0.6988430023193359),
 ('focus', 0.6950662136077881),
 ('speculation', 0.6800159811973572),
 ('viewpoint', 0.6783989071846008),
 ('debate', 0.6778930425643921)]

In [None]:
# Save and Load Model
model.save('newmodel')
model = Word2Vec.load('newmodel')

We have trained and saved a Word2Vec model for our document. However, when a new dataset comes, you want to update the model so as to account for new words.

# How to update an existing Word2Vec model with new data?

On an existing Word2Vec model, call the build_vocab() on the new datset and then call the train() method. build_vocab() is called first because the model has to be apprised of what new words to expect in the incoming corpus.

In [36]:
# Update the model with new data.
model.build_vocab(data_part2, update=True)
model.train(data_part2, total_examples=model.corpus_count, epochs=model.iter)
model['topic']

2020-04-19 21:05:08,563 : INFO : collecting all words and their counts
2020-04-19 21:05:08,565 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-19 21:05:12,710 : INFO : collected 153347 word types from a corpus of 7005207 raw words and 701 sentences
2020-04-19 21:05:12,711 : INFO : Updating model with new vocabulary
2020-04-19 21:05:13,198 : INFO : New added 153347 unique words (50% of original 306694) and increased the count of 153347 pre-existing words (50% of original 306694)
2020-04-19 21:05:14,908 : INFO : deleting the raw counts dictionary of 153347 items
2020-04-19 21:05:14,913 : INFO : sample=0.001 downsamples 72 most-common words
2020-04-19 21:05:14,914 : INFO : downsampling leaves estimated 10509051 word corpus (150.0% of prior 7005207)
2020-04-19 21:05:15,513 : INFO : estimated required memory for 306694 words and 100 dimensions: 398702200 bytes
2020-04-19 21:05:15,514 : INFO : updating layer weights
  This is separate from the ipykernel pac

2020-04-19 21:05:49,871 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-19 21:05:49,873 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-19 21:05:49,885 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-19 21:05:49,888 : INFO : EPOCH - 4 : training on 7005207 raw words (5253121 effective words) took 8.1s, 651020 effective words/s
2020-04-19 21:05:50,900 : INFO : EPOCH 5 - PROGRESS: at 13.12% examples, 692525 words/s, in_qsize 15, out_qsize 0
2020-04-19 21:05:51,918 : INFO : EPOCH 5 - PROGRESS: at 27.10% examples, 708685 words/s, in_qsize 15, out_qsize 0
2020-04-19 21:05:52,935 : INFO : EPOCH 5 - PROGRESS: at 39.94% examples, 693685 words/s, in_qsize 14, out_qsize 1
2020-04-19 21:05:53,935 : INFO : EPOCH 5 - PROGRESS: at 52.07% examples, 676609 words/s, in_qsize 14, out_qsize 1
2020-04-19 21:05:54,940 : INFO : EPOCH 5 - PROGRESS: at 63.62% examples, 662747 words/s, in_qsize 14, out_qsize 1
2020-04-19 21:05:55,

array([ 0.2852214 , -1.0016894 ,  0.36538497, -0.485041  , -0.17642544,
       -0.23392639,  0.5296541 , -0.89992523,  1.2112411 ,  1.0386949 ,
       -0.9957959 , -0.71208715, -0.6928785 , -0.4220287 ,  0.60961115,
        1.8373113 ,  0.8656116 , -0.19760089,  0.9054839 ,  0.23233497,
        0.17219867,  0.6895245 , -0.4219415 , -0.6903395 , -0.75695884,
       -1.130927  , -0.7414807 , -1.0186759 ,  1.679157  , -1.7498915 ,
       -0.868277  , -0.35877767,  0.63911796,  0.43743446, -1.6824471 ,
        1.7797668 , -1.1549269 ,  1.8024943 ,  0.84520864,  0.40655845,
       -0.06938665, -0.30292737,  1.3847766 , -0.95705163,  0.5873874 ,
       -0.6506737 , -0.607128  ,  1.587782  ,  0.6351494 ,  0.01420654,
       -0.26282036, -2.3642516 ,  0.04982575, -0.96697474, -0.39967275,
        0.26223755,  1.1660594 , -0.1115896 , -0.44340703,  0.8213022 ,
       -2.3636622 , -0.93666196,  0.21714604, -0.28326985,  0.06686672,
        2.050047  ,  0.8305246 ,  0.7034536 , -0.13184188,  1.93

# How to extract word vectors using pre-trained Word2Vec and FastText models?

We just saw how to get the word vectors for Word2Vec model we just trained. However, gensim lets you download state of the art pretrained models through the downloader API. Let’s see how to extract the word vectors from a couple of these models.

In [None]:
import gensim.downloader as api

# Download the models
fasttext_model300 = api.load('fasttext-wiki-news-subwords-300')
word2vec_model300 = api.load('word2vec-google-news-300')
glove_model300 = api.load('glove-wiki-gigaword-300')

# Get word embeddings
word2vec_model300.most_similar('support')

We have 3 different embedding models. You can evaluate which one performs better using the respective model’s evaluate_word_analogies() on a standard analogies dataset.

In [None]:
# Word2ec_accuracy
word2vec_model300.evaluate_word_analogies(analogies="questions-words.txt")[0]
#> 0.7401448525607863

# fasttext_accuracy
fasttext_model300.evaluate_word_analogies(analogies="questions-words.txt")[0]
#> 0.8827876424099353

# GloVe accuracy
glove_model300.evaluate_word_analogies(analogies="questions-words.txt")[0]
#> 0.7195422354510931

# How to create document vectors using Doc2Vec?

Unlike **Word2Vec, a Doc2Vec** model provides a vectorised representation of a group of words taken collectively as a single unit. It is not a simple average of the word vectors of the words in the sentence.

Let’s use the text8 dataset to train the Doc2Vec.

In [None]:
import gensim
import gensim.downloader as api

# Download dataset
dataset = api.load("text8")
data = [d for d in dataset]

The training data for Doc2Vec should be a list of TaggedDocuments. To create one, we pass a list of words and a unique integer as input to the models.doc2vec.TaggedDocument().

In [37]:
# Create the tagged document needed for Doc2Vec
def create_tagged_document(list_of_list_of_words):
    for i, list_of_words in enumerate(list_of_list_of_words):
        yield gensim.models.doc2vec.TaggedDocument(list_of_words, [i])

train_data = list(create_tagged_document(data))

print(train_data[:1])

[TaggedDocument(words=['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorative', 'way', 'to', 'describe', 'any', 'act', 'that', 'used', 'violent', 'means', 'to', 'destroy', 'the', 'organization', 'of', 'society', 'it', 'has', 'also', 'been', 'taken', 'up', 'as', 'a', 'positive', 'label', 'by', 'self', 'defined', 'anarchists', 'the', 'word', 'anarchism', 'is', 'derived', 'from', 'the', 'greek', 'without', 'archons', 'ruler', 'chief', 'king', 'anarchism', 'as', 'a', 'political', 'philosophy', 'is', 'the', 'belief', 'that', 'rulers', 'are', 'unnecessary', 'and', 'should', 'be', 'abolished', 'although', 'there', 'are', 'differing', 'interpretations', 'of', 'what', 'this', 'means', 'anarchism', 'also', 'refers'

The input is prepared. To train the model, you need to initialize the Doc2Vec model, build the vocabulary and then finally train the model.

In [None]:
# Init the Doc2Vec model
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)

# Build the Volabulary
model.build_vocab(train_data)

# Train the Doc2Vec model
model.train(train_data, total_examples=model.corpus_count, epochs=model.epochs)

To get the document vector of a sentence, pass it as a list of words to the infer_vector() method.

In [None]:
print(model.infer_vector(['australian', 'captain', 'elected', 'to', 'bowl']))
#> array([-0.11043505,  0.21719663, -0.21167697, -0.10790558,  0.5607173 ,
#>        ...
#>        0.16428669, -0.31307793, -0.28575218, -0.0113026 ,  0.08981086],
#>       dtype=float32)

# How to compute similarity metrics like cosine similarity and soft cosine similarity?

Soft cosine similarity is similar to cosine similarity but in addition considers the semantic relationship between the words through its vector representation.

To compute soft cosines, you will need a word embedding model like Word2Vec or FastText. First, compute the similarity_matrix. Then convert the input sentences to bag-of-words corpus and pass them to the softcossim() along with the similarity matrix.

In [38]:
from gensim.matutils import softcossim
from gensim import corpora

sent_1 = 'Sachin is a cricket player and a opening batsman'.split()
sent_2 = 'Dhoni is a cricket player too He is a batsman and keeper'.split()
sent_3 = 'Anand is a chess player'.split()

# Prepare the similarity matrix
similarity_matrix = fasttext_model300.similarity_matrix(dictionary, tfidf=None, threshold=0.0, exponent=2.0, nonzero_limit=100)

# Prepare a dictionary and a corpus.
documents = [sent_1, sent_2, sent_3]
dictionary = corpora.Dictionary(documents)

# Convert the sentences into bag-of-words vectors.
sent_1 = dictionary.doc2bow(sent_1)
sent_2 = dictionary.doc2bow(sent_2)
sent_3 = dictionary.doc2bow(sent_3)

# Compute soft cosine similarity
print(softcossim(sent_1, sent_2, similarity_matrix))
#> 0.7868705819999783

print(softcossim(sent_1, sent_3, similarity_matrix))
#> 0.6036445529268666

print(softcossim(sent_2, sent_3, similarity_matrix))
#> 0.60965453519611

NameError: name 'fasttext_model300' is not defined

Below are some useful similarity and distance metrics based on the word embedding models like fasttext and GloVe. We have already downloaded these models using the downloader API.

In [None]:
# Which word from the given list doesn't go with the others?
print(fasttext_model300.doesnt_match(['india', 'australia', 'pakistan', 'china', 'beetroot']))  
#> beetroot

# Compute cosine distance between two words.
print(fasttext_model300.distance('king', 'queen'))
#> 0.22957539558410645


# Compute cosine distances from given word or vector to all words in `other_words`.
print(fasttext_model300.distances('king', ['queen', 'man', 'woman']))
#> [0.22957546 0.465837   0.547001  ]


# Compute cosine similarities
print(fasttext_model300.cosine_similarities(fasttext_model300['king'], 
                                            vectors_all=(fasttext_model300['queen'], 
                                                        fasttext_model300['man'], 
                                                        fasttext_model300['woman'],
                                                        fasttext_model300['queen'] + fasttext_model300['man'])))  
#> array([0.77042454, 0.534163  , 0.45299897, 0.76572555], dtype=float32)
# Note: Queen + Man is very similar to King.

# Get the words closer to w1 than w2
print(glove_model300.words_closer_than(w1='king', w2='kingdom'))
#> ['prince', 'queen', 'monarch']


# Find the top-N most similar words.
print(fasttext_model300.most_similar(positive='king', negative=None, topn=5, restrict_vocab=None, indexer=None))
#> [('queen', 0.63), ('prince', 0.62), ('monarch', 0.59), ('kingdom', 0.58), ('throne', 0.56)]


# Find the top-N most similar words, using the multiplicative combination objective,
print(glove_model300.most_similar_cosmul(positive='king', negative=None, topn=5))
#> [('queen', 0.82), ('prince', 0.81)

# How to summarize text documents?

Gensim implements the textrank summarization using the summarize() function in the summarization module. All you need to do is to pass in the tet string along with either the output summarization ratio or the maximum count of words in the summarized output.

There is no need to split the sentence into a tokenized list because gensim does the splitting using the built-in split_sentences() method in the gensim.summarization.texcleaner module.

Let’s summarize the clipping from a new article in sample.txt.

In [None]:
from gensim.summarization import summarize, keywords
from pprint import pprint

text = " ".join((line for line in smart_open('sample.txt', encoding='utf-8')))

# Summarize the paragraph
pprint(summarize(text, word_count=20))
#> ('the PLA Rocket Force national defense science and technology experts panel, '
#>  'according to a report published by the')

# Important keywords from the paragraph
print(keywords(text))
#> force zhang technology experts pla rocket