In [11]:
!pip install nltk

Collecting nltk
  Using cached nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting joblib (from nltk)
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (796 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m796.9/796.9 kB[0m [31m33.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading joblib-1.4.2-py3-none-any.whl (301 kB)
Installing collected packages: regex, joblib, nltk
Successfully installed joblib-1.4.2 nltk-3.9.1 regex-2024.11.6

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m

Create a Dictionary from a list of sentences
======================================

In gensim, the dictionary contains a map of all words (tokens) to its unique id.

You can create a dictionary from a paragraph of sentences, from a text file that contains multiple lines of text and from multiple such text files contained in a directory. For the second and third cases, we will do it without loading the entire file into memory so that the dictionary gets updated as you read the text line by line.

Let’s start with the ‘List of sentences’ input.

When you have multiple sentences, you need to convert each sentence to a list of words. List comprehensions is a common way to do this.

In [1]:
import gensim
from gensim import corpora
from pprint import pprint

# How to create a dictionary from a list of sentences?
documents = ["WG Grace: Cricket legend has 10 matches wiped from iconic first-class record",
             "One hundred and fourteen years after the legendary Victorian era cricketer's last match",
             "he has had 685 runs, 67 wickets and two centuries wiped from the records books",
             "In a ruthless move the Wisden Cricketers' Almanack has decided that 10 of Grace's matches were not at first-class level and as a result updated its records"]

# Tokenize(split) the sentences into words
texts = [[text for text in doc.split()] for doc in documents]

# Create dictionary
dictionary = corpora.Dictionary(texts)

# Get information about the dictionary
print(dictionary)


Dictionary<54 unique tokens: ['10', 'Cricket', 'Grace:', 'WG', 'first-class']...>


As it says the dictionary has 54 unique tokens (or words). Let’s see the unique ids for each of these tokens.

In [2]:
# Show the word to id map
print(dictionary.token2id)


{'10': 0, 'Cricket': 1, 'Grace:': 2, 'WG': 3, 'first-class': 4, 'from': 5, 'has': 6, 'iconic': 7, 'legend': 8, 'matches': 9, 'record': 10, 'wiped': 11, 'One': 12, 'Victorian': 13, 'after': 14, 'and': 15, "cricketer's": 16, 'era': 17, 'fourteen': 18, 'hundred': 19, 'last': 20, 'legendary': 21, 'match': 22, 'the': 23, 'years': 24, '67': 25, '685': 26, 'books': 27, 'centuries': 28, 'had': 29, 'he': 30, 'records': 31, 'runs,': 32, 'two': 33, 'wickets': 34, 'Almanack': 35, "Cricketers'": 36, "Grace's": 37, 'In': 38, 'Wisden': 39, 'a': 40, 'as': 41, 'at': 42, 'decided': 43, 'its': 44, 'level': 45, 'move': 46, 'not': 47, 'of': 48, 'result': 49, 'ruthless': 50, 'that': 51, 'updated': 52, 'were': 53}


We have successfully created a Dictionary object. Gensim will use this dictionary to create a bag-of-words corpus where the words in the documents are replaced with its respective id provided by this dictionary.

If you get new documents in the future, it is also possible to update an existing dictionary to include the new words.

In [3]:
documents_2 = ["Jurgen Klopp says Liverpool will have to \"be ready to suffer\" in their Champions League semi-final second leg against Villarreal, despite a two-goal advantage",
                "An own goal from Pervis Estupinan and a Sadio Mane strike gave Liverpool a 2-0 win at Anfield last Wednesday",
                "Manager Klopp warned: \"We have to be ready to play a top game because they will go for us.\"",
                "The Reds have been European champions six times and last triumphed in 2019"]

texts_2 = [[text for text in doc.split()] for doc in documents_2]

dictionary.add_documents(texts_2)


# If you check now, the dictionary should have been updated with the new words (tokens).
print(dictionary)

print(dictionary.token2id)


Dictionary<110 unique tokens: ['10', 'Cricket', 'Grace:', 'WG', 'first-class']...>
{'10': 0, 'Cricket': 1, 'Grace:': 2, 'WG': 3, 'first-class': 4, 'from': 5, 'has': 6, 'iconic': 7, 'legend': 8, 'matches': 9, 'record': 10, 'wiped': 11, 'One': 12, 'Victorian': 13, 'after': 14, 'and': 15, "cricketer's": 16, 'era': 17, 'fourteen': 18, 'hundred': 19, 'last': 20, 'legendary': 21, 'match': 22, 'the': 23, 'years': 24, '67': 25, '685': 26, 'books': 27, 'centuries': 28, 'had': 29, 'he': 30, 'records': 31, 'runs,': 32, 'two': 33, 'wickets': 34, 'Almanack': 35, "Cricketers'": 36, "Grace's": 37, 'In': 38, 'Wisden': 39, 'a': 40, 'as': 41, 'at': 42, 'decided': 43, 'its': 44, 'level': 45, 'move': 46, 'not': 47, 'of': 48, 'result': 49, 'ruthless': 50, 'that': 51, 'updated': 52, 'were': 53, '"be': 54, 'Champions': 55, 'Jurgen': 56, 'Klopp': 57, 'League': 58, 'Liverpool': 59, 'Villarreal,': 60, 'advantage': 61, 'against': 62, 'despite': 63, 'have': 64, 'in': 65, 'leg': 66, 'ready': 67, 'says': 68, 'secon

Create a Dictionary from one file
=============================

You can also create a dictionary from a text file.

The below example reads a file line-by-line and uses gensim’s simple_preprocess to process one line of the file at a time.

The advantage here is it let’s you read an entire text file without loading the file in memory all at once.

In [None]:
!wget https://www.gutenberg.org/ebooks/2701.txt.utf-8
!head -n 5 2701.txt.utf-8

--2025-04-30 14:42:22--  https://www.gutenberg.org/ebooks/2701.txt.utf-8
Risoluzione di www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connessione a www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connesso.
Richiesta HTTP inviata, in attesa di risposta... 302 Found
Posizione: http://www.gutenberg.org/cache/epub/2701/pg2701.txt [segue]
--2025-04-30 14:42:26--  http://www.gutenberg.org/cache/epub/2701/pg2701.txt
Connessione a www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connesso.
Richiesta HTTP inviata, in attesa di risposta... 302 Found
Posizione: https://www.gutenberg.org/cache/epub/2701/pg2701.txt [segue]
--2025-04-30 14:42:26--  https://www.gutenberg.org/cache/epub/2701/pg2701.txt
Connessione a www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connesso.
Richiesta HTTP inviata, in attesa di risposta... 200 OK
Lunghezza: 1276288 (1,2M) [text/plain]
Salvataggio in: ‘2701.txt.utf-8’


2025-04-30 14:42:32 (529 KB

In [4]:
file = open('2701.txt.utf-8','r')
outFile = open('moby_dick.txt','w')
copyToFile = False
for l in file:
  if l.startswith('*** END OF THE PROJECT GUTENBERG EBOOK'):
    copyToFile = False
  if copyToFile:
    outFile.write(' ')
    outFile.write(l)
    if len(l)==0:
      outFile.write('\n')
  if l.startswith('*** START OF THE PROJECT GUTENBERG EBOOK'):
    copyToFile = True
outFile.close()

In [5]:
!head -n 500 moby_dick.txt
!cat moby_dick.txt | wc -l

 
 
 
 
 MOBY-DICK;
 
 or, THE WHALE.
 
 By Herman Melville
 
 
 
 CONTENTS
 
 ETYMOLOGY.
 
 EXTRACTS (Supplied by a Sub-Sub-Librarian).
 
 CHAPTER 1. Loomings.
 
 CHAPTER 2. The Carpet-Bag.
 
 CHAPTER 3. The Spouter-Inn.
 
 CHAPTER 4. The Counterpane.
 
 CHAPTER 5. Breakfast.
 
 CHAPTER 6. The Street.
 
 CHAPTER 7. The Chapel.
 
 CHAPTER 8. The Pulpit.
 
 CHAPTER 9. The Sermon.
 
 CHAPTER 10. A Bosom Friend.
 
 CHAPTER 11. Nightgown.
 
 CHAPTER 12. Biographical.
 
 CHAPTER 13. Wheelbarrow.
 
 CHAPTER 14. Nantucket.
 
 CHAPTER 15. Chowder.
 
 CHAPTER 16. The Ship.
 
 CHAPTER 17. The Ramadan.
 
 CHAPTER 18. His Mark.
 
 CHAPTER 19. The Prophet.
 
 CHAPTER 20. All Astir.
 
 CHAPTER 21. Going Aboard.
 
 CHAPTER 22. Merry Christmas.
 
 CHAPTER 23. The Lee Shore.
 
 CHAPTER 24. The Advocate.
 
 CHAPTER 25. Postscript.
 
 CHAPTER 26. Knights and Squires.
 
 CHAPTER 27. Knights and Squires.
 
 CHAPTER 28. Ahab.
 
 CHAPTER 29. Enter Ahab; to Him, Stubb.
 
 CHAPTER 30. The Pipe.
 
 CHAPTER 31. 

In [6]:
from gensim.utils import simple_preprocess
import os

# Create gensim dictionary form a single text file, deacc=True -> remove accent marks from tokens
dictionary = corpora.Dictionary(simple_preprocess(line, deacc=True) for line in open('moby_dick.txt', encoding='utf-8'))

# Token to Id map
dictionary.token2id


{'dick': 0,
 'moby': 1,
 'or': 2,
 'the': 3,
 'whale': 4,
 'by': 5,
 'herman': 6,
 'melville': 7,
 'contents': 8,
 'etymology': 9,
 'extracts': 10,
 'librarian': 11,
 'sub': 12,
 'supplied': 13,
 'chapter': 14,
 'loomings': 15,
 'bag': 16,
 'carpet': 17,
 'inn': 18,
 'spouter': 19,
 'counterpane': 20,
 'breakfast': 21,
 'street': 22,
 'chapel': 23,
 'pulpit': 24,
 'sermon': 25,
 'bosom': 26,
 'friend': 27,
 'nightgown': 28,
 'biographical': 29,
 'wheelbarrow': 30,
 'nantucket': 31,
 'chowder': 32,
 'ship': 33,
 'ramadan': 34,
 'his': 35,
 'mark': 36,
 'prophet': 37,
 'all': 38,
 'astir': 39,
 'aboard': 40,
 'going': 41,
 'christmas': 42,
 'merry': 43,
 'lee': 44,
 'shore': 45,
 'advocate': 46,
 'postscript': 47,
 'and': 48,
 'knights': 49,
 'squires': 50,
 'ahab': 51,
 'enter': 52,
 'him': 53,
 'stubb': 54,
 'to': 55,
 'pipe': 56,
 'mab': 57,
 'queen': 58,
 'cetology': 59,
 'specksnyder': 60,
 'cabin': 61,
 'table': 62,
 'head': 63,
 'mast': 64,
 'deck': 65,
 'quarter': 66,
 'sunset': 

Create a bag of words corpus
===========================

Now you know how to create a dictionary from a list and from text file.

The next important object you need to familiarize with in order to work in gensim is the Corpus (a Bag of Words). That is, it is a corpus object that contains the word id and its frequency in each document. You can think of it as gensim’s equivalent of a Document-Term matrix.

Once you have the updated dictionary, all you need to do to create a bag of words corpus is to pass the tokenized list of words to the Dictionary.doc2bow()

Let’s create s Corpus for a simple list (my_docs) containing 2 sentences.


In [7]:
# List with 2 sentences
my_docs = ["Who let the dogs out?",
           "Who? Who? Who? Who?"]

# Tokenize the docs
tokenized_list = [simple_preprocess(doc) for doc in my_docs]

# Create the Corpus
mydict = corpora.Dictionary()
mycorpus = [mydict.doc2bow(tok, allow_update=True) for tok in tokenized_list]
pprint(mycorpus)


[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(4, 4)]]


How to interpret the above corpus?

The (0, 1) in line 1 means, the word with id=0 appears once in the 1st document.
Likewise, the (4, 4) in the second list item means the word with id 4 appears 4 times in the second document. And so on.

Well, this is not human readable. To convert the id’s to words, you will need the dictionary to do the conversion.

Let’s see how to get the original texts back.

In [8]:
word_counts = [[(mydict[id], count) for id, count in line] for line in mycorpus]
pprint(word_counts)


[[('dogs', 1), ('let', 1), ('out', 1), ('the', 1), ('who', 1)], [('who', 4)]]


Notice, the order of the words gets lost. Just the word and it’s frequency information is retained.

Create a bag of words corpus from a text file
=======================================

Reading words from a python list is quite straightforward because the entire text was in-memory already.
However, you may have a large file that you don’t want to load the entire file in memory.

You can import such files one line at a time by defining a class and the __iter__ function that iteratively reads the file one line at a time and yields a corpus object. But how to create the corpus object?

The __iter__() from BoWCorpus reads a line from the file, process it to a list of words using simple_preprocess() and pass that to the dictionary.doc2bow().

Also, notice that I am using the smart_open() from smart_open package because, it lets you open and read large files line-by-line from a variety of sources such as S3, HDFS, WebHDFS, HTTP, or local and compressed files. That’s pretty awesome by the way!

However, if you had used open() for a file in your system, it will work perfectly file as well.


In [9]:
# Use only a portion of the moby_dick.txt file.
!head -n 2000 moby_dick.txt > moby_dick_2000.txt

In [12]:
from gensim.utils import simple_preprocess
from smart_open import smart_open
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')


class BoWCorpus(object):
    def __init__(self, path, dictionary):
        self.filepath = path
        self.dictionary = dictionary

    def __iter__(self):
        global mydict # OPTIONAL
        for line in smart_open(self.filepath, encoding='utf-8'):
            line = line.strip()
            # tokenize
            if (len(line)>0):
              tokenized_list = simple_preprocess(line, deacc=True)

              # create bag of words
              bow = self.dictionary.doc2bow(tokenized_list, allow_update=True)

              # update the source dictionary (OPTIONAL)
              mydict.merge_with(self.dictionary)

              # lazy return the BoW
              yield bow


# Create the Dictionary
mydict = corpora.Dictionary()

# Create the Corpus
bow_corpus = BoWCorpus('moby_dick_2000.txt', dictionary=mydict)  # memory friendly

# Print the token_id and count for each line.
for line in bow_corpus:
    print(line)

[nltk_data] Downloading package stopwords to /home/fonty/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


[(0, 1), (1, 1)]
[(2, 1), (3, 1), (4, 1)]
[(5, 1), (6, 1), (7, 1)]
[(8, 1)]
[(9, 1)]
[(5, 1), (10, 1), (11, 1), (12, 2), (13, 1)]
[(14, 1), (15, 1)]
[(3, 1), (14, 1), (16, 1), (17, 1)]
[(3, 1), (14, 1), (18, 1), (19, 1)]
[(3, 1), (14, 1), (20, 1)]
[(14, 1), (21, 1)]
[(3, 1), (14, 1), (22, 1)]
[(3, 1), (14, 1), (23, 1)]
[(3, 1), (14, 1), (24, 1)]
[(3, 1), (14, 1), (25, 1)]
[(14, 1), (26, 1), (27, 1)]
[(14, 1), (28, 1)]
[(14, 1), (29, 1)]
[(14, 1), (30, 1)]
[(14, 1), (31, 1)]
[(14, 1), (32, 1)]
[(3, 1), (14, 1), (33, 1)]
[(3, 1), (14, 1), (34, 1)]
[(14, 1), (35, 1), (36, 1)]
[(3, 1), (14, 1), (37, 1)]
[(14, 1), (38, 1), (39, 1)]
[(14, 1), (40, 1), (41, 1)]
[(14, 1), (42, 1), (43, 1)]
[(3, 1), (14, 1), (44, 1), (45, 1)]
[(3, 1), (14, 1), (46, 1)]
[(14, 1), (47, 1)]
[(14, 1), (48, 1), (49, 1), (50, 1)]
[(14, 1), (48, 1), (49, 1), (50, 1)]
[(14, 1), (51, 1)]
[(14, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1)]
[(3, 1), (14, 1), (56, 1)]
[(14, 1), (57, 1), (58, 1)]
[(14, 1), (59, 1)]
[(3, 

In [13]:
print(mydict);

Dictionary<3810 unique tokens: ['dick', 'moby', 'or', 'the', 'whale']...>


In [14]:
word_counts = [[(mydict[id], count) for id, count in line] for line in bow_corpus]
pprint(word_counts)

[[('dick', 1), ('moby', 1)],
 [('or', 1), ('the', 1), ('whale', 1)],
 [('by', 1), ('herman', 1), ('melville', 1)],
 [('contents', 1)],
 [('etymology', 1)],
 [('by', 1), ('extracts', 1), ('librarian', 1), ('sub', 2), ('supplied', 1)],
 [('chapter', 1), ('loomings', 1)],
 [('the', 1), ('chapter', 1), ('bag', 1), ('carpet', 1)],
 [('the', 1), ('chapter', 1), ('inn', 1), ('spouter', 1)],
 [('the', 1), ('chapter', 1), ('counterpane', 1)],
 [('chapter', 1), ('breakfast', 1)],
 [('the', 1), ('chapter', 1), ('street', 1)],
 [('the', 1), ('chapter', 1), ('chapel', 1)],
 [('the', 1), ('chapter', 1), ('pulpit', 1)],
 [('the', 1), ('chapter', 1), ('sermon', 1)],
 [('chapter', 1), ('bosom', 1), ('friend', 1)],
 [('chapter', 1), ('nightgown', 1)],
 [('chapter', 1), ('biographical', 1)],
 [('chapter', 1), ('wheelbarrow', 1)],
 [('chapter', 1), ('nantucket', 1)],
 [('chapter', 1), ('chowder', 1)],
 [('the', 1), ('chapter', 1), ('ship', 1)],
 [('the', 1), ('chapter', 1), ('ramadan', 1)],
 [('chapter', 

Save a gensim dictionary and corpus to disk and load them back
========================================================

In [15]:
# Save the Dict and Corpus
mydict.save('moby_dick.dict')  # save dict to disk
corpora.MmCorpus.serialize('bow_moby_dick.mm', bow_corpus)  # save corpus to disk

In [16]:
# Load them back
loaded_dict = corpora.Dictionary.load('moby_dick.dict')

corpus = corpora.MmCorpus('bow_moby_dick.mm')
for line in corpus:
    print(line)

[(0, 1.0), (1, 1.0)]
[(2, 1.0), (3, 1.0), (4, 1.0)]
[(5, 1.0), (6, 1.0), (7, 1.0)]
[(8, 1.0)]
[(9, 1.0)]
[(5, 1.0), (10, 1.0), (11, 1.0), (12, 2.0), (13, 1.0)]
[(14, 1.0), (15, 1.0)]
[(3, 1.0), (14, 1.0), (16, 1.0), (17, 1.0)]
[(3, 1.0), (14, 1.0), (18, 1.0), (19, 1.0)]
[(3, 1.0), (14, 1.0), (20, 1.0)]
[(14, 1.0), (21, 1.0)]
[(3, 1.0), (14, 1.0), (22, 1.0)]
[(3, 1.0), (14, 1.0), (23, 1.0)]
[(3, 1.0), (14, 1.0), (24, 1.0)]
[(3, 1.0), (14, 1.0), (25, 1.0)]
[(14, 1.0), (26, 1.0), (27, 1.0)]
[(14, 1.0), (28, 1.0)]
[(14, 1.0), (29, 1.0)]
[(14, 1.0), (30, 1.0)]
[(14, 1.0), (31, 1.0)]
[(14, 1.0), (32, 1.0)]
[(3, 1.0), (14, 1.0), (33, 1.0)]
[(3, 1.0), (14, 1.0), (34, 1.0)]
[(14, 1.0), (35, 1.0), (36, 1.0)]
[(3, 1.0), (14, 1.0), (37, 1.0)]
[(14, 1.0), (38, 1.0), (39, 1.0)]
[(14, 1.0), (40, 1.0), (41, 1.0)]
[(14, 1.0), (42, 1.0), (43, 1.0)]
[(3, 1.0), (14, 1.0), (44, 1.0), (45, 1.0)]
[(3, 1.0), (14, 1.0), (46, 1.0)]
[(14, 1.0), (47, 1.0)]
[(14, 1.0), (48, 1.0), (49, 1.0), (50, 1.0)]
[(14, 1.0), 

Create the TFIDF matrix
=======================

The Term Frequency – Inverse Document Frequency(TF-IDF) is also a bag-of-words model but unlike the regular corpus, TFIDF down weights tokens (words) that appears frequently across documents.

How is TFIDF computed?

Tf-Idf is computed by multiplying a local component like term frequency (TF) with a global component, that is, inverse document frequency (IDF) and optionally normalizing the result to unit length.

As a result of this, the words that occur frequently across documents will get downweighted.

There are multiple variations of formulas for TF and IDF existing. Gensim uses the SMART Information retrieval system that can be used to implement these variations. You can specify what formula to use specifying the smartirs parameter in the TfidfModel. See help(models.TfidfModel) for more details.

So, how to get the TFIDF weights?

By training the corpus with models.TfidfModel(). Then, apply the corpus within the square brackets of the trained tfidf model. See example below.

In [17]:
from gensim import models
import numpy as np

documents = ["This is the first line",
             "This is the second sentence",
             "This third document"]

# Create the Dictionary and Corpus
mydict = corpora.Dictionary([simple_preprocess(line) for line in documents])
corpus = [mydict.doc2bow(simple_preprocess(line)) for line in documents]

# Show the Word Weights in Corpus
for doc in corpus:
    print([[mydict[id], freq] for id, freq in doc])

# [['first', 1], ['is', 1], ['line', 1], ['the', 1], ['this', 1]]
# [['is', 1], ['the', 1], ['this', 1], ['second', 1], ['sentence', 1]]
# [['this', 1], ['document', 1], ['third', 1]]
print('======TF-IDF======')
# Create the TF-IDF model
tfidf = models.TfidfModel(corpus, smartirs='ntc')

# Show the TF-IDF weights
for doc in tfidf[corpus]:
    print([[mydict[id], np.around(freq, decimals=2)] for id, freq in doc])
# [['first', 0.66], ['is', 0.24], ['line', 0.66], ['the', 0.24]]
# [['is', 0.24], ['the', 0.24], ['second', 0.66], ['sentence', 0.66]]
# [['document', 0.71], ['third', 0.71]]

[['first', 1], ['is', 1], ['line', 1], ['the', 1], ['this', 1]]
[['is', 1], ['the', 1], ['this', 1], ['second', 1], ['sentence', 1]]
[['this', 1], ['document', 1], ['third', 1]]
[['first', 0.63], ['is', 0.31], ['line', 0.63], ['the', 0.31], ['this', 0.13]]
[['is', 0.31], ['the', 0.31], ['this', 0.13], ['second', 0.63], ['sentence', 0.63]]
[['this', 0.15], ['document', 0.7], ['third', 0.7]]


Notice the difference in weights of the words between the original corpus and the tfidf weighted corpus.

The words ‘is’ and ‘the’ occur in two documents and were weighted down. The word ‘this’ appearing in all three documents was removed altogether. In simple terms, words that occur more frequently across the documents get smaller weights.

Downloader API and create bigrams and trigrams
============================================

Gensim provides an inbuilt API to download popular text datasets and word embedding models.

A comprehensive list of available datasets and models is maintained here: https://raw.githubusercontent.com/RaRe-Technologies/gensim-data/master/list.json.

Using the API to download the dataset is as simple as calling the api.load() method with the right data or model name.

Now you know how to download datasets and pre-trained models with gensim.

Let’s download the text8 dataset, which is nothing but the “First 100,000,000 bytes of plain text from Wikipedia”. Then, from this, we will generate bigrams and trigrams.

But what are bigrams and trigrams? and why do they matter?

In paragraphs, certain words always tend to occur in pairs (bigram) or in groups of threes (trigram). Because the two words combined together form the actual entity. For example: The word ‘French’ refers the language or region and the word ‘revolution’ can refer to the planetary revolution. But combining them, ‘French Revolution’, refers to something completely different.

It’s quite important to form bigrams and trigrams from sentences, especially when working with bag-of-words models.


In [18]:
import gensim.downloader as api

# get dataset info
api.info('text8')
# load dataset
dataset = api.load("text8")
dataset = [wd for wd in dataset]

dct = corpora.Dictionary(dataset)
corpus = [dct.doc2bow(line) for line in dataset]



So how to create the bigrams?

It’s quite easy and efficient with gensim’s Phrases model. The created Phrases model allows indexing, so, just pass the original text (list) to the built Phrases model to form the bigrams. An example is shown below:

In [19]:
# Build the bigram models
bigram = gensim.models.phrases.Phrases(dataset, min_count=10, threshold=100)

Apply the bigram model to the first text.

In [20]:
bm = {}
for b in bigram[dataset[0]]:
  if "_" in b: # '_' is used to concatenate words into the bigram
    if b in bm:
      bm[b]=bm[b]+1
    else:
      bm[b]=1
pprint(bm)

{'anarchist_communism': 3,
 'anarcho_capitalism': 8,
 'anarcho_capitalist': 3,
 'anarcho_capitalists': 3,
 'anarcho_syndicalism': 8,
 'anarcho_syndicalist': 3,
 'anti_fascist': 2,
 'armoured_cars': 1,
 'avant_garde': 1,
 'ayn_rand': 1,
 'benjamin_tucker': 5,
 'bertrand_russell': 1,
 'carried_out': 1,
 'chat_rooms': 1,
 'civil_rights': 1,
 'civil_war': 8,
 'classical_liberalism': 1,
 'co_operative': 1,
 'cognitive_behavioral': 1,
 'cultural_imperialism': 1,
 'd_ration': 1,
 'dates_back': 1,
 'detailed_discussion': 1,
 'diagnostic_criteria': 1,
 'don_t': 2,
 'dsm_iv': 4,
 'e_g': 3,
 'emma_goldman': 5,
 'environmental_factors': 1,
 'ethnic_groups': 1,
 'external_links': 1,
 'f_lix': 1,
 'facial_expressions': 3,
 'folk_music': 1,
 'gender_roles': 1,
 'gilles_deleuze': 1,
 'gnu_linux': 1,
 'herbert_spencer': 1,
 'highly_controversial': 1,
 'hip_hop': 1,
 'holy_spirit': 1,
 'human_beings': 1,
 'hunter_gatherer': 2,
 'i_am': 2,
 'identify_themselves': 1,
 'individualist_anarchism': 4,
 'indiv

Can you guess how to create a trigram?

Well, rinse and repeat the same procedure on the output of the bigram model. Once you’ve generated the bigrams, you can pass the output to train a new Phrases model. Then, the bigrammed corpus is applied to the trained trigram model. Confused? See the example below.


In [21]:
# Build the trigram models
trigram = gensim.models.phrases.Phrases(bigram[dataset], min_count=10, threshold=100)

In [22]:
tm = {}
# Construct trigram
for t in trigram[bigram[dataset[0]]]:
  if len(t.split("_"))>2:
    if t in tm:
      tm[t]=tm[t]+1
    else:
      tm[t]=1
pprint(tm)

{'dsm_iv_tr': 1,
 'f_lix_guattari': 1,
 'g_n_rale': 1,
 'jean_jacques_rousseau': 1,
 'pierre_joseph_proudhon': 3,
 'spanish_civil_war': 4}


In [23]:
#Create bigram from a file
from gensim.utils import simple_preprocess

tokens = []
for line in open('moby_dick.txt', encoding='utf-8'):
    # tokenize
    tokenized_list = simple_preprocess(line, deacc=True)
    tokens.append(tokenized_list)

dct = corpora.Dictionary(tokens)
corpus = [dct.doc2bow(line) for line in tokens]

# Build the bigram models
bigram = gensim.models.phrases.Phrases(tokens , min_count=5, threshold=100)

In [24]:
# Construct bigram
bm = {}
for s in tokens:
  for b in bigram[s]:
    if not b.endswith('_'):
      if "_" in b: # '_' is used to concatenate words into the bigram
        if b in bm:
          bm[b]=bm[b]+1
        else:
          bm[b]=1
pprint(bm)

{'any_rate': 12,
 'aye_aye': 30,
 'book_ii': 7,
 'cape_horn': 14,
 'captain_peleg': 29,
 'centuries_ago': 6,
 'chief_mate': 18,
 'cruising_ground': 7,
 'cutting_spade': 7,
 'dost_thou': 13,
 'dough_boy': 16,
 'drew_nigh': 6,
 'fast_fish': 17,
 'father_mapple': 8,
 'forty_years': 10,
 'full_grown': 11,
 'gallant_sails': 3,
 'good_bye': 13,
 'good_deal': 8,
 'guernsey_man': 11,
 'ha_ha': 12,
 'heidelburgh_tun': 7,
 'ivory_leg': 13,
 'jack_knife': 8,
 'king_post': 10,
 'lamp_feeder': 7,
 'life_buoy': 12,
 'look_outs': 12,
 'loose_fish': 16,
 'lower_jaw': 15,
 'main_mast': 14,
 'main_top': 10,
 'mast_head': 44,
 'mast_heads': 37,
 'moby_dick': 80,
 'monkey_rope': 8,
 'mr_starbuck': 25,
 'mrs_hussey': 12,
 'must_needs': 11,
 'never_mind': 19,
 'new_bedford': 17,
 'pequod_meets': 10,
 'pivot_hole': 7,
 'quarter_deck': 32,
 'rose_bud': 6,
 'sag_harbor': 7,
 'seven_hundred': 8,
 'seventy_seventh': 6,
 'she_blows': 15,
 'spout_hole': 10,
 'spouter_inn': 6,
 'st_george': 9,
 'steering_oar': 7,
 

Train Word2Vec model using gensim
=================================

A word embedding model is a model that can provide numerical vectors for a given word. Using the Gensim’s downloader API, you can download pre-built word embedding models like word2vec, fasttext, GloVe and ConceptNet. These are built on large corpuses of commonly occurring text data such as wikipedia, google news etc.

However, if you are working in a specialized niche such as technical documents, you may not able to get word embeddings for all the words. So, in such cases its desirable to train your own model.

Gensim’s Word2Vec implementation let’s you train your own word embedding model for a given corpus.

In [25]:
from gensim.models.word2vec import Word2Vec
from multiprocessing import cpu_count
import gensim.downloader as api

# Download dataset
dataset = api.load("text8")
data = [d for d in dataset]

# Split the data into 2 parts. Part 2 will be used later to update the model
data_part1 = data[:1000]
data_part2 = data[1000:]

# Train Word2Vec model. Defaults result vector size = 100
model = Word2Vec(data_part1, min_count = 0, workers=cpu_count())

In [26]:
# Get the word vector for a given word
wvcat = model.wv['cat']
pprint(wvcat)

array([ 0.72837514,  0.46999422, -1.5906204 ,  0.02661184, -0.7880465 ,
       -1.0289775 ,  1.0799946 ,  1.2812871 , -0.5881823 , -0.26630947,
        0.32309926, -0.29398748, -0.05283977,  0.66423386, -0.21376342,
       -0.44050166,  0.17594026,  0.00531982,  0.7763083 ,  0.5426263 ,
       -1.3476917 , -0.9364098 ,  0.15847233,  0.6604394 , -0.7820675 ,
       -1.6109391 ,  0.03881091,  0.49071988,  0.87336236, -0.7390944 ,
        1.348666  , -0.26909322,  0.41117972,  0.2415903 , -1.0835719 ,
       -0.45663828, -0.57888913,  1.0062555 , -0.7629311 , -1.4163765 ,
        0.38934773,  0.57231486, -1.182781  , -1.1817743 ,  0.46336493,
       -1.2941573 ,  0.3917595 , -0.09167895,  0.74685794, -1.4349161 ,
        1.3214214 ,  0.47409415, -1.7148645 , -0.11898834,  0.20113483,
       -1.8425608 , -1.4740602 , -1.4940908 , -0.23297542,  0.37020728,
       -0.05250269,  0.12941001,  0.20102324,  0.4134369 , -0.635621  ,
        0.44304103,  0.23150936, -0.95405966,  0.39170527,  0.30

In [29]:
#get similar words
model.wv.most_similar('cat')

[('dog', 0.8306573629379272),
 ('sweet', 0.788445234298706),
 ('bee', 0.7865030765533447),
 ('flower', 0.7734196782112122),
 ('goat', 0.7535891532897949),
 ('bird', 0.7403355240821838),
 ('eyed', 0.7379123568534851),
 ('aloe', 0.7363643050193787),
 ('cow', 0.7336905002593994),
 ('bear', 0.7272558808326721)]

In [30]:
# Save model
model.save('model_1')

We have trained and saved a Word2Vec model for our document. However, when a new dataset comes, you want to update the model so as to account for new words.
On an existing Word2Vec model, call the build_vocab() on the new datset and then call the train() method. build_vocab() is called first because the model has to be apprised of what new words to expect in the incoming corpus.

In [31]:
# Load model
model = Word2Vec.load('model_1')
# Update the model with new data
model.build_vocab(data_part2, update=True)
model.train(data_part2, total_examples=model.corpus_count, epochs=model.epochs)
model.save('model_2')

In [32]:
# Get the word vector for given word
wvcat = model.wv['cat']
pprint(wvcat)

array([ 0.8282245 ,  0.84294796, -1.690042  ,  0.19623046, -0.72806877,
       -1.6114273 ,  1.1052216 ,  1.660066  , -0.9468567 , -0.15754865,
        0.83342034, -0.24367188, -0.10019975,  1.0250324 ,  0.19377466,
       -0.94260854,  0.3857294 ,  0.06678723,  1.2230035 ,  0.9974487 ,
       -1.449075  , -0.71466994,  0.04461149,  0.806071  , -0.75222015,
       -1.3038522 ,  0.4726949 ,  0.21151261,  1.4201963 , -0.7107925 ,
        1.1038533 , -0.28408712,  0.5263383 ,  0.29119483, -1.1705174 ,
       -0.5010353 , -0.577786  ,  1.0438479 , -1.1100441 , -1.7172271 ,
        0.6027778 ,  0.7704786 , -1.2443457 , -1.0379335 ,  0.3584203 ,
       -1.5220275 ,  0.38735113, -0.46733156,  0.5794025 , -1.4255886 ,
        1.6245542 ,  0.78526783, -1.8197856 , -0.9461159 ,  0.42868504,
       -2.1866922 , -1.4867642 , -1.1405015 , -0.43092072,  0.9959547 ,
        0.01077598,  0.05654882,  0.3313903 ,  0.14858323, -0.46720445,
        0.279997  ,  0.4989249 , -0.97454333,  0.3022567 ,  0.31

In [33]:
#get similar words
model.wv.most_similar('cat')

[('dog', 0.8749537467956543),
 ('bee', 0.8030326962471008),
 ('cow', 0.7863951921463013),
 ('goat', 0.7785074710845947),
 ('eyed', 0.7617635130882263),
 ('ass', 0.7543697953224182),
 ('stuffed', 0.7390522360801697),
 ('bird', 0.7387641072273254),
 ('pet', 0.7382631897926331),
 ('bear', 0.72823166847229)]

Create word2vec model from a text file.

In [35]:
from gensim.utils import simple_preprocess
from gensim.models.word2vec import Word2Vec
from multiprocessing import cpu_count

#create word2vec model from text file
tokens = []
for line in open('moby_dick.txt', encoding='utf-8'):
    # tokenize
    tokenized_list = simple_preprocess(line, deacc=True)
    tokens.append(tokenized_list)

# Train Word2Vec model
model = Word2Vec(tokens, min_count = 1, vector_size=50, workers=cpu_count())

In [36]:
# Get the word vector for given word
model.wv['whole']

array([-0.08501944, -0.16705601, -0.34895813,  0.06211347,  0.13621935,
       -1.1478244 ,  0.43677822,  1.7451652 , -0.31557438, -0.6042562 ,
        0.2538355 , -1.6225078 , -0.11179993,  1.2106034 , -0.63412726,
        0.785033  ,  0.88433653,  0.10936393, -0.9340433 , -1.6650339 ,
        0.3126828 ,  0.27254486,  0.99801886, -0.6984207 ,  0.64365894,
       -0.15205695, -0.06257302, -0.03117912, -0.70501226, -0.36861187,
       -0.01487787,  0.37061378,  0.91295177, -0.5892126 , -0.47076505,
       -0.08561809,  0.8600785 , -0.50784945,  0.13303635, -0.10457736,
        0.7683668 ,  0.18873312, -0.72218215,  0.2811368 ,  1.3298738 ,
       -0.2285316 , -0.3541198 , -0.04142136,  0.30809757,  1.0555038 ],
      dtype=float32)

In [37]:
#get similar words
model.wv.most_similar('ocean')

[('whole', 0.9992893934249878),
 ('end', 0.9990779161453247),
 ('under', 0.9990672469139099),
 ('iron', 0.9990155696868896),
 ('full', 0.9989867210388184),
 ('off', 0.9989599585533142),
 ('black', 0.9989580512046814),
 ('waters', 0.9989373683929443),
 ('ground', 0.9989253878593445),
 ('while', 0.9989184737205505)]

In [38]:
#get similarity between two words
model.wv.similarity('ocean','whole')

0.9992892

Import pre-trainined word2vec
============================

We just saw how to get the word vectors for Word2Vec model we just trained. However, gensim lets you download state of the art pretrained models through the downloader API. Let’s see how to extract the word vectors from a couple of these models.

In [39]:
import gensim.downloader as api

# Download the models (1660MB)
word2vec_model300 = api.load('word2vec-google-news-300')



KeyboardInterrupt: 

In [None]:
#get similar words
word2vec_model300.most_similar('cat')

In [40]:
import gensim.downloader as api

#download a model based on Glove (128MB)
glove_model100 = api.load('glove-wiki-gigaword-100')



In [41]:
#get similar words
glove_model100.most_similar('cat')

[('dog', 0.8798074722290039),
 ('rabbit', 0.7424426674842834),
 ('cats', 0.7323005199432373),
 ('monkey', 0.7288710474967957),
 ('pet', 0.719014048576355),
 ('dogs', 0.716387152671814),
 ('mouse', 0.6915250420570374),
 ('puppy', 0.6800068020820618),
 ('rat', 0.6641027331352234),
 ('spider', 0.6501135230064392)]

In [42]:
glove_model100.similarity('road','car')

0.53735423

In [43]:
glove_model100.similarity('dog','cat')

0.8798075