In [1]:
# Gensim is a Python library for topic modelling, document 
# indexing and similarity retrival with large corpora. 
# Target audience is the natural language processing (NLP)
# and information retrival (IR) community.

In [2]:
# Features
# * All algorithms are memory-independent. The corpus size 
# (can process input larger than RAM, streamed, out-of-core)
# * Intuitive interfaces
#    * easy to plug in your own input corpus/datastream (simple stream API)
#    * easy to extend with other Vector Space algorithms (simple transform API)
# * Efficient multicore implementations of popular algorithms
#   such as online atent Semantic Analysis (LSA/LSI/SVD),
#   Latent Dirichlet Allocation (LDA), Random Projections (RP),
#   Hierarchical Dirichlet Process (HDP) or word2vec deep 
#   learning.
# * Distributed computing: can run Latent Semantic Analysis and 
#   Latent Dirichlet Allocation on a cluster of computers.

In [7]:

# Vector space model
# Vector space model or term vector model is an algebraic model
# for representing text documents (and any objects, in general)
# as vectors of identifers (such as index terms). It is used 
# in information filtering, information retrieval, indexing
# and relevancy rankings. Its first use was in the SMART
# Information Retrieval System.

# Definitions
# Documents and queries are represented as vectors

In [6]:
%%latex
$d_j = (w_{1,j}, w_{2,j}, ... w_{t,j})$

<IPython.core.display.Latex object>

In [8]:
# Each dimension corresponds to a separate term. If a term 
# occurs in the document, its value in the the vector is 
# non-zero. Several different ways of computing these values,
# also known as (term) weights, have been developed. One of
# the best known schemes is tf-idf weighting.

# The definition of term depends on the application. Typically 
# terms are single words, keywords, or longer phrases. If words
# are chosen to be the terms, the dimensionality of the vector
# is the number of words in the vocabulary (the number of 
# distinct words occuring in the corpus).

# Vector operations can be used to compare documents with queries.



In [9]:
# Application
# Relevance rankings of documents in a keyword search can be 
# calculated, using the assumptions of document similarities
# theory, by comparing the deviation of angles between 
# each document vector and the original query vector
# where the the query is represented as a vector with 
# same dimension as the vectors that represent the other 
# documents. 
# In practice, it is easier to calculate the cosine of the 
# angle between the vectors, instead of the angle itself.


In [13]:
%%latex
$cos\theta = \frac{d_2.q}{||d_2||.||q||}$

<IPython.core.display.Latex object>

In [12]:
# Advantages
# The vector space model has the following advantages over 
# the Standard Boolean model:
# 1.  Simple model based on linear algebra
# 2.  Term weights not binary
# 3.  Allows computing a continuous degree of similarity 
#     between queries and documents.
# 4.  Allows ranking documents according to their possible 
#     relevance.
# 5.  Allows partial matching.


In [14]:
# Topic modelling for humans
# * Train large-scale semantic NLP models
# * Represent text as semantic vectors
# * Find semantically related documents

In [15]:
# Core concepts
# The core concepts of gensim are: 
# 1. Document: some text.
# 2. Corpus: a collection of documents.
# 3. Vector: a mathematically convenient representation of 
# a document.
# 4. Model: an algorithm for transforming vectors from 
# one representation to another. 
# Let's examine each of these in slightly more detail. 

In [16]:
# Document 
# In Gensim, a document is an object of the text sequence type
# (commonly known as str in Python 3). A document could be 
# anything from a short 140 character tweet, a single paragraph
# (i.e., journal artical abstract), a news artical, or a book. 

In [17]:
document = "Human machine interface for lab abc computer applications"

In [18]:
# Corpus
# A corpus is a collection of Document objects. Corpora serve
# two roles in Gensim:
# 1. Input for training a Model. During training, the models 
# this training corpus to look for common themes and topics,
# initializing their internal model parameters. 
# 
#    Gensim focuses on unsupervised models so that no human 
#    intervention, such as costly annotations or tagging 
#    documents by hand, is required. 
# 2. Documents to organize. After training, a topic model can 
# be used to extract topics from new documents (documents not
# seen in the training corpus).
#    Such corpora can be indexed for Similarity Queries, 
#    queried by semantic similarity, clustered etc. 
# Here is an example corpus. It consists of 9 documents, where
# ech document is a string consisting of a single sentence. 

In [19]:
text_corpus = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

In [20]:
# Important
# The above example loads the entire corpus into memory. In
# practice, corpora may be very large, so loading them into 
# memory may be impossible. Gensim intelligently handles 
# such corpora by streaming them one document at a time.
# See Corpus Streaming - One Document at a Time for details.

In [21]:
# This is a particularly small example of a corpus for 
# illustration purposes. Another example could be a list 
# of all the plays written by Shakespear, list of all 
# wikipedia articles, or all tweets by a particular person 
# of interest. 

In [22]:
# After collecting our corpus, there are typically a number of 
# preprocessing steps we want to undertake. We'll keep it 
# simple and just remove some commonly used English words (
# such as 'the') and words that occur only once in the corpus. 
# In the process of doing so, we'll tokenize our data. 
# Tokenization breaks up the documents into words (in this 
# case using space as a delimeter).

In [23]:
# Important
# There are better ways to perform preprocessing than just 
# lower-casing and splitting by space. Effective preprocessing
# is beyond the scope of this tutorial: if you're interested, 
# check out the gensim.utils.simple_preprocess() function

In [24]:
# Create a set of frequent words
stoplist = set('for a of the and to in'.split(' '))

In [25]:
stoplist

{'a', 'and', 'for', 'in', 'of', 'the', 'to'}

In [27]:
# Lowercase each document, split it by whitespace and filter
# out stopwords
texts = [[word for word in document.lower().split() if word not in stoplist]
        for document in text_corpus]

In [28]:
texts

[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
 ['graph', 'minors', 'survey']]

In [29]:
# Count word frequencies
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

In [30]:
# Only keep words that appear more than once
processed_corpus = [[token for token in text if frequency[token]>1]
                   for text in texts]

In [31]:
processed_corpus

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

In [32]:
# Before proceeding, we want to associate each word in the 
# corpus with a unique integer ID. We can do this using 
# the gensim.corpora.Dictionary class. This dictionary defines
# the vocabulary of all words that our processing know about.

In [36]:
from gensim import corpora
dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)

Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)


In [37]:
# Because our corpus is small, there are only 12 different
# tokens in this gensim.corpora.Dictionary. For larger corpuses,
# dictionaries that contains hundreds of thousands of tokens
# are quite common.

In [39]:
# Vector
# To infer the latent structure in our corpus we need a way 
# to represent documents that we can manipulate mathematically. 
# One approach is to represent each document as a vector of
# features. For example, a single feature may be thought of 
# as a question-answer pair: 
# 1. How many times does the word splonge appear in the 
# document? Zero
# 2. How many paragraphs does the document consist of? Two
# 3. How many fonts does the document use? Five
# The question is usually represented only by its integer id 
# (such as 1, 2 and 3). The representation of this document 
# becomes a series of pairs like (1, 0.0), (2, 2.0), (3, 5.0).
# This is known as a dense vector, because it contains an 
# explicit answer to each of the above questions. 

In [40]:
# If we know all the questions in advance, we may leave them 
# implicit and simply represent the document as (0, 2, 5). 
# This sequence of answers is the vector for our document
# (in this case a 3-dimenional dense vector). For practical
# purposes, only questions to which the answer is (or 
# can be converted to) a single floating point number 
# are allowed in Gensim.
# In practice, vectors often consist of many zero values. 
# To save memory, Gensim omits all vector elements with value
# 0.0. The above example thus becomes (2, 2.0), (3, 5.0). This
# is known as a spare vector or bag-of-words vector. The values
# of all missing features in this sparse representation
# can be unambiguously resolved to zero, 0.0

In [41]:
# Assuming the questions are the same, we can compare the 
# vectors of two different documents to each other. For 
# example, assume we are given two vectors (0.0, 2.0, 5.0)
# and (0.1, 1.9, 4.9). Because the vectors are very similar
# to each other, we can conclude that the documents 
# corresponding to those vectors are similar, too. Of course, 
# the correctness of that conclusion depends on how well we 
# picked the questions in the first place. 

In [42]:
# Another approach to represent a document as a vector is 
# the bag-of-words model. Under the bag-of-words model 
# each document is represented by a vector containing the 
# frequency counts of each word in the dictionary. For 
# example, assume we have a dictionary containing the words
# ['coffee', 'milk', 'sugar', 'spoon']. A document consisting
# of the string "coffee milk coffee" would then be represented
# by the vector [2, 1, 0, 0] where the entries of vector are 
# (in order) the occurrences of "coffee", "milk", "sugar"
# and "spoon" in the document. The lenght of the vector is the
# number of entries in the dictionary. One of the main properties
# of the bag-of-words model is that it completely ignores the
# order of the tokens in the document that is encoded, which is
# where the name bag-of-words comes from. 

In [43]:
# Our processed corpus has 12 unique words in it, which means
# that each document will be represented by 12-dimensional
# vector under the bag-of-words model. We can use the dictionary
# to turn tokenized documents into these 12-dimensional vectors.
# We can see what these IDs correspond to:

In [45]:
import pprint
pprint.pprint(dictionary.token2id)

{'computer': 0,
 'eps': 8,
 'graph': 10,
 'human': 1,
 'interface': 2,
 'minors': 11,
 'response': 3,
 'survey': 4,
 'system': 5,
 'time': 6,
 'trees': 9,
 'user': 7}


In [46]:
# For example, suppose we wanted to vectorize the phrase 
# "Human computer interaction" (note that this phrase was 
# not in our original corpus). We can create the bag-of-word
# representation for a document using the doc2bow method of 
# the dictionary, which returns a sparse representation of the
# word counts:

In [49]:
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
new_vec

[(0, 1), (1, 1)]

In [50]:
# The first entry in each tuple corresponds to the ID of the 
# token in the dictionary, the second corresponds to the count
# of this token. 

In [51]:
# Note that "interaction" did not occur in the original corpus
# and so it was not included in the vectorization. Also note
# that this vector only contains entries for words that acctually
# appeared in the document. Because any given document will only
# contain a few words out of the many words in the dictionary, 
# words that do not appear in the vectorization are represented
# as implicitly zero as a space saving measure. 

In [52]:
# We can convert our entire original corpus to a list of vectors:

In [54]:
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
pprint.pprint(bow_corpus)

[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]


In [55]:
# Note that while this list lives entirely in memory, in most 
# applications you will want a more scalable solution. Luckily, 
# gensim allows you to use any iterator that returns a single
# document vector at a time. See the documentation for more details.

In [56]:
# Important!
# Depending on how the representation was obtained, two 
# different documents may have the same vector representations. 

In [57]:
# Model
# Now that we have vectorized our corpus we can begin to
# transform it using models. We use model as an abstract 
# term referring to a transformation from one document 
# representation to another. In gensim documents are 
# represented as vectors so a model can be thought of as 
# a transformation between two vector spaces. The model learns
# the details of this transformation during training, when
# it reads the training Corpus.

In [58]:
# One simple example of a model is tf-idf. The tf-idf model
# transforms vectors from the bag-of-words representation
# to a vector space where the frequency counts are weighted
# according to the relative rarity of each word in the corpus.

In [59]:
# Here's a simple example. Let's initialize the tf-idf model,
# training it on our corpus and transforming the string
# "system minors"

In [60]:
from gensim import models

# train the model
tfidf = models.TfidfModel(bow_corpus)

# transform the "system minors" string
words = "system minors".lower().split()
print(tfidf[dictionary.doc2bow(words)])

[(5, 0.5898341626740045), (11, 0.8075244024440723)]


In [61]:
# The tfidf model again returns a list of tuples, where the 
# first entry is the token ID and the second entry is the 
# tf-idf weighting. Note that the ID corresponding to "system"
# (which occured 4 times in the original corpus) has been 
# weighted lower than the ID corresponding to minors (which 
# only occured twice).

# you can save trained models to disk and later load them back
# either to continue training on new training documents or 
# to transform new documents

# gensim offers a number of different models/transformations. 
# for more, see Topics and Transformations.

In [62]:
# Once you've created the model, you can do all sorts of cool 
# stuff with it. For example, to transform the whole corpus
# via Tfldf and index, in preparation for similarity queries:

In [63]:
from gensim import similarities
index = similarities.SparseMatrixSimilarity(tfidf[bow_corpus], num_features=12)

In [65]:
# and to query the similarity of our query document
# query_document against every document in the corpus

In [66]:
query_document = 'system engineering'.split()

In [67]:
query_bow = dictionary.doc2bow(query_document)

In [68]:
sims = index[tfidf[query_bow]]
print(list(enumerate(sims)))

[(0, 0.0), (1, 0.32448703), (2, 0.41707572), (3, 0.7184812), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]


In [70]:
text_corpus

['Human machine interface for lab abc computer applications',
 'A survey of user opinion of computer system response time',
 'The EPS user interface management system',
 'System and human system engineering testing of EPS',
 'Relation of user perceived response time to error measurement',
 'The generation of random binary unordered trees',
 'The intersection graph of paths in trees',
 'Graph minors IV Widths of trees and well quasi ordering',
 'Graph minors A survey']

In [71]:
# how to read this output? Document 3 has a similarity score of 
# 0.718=72%, document 2 has a similarity score of 42% etc. We
# can make this slightly more readable by sorting.

In [72]:
for document_number, score in sorted(enumerate(sims), key=lambda x: x[1], reverse=True):
    print(document_number, score)

3 0.7184812
2 0.41707572
1 0.32448703
0 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 0.0


In [73]:
# Summary 
# The core concepts of gensim are:
# 1. Document: some text
# 2. Corpus: a collection of documents
# 3. Vector: a mathematically convenient representation of a document
# 4. Model: an algorithm for transforming vectors from 
# one representation to another. 

In [None]:
# We saw these concepts in action. First, we started with a 
# corpus of documents. Next, we transformed these documents
# to a vector space representation. After that, we created 
# a model that transformed our original vector representation
# to Tfldf. Finally, we used our model to calculate the 
# similarity between some query document and all documents 
# in the corpus.