In [1]:
import pprint

In [3]:
text_corpus = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

In [4]:
# Create a set of frequent words
stoplist = set('for a of the and to in'.split(' '))
# Lowercase each document, split it by white space and filter out stopwords
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in text_corpus]

In [5]:
# Count word frequencies
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

In [6]:
# Only keep words that appear more than once
processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]
pprint.pprint(processed_corpus)

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


In [7]:
#lets associate each work in teh corpus with a unique integer ID using the gensim.corpora.Dictionary class.

from gensim import corpora

dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)

Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)


because our corpus is small there are only 12 different tokens in the dictionary. Large corpuses can have dictionaries that range in the hundres of thousands of tokens.

<h3>Vector</h3>
To infer the latest structure we need to represent the documnets mathematically. We can do this by representing ech document as a vector of features.
A single feature may be thought of as a question-answer pair:
    
    1. how many times does the word splonge appear in the docuiment? 0
    2. How many paragraphs does the document consist of? 2
    3. How many fonts does the document use? 5
    
The representation of the features of the document can become a series of pairs like:
    (1, 0),(2, 2), (3, 5). This is known as a dense vector.
    

    Note: Only questions to which the answer is (or can be converted to) a single floating point number are allowed in Gensim,
    
    Note: In pratice vectors can contain many zero values. To save memory Gensim omits all vector elements with value 0.0.
    
Assuming the questions are the same, we can comare the vectors of two different documents to each other. 


<h3> Bag of words model</h3>

Another approach to represent a document as a vector is the 'bag of words' model. Under the bag of words model each document is represented by a vector containing the frequency counts of each word in the dictionary.
E.g. ['coffee','milk','sugar','spoon'] could be the dictionary.

A document consisting of the string "coffee milk coffee" would be represented in a bag of words model for this dict by the vector: [2,1,0,0]

One of the main properties of the 'bag of words' model is that it completely ignores the order of the tokens in the document that is encoded.    

Ther whole corpus can be converted into a list of vectors

In [9]:
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
pprint.pprint(bow_corpus)

[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]


note: the distinction between a document and a vector is that the former is text, and the latter is a mathematically convenient representation of the text.

note: depending how representation was obtained, two different documents may have the same vector representations.

The vectorised corpus can now be transformed using models. 
One simple example of a model is tf-idf. The tf-idf model transforms vectors from a bag of words representation to a vector space where the frequency counts are weighted according to the relative rarity of each word in the corpus

In [10]:
from gensim import models

#train the model
tfidf = models.TfidfModel(bow_corpus)

#transform the "system minors" string

words = "system minors".lower().split()
print(tfidf[dictionary.doc2bow(words)])

[(5, 0.5898341626740045), (11, 0.8075244024440723)]


The tfidf model returns a list of tuples where the first entry is the token ID and the second entry is the tf-idf weighting. The results above show the tf-idf weighting of 0.59 for the word 'system' (5), which appeared 11 times in the original corpus. The second tuple shows the word "minors" which only appears twice in the original corpus and is assigned with a tf-idf weighting of 0.80 by our model.

Once the model has been created you can perform other transformations to it.

For example: transform the whole corpus via Tfldf and index it in preparation for similarity queries:

In [12]:
from gensim import similarities
index = similarities.SparseMatrixSimilarity(tfidf[bow_corpus], num_features=12)

Now if we want to query the similarity of our query document against every document in the corpus:
    

In [13]:
query_document = 'system engineering'.split()
query_bow = dictionary.doc2bow(query_document)
sims = index[tfidf[query_bow]]
print(list(enumerate(sims)))

[(0, 0.0), (1, 0.32448703), (2, 0.41707572), (3, 0.7184812), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]


<h2> Summary</h2>

The core concepts of gensim are:
<li>1. Document: some text
<li>2. Corpus: a collection of documents
<li>3. Vector: a mathematically convenient representation of a document
<li>4. Model: an algorithm for transforming vectors from one representation to another

<h4>What we did in this notebook</h4>

1. Collected a corpus of documents
2. Transformed the documents to a vector space representation
3. Created a model that transformed the original vector representation to Tfldf.
4. Used the model to calculate the similarity between a query document and all documents in the corpus