### Similarity
In this tutorial we determine similarity between pairs of documents, or the similarity between a specific document and a set of other documents (such as a user query vs. indexed documents).

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from gensim import corpora, models, similarities

2017-02-06 16:50:50,265 : INFO : 'pattern' package not found; tag filters are not available for English


In [3]:
# load dictionary and corpus
dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')
corpus = corpora.MmCorpus('/tmp/deerwester.mm') # comes from the first tutorial, "From strings to vectors"

2017-02-06 16:55:58,782 : INFO : loading Dictionary object from /tmp/deerwester.dict
2017-02-06 16:55:58,784 : INFO : loaded /tmp/deerwester.dict
2017-02-06 16:55:58,785 : INFO : loaded corpus index from /tmp/deerwester.mm.index
2017-02-06 16:55:58,786 : INFO : initializing corpus reader from /tmp/deerwester.mm
2017-02-06 16:55:58,787 : INFO : accepted corpus with 9 documents, 12 features, 28 non-zero entries


In [4]:
# initialize lsi with corpus and dictionary
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)

2017-02-06 16:56:23,464 : INFO : using serial LSI version on this node
2017-02-06 16:56:23,465 : INFO : updating model with new documents
2017-02-06 16:56:23,467 : INFO : preparing a new chunk of documents
2017-02-06 16:56:23,468 : INFO : using 100 extra samples and 2 power iterations
2017-02-06 16:56:23,469 : INFO : 1st phase: constructing (12, 102) action matrix
2017-02-06 16:56:23,471 : INFO : orthonormalizing (12, 102) action matrix
2017-02-06 16:56:23,475 : INFO : 2nd phase: running dense svd on (12, 9) matrix
2017-02-06 16:56:23,477 : INFO : computing the final decomposition
2017-02-06 16:56:23,478 : INFO : keeping 2 factors (discarding 43.156% of energy spectrum)
2017-02-06 16:56:23,479 : INFO : processed documents up to #9
2017-02-06 16:56:23,481 : INFO : topic #0(3.341): 0.644*"system" + 0.404*"user" + 0.301*"eps" + 0.265*"time" + 0.265*"response" + 0.240*"computer" + 0.221*"human" + 0.206*"survey" + 0.198*"interface" + 0.036*"graph"
2017-02-06 16:56:23,482 : INFO : topic #1(2

Now suppose a user typed in the query “Human computer interaction”. We would like to sort our nine corpus documents in decreasing order of relevance to this query.

In [5]:
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow] # convert the query to LSI space
print(vec_lsi)

[(0, 0.46182100453271574), (1, -0.070027665278999993)]


To prepare for similarity queries, we need to enter all documents which we want to compare against subsequent queries. In our case, they are the same nine documents used for training LSI, converted to 2-D LSA space. But that’s only incidental, we might also be indexing a different corpus altogether.

In [8]:
index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it

# save and load the index
index.save('/tmp/deerwester.index')
index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')

2017-02-06 17:03:54,502 : INFO : creating matrix with 9 documents and 2 features
2017-02-06 17:03:54,505 : INFO : saving MatrixSimilarity object under /tmp/deerwester.index, separately None
2017-02-06 17:03:54,507 : INFO : saved /tmp/deerwester.index
2017-02-06 17:03:54,508 : INFO : loading MatrixSimilarity object from /tmp/deerwester.index
2017-02-06 17:03:54,509 : INFO : loaded /tmp/deerwester.index


#### Performing queries
To obtain similarities of our query document against the nine indexed documents.   
Cosine measure returns similarities in the range <-1, 1> (the greater, the more similar)

In [9]:
sims = index[vec_lsi] # perform a similarity query against the corpus
print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples

[(0, 0.99809301), (1, 0.93748635), (2, 0.99844527), (3, 0.9865886), (4, 0.90755945), (5, -0.12416792), (6, -0.10639259), (7, -0.098794639), (8, 0.050041765)]


In [45]:
 sorted(enumerate(sims), key=lambda item: -item[1])

[(8, 0.99844527),
 (7, 0.99809301),
 (6, 0.9865886),
 (5, 0.93748635),
 (4, 0.90755945),
 (3, 0.050041765),
 (2, -0.098794639),
 (1, -0.10639259),
 (0, -0.12416792)]