# Similarity metrics

In [1]:
from gensim.corpora import Dictionary
from gensim.models import ldamodel
from gensim.models import TfidfModel
import numpy
%matplotlib inline

In [2]:
# setting up corpus and documents we will be comparing
texts = [['bank','river','shore','water'],
        ['river','water','flow','fast','tree'],
        ['bank','water','fall','flow'],
        ['bank','bank','water','rain','river'],
        ['river','water','mud','tree'],
        ['money','transaction','bank','finance'],
        ['bank','borrow','money'],
        ['bank','finance'],
        ['finance','money','sell','bank'],
        ['borrow','sell'],
        ['bank','loan','sell']]

In [3]:
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts] 

### Creating TF-IDF and LDA models for the following corpus will help us illustrate our distance metrics.

In [4]:
tfidf = TfidfModel(corpus)
model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=2)

#### Representation of TF-IDF would have as many features as the size of the vocabulary, and an LDA model representation would have as many features as the number of topics. We will be using both these models later to compare distances.

In [5]:
model.show_topics()

[(0,
  '0.149*"water" + 0.148*"bank" + 0.135*"river" + 0.077*"tree" + 0.066*"flow" + 0.049*"money" + 0.049*"fast" + 0.048*"finance" + 0.047*"mud" + 0.047*"rain"'),
 (1,
  '0.212*"bank" + 0.112*"sell" + 0.098*"finance" + 0.097*"money" + 0.072*"borrow" + 0.060*"water" + 0.053*"loan" + 0.051*"transaction" + 0.040*"flow" + 0.038*"fall"')]

Let's use three documents to compare – a document to do with river banks, one to do with
financial banks, and one that has the context of both (maybe a financial bank on the bank of
a river?).

In [6]:
doc_water = ['river', 'water', 'shore']
doc_finance = ['finance', 'money', 'sell']
doc_bank = ['finance', 'bank', 'tree', 'water']

#### Once we have our documents, we quickly convert these into a bag of words, TF-IDF, and LdaModel representations.

In [7]:
bow_water = model.id2word.doc2bow(doc_water)
bow_finance = model.id2word.doc2bow(doc_finance)
bow_bank = model.id2word.doc2bow(doc_bank)

lda_bow_water = model[bow_water]
lda_bow_finance = model[bow_finance]
lda_bow_bank = model[bow_bank]

tfidf_bow_water = tfidf[bow_water]
tfidf_bow_finance = tfidf[bow_finance]
tfidf_bow_bank = tfidf[bow_bank]

Let's have a look at lda_bow_water and see what it looks like:

In [8]:
lda_bow_water

[(0, 0.8510084), (1, 0.1489916)]

This makes sense – the document contained words to do with river banks, and its
proportion of topic_0 is 85%.

In [9]:
# The lda_bow_finance variable should be roughly the opposite – let's test this:
lda_bow_finance

[(0, 0.14109361), (1, 0.8589064)]

As we expected – the LDA representations of the two documents are quite
different, which we could see even when we constructed the documents. This means that
their distance would also be quite high, as they are not similar documents.

In [10]:
lda_bow_bank

[(0, 0.736364), (1, 0.26363602)]

This is a well-balanced document with respect to the topics (as expected).

# Similarity queries

In [11]:
from gensim import similarities

Since we have a small corpus, we can use the MatrixSimilarity class to create our
indexing.

In [12]:
index = similarities.MatrixSimilarity(model[corpus])

We created our index based on the similarities created by the LDA transformation of our
corpus. We can create the same index using TF-IDF, or even bag of words, but we can
expect better performance when using topics. We should also keep in mind that our queries
should be in the same input space as the representation in which we created our index.

In [13]:
# Let's use the same lda_bow_finance document and find which articles are most
 # similar.
sims = index[lda_bow_finance]

In [14]:
# a list with each document and the corresponding similarity
# values.
print(list(enumerate(sims)))

[(0, 0.32152134), (1, 0.25986925), (2, 0.3550473), (3, 0.29625738), (4, 0.27964425), (5, 0.99957216), (6, 0.99986553), (7, 0.99556077), (8, 0.9995367), (9, 0.9981149), (10, 0.99999887)]


In [15]:
# Let's look at which documents were actually picked up, and sort them according to how
# similar they are.
sims = sorted(enumerate(sims), key=lambda item: -item[1])
for doc_id, similarity in sims:
    print(texts[doc_id], similarity)

['bank', 'loan', 'sell'] 0.99999887
['bank', 'borrow', 'money'] 0.99986553
['money', 'transaction', 'bank', 'finance'] 0.99957216
['finance', 'money', 'sell', 'bank'] 0.9995367
['borrow', 'sell'] 0.9981149
['bank', 'finance'] 0.99556077
['bank', 'water', 'fall', 'flow'] 0.3550473
['bank', 'river', 'shore', 'water'] 0.32152134
['bank', 'bank', 'water', 'rain', 'river'] 0.29625738
['river', 'water', 'mud', 'tree'] 0.27964425
['river', 'water', 'flow', 'fast', 'tree'] 0.25986925


Our query was the LDA representation
of a finance-related document, and the similarity query returned all finance-related
documents as most similar while the documents to do with trees and rivers were least
similar - just as we would expect.