<a href="https://colab.research.google.com/github/ShaunakSen/Contextual-hyperlinks/blob/master/WMD_Python_Genism.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Finding similar documents with Word2Vec and WMD

[link](https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html)

Word Mover's Distance is a promising new tool in machine learning that allows us to submit a query and return the most relevant documents. For example, in a blog post OpenTable use WMD on restaurant reviews. Using this approach, they are able to mine different aspects of the reviews. In part 2 of this tutorial, we show how you can use Gensim's WmdSimilarity to do something similar to what OpenTable did. In part 1 shows how you can compute the WMD distance between two documents using wmdistance. Part 1 is optional if you want use WmdSimilarity, but is also useful in it's own merit.

### Word Mover's Distance basics

WMD is a method that allows us to assess the "distance" between two documents in a meaningful way, even when they have no words in common. It uses word2vec [4] vector embeddings of words. It been shown to outperform many of the state-of-the-art methods in k-nearest neighbors classification [3].

WMD is illustrated below for two very similar sentences (illustration taken from Vlad Niculae's blog). The sentences have no words in common, but by matching the relevant words, WMD is able to accurately measure the (dis)similarity between the two sentences. The method also uses the bag-of-words representation of the documents (simply put, the word's frequencies in the documents), noted as  d  in the figure below. The intution behind the method is that we find the minimum "traveling distance" between documents, in other words the most efficient way to "move" the distribution of document 1 to the distribution of document 2.

![](https://vene.ro/images/wmd-obama.png)

This method was introduced in the article "From Word Embeddings To Document Distances" by Matt Kusner et al. (link to PDF). It is inspired by the "Earth Mover's Distance", and employs a solver of the "transportation problem".

In this tutorial, we will learn how to use Gensim's WMD functionality, which consists of the wmdistance method for distance computation, and the WmdSimilarity class for corpus based similarity queries.



### Part 1: Computing the Word Mover's Distance

To use WMD, we need some word embeddings first of all. You could train a word2vec (see tutorial here) model on some corpus, but we will start by downloading some pre-trained word2vec embeddings. Download the GoogleNews-vectors-negative300.bin.gz embeddings here (warning: 1.5 GB, file is not needed for part 2). Training your own embeddings can be beneficial, but to simplify this tutorial, we will be using pre-trained embeddings at first.

Let's take some sentences to compute the distance between.



In [5]:
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"


--2019-09-04 21:28:43--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.36.110
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.36.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2019-09-04 21:29:27 (36.1 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



In [0]:
from time import time

start_nb = time()

In [0]:
# Initialize logging.
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')

sentence_obama = 'Obama speaks to the media in Illinois'
sentence_president = 'The president greets the press in Chicago'
sentence_obama = sentence_obama.lower().split()
sentence_president = sentence_president.lower().split()

These sentences have very similar content, and as such the WMD should be low. Before we compute the WMD, we want to remove stopwords ("the", "to", etc.), as these do not contribute a lot to the information in the sentences.



In [5]:
!pip install nltk==3.4.4

Collecting nltk==3.4.4
[?25l  Downloading https://files.pythonhosted.org/packages/87/16/4d247e27c55a7b6412e7c4c86f2500ae61afcbf5932b9e3491f8462f8d9e/nltk-3.4.4.zip (1.5MB)
[K     |████████████████████████████████| 1.5MB 4.7MB/s 
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... [?25l[?25hdone
  Created wheel for nltk: filename=nltk-3.4.4-cp36-none-any.whl size=1450224 sha256=c6dd7f2acf7927706f22248415bc625874efd11f070f9c1b124b3576fe0fe811
  Stored in directory: /root/.cache/pip/wheels/41/c8/31/48ace4468e236e0e8435f30d33e43df48594e4d53e367cf061
Successfully built nltk
Installing collected packages: nltk
  Found existing installation: nltk 3.2.5
    Uninstalling nltk-3.2.5:
      Successfully uninstalled nltk-3.2.5
Successfully installed nltk-3.4.4


In [8]:
# Import and download stopwords from NLTK.
from nltk.corpus import stopwords
from nltk import download

download('stopwords') # Download stopwords list.

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [9]:
print (stopwords.words('english')[:10])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


In [10]:
# Remove stopwords.

stopwords = stopwords.words('english')

sentence_obama = [word for word in sentence_obama if word not in stopwords]
sentence_president = [word for word in sentence_president if word not in stopwords]

print (sentence_obama)
print (sentence_president)

['obama', 'speaks', 'media', 'illinois']
['president', 'greets', 'press', 'chicago']


Now, as mentioned earlier, we will be using some downloaded pre-trained embeddings. We load these into a Gensim Word2Vec model class. Note that the embeddings we have chosen here require a lot of memory.



In [0]:
import os
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

In [12]:
start = time()

if not os.path.exists(path="./GoogleNews-vectors-negative300.bin.gz"):
  raise ValueError("SKIP: You need to download the google news model")

model = KeyedVectors.load_word2vec_format(fname='./GoogleNews-vectors-negative300.bin.gz', binary=True)

print('Cell took %.2f seconds to run.' % (time() - start))

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Cell took 117.37 seconds to run.


So let's compute WMD using the wmdistance method.



In [13]:
distance = model.wmdistance(sentence_obama, sentence_president)
print (distance)

3.3741233214730024


In [14]:
# Let's try the same thing with two completely unrelated sentences. Notice that the distance is larger.

sentence_orange = 'Oranges are my favorite fruit'
sentence_orange = sentence_orange.lower().split()
sentence_orange = [word for word in sentence_orange if word not in stopwords]

print (model.wmdistance(document1=sentence_obama, document2=sentence_orange))

4.380239402988511


#### Normalizing word2vec vectors

When using the wmdistance method, it is beneficial to normalize the word2vec vectors first, so they all have equal length. To do this, simply call model.init_sims(replace=True) and Gensim will take care of that for you.

Usually, one measures the distance between two word2vec vectors using the cosine distance (see cosine similarity), which measures the angle between vectors. WMD, on the other hand, uses the Euclidean distance. The Euclidean distance between two vectors might be large because their lengths differ, but the cosine distance is small because the angle between them is small; we can mitigate some of this by normalizing the vectors.

Note that normalizing the vectors can take some time, especially if you have a large vocabulary and/or large vectors.

Usage is illustrated in the example below. It just so happens that the vectors we have downloaded are already normalized, so it won't do any difference in this case.

In [15]:
# Normalizing word2vec vectors.
start = time()

model.init_sims(replace=True)  # Normalizes the vectors in the word2vec class.
print ('Cell took %.2f seconds to run.' %(time() - start))


Cell took 27.60 seconds to run.


In [16]:
print(model.wmdistance(sentence_obama, sentence_president))  # Compute WMD as normal.

print (model.wmdistance(sentence_orange, sentence_obama))

1.0174646259300113
1.3663488311444436


In [0]:
real_captions = ['two dogs playing in the field', 'man in red shirt riding bike', 'little bird sitting on a branch', 'man in blue is in the water']

generated_captions = ['puppies running in the ground', 'man riding bicycle in maroon', 'bird sits in leafless tree', 'child in black wetsuit is in the waves on surfboard']

real_captions = [word for word in real_captions if word not in stopwords]
generated_captions = [word for word in generated_captions if word not in stopwords]

wrong_captions = ['a computer on the floor'] * len(real_captions)

wrong_captions = [word for word in wrong_captions if word not in stopwords]


for x in range(len(real_captions)):
  print (model.wmdistance(real_captions[x], generated_captions[x]))

0.9862614148723878
1.0401546156734276
1.0679314297375004
0.8883583908601389


In [0]:
for x in range(len(real_captions)):
  print (model.wmdistance(real_captions[x], wrong_captions[x]))

1.28264269547198
1.5681078434373743
1.3253273516472157
1.1607600246182723


Better idea to remove the stopwords and then try

In [35]:
idx = 3
reference = [[word for word in real_captions[idx].split()]]
candidate = [word for word in generated_captions[idx].split()]
print (reference, candidate)

[['child', 'in', 'black', 'wetsuit', 'is', 'in', 'the', 'waves', 'on', 'surfboard']] ['man', 'in', 'blue', 'is', 'in', 'the', 'water']


In [36]:
# cumulative BLEU scores
from nltk.translate.bleu_score import sentence_bleu

print('Cumulative 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0), smoothing_function=None))
print('Cumulative 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 1, 0, 0), smoothing_function=None))
print('Cumulative 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 1, 0), smoothing_function=None))
print('Cumulative 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0,0,0,1)))

Cumulative 1-gram: 0.372251
Cumulative 2-gram: 0.217146
Cumulative 3-gram: 0.130288
Cumulative 4-gram: 0.000000
