In [None]:
%matplotlib inline

# Document similarity methods

In this notebook, we illustrate different document similarity methods and use them to retrieve similar customer reviews. The method word movers' distance is illustrated. This method depends on word embeddings too, but also look at an 'individual words' viewpoint.

## Word movers' distance

WMD uses normalised word embeddings and Bag of words to calculate distance between sentences/ documents. It resolves the problem of synonyms between sentences. The intution behind the method is that we find the minimum "traveling distance" between documents, in other words the most efficient way to "move" the distribution of sentence/document 1 to the distribution of sentence/ document 2.

A good illustration is the two sentences below:
sentence_obama and sentence_president

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Image from https://vene.ro/images/wmd-obama.png
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from time import time

#img = mpimg.imread('/content/drive/MyDrive/Colab Notebooks/images/wmd-obama.png')
#imgplot = plt.imshow(img)
#plt.axis('off')
#plt.show()

In [None]:
# Initialize logging.
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

sentence_obama = 'Obama speaks to the media in Illinois'
sentence_president = 'The president greets the press in Chicago'

## Preprocessing
First, let's do some pre-processing, removing stopwords and punctuation. 

In [None]:
# Import and download stopwords from NLTK.
from nltk.corpus import stopwords
from nltk import download
download('stopwords')  # Download stopwords list.
stop_words = stopwords.words('english')

# Pre-processing a document.
from nltk import word_tokenize
download('punkt')  # Download data for tokenizer.

def preprocess(doc):
    doc = doc.lower()  # Lower the text.
    doc = word_tokenize(doc)  # Split into words.
    doc = [w for w in doc if not w in stop_words]  # Remove stopwords.
    doc = [w for w in doc if w.isalpha()]  # Remove numbers and punctuation.
    return doc

sentence_obama = preprocess(sentence_obama)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


We will use the word vectors that are pre-trained and available from google. https://code.google.com/archive/p/word2vec/ They are hosted in gensim. These are from part of Google News dataset (about 100 billion words) with 300-dimensional vectors for 3 million words and phrases.

More options:
https://radimrehurek.com/gensim/models/word2vec.html

In [None]:
import gensim.downloader as api
from gensim.models import Word2Vec
model = api.load('word2vec-google-news-300')

The WMD model is based on the word vectors, and can be obtained directly from the distance measure from the individual word vectors. Note the sentence on Obama and the president has a distance of 3.79. 
Note that similarity and distance are opposites. When the distance is high, the similarity is low. Vice versa is true as well.

In [None]:
distance = model.wmdistance(sentence_obama, sentence_president)
print('distance = %.4f' % distance)

2021-01-13 08:21:06,038 : INFO : Removed 0 and 7 OOV words from document 1 and 2 (respectively).
2021-01-13 08:21:06,039 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2021-01-13 08:21:06,040 : INFO : built Dictionary(18 unique tokens: ['illinois', 'media', 'obama', 'speaks', 'C']...) from 2 documents (total 38 corpus positions)


distance = 3.7959


Notice that in another case with a very different sentence talking about oranges, the distance is bigger (less similar) at 4.4

In [None]:
sentence_orange = preprocess('Oranges are my favorite fruit')
distance = model.wmdistance(sentence_obama, sentence_orange)
print('distance = %.4f' % distance)

2021-01-13 08:21:08,857 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2021-01-13 08:21:08,859 : INFO : built Dictionary(7 unique tokens: ['illinois', 'media', 'obama', 'speaks', 'favorite']...) from 2 documents (total 7 corpus positions)


distance = 4.3802


## Normalising WMD
 
In the WMD distance, sentences of different lenghts can increase the distance. This is due to different word vectors lengths of the sentences. To mitigate this, the document vectors of the sentences are each normalised to the same dimension. The normalisation in this case reduces the distance lengths.

In [None]:
model.init_sims(replace=True)  # Normalizes the vectors in the word2vec class.

distance = model.wmdistance(sentence_obama, sentence_president)  # Compute WMD as normal.
print('distance: %r' % distance)

distance = model.wmdistance(sentence_obama, sentence_orange)
print('distance = %.4f' % distance)

## Building our own word vectors

Instead of using the word vectors from Google, we train our own word vectors in this yelp reviews case. This is in json format and in particular we look at 6 restaurant IDs. This reviews dataset is available from Kaggle. https://www.kaggle.com/yelp-dataset/yelp-dataset 
The following code reads in the reviews one by one into the wmd_corpus (only for the 6 restaurant IDs) and the w2v corpus (for training word vectors.)

As always, training a new word vector model is advantageous since this trained model will reflect the semantics of this reviews case. 

In [None]:
start = time()

import json

# Review IDs of the restaurants.
ids = ['fWKvX83p0-ka4JS3dc6E5A', 'IjZ33sJrzXqU-0X6U8NwyA', 'IESLBzqUCLdSzSqm0eCSxQ',
      '1uJFq2r5QfJG_6ExMRCaGw', 'm2CKSsepBCoRYWxiRUsxAg', 'jJAIXA46pU1swYyRCdfXtQ']

w2v_corpus = []  # Documents to train word2vec on (all 6 restaurants).
wmd_corpus = []  # Documents to run queries against (only one restaurant).
documents = []  # wmd_corpus, with no pre-processing (so we can see the original documents).
with open('/content/drive/MyDrive/Colab Data/yelp_academic_dataset_review.json') as data_file: 
    for line in data_file:
        json_line = json.loads(line)
        if json_line['review_id'] not in ids:
            # Not one of the 6 restaurants.
            continue
         
        # Pre-process document.
        text = json_line['text']  # Extract text from JSON object.
        text = preprocess(text)
        # Add to corpus for training Word2Vec.
        w2v_corpus.append(text)

        if json_line['review_id'] in ids:
            # Add to corpus for similarity queries.
            wmd_corpus.append(text)
            documents.append(json_line['text'])

Train a new word vector model for the WMD similarity and then use it for the Word similarity model (WMD). 

In [None]:
# Train Word2Vec on all the restaurants. 
n_model = Word2Vec(w2v_corpus, workers=3, size=100, min_count=1)

# Initialize WmdSimilarity.
from gensim.similarities import WmdSimilarity
num_best = 5
instance = WmdSimilarity(wmd_corpus, model, num_best=5)  # means the top 5 documents are retrieved
# possible sentences come from only the wmd_corpus.
# WMD based on the word2vec from w2c corpus.
# you must first build vocabulary before training the model --> usually due to empty w2v_corpus

2021-01-13 08:23:05,471 : INFO : collecting all words and their counts
2021-01-13 08:23:05,472 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-01-13 08:23:05,474 : INFO : collected 318 word types from a corpus of 438 raw words and 6 sentences
2021-01-13 08:23:05,476 : INFO : Loading a fresh vocabulary
2021-01-13 08:23:05,478 : INFO : effective_min_count=1 retains 318 unique words (100% of original 318, drops 0)
2021-01-13 08:23:05,479 : INFO : effective_min_count=1 leaves 438 word corpus (100% of original 438, drops 0)
2021-01-13 08:23:05,481 : INFO : deleting the raw counts dictionary of 318 items
2021-01-13 08:23:05,482 : INFO : sample=0.001 downsamples 73 most-common words
2021-01-13 08:23:05,483 : INFO : downsampling leaves estimated 354 word corpus (80.9% of prior 438)
2021-01-13 08:23:05,489 : INFO : estimated required memory for 318 words and 100 dimensions: 413400 bytes
2021-01-13 08:23:05,491 : INFO : resetting layer weights
2021-01-13 08:23:05,

Test it on one of the review sentences. 

In [None]:
start = time()
sent = 'love the gyro plate. Rice is so good and I also dig their candy selection.'
query = preprocess(sent)
sims = instance[query]  # A query is simply a "look-up" in the similarity class.
print ('Cell took %.2f seconds to run.' %(time() - start))

2021-01-13 08:23:42,026 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2021-01-13 08:23:42,028 : INFO : built Dictionary(76 unique tokens: ['absolute', 'absolutely', 'amazing', 'anyway', 'arrived']...) from 2 documents (total 85 corpus positions)
2021-01-13 08:23:42,047 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2021-01-13 08:23:42,048 : INFO : built Dictionary(90 unique tokens: ['arrived', 'awesome', 'back', 'bad', 'baked']...) from 2 documents (total 120 corpus positions)
2021-01-13 08:23:42,067 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2021-01-13 08:23:42,067 : INFO : built Dictionary(9 unique tokens: ['also', 'candy', 'dig', 'good', 'gyro']...) from 2 documents (total 18 corpus positions)
2021-01-13 08:23:42,071 : INFO : Removed 1 and 0 OOV words from document 1 and 2 (respectively).
2021-01-13 08:23:42,073 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2021-01-13 08:23:42,074 : INFO : built Dictionary(43 uniq

Cell took 0.11 seconds to run.


Output the num_best sentences that are most similar to the query.

In [None]:
# Print the query and the retrieved documents, together with their similarities.
print ('Query:', sent)
for i in range(num_best):
    print (i, '----')
    print (documents[sims[i][0]])
    print ('sim = %.4f' % sims[i][1])

Query: love the gyro plate. Rice is so good and I also dig their candy selection.
0 ----
love the gyro plate. Rice is so good and I also dig their candy selection :)
sim = 1.0000
1 ----
I have no idea why some people give bad reviews about this place. It goes to show you, you can please everyone. They are probably griping about something that their own fault...there are many people like that.

In any case, my friend and I arrived at about 5:50 PM this past Sunday. It was pretty crowded, more than I thought for a Sunday evening and thought we would have to wait forever to get a seat but they said we'll be seated when the girl comes back from seating someone else. We were seated at 5:52 and the waiter came and got our drink orders. Everyone was very pleasant from the host that seated us to the waiter to the server. The prices were very good as well. We placed our orders once we decided what we wanted at 6:02. We shared the baked spaghetti calzone and the small "Here's The Beef" pizza so 

**Workshop submission:**

Use the news.xls corpus to replace the yelp academic dataset to re-construct a word vector model. Let’s call this word vector model WV_News and the word vector model from Yelp reviews WV_Yelp. 


*   Get the most similar document for the following two sentences from two models (Yelp & WV_NEWS):


sentence_1: The pandemic is showing little sign of slowing down, with more than 10,000 new deaths recorded worldwide every day. according to an AFP tally.

Yelp doc:

WV_NEWS doc:


---


Sentence_2:The dollar's weakening is likely to last at least another six months as investors continue to shift to risky assets and higher returns from a Reuters poll.

Yelp doc:

WV_NEWs doc:



---



*   Comment on your findings. For eg. On the choice of words, the usage/ size of the corpus – all these can affect the results). Does a different word vector corpus affect also the similarity scores? Also, on the ‘extracted’ news sentence outcome. Does it look suitable?





