# Ranking document similarity using keywords and semantic matching

#### 4OH4
#### March 2020

https://github.com/4OH4/doc-similarity

This notebook contains examples of two methods for comparing content of text documents for similarity, such as might be used for search queries or content recommender systems. The first (TF-idf) scores document relationship based on the frequency of occurence of shared words. It is fast, and works well when documents are large and/or have lots of overlap. The second technique looks for shared words that address similar concepts, but does not require an exact match: for example, it links 'fruit and vegetables' with the word 'tomato'. This is slower, and gives less clear-cut results, but is good with shorter search queries or documents with low overlap.

## Contents
1. [TF-IDF to score shared key words](#sec1)  
1b. [Using a lemmatizer](#sec1b)  
2. [Semantic matching using GloVe embeddings](#sec2)  
2b. [Using the ready-made DocSim class](#sec2b)  

## Requirements
To install the required packages:

    pip install -r requirements.txt
    
## Known Issues
 - Warning generated by `gensim`: `RuntimeWarning: divide by zero encountered in true_divide` - I haven't been able to find the route cause of this, although does not appear to be producting eroneous results

### Load test data

In [1]:
import json

In [2]:
# Load test data
with open('test_data.json') as in_file:
    test_data = json.load(in_file)

titles = [item[0] for item in test_data['data']]
documents = [item[1] for item in test_data['data']]

print(f'{len(documents)} documents')

for idx in range(5):
    print(idx, " \t ", titles[idx], " : \t", documents[idx][:100])

28 documents
0  	  Pomegranate Bhagwa  : 	 Fresh Pomegranate from Anushka Avni International Bhagwa is a premium Pomegranate variety from India
1  	  Pomegranate Arakta  : 	 Fresh Pomegranate Arakta from Anushka Avni International This Pomegranate are bigger in size, sweet 
2  	  About Us  : 	 About Us Anushka Avni International (AAI) takes pleasure in presenting itself as one of the renowned
3  	  Contact Us  : 	 About Us Anushka Avni International (AAI) takes pleasure in presenting itself as one of the renowned
4  	  White Onions  : 	 White Onions from Anushka Avni International Fresh White Onion, which is widely acclaimed for its he


<a id="sec1"></a>
## 1. TF-IDF to score shared key words

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

import nltk
from nltk.corpus import stopwords

nltk.download('punkt')
stop_words = set(stopwords.words('english')) 

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rupert.thomas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
search_terms = 'fruit and vegetables'
# search_terms = 'tomato'
# search_terms = 'sewing machine'

vectorizer = TfidfVectorizer(stop_words=stop_words)
vectors = vectorizer.fit_transform([search_terms] + documents)

# Calculate the word frequency, and calculate the cosine similarity of the search terms to the documents
cosine_similarities = linear_kernel(vectors[0:1], vectors).flatten()
document_scores = [item.item() for item in cosine_similarities[1:]]  # convert back to native Python dtypes

# Print the top-scoring results and their titles
score_titles = [(score, title) for score, title in zip(document_scores, titles)]

for score, title in (sorted(score_titles, reverse=True, key=lambda x: x[0])[:5]):
    print(f'{score:0.3f} \t {title}')

0.122 	 Pomegranate Bhagwa
0.046 	 Pomegranate Arakta
0.000 	 About Us
0.000 	 Contact Us
0.000 	 White Onions


When using the search terms 'fruits and vegetables', only two documents have returned non-zero similarity scores - both contain the word 'fruit'. When searching for 'tomato', however, there are no matches; only the plural 'tomatoes' is present in the document corpus, and that does not match.

<a id="sec1b"></a>
## 1b. Using a lemmatizer

A lemmatizer reduces words down to their simplest 'lemma'. This is particularly helpful with dealing with plurals.

In [5]:
# from: https://scikit-learn.org/stable/modules/feature_extraction.html

from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer

class LemmaTokenizer:
    """
    Interface to the WordNet lemmatizer from nltk
    """
    ignore_tokens = [',', '.', ';', ':', '"', '``', "''", '`']
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc) if t not in self.ignore_tokens]

In [6]:
# Demonstrate the job of the tokenizer

tokenizer=LemmaTokenizer()

tokenizer('It was raining cats and dogs in FooBar')

['It', 'wa', 'raining', 'cat', 'and', 'dog', 'in', 'FooBar']

In [7]:
# search_terms = 'fruit and vegetables'
search_terms = 'tomato'
# search_terms = 'sewing machine'

# Initialise TfidfVectorizer with the LemmaTokenizer. Also need to lemmatize the stop words as well
token_stop = tokenizer(' '.join(stop_words))
vectorizer = TfidfVectorizer(stop_words=token_stop, tokenizer=tokenizer)

# Calculate the word frequency, and calculate the cosine similarity of the search terms to the documents
vectors = vectorizer.fit_transform([search_terms] + documents)
cosine_similarities = linear_kernel(vectors[0:1], vectors).flatten()

document_scores = [item.item() for item in cosine_similarities[1:]]  # convert back to native Python dtypes

score_titles = [(score, title) for score, title in zip(document_scores, titles)]

for score, title in (sorted(score_titles, reverse=True, key=lambda x: x[0])[:5]):
    print(f'{score:0.3f} \t {title}')

0.365 	 Tomatoes
0.000 	 Pomegranate Bhagwa
0.000 	 Pomegranate Arakta
0.000 	 About Us
0.000 	 Contact Us


This gives better results - the document that contains the word 'tomatoes' is now scoring highly.

<a id="sec2"></a>
## 2. Semantic matching using GloVe embeddings

This example and the class code for DocSim re-use and extend code from the Gensim tutorial notebook:  
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb

The first part of this section runs though the individual steps in the process. This code is also available packaged in a ready-to-use class - scroll further down to see how it works.

In [8]:
import json
import logging
from re import sub
from multiprocessing import cpu_count

import numpy as np

import gensim.downloader as api
from gensim.utils import simple_preprocess
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.models import WordEmbeddingSimilarityIndex
from gensim.similarities import SparseTermSimilarityMatrix
from gensim.similarities import SoftCosineSimilarity

In [9]:
import logging

# Initialize logging.
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.WARNING)  # DEBUG # INFO

In [10]:
import nltk

# Import and download stopwords from NLTK.
nltk.download('stopwords')  # Download stopwords list.
stopwords = set(nltk.corpus.stopwords.words("english"))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rupert.thomas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [11]:
# Support functions for pre-processing and calculation
# From: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb

def preprocess(doc):
    # Tokenize, clean up input document string
    doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
    doc = sub(r'<[^<>]+(>|$)', " ", doc)
    doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
    doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
    return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

### Prepare the data

In [12]:
# Load test data
with open('test_data.json') as in_file:
    test_data = json.load(in_file)

titles = [item[0] for item in test_data['data']]
documents = [item[1] for item in test_data['data']]

print(f'{len(documents)} documents')

# Print the first few document titles and intro text
# for idx in range(5):
#     print(idx, "\t | \t", titles[idx], "\t | \t", documents[idx][:100])

28 documents


In [13]:
query_string = 'fruit and vegetables'

# Preprocess the documents, including the query string
corpus = [preprocess(document) for document in documents]
query = preprocess(query_string)

### Build the model

The word embedding model is a large file, so loading is quite a long-running task.

In [14]:
%%time

# Download and/or load the GloVe word vector embeddings

if 'glove' not in locals():  # only load if not already in memory
    glove = api.load("glove-wiki-gigaword-50")
    
similarity_index = WordEmbeddingSimilarityIndex(glove)

Wall time: 32 s


In [15]:
%%time

# Build the term dictionary, TF-idf model
# The search query must be in the dictionary as well, in case the terms do not overlap with the documents (we still want similarity)
dictionary = Dictionary(corpus+[query])
tfidf = TfidfModel(dictionary=dictionary)

# Create the term similarity matrix. 
# The nonzero_limit enforces sparsity by limiting the number of non-zero terms in each column. 
# For my application, I got best results by removing the default value of 100
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)  # , nonzero_limit=None)

Wall time: 7.65 s


In [16]:
# Compute Soft Cosine Measure between the query and the documents.
query_tf = tfidf[dictionary.doc2bow(query)]

index = SoftCosineSimilarity(
            tfidf[[dictionary.doc2bow(document) for document in corpus]],
            similarity_matrix)

doc_similarity_scores = index[query_tf]

  Y = np.multiply(Y, 1 / np.sqrt(Y_norm))
  Y = np.multiply(Y, 1 / np.sqrt(Y_norm))


### Output the document similarity results

In [17]:
# Output the similarity scores for top 15 documents
sorted_indexes = np.argsort(doc_similarity_scores)[::-1]
for idx in sorted_indexes[:15]:
    print(f'{idx} \t {doc_similarity_scores[idx]:0.3f} \t {titles[idx]}')

0 	 0.631 	 Pomegranate Bhagwa
11 	 0.613 	 Small Onions
12 	 0.612 	 Tomatoes
16 	 0.589 	 Grapes Black Sharad Seedless
17 	 0.563 	 Grapes Flame / Red Seedless
1 	 0.560 	 Pomegranate Arakta
4 	 0.516 	 White Onions
10 	 0.490 	 About Us
7 	 0.480 	 Red Onions
9 	 0.469 	 Pink Onions
22 	 0.426 	 LPI DA-6 sewing machine
21 	 0.426 	 LPI DE-DA sewing machine
20 	 0.426 	 WAZIR DE-DA sewing machine
19 	 0.426 	 LPI DA ? 1 Extra Heavy Duty
18 	 0.426 	 DA-TEX sewing machine


### Find the most relevant terms in the documents

In [18]:
# For each term in the search query, what were the most similar words in each document?
doc_similar_terms = []
max_results_per_doc = 5
for term in query:
    idx1 = dictionary.token2id[term]
    for document in corpus:
        results_this_doc = []
        for word in set(document):
            idx2 = dictionary.token2id[word]
            score = similarity_matrix.matrix[idx1, idx2]
            if score > 0.0:
                results_this_doc.append((word, score))
        results_this_doc = sorted(results_this_doc, reverse=True, key=lambda x: x[1])  # sort results by score
        results_this_doc = results_this_doc[:min(len(results_this_doc), max_results_per_doc)]  # take the top results
        doc_similar_terms.append(results_this_doc)

In [19]:
# Output the results for the top 15 documents
for idx in sorted_indexes[:15]:
    similar_terms_string = ', '.join([result[0] for result in doc_similar_terms[idx]])
    print(f'{idx} \t {doc_similarity_scores[idx]:0.3f} \t {titles[idx]}  :  {similar_terms_string}')

0 	 0.631 	 Pomegranate Bhagwa  :  fruit, delicious, sweet, cherry, fresh
11 	 0.613 	 Small Onions  :  fresh
12 	 0.612 	 Tomatoes  :  fresh, taste, tomatoes
16 	 0.589 	 Grapes Black Sharad Seedless  :  grapes, taste
17 	 0.563 	 Grapes Flame / Red Seedless  :  varieties, grapes
1 	 0.560 	 Pomegranate Arakta  :  fruit, sweet, fresh, taste, soft
4 	 0.516 	 White Onions  :  fresh
10 	 0.490 	 About Us  :  harvest
7 	 0.480 	 Red Onions  :  fresh, flesh
9 	 0.469 	 Pink Onions  :  fresh, flesh
22 	 0.426 	 LPI DA-6 sewing machine  :  cotton
21 	 0.426 	 LPI DE-DA sewing machine  :  cotton
20 	 0.426 	 WAZIR DE-DA sewing machine  :  cotton
19 	 0.426 	 LPI DA ? 1 Extra Heavy Duty  :  cotton
18 	 0.426 	 DA-TEX sewing machine  :  cotton


This shows which terms in each of the documents were most similar to terms in the search query. What it doesn't show, however, is the exact contribution of each of the terms to the document score, as each word similarity score will be weighted by the term frequency. 

<a id="sec2b"></a>
## 2b. Using the ready-made DocSim class

The `DocSim` class wraps up functionality to prepare and compare data in a single object. It also persists the word embedding model to avoid having to reload it each time it is used. The word embedding model is loaded on initialisation, as this is quite a long-running task.

`DocSim_threaded` has similar functionality, but loads the model in a separate thread. Similarity queries cannot be evaluated until the model is ready - check the status of the `model_ready` flag.

In [20]:
import json
import docsim

In [21]:
%%time

docsim_obj = docsim.DocSim(verbose=True)
# docsim_obj = docsim.DocSim_threaded(verbose=True)

Loading default GloVe word vector model: glove-wiki-gigaword-50
Model loaded
Wall time: 28.2 s


In [22]:
print(f'Model ready: {docsim_obj.model_ready}')

Model ready: True


In [23]:
# Load test data
with open('test_data.json') as in_file:
    test_data = json.load(in_file)

titles = [item[0] for item in test_data['data']]
documents = [item[1] for item in test_data['data']]

print(f'{len(documents)} documents')

query_string = 'fruit and vegetables'

28 documents


In [24]:
%%time

similarities = docsim_obj.similarity_query(query_string, documents)

28 documents loaded into corpus
Wall time: 7.26 s


  Y = np.multiply(Y, 1 / np.sqrt(Y_norm))
  Y = np.multiply(Y, 1 / np.sqrt(Y_norm))


In [25]:
# Output the similarity scores for top 15 documents
for idx, score in (sorted(enumerate(similarities), reverse=True, key=lambda x: x[1])[:15]):
    print(f'{idx} \t {score:0.3f} \t {titles[idx]}')

0 	 0.631 	 Pomegranate Bhagwa
11 	 0.613 	 Small Onions
12 	 0.612 	 Tomatoes
16 	 0.589 	 Grapes Black Sharad Seedless
17 	 0.563 	 Grapes Flame / Red Seedless
1 	 0.560 	 Pomegranate Arakta
4 	 0.516 	 White Onions
10 	 0.490 	 About Us
7 	 0.480 	 Red Onions
9 	 0.469 	 Pink Onions
18 	 0.426 	 DA-TEX sewing machine
19 	 0.426 	 LPI DA ? 1 Extra Heavy Duty
20 	 0.426 	 WAZIR DE-DA sewing machine
21 	 0.426 	 LPI DE-DA sewing machine
22 	 0.426 	 LPI DA-6 sewing machine
