# Performing search with the built Whoosh and Annoy modules

After building the indices for our search engine(s), we can just read in the pickled objects. The obtained hit indices we can carry forward to another pipline, or render here by looking back into the corpus. 

In [1]:
# -*- coding: utf-8 -*-

import json, os, spacy, re, gensim, string, collections, pickle, sys, time
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
from gensim import corpora
import pandas as pd
import numpy as np

from pathos.helpers import cpu_count, freeze_support
from pathos.multiprocessing import ProcessingPool
from tqdm import tqdm

from whoosh.index import create_in
from whoosh.fields import *
from whoosh.qparser import QueryParser, OrGroup, MultifieldParser

from annoy import AnnoyIndex




Don't forget to add Covid19 specific info to the spacy model.

In [2]:
spacy_nlp = spacy.load('en_core_sci_lg')
new_vector = spacy_nlp(
               """Positive-sense single‐stranded ribonucleic acid virus, subgenus 
                   sarbecovirus of the genus Betacoronavirus. 
                   Also known as severe acute respiratory syndrome coronavirus 2, 
                   also known by 2019 novel coronavirus. It is 
                   contagious in humans and is the cause of the ongoing pandemic of 
                   coronavirus disease. Coronavirus disease 2019 is a zoonotic infectious 
                   disease.""").vector    
vector_data = {"COVID-19": new_vector,
               "2019-nCoV": new_vector,
               "SARS-CoV-2": new_vector}    
for word, vector in vector_data.items():
        spacy_nlp.vocab.set_vector(word, vector)


#spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
#cpu_number = cpu_count()

import warnings
warnings.filterwarnings("ignore")

In [3]:
def search_index(ix, search_query, max_nb_docs):
    #create a Whoosh search object
    searcher = ix.searcher()
    #add a weight for the OR search condition
    og = OrGroup.factory(0.9)
    #specify fields to search through in the indexed objects
    query = MultifieldParser(["title","content", "umls"], ix.schema, group=OrGroup).parse(search_query)
    #initial the search and produce readable results
    with ix.searcher() as searcher:
        results = searcher.search(query, limit=max_nb_docs)
        #print(results)
        #for x in results:
        #    print(x)
        results = [(x["title"], x.score) for x in results]
        
        
        #quit()
        
    return results

Our current search engine model is actually twofold: a Whoosh word-based search and Annoy doc2vec-based nearest neighbour search. In this case we mix both result lists to obtain the final search hits...

Let's entry our search query and the maximal number of results from each search engine...

In [4]:
#put in a search query
search_query = 'covid19 heart diseases risks'
max_nb_docs = 10

Since the whoosh index has been constructed from lemmas and UMLS terms, we need to transform the initial search query into lemma forms:

In [5]:
#whoosh indexing
ix = pickle.load(open("ix_whoosh_doc.p", "rb"))
#transform the search query into its lemma forms
sq_nlp = spacy_nlp(search_query)
search_query_lemmas = ' '.join([x.lemma_ for x in sq_nlp])
print('Lemma form of the search query: {}'.format(search_query_lemmas))

Lemma form of the search query: covid19 heart disease risk


Obtaining the whoosh indices with scores:

In [6]:
whoosh_result = search_index(ix, search_query_lemmas, max_nb_docs)

To get indices from Annoy we transform the search query into document vector and perform the search for nearest neighbours in Annoy's forest. Once we get the indices, we convert them into the original paper ids:

In [7]:
#search query doc encoding from scispacy
search_query_vector = sq_nlp.vector
#annoy indexing
u = AnnoyIndex(200, 'angular')
u.load('semantic_search_doc.tree') 
#nns by vector gives us two lists: a list with indices and a list with distance 
annoy_results = u.get_nns_by_vector(search_query_vector, max_nb_docs, search_k=10, include_distances=True)
annoy_results = zip(*annoy_results)    
#to re-map the annoy indices to indices of the text corpus
with open("paper_id_list.txt", "r+") as f:
    paper_id_list = f.readlines() 
annoy_results_ids = [(paper_id_list[x[0]].strip(),x[1]) for x in annoy_results]

Let's compare the results from both search modules. We load up once again the whole corpus as a data frame and select the correspoding ids and contents by papers ids. We print out just the first sentence of each paper for convenience.

In [8]:
#load the corpus once again to be able to read the results
source_folder = r'C:\Users\lga\eclipse-workspace\Covid19\FirstAttempt\v6-7_papers\document_data'
vector_df = pd.concat([pd.read_json(os.path.join(source_folder,p)) for p in list(os.listdir(source_folder))])
    
for nb_hit, (whoosh_id, whoosh_score) in enumerate(whoosh_result):
    text_result = vector_df[vector_df.paper_id == whoosh_id]["text"].tolist()[0]
    print("Whoosh hit number: {}".format(nb_hit))
    print("Paper id: {}".format(whoosh_id))
    print("Paper whoosh score: {}".format(whoosh_score))
    print("Begining of the paper: {}".format(text_result[:300]))
    print("\n")
    


Whoosh hit number: 0
Paper id: 5f139e7f8d32031001059ea4a8fce4881a1f725b
Paper whoosh score: 32.258142099097945
Begining of the paper: journal pre-proof catheterization laboratory considerations during the coronavirus (covid-19) pandemic: from acc's interventional council and scai disclosures: fgpw reports serving as a site principal investigator for a multicenter trial supported by medtronic and receiving compensation from medtron


Whoosh hit number: 1
Paper id: 4fc8df5bcd44fad04689ea400e49294248ed36c7
Paper whoosh score: 31.710969867083314
Begining of the paper: journal pre-proof covid-19 and the renin-angiotensin system covid-19 and the renin-angiotensin system in late 2019, a coronavirus disease (covid-19) leading to severe acute respiratory syndrome (sars) started in china and has become a pandemic. the responsible virus has been designated sars-cov-2. t


Whoosh hit number: 2
Paper id: 5f390d49e1013bb9d4e4e7ece57004c4538737c1
Paper whoosh score: 29.50293315528709
Begining of the 

In [9]:
for nb_hit, (annoy_id, annoy_score) in enumerate(annoy_results_ids):
    text_result = vector_df[vector_df.paper_id == annoy_id]["text"].tolist()[0]
    print("Annoy hit number: {}".format(nb_hit))
    print("Paper id: {}".format(annoy_id))
    print("Paper annoy score: {}".format(annoy_score))
    print("Begining of the paper: {}".format(text_result[:300]))
    print("\n")

Annoy hit number: 0
Paper id: c3ba4e042c5173d4a141f12cc5af6bcc9d7e9bb1
Paper annoy score: 0.8624654412269592
Begining of the paper: the care of highly contagious life-threatening infectious diseases (hlid


Annoy hit number: 1
Paper id: 85819dd0484e8be37690f33d94215268d6aa908d
Paper annoy score: 0.8804579973220825
Begining of the paper: emerging viral diseases in pulmonary medicine


Annoy hit number: 2
Paper id: d48449ff827937b1f17f84eaa5ce9e12e34333a3
Paper annoy score: 0.9756640791893005
Begining of the paper: risikomanagement besonderer infektionssituationen risk management of special infections phase


Annoy hit number: 3
Paper id: e685324959e008880480a9051e82e25ea28ab28a
Paper annoy score: 0.9995529055595398
Begining of the paper: intensivmed 5 2007 der notfallplan des krankenhauses bei allgemeingefährlichen infektionskrankheiten hospital emergency plan for the management of patients with highly contagious diseases " abstract patients with imported highly contagious diseases like

As seen, the semantic search by Annoy tends to pick up incomplete documents consisting of single words, what shoud be a subject of further investigation (data corruption?).