# Building (a) search engine(s) with Whoosh and Annoy

Once we have our data produced by [CoronaWhy Team](https://www.coronawhy.org/) for [Covid19 Kaggle challenge](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge), we can read them in and feed both into a simple search index of Whoosh and into Annoy's search forest (link). Our data consists of 40k+ papers, from which we have produced three apart data sets with original texts but split into three levels of granularity: sentences, sections and entire documents.

[Whoosh](https://whoosh.readthedocs.io/en/latest/quickstart.html) is a Python pure index search engine using Okapi BM25F ranking function as well other user-defined search functions. We use it here to perform a basic word-based search.
[Annoy](https://github.com/spotify/annoy) is a library for search in n-dimensional numerical space, e.g. word/document embeddings.

The results from the indexing modules can be combined in different way: with Whoosh as a n-gram filter and Annoy as refinement, with two competing search methods etc. 

In [None]:
# -*- coding: utf-8 -*-
import json, os, spacy, re, gensim, string, collections, pickle, sys, time
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
from gensim import corpora
import pandas as pd
import numpy as np

from pathos.helpers import cpu_count, freeze_support
from pathos.multiprocessing import ProcessingPool
from tqdm import tqdm

from whoosh.index import create_in
from whoosh.fields import *
from whoosh.qparser import QueryParser, OrGroup, MultifieldParser

from annoy import AnnoyIndex

spacy_nlp = spacy.load('en_core_sci_lg')
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
cpu_number = cpu_count()

import warnings
warnings.filterwarnings("ignore")

In [None]:
source_folder = "..."

list_paths_source_texts = [os.path.join(source_folder,p) for p in os.listdir(source_folder)]


## Building search index with Whoosh

With the extracted list of lemmas, and optionally UMLS terms, we can build the Whoosh index. To make use of multiprocessing, we can give in the number of available CPUs. Afterwards, we can save the whoosh object to reuse it later for search queries.

In [None]:
ix = index_texts(paper_id_list, list_lemma, list_umls, cpu_number)
    pickle.dump(ix,open("ix_whoosh_doc.p", "wb"))

## Collecting document vectors from Scispacy

To profit from advantages of semantic search, we need to gather document vectors across the whole corpus. The vectors themselves come from [a scispacy model](https://allenai.github.io/scispacy/). To make it faster, we employ once again multiprocessing with [Pathos](https://pypi.org/project/pathos/).  

In [None]:
#divide lists into even chunks equal to the number of processes
content_list_chunked = chunking(content_list, cpu_number)
paper_id_list_chunked = chunking(paper_id_list, cpu_number)

results = pp.map(obtain_doc_vec, content_list_chunked, paper_id_list_chunked, list(range(len(content_list_chunked))))
#flatten the list from multiple processes
vectors_doc_list = [x for y in results for x in y]    

## Building Annoy forest for semantic search

Finally, we can feed the doc vectors into Annoy to build a search forest of size 10. 

In [None]:
t = AnnoyIndex(200, 'angular')
    
for nb, x in enumerate(vectors_doc_list):
    t.add_item(nb, x)
    
t.build(10)
t.save("semantic_search_doc.tree")
    
print('\n')
print("---finished---")