# Information Retrieval and Web Analytics: Deep Learning for Search

## Query Expansion Synonyms: word2Vec VS FastText


- One of the most frequent issues that prevent matching is the fact that people can express a concept in multiple different ways.
- **Synonyms** are words that differ in spelling and pronunciation, but that have the same or a very close meaning. 
- In information retrieval, it’s common to use synonyms to decorate text in order to increase the probability that an appropriate query will match. 


- We will implement query expansion through synonyms generation at query time only (not at index time) using (and comparing) two different word embedding approaches:
    - word2vec
    - FastText

**A drawback of word2vec** is that it is not able to represent words that do not appear in the training dataset.

**FastText is an extension to Word2Vec** proposed by Facebook in 2016. Instead of feeding individual words into the Neural Network, **FastText breaks words into several n-grams**. 
After training the Neural Network, we will have word embeddings for all the n-grams given the training dataset. Rare words (or words that are not in the training set) can now be properly represented since it is highly likely that some of their n-grams also appears in other words.


We will work with a subset of the COVID-19 [Open Research Dataset (CORD-19)](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) pubblished on kaggle. CORD-19 is a resource of over 200,000 scholarly articles, including over 100,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. 

We will issue some queries and we want to get the text that is related to the submitted query.

The dataset has already been preprocessed and stored in a csv file (```'inputs/biorxiv_clean.csv'```) for each paper we have:
- paper_id
- title
- authors
- text

#### 0.Import libraries 

In [1]:
import pandas as pd
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import preprocess_string
import nltk
nltk.download('punkt') # used in sent_tokenize

[nltk_data] Downloading package punkt to C:\Users\Carlos
[nltk_data]     Chen\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

#### 1.Read data

In [2]:
def load_corpus():
    cols = ['paper_id', 'title', 'authors', 'text']
    
    biorxiv_clean = pd.read_csv('inputs/biorxiv_clean.csv', na_filter=False, usecols=cols)
    
    # 'paper_id' is used as an index column
    biorxiv_clean.set_index('paper_id', inplace=True)    
    return biorxiv_clean

In [3]:
corpus = load_corpus()
corpus.shape

(1625, 3)

In [4]:
corpus.head(3)

Unnamed: 0_level_0,title,authors,text
paper_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bbf09194127619f57b3ddf5daf684593a5831367,The Effectiveness of Targeted Quarantine for M...,"Alastair Jamieson-Lane, Eric Cytrnbaum","Introduction\n\nCOVID-19, initially observed/d..."
2a21fdd15e07c89c88e8c2f6c6ab5692568876ec,Evaluation of Group Testing for SARS-CoV-2 RNA,"Nasa Sinnott-Armstrong, Daniel L Klein, Brenda...",Introduction\n\nGroup testing was first descri...
e686d1ce1540026ecb100c09f99ed091c139b92c,Why estimating population-based case fatality ...,"Lucas Böttcher, Mingtao Xia, Tom Chou",\n\nDifferent ways of calculating mortality ra...


#### 2.Split data into paragraphs

The full text of a paper is contained in column 'text'.

Let's work at paragraph-level, in this way the search can give more detailed answers and more readable answers than full-text search.

- **Split the text of the paper in paragraphs based on a simple rule: paragraphs are separated by "\n", a paragraph must be at least 100 characters in length.**

- We also return respective paper IDs because we want to go from a paragraph to its respective paper later.

In [23]:
def get_paragraphs(corpus, sep="\n", min_length=100, verbose=10000):
    paragraphs, paper_ids = [], []
    
    for i, (paper_id, row) in enumerate(corpus.iterrows()):
        
        #loop over all texts
        for text in row["text"].split(sep):
            if len(text) > min_length: #check paragraph length
                paragraphs.append(text) #add paragrap to list of all paragraphs
                paper_ids.append(paper_id) #add paper id to list of all paper ids
        
        # print progress if needed
        if verbose > 0 and (i + 1) % verbose == 0:
            print(f"Progress: {i + 1}")
            
    return paragraphs, paper_ids

In [33]:
paragraphs, paper_ids = list(get_paragraphs(corpus, verbose=1000))
len(paragraphs)

Progress: 1000


15768

In [11]:
paragraphs[0]

'COVID-19, initially observed/detected in Hubei province of China during December 2019, has since spread to all but a handful countries, causing (as of the time of writing) an estimated 855,000 infections and 42,000 deaths ( [8] , March 31st). COVID-19 has a basic reproductive number, R 0 , currently estimated in the region of 2.5 -3 [5] . Social distance and general quarantine measures can reduce R 0 temporarily, but not permanently. For R 0 = 3, left unchecked COVID-19 can be expected to infect more than 90% of our community, with 30% of the population infected at the epidemic peak. Even with significant quarantine measures in place the population will not reach "herd immunity" to this virus until 2/3 of the population has gained resistance-either through vaccination, or infection and subsequent recovery.In order to place these numbers in a concrete context, a recent survey in New Zealand indicated that the country had a total of 520 ventilator machines [7] . Given the country\'s dem

In [12]:
paper_ids[0]

'bbf09194127619f57b3ddf5daf684593a5831367'

#### 3. Preprocess paragraphs

Preprocess and tokenize paragraphs for the search engine. Please, use ```gensim.parsing.preprocessing.preprocess_string``` (notice, it is passed as input parameter) to:
- removes all punctuation, numbers, whitespaces, and stop words.
- tokenizes the result.

**Note:** function ```get_tokens``` is a **generator**: this way we can tokenize the entire corpus on the fly, i.e. without loading the result in memory.

In [34]:
def get_tokens(docs, preprocess=preprocess_string, verbose=10000):
    
    for i, doc in enumerate(docs):
        yield preprocess(doc) # preprocess
        
        # print progress if needed
        if verbose > 0 and (i + 1) % verbose == 0:
            print(f"Progress: {i + 1}")

In [35]:
paragraph_tokens = list(get_tokens(paragraphs))

Progress: 10000


In [36]:
paragraph_tokens[:1]

[['covid',
  'initi',
  'observ',
  'detect',
  'hubei',
  'provinc',
  'china',
  'decemb',
  'spread',
  'hand',
  'countri',
  'caus',
  'time',
  'write',
  'estim',
  'infect',
  'death',
  'march',
  'covid',
  'basic',
  'reproduct',
  'number',
  'current',
  'estim',
  'region',
  'social',
  'distanc',
  'gener',
  'quarantin',
  'measur',
  'reduc',
  'temporarili',
  'perman',
  'left',
  'uncheck',
  'covid',
  'expect',
  'infect',
  'commun',
  'popul',
  'infect',
  'epidem',
  'peak',
  'signific',
  'quarantin',
  'measur',
  'place',
  'popul',
  'reach',
  'herd',
  'immun',
  'viru',
  'popul',
  'gain',
  'resist',
  'vaccin',
  'infect',
  'subsequ',
  'recoveri',
  'order',
  'place',
  'number',
  'concret',
  'context',
  'recent',
  'survei',
  'new',
  'zealand',
  'indic',
  'countri',
  'total',
  'ventil',
  'machin',
  'given',
  'countri',
  'demograph',
  'tabl',
  'current',
  'estim',
  'tabl',
  'provid',
  'demograph',
  'data',
  'new',
  'zealand

#### 4. Buil index with BM25

Use the class ```BM25``` from Gensim to build the **paragraph search index** based on paragraphs tokens.

In [37]:
from gensim.summarization.bm25 import BM25

bm25 = BM25(paragraph_tokens) # constructing a paragraph search index


#### 5.Implement Search

Having built the paragraph index, we now need to implement function ```get_top_n``` that retrieves the top N best matching results for a given query (which is list of key words). It does the following:

- Calculate BM25 scores of all indexed documents with respect to a given query.
- Get indices of top N scores using Numpy's efficient ```argpartition()``` function.
- Sort retieved top N scores in descending order and return their indices.

In [38]:
import numpy as np

def get_top_n(bm25, query, n=10):
    
    #using get_tokens and tranform it from string to list of terms
    query = query.split() # cast query from string to list
    query = list(get_tokens(query)) # apply preprocessing
    query = [item for sublist in query for item in sublist] # transform list of list to list
    # score docs using a specific function of bm25
    scores = np.array(bm25.get_scores(query))
    
    # get indices of top N scores
    idx = np.argpartition(scores, -n)[-n:]

    # sort top N scores and return their indices
    # if all the scores are 0 return empty list
    if np.sum(scores[idx]) == 0: 
        return[] 
    return idx[np.argsort(-scores[idx])]

#### Play with the search

Try some queries related to Covid-19; try also to submit the query with some typo to see the result. Example queries may be:
- covid
- coronavirus
- covi (with typo)
- etc.


In [39]:
top_idx = None
test_query = "covid"
try:
    top_idx = get_top_n(bm25, test_query)[0]
    print('paper_id: {}'.format(paper_ids[top_idx]))
    print('\nText: {}'.format(paragraphs[top_idx]))
    
except:
    print("No matching documents found")


paper_id: 2f516463f332a70a46e5157c6ce1030d0abc994d

Text: . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.05.20050245 doi: medRxiv preprint Clinical screening will decrease the prevalence of COVID-19 carriers in the test population if the probability of being asymptomatic given COVID-19 (+) is less than the proportion of the population that is asymptomatic (regardless of COVID-19 (+) or (-)). We demonstrate this by proving the following claim.Claim. The sub-population not showing any symptoms will have a lower proportion of symptomatic carriers when the proportion of people not showing any symptoms in our group is less than the estimated proportion of COVID-19 carriers who are asymptomatic.We can prove this through a couple of applications of Bayes' theorem. We refer to t

In order to check which key words occur in a retrieved paragraph, I implement function "highlight" that highlights in red given key words (as tokens) in a given paragraph (as tokens).

In [40]:
from IPython.core.display import display, HTML

def mark(s, color='black'):
    return "<text style=color:{}>{}</text>".format(color, s)

def highlight(keywords, tokens, color='red'):
    keywords = keywords.split()
    keywords = list(get_tokens(keywords))
    keywords = [item for sublist in keywords for item in sublist]
    kw_set = set(keywords)
    tokens_hl = []
    
    for t in tokens:
        if t in kw_set:
            tokens_hl.append(mark(t, color=color))
        else:
            tokens_hl.append(t)
    
    return " ".join(tokens_hl)

In [41]:

par = ''
if top_idx:
    par = paragraph_tokens[top_idx]
HTML(highlight(test_query, par))

- You may notice that when you submit a query with a typo (for instance "covi") the search does not find any matching result.

### Query Expansion with Word2Vec and Fast Text

Here we are going to compare **word2vec** and **FastText**. We will demonstrate as the former is able to find synonyms (so to perform query expansion) for those terms that are part of the training set. On the other hand, the latter, working with n-grams, is able to generate sysnonyms also for those terms that do not appear in the training set.

We will train the two models, a Word2Vec and a fastText model using **Gensim**.

Since FastText word embedding models are trained on sentences, we need to split paragraphs into sentences first. 

Complete the function ```get_sentences```. To split paragraphs into sentences use ```nltk.sent_tokenize```.

Note: the function "get_sentences" is a generator but we will get the list this time.


In [45]:
def get_sentences(docs, verbose=10000):
    #loop over all docs (paragraphs in our case)
    for i, doc in enumerate(docs):
        
        # use nltk.sent_tokenize to split paragraphs into sentences
        for sentence in nltk.sent_tokenize(doc):
            # preprocess each sentence using gensim (return string not list)
            yield " ".join(preprocess_string(sentence))
            
        # print progress if needed
        if verbose > 0 and (i + 1) % verbose == 0:
            print(f"Progress: {i + 1}")

In [46]:
#split paragraphs into sentences
sentences = list(get_sentences(paragraphs))

Progress: 10000


In [47]:
sentences[:2]

['covid initi observ detect hubei provinc china decemb spread hand countri caus time write estim infect death march',
 'covid basic reproduct number current estim region']

- **We also split each sentence into set of words to work with word2Vec**

In [48]:
#split each sentence into a list od words
words = [s.split() for s in sentences ]
words[:2]

[['covid',
  'initi',
  'observ',
  'detect',
  'hubei',
  'provinc',
  'china',
  'decemb',
  'spread',
  'hand',
  'countri',
  'caus',
  'time',
  'write',
  'estim',
  'infect',
  'death',
  'march'],
 ['covid', 'basic', 'reproduct', 'number', 'current', 'estim', 'region']]

#### Create a Word2Vec model

Use the following parameters:

- embedding dimension = 100
- window size: 10 tokens before and 10 tokens after to get wider context
- min_count=10, # only consider tokens with at least 10 occurrences in the corpus
- negative=15, # negative subsampling: bigger than default to sample negative examples more
- sg: Training algorithm: 1 for skip-gram; otherwise CBOW.


In [54]:
#create a word2Vec model  
w2v_model = Word2Vec(sentences = words, size=100, window=10, min_count=10, negative=15, sg = 1)

- At this point we have trained a word2vec model and given a query we can find similar terms that we consider synonyms and use them to expand our original query.

- As example try to expand the query "coronavirus"

In [55]:
query = 'coronavirus'
w2v_model.most_similar(query)

  w2v_model.most_similar(query)


[('coronaviru', 0.7579174041748047),
 ('betacoronaviru', 0.7556777596473694),
 ('virus', 0.750465989112854),
 ('betacoronavirus', 0.7273703813552856),
 ('cov', 0.7027322053909302),
 ('sar', 0.6927350163459778),
 ('gorbalenya', 0.6909762024879456),
 ('coronavirida', 0.689199686050415),
 ('alphacoronaviru', 0.6876479387283325),
 ('civet', 0.6861595511436462)]

- Implement a function ```expand_query``` that, given a query, expands it with the most n similar terms

In [60]:
def expand_query(query, wv, topn=10):
    
    query = preprocess_string(query)
    expanded_query = [t for t in query] # initialize with original query. Note, it is a list
    
    # extend each single term of the original query and append to expanded query
    for t in query:
        expanded_query.extend(s for s, f in wv.most_similar(t, topn=topn))
        
    return expanded_query

In [61]:
query = 'covid'
try:
    expanded_query = ' '.join(expand_query(query, w2v_model))
    print(expanded_query)
except Exception as e:
    print(e)

covid misdiagnos sari nephropathi diseas jurisdict covd afflict advoc case swiftli


  expanded_query.extend(s for s, f in wv.most_similar(t, topn=topn))


 - notice we are getting some "weird term" just because we are working on preprocessed (stemmed, etc) text.

#### Perform search with expanded query using Word2Vec
 - Play with some queries - **insert also query with typos (see Covi for instance)**

In [62]:
# Play with some queries - insert also query with typos

expanded_query = ''
query = 'covid coronavirus'
try:
    expanded_query = ' '.join(expand_query(query, w2v_model))
except Exception as e:
    print(e)

top_idx = None
print('Original query: {}'.format(query))
print('Expanded query: {}'.format(expanded_query))
print('---')
try:
    top_idx = get_top_n(bm25, expanded_query)[0]
    print('\npaper_id: {}'.format(paper_ids[top_idx]))
    print('\nText: {}'.format(paragraphs[top_idx]))
    
except:
    print("No matching documents found")



Original query: covid coronavirus
Expanded query: covid coronaviru misdiagnos sari nephropathi diseas jurisdict covd afflict advoc case swiftli renam coronavirus novel ncov betacoronaviru sar deadli cov peiri broke
---

paper_id: f4636bda5c5ae53ed6c05f5aba349a84cdb862be

Text: All the databases of ICMJE-accepted platforms of clinical-trial registries [10] were considered. Search terms for Chinese Clinical Trial Registry (ChiCTR) were: "COVID-19," "2019-novel Corona Virus (2019-nCoV)," "Novel Coronavirus Pneumonia (NCP)," "Severe Acute Respiratory Infection (SARI)," and "Severe Acute Respiratory Syndrome -Corona Virus-2 (SARS-CoV-2)." Search terms for the Netherlands National Trial Register were "nCoV," "Coronavirus," "SARS," "SARI," "NCP," and "COVID." Search terms for other databases were "2019-nCoV OR Novel Coronavirus OR New Coronavirus OR SARS-CoV-2 OR SARI OR NCP OR Novel Coronavirus Pneumonia OR COVID-19 OR Wuhan pneumonia."The search was conducted on 14 February 2020. The detail

  expanded_query.extend(s for s, f in wv.most_similar(t, topn=topn))


In [63]:
par = ''
if top_idx:
    par = paragraph_tokens[top_idx]
HTML(highlight(expanded_query, par))

#### Important!! 

**Try to expand a query generating a synonym of a term not seen in the training set (for instance a single term query with a typo like "ovid")**

In [64]:
query = 'covi'
try:
    expanded_query = ' '.join(expand_query(query, w2v_model))
    print(expanded_query)
except Exception as e:
    print(e)

"word 'covi' not in vocabulary"


  expanded_query.extend(s for s, f in wv.most_similar(t, topn=topn))


- **Note: Word2Vec is able to only generate synonyms for those terms alreay seen during training phase.**

--------------------------------------

### Query Expansion with fastText
- Above we have already splitted the paragraphs in sentences since FastText works with sentences for the training.

In [65]:
sentences = list(get_sentences(paragraphs))

Progress: 10000


The reason why I made the list (thus storing the results in memory) is that I am going to use it twice: for building vocabulary and then fastText training. This just saves a lot of runtime in this case.

**Initialize the fastText model**. Use the following parameters:
- sg=1, # use skip-gram: usually gives better results
- size=100, # embedding dimension (default)
- window=10, # window size: 10 tokens before and 10 tokens after to get wider context
- min_count=10, # only consider tokens with at least 10 occurrences in the corpus
- negative=15, # negative subsampling: bigger than default to sample negative examples more
- min_n=2, # min character n-gram
- max_n=5 # max character n-gram

In [66]:
from gensim.models.fasttext import FastText

ft_model = FastText(
    sg=1, # use skip-gram: usually gives better results
    size=100, # embedding dimension (default)
    window=10, # window size: 10 tokens before and 10 tokens after to get wider context
    min_count=10, # only consider tokens with at least 10 occurrences in the corpus
    negative=15, # negative subsampling: bigger than default to sample negative examples more
    min_n=2, # min character n-gram
    max_n=5 # max character n-gram
)

Build a vocabulary using the in-memory list of sentences and generator of tokens.

In [67]:
ft_model.build_vocab(get_tokens(sentences, verbose=100000))

Progress: 100000
Progress: 200000


Train the FastText model for 3 epochs using the in-memory list of sentences and generator of tokens. Since we use a generator to get tokens on the fly, we need to call "train" 3 times specifying "epochs=1" each time. Otherwise, the generator will go out of items after epoch 1.

In [68]:
epochs = 3

for epoch in range(epochs):
    print(f"Epoch {epoch}")
    
    ft_model.train(
        get_tokens(sentences, verbose=100000),
        epochs=1,
        total_examples=ft_model.corpus_count, 
        total_words=ft_model.corpus_total_words)

Epoch 0
Progress: 100000
Progress: 200000
Epoch 1
Progress: 100000
Progress: 200000
Epoch 2
Progress: 100000
Progress: 200000


Save the trained fastText model.

In [69]:
test_query = "covid"
ft_model.most_similar(test_query)

  ft_model.most_similar(test_query)


[('covd', 0.861100435256958),
 ('chinazzi', 0.804714024066925),
 ('seventi', 0.8020514845848083),
 ('seventeen', 0.8010070323944092),
 ('pediatr', 0.7969161868095398),
 ('selfreport', 0.7945656776428223),
 ('sever', 0.7945213317871094),
 ('nineteen', 0.785205602645874),
 ('clinicaltri', 0.7835091352462769),
 ('sari', 0.7827175855636597)]

#### Perform search with query expanded using FastText

 - Play with some queries - **insert also query with typos (see Covi for instance)**

In [70]:
# Play with some queries - insert also query with typos (see Covi for instance)

expanded_query = ''
query = 'Immune response and immunity'
try:
    expanded_query = ' '.join(expand_query(query, ft_model))
except Exception as e:
    print(e)

top_idx = None
print('Original query: {}'.format(query))
print('Expanded query: {}'.format(expanded_query))
print('---')
try:
    top_idx = get_top_n(bm25, expanded_query)[0]
    print('\npaper_id: {}'.format(paper_ids[top_idx]))
    print('\nText: {}'.format(paragraphs[top_idx]))
    
except:
    print("No matching documents found")



  expanded_query.extend(s for s, f in wv.most_similar(t, topn=topn))


Original query: Immune response and immunity
Expanded query: immun respons immun immuni autoimmun immunosuppress immunocompromi immunomodulatori immunolog immunogen immunodomin immunoreact immunostain respon respond rumor congress recess correspondingli regain anymor recapitul repress immuni autoimmun immunosuppress immunocompromi immunomodulatori immunolog immunogen immunodomin immunoreact immunostain
---

paper_id: fda076d8d8b2fb52990e3db1982e4af2ff0fb08e

Text: The overall analysis revealed thirteen promiscuous B-cell and T-cell epitopes that consist of four immunogenic continuous B-cell epitopes (ANYVQASEK, NYVQASEK, KSVEKPAS, and TPQQPPAQ), seven discontinuous B-cell epitopes, three immunogenic MHC-I epitopes (YVYDTRGKL, FYRQGAFEL, and FTQLVAAYL) , and three immunogenic MHC-II epitopes (FFGGKVLNF, FYRQGAFEL, and  FDYALVQHF) , that proposed to be used in multi-epitope peptide vaccine designing.Molecular docking and population coverage analysis are crucial factor in the development 

In [71]:
par = ''
if top_idx:
    par = paragraph_tokens[top_idx]
HTML(highlight(expanded_query, par))

#### Important!! 

**Try to expand a query generating a synonym of a term not seen in the training set (for instance a single term query with a typo like "ovid")**

In [72]:
query = 'covi'
try:
    expanded_query = ' '.join(expand_query(query, ft_model))
    print(expanded_query)
except Exception as e:
    print(e)

covi covid covd seventeen seventi cot seventh draconian seven residenti nineteen


  expanded_query.extend(s for s, f in wv.most_similar(t, topn=topn))


**Notice how FastText model is able to generate synonyms also for those terms that were not used to train the model. As a consequence it is possible to expand the query even in case of terms not previously seen.**

#### Reference:
- https://www.kaggle.com/slavaz/simple-paragraph-search-bm25-fasttext-qe