<a href="https://www.kaggle.com/code/aisuko/retrieve-re-rank?scriptVersionId=167290626" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

In the notebooks [Semantic Textual Similarity](https://www.kaggle.com/code/aisuko/semantic-textual-similarity), [Semantic Search](https://www.kaggle.com/code/aisuko/semantic-search), [Similar Questions Retrieval](https://www.kaggle.com/code/aisuko/similar-questions-retrieval), [Semantic Search in Publications](https://www.kaggle.com/code/aisuko/semantic-search-in-publications), [Wikipedia Q&A Retrieval Semantic Search](https://www.kaggle.com/code/aisuko/wikipedia-q-a-retrieval-semantic-search). We have shown how to compute the embeddings of queries, sentences, paragraphs, articles, and how to use semantic search function.

In this notebook, we are going to show a complex search tasks, for example, for question answering retrieval, the search can significantly be improved by using **Retrieve & Re_Rank**.

# Retrieve & Re-Rank Pipeline

A pipeline for information retrieval / question answering retrieval that works well is the following. All components are provided and explained in here:

<div style="text-align: center"><img src="https://hostux.social/system/media_attachments/files/111/898/410/559/698/272/original/c090c2ca83c51a40.png" width="100%" heigh="100%" alt="Retrieve&Re-Rank pipeline"></div>


From the picture above. For a query, we first use a **retrieval system** that retrieves a large list of e.g. 100 possible hits which are potentially relevant for the query. For the retrieval, we can use either lexical search, e.g, with Elasticsearch, or we can use dense retrieval with a semantic model.

However, the retrieval system might retrieve documents that are not that relevant for the search query. Hence, in a second stage, we use a **re-ranker** based on a **cross-encoder** that scores that relevancy of all candidates for the given search query.

The output will be a ranked list of hits we can present to the user.

## Retrieval System: Bi-Encoder

For the retrieval of the candidate set, we can either use lexical search (e.g. Elasticsearch), or we can use a semantic model. However, lexical search looks for literal matches of the query words in our document collection. It will not recognize synonyms, acronyms or spelling variations. In contrast, semantic search (or dense retrieval) encodes the search query into vector space and retrieves the document embeddings that are close in vector space.

<div style="text-align: center"><img src="https://hostux.social/system/media_attachments/files/111/888/433/770/763/125/original/dafea983f4905b4b.png" width="60%" heigh="60%" alt="Semantic Search"></div>

Have a look at the notebook [Semantic Search](https://www.kaggle.com/code/aisuko/semantic-search) to get more detail.

## Re-Rank: Cross-Encoder

The retriever has to be efficient for large document collections with millions of entries. However, it might return irrelevant candidates.

A re-ranker based on a Cross-Encoder can substantially improve the final results for the user. The query and a possible document is passed simultaneously to transformer network, which when outputs a single score between 0 and 1 indicating how relevant the document is for the given query.

<div style="text-align: center"><img src="https://hostux.social/system/media_attachments/files/111/898/502/437/707/103/original/cd1bb78c46020897.png" width="60%" heigh="60%" alt="Re-Rank:Cross-Encoder"></div>

The advantage of Cross-Encoders is the higher performance, as they perform attention across the query and the document.

Scoring thousands or millions of (query, document)-pairs would be rather slow. Hence, we use the retriever to create a set of e.g. 100 possible candidates which are then re-ranked by the Cross-Encoder.

# Implementation of Retrieve Rerank

Let's use the smaller Simple English Wikipedia(it is fits better in RAM) as document collection to provide answers to user questions/ search queries. First, we split all Wikipedia articles into paragraphs and encode them with a Bi-encoder(semantic models). If a new qeury/question is entered, it is encoded by the same bi-encoder and the paragraphs with the highest cosine-similarity are retrieved([Semantic Search](https://www.kaggle.com/code/aisuko/semantic-search)). Next, the retrieved candidates are scored by a Cross-Encoder re-ranker and the 5 passages with the highest score from the Cross-encoder are presented to the user.

**Note: WE also compare the result to lexical search(keyword search). We use the BM25 algorithm which is implemented in the rank_bm25 package. Removing the code of it will be totally fine.**

## Retiever: bi-encoder

For semantic research, we use model `multi-qa-MiniLM-L6-cos-v1` and retrieve 32 potentially passages that answer the input query.

In [1]:
%%capture
!pip install sentence-transformers==2.3.1
!pip install rank_bm25==0.2.2
!pip install datasets==2.15.0
!pip install tqdm==4.66.1

In [2]:
import os

os.environ['DATASET_NAME']='simplewiki-2020-11-01.jsonl.gz'
os.environ['DATASET_URL']='http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz'

os.environ['MODEL_NAME']='multi-qa-MiniLM-L6-cos-v1'
os.environ['CROSS_CODE_NAME']='cross-encoder/ms-marco-MiniLM-L-6-v2'

In [3]:
import json
import gzip

from sentence_transformers.util import http_get

http_get(os.getenv('DATASET_URL'), os.getenv('DATASET_NAME'))

passages=[]
with gzip.open(os.getenv('DATASET_NAME'), 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data=json.loads(line.strip())
        # add all paragraphs
#         passages.extend(data['paragraphs'])

        # only add the first paragraph
        passages.append(data['paragraphs'][0])

#         for paragraph in data['paragraphs']:
#             # We encode the passages as [title, text]
#             passages.append([data['title'], paragraph])

len(passages)

  0%|          | 0.00/50.2M [00:00<?, ?B/s]

169597

## Loading embeddings

We load the pre-processed embeddings of the simple_english_wikipedia. See [simple_english_wikipedia_p0](https://huggingface.co/datasets/aisuko/simple_english_wikipedia_p0).

In [4]:
from datasets import load_dataset

# The dataset is 1.48 GB, maybe it takes little bit longer than other resources.
embedding_dataset=load_dataset('aisuko/simple_english_wikipedia_p0')

Downloading readme:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/797M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs)


In [5]:
import torch

corpus_embeddings=torch.from_numpy(embedding_dataset['train'].to_pandas().to_numpy()).to(torch.float)
len(corpus_embeddings)

169597

## Retiever: bi-encoder

For semantic research, we use model `multi-qa-MiniLM-L6-cos-v1` and retrieve 32 potentially passages that answer the input query.

In [6]:
from sentence_transformers import SentenceTransformer

bi_encoder=SentenceTransformer(os.getenv('MODEL_NAME'))
bi_encoder.max_seq_length=256
bi_encoder.to('cuda')
bi_encoder

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
  (2): Normalize()
)

## Re-Rank:Cross-Encoder

We use a more powerful CrossEncoder `cross-encoder/ms-macro-MiniLM-L-6-v2` that scores the query and all retrieved passages for their relevancy. The cross-encoder further boost the performance, especially when we search over a corpus for which the bi-encoder was not trained for.

Here we use the re-rnk the results list to improve the quality.

In [7]:
from sentence_transformers import CrossEncoder

cross_encoder=CrossEncoder(os.getenv('CROSS_CODE_NAME'))
cross_encoder

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

<sentence_transformers.cross_encoder.CrossEncoder.CrossEncoder at 0x7c9c4576a470>

## Lexical(keyword) search

We also compare the results to lexical seach (keyword search). Here, we use the BM25 algorithm which is implemented in the rank_bm25 package.

In [8]:
import string
from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
from tqdm.autonotebook import tqdm
import numpy as np

# We lower case our text and remove stop-words from indexing
def bm25_tokenizer(text):
    tokenized_doc=[]
    for token in text.lower().split():
        token=token.strip(string.punctuation)
        
        if len(token) >0 and token not in _stop_words.ENGLISH_STOP_WORDS:
            tokenized_doc.append(token)
    return tokenized_doc

tokenized_corpus=[]
for passage in tqdm(passages):
    tokenized_corpus.append(bm25_tokenizer(passage))
    
bm25=BM25Okapi(tokenized_corpus)

  0%|          | 0/169597 [00:00<?, ?it/s]

We are going to define the search function that will search all wikipedia articles for passages that answer the query.

In [9]:
from sentence_transformers.util import semantic_search


def search(query, top_k=32):
    print('Input question', query)
    
    #### BM25 search (lexical search) ####
    bm25_scores=bm25.get_scores(bm25_tokenizer(query))
    top_n=np.argpartition(bm25_scores, -5)[-5:]
    bm25_hits=[{'corpus_id':idx, 'score': bm25_scores[idx]} for idx in top_n]
    bm25_hits=sorted(bm25_hits, key=lambda x:x['score'], reverse=True)
    
    print('Top-3 lexical search (BM25) hits')
    for hit in bm25_hits[0:3]:
        print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))
        
    #### Semantic Search ####
    # encode the query using the bi-encoder and find potentially relevant passages
    question_embedding=bi_encoder.encode(query, convert_to_tensor=True).to('cuda')
    hits=semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
    hits=hits[0] # get the hits from the first query
    
    #### Re-Ranking ####
    # Now, score all retrieved passages with the cross_encoder
    cross_inp=[[query, passages[hit['corpus_id']]] for hit in hits]
    cross_scores=cross_encoder.predict(cross_inp)
    
    # soft results by the cross-encoder scores
    for idx in range(len(cross_scores)):
        hits[idx]['cross-score']=cross_scores[idx]
    
    # ouptut of top-5 hits from bi-encoder
    print('\n--------------------------\n')
    print('Top-3 bi-encoder Retrieval hits')
    hits=sorted(hits, key=lambda x:x['cross-score'], reverse=True)
    for hit in hits[0:3]:
        print('\t{:.3f}\t{}'.format(hit['cross-score'], passages[hit['corpus_id']].replace('\n',' ')))
    
    # output of top-5 hits from re-ranker
    print('\n--------------------------\n')
    print('Top-3 Cross-Encoder Re-ranker hits')
    hits=sorted(hits, key=lambda x:x['cross-score'], reverse=True)
    for hit in hits[0:3]:
        print('\t{:.3f}\t{}'.format(hit['cross-score'], passages[hit['corpus_id']].replace('\n',' ')))

In [10]:
search(query='What is the capital of the Australia?')

Input question What is the capital of the Australia?
Top-3 lexical search (BM25) hits
	13.172	South Australia is one of the six states of Australia. The capital city is Adelaide.
	11.948	Western Australia is one of the eight states and territories of Australia. It is the biggest state in Australia measured by amount of land. It has a population of about two million people. Its capital city is Perth.
	11.621	Adelaide is a city in Australia. It is the capital city of the state of South Australia, and it has an approximate population of 1.2 million people.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


--------------------------

Top-3 bi-encoder Retrieval hits
	8.267	Canberra is the capital city of Australia. There are 403,468 people who live there. It doesn’t belong to a state but it is in the Australian Capital Territory (ACT).
	7.242	Australia, formally the Commonwealth of Australia, is a country and sovereign state in the southern hemisphere, located in Oceania. Its capital city is Canberra, and its largest city is Sydney.
	7.238	The Australian Capital Territory or "ACT" is a small territory in Australia. It was created as the home for the Australian capital city, Canberra, because of fighting between New South Wales and Victoria over where the capital city should be. The Jervis Bay Territory was created at the same time so that the ACT would have a harbour without relying on one of the states.

--------------------------

Top-3 Cross-Encoder Re-ranker hits
	8.267	Canberra is the capital city of Australia. There are 403,468 people who live there. It doesn’t belong to a state bu

In [11]:
search(query='What is the best orchestra in the world?')

Input question What is the best orchestra in the world?
Top-3 lexical search (BM25) hits
	15.328	The BBC Symphony Orchestra is the main orchestra of the British Broadcasting Corporation. It is one of the best orchestras in Britain.
	15.320	The NHK Symphony Orchestra is a Japanese orchestra based in Tokyo, Japan. In Japanese it is written: NHK交響楽団, pronounced: Enueichikei Kōkyō Gakudan. When the orchestra was started in 1926 it was called "New Symphony Orchestra". It was the first large professional orchestra in Japan. Later, it changed its name to "Japan Symphony Orchestra". In 1951 it started to get money from the Japanese radio station NHK (Nippon Hōsō Kyōkai), so it changed its name again to the name it has now. It is thought of as the best orchestra in Japan. They have played in many parts of the world, including at the BBC Proms in London.
	14.079	The Bamberger Symphoniker (Bamberg Symphony Orchestra) is a world-famous orchestra from the city of Bamberg, Germany. It was formed in 

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


--------------------------

Top-3 bi-encoder Retrieval hits
	5.952	The London Symphony Orchestra (LSO) is one of the most famous orchestras of the world. They are based in London's Barbican Centre, but they often tour to lots of different countries.
	5.794	The Vienna Philharmonic (in German: die Wiener Philharmoniker) is an orchestra based in Vienna, Austria. It is thought of as one of the greatest orchestras in the world.
	5.324	The Berlin Philharmonic (in German: Die Berliner Philharmoniker), is an orchestra from Berlin, Germany. It is one of the greatest orchestras in the world. The conductor of the orchestra is Sir Simon Rattle.

--------------------------

Top-3 Cross-Encoder Re-ranker hits
	5.952	The London Symphony Orchestra (LSO) is one of the most famous orchestras of the world. They are based in London's Barbican Centre, but they often tour to lots of different countries.
	5.794	The Vienna Philharmonic (in German: die Wiener Philharmoniker) is an orchestra based in Vienna, A

In [12]:
search(query='Number areas of Australia')

Input question Number areas of Australia
Top-3 lexical search (BM25) hits
	13.402	The Tasmanian Wilderness is a term that is used for a range of areas in Tasmania, Australia.
	12.114	Bangladesh consists of a number of administrative areas called divisions ("bibhag"), each named after its respective capital.
	11.962	Parks Australia is an agency that manages national parks and conservation areas in Australia. The parks are governed by the Director of National Parks, which is a division of the federal government's Department of Sustainability, Environment, Water, Populations and Communities. Parks Australia looks after parks in the federal territories; those in the six states are managed by their own state-government agencies. Parks Australia provides some support to each state in managing their parklands. It also supports indigenous landholders to establish Indigenous Protected Areas.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


--------------------------

Top-3 bi-encoder Retrieval hits
	0.879	The Australian House of Representatives is elected from 151 single-member areas called Divisions. They are also commonly known as electorates or seats.
	0.024	The outback is the remote areas of Australia. The outback is not a defined area, it is only a term used to refer to locations that are far away from big cities.
	-0.081	The continent of Australia consists of the land masses which sit on Australia's continental plate. This includes mainland Australia, Tasmania, and the island of New Guinea. It is Situated in the geographical region of Oceania.

--------------------------

Top-3 Cross-Encoder Re-ranker hits
	0.879	The Australian House of Representatives is elected from 151 single-member areas called Divisions. They are also commonly known as electorates or seats.
	0.024	The outback is the remote areas of Australia. The outback is not a defined area, it is only a term used to refer to locations that are far away fro

In [13]:
search(query='When did the cold war end?')

Input question When did the cold war end?
Top-3 lexical search (BM25) hits
	17.374	The Cold War was the tense relationship between the United States (and its allies), and the Soviet Union (the USSR and its allies) between the end of World War II and the fall of the Soviet Union. It is called the "Cold" War because the US and the USSR never actually fought each other directly. Instead, they opposed each other in conflicts known as proxy wars, where each country chose a side to support.
	17.291	The Reagan Doctrine was a document by the United States under the Reagan Administration. It was about being against the global influence of the Soviet Union during the final years of the Cold War. The doctrine lasted for less than a decade, it was the most important document of United States foreign policy from the early 1980s until the end of the Cold War in 1991.
	15.420	Cold Norton is a village and civil parish in Maldon District, Essex, England. In 2001 there were 1103 people living in Cold No

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


--------------------------

Top-3 bi-encoder Retrieval hits
	5.203	The Cold War was the tense relationship between the United States (and its allies), and the Soviet Union (the USSR and its allies) between the end of World War II and the fall of the Soviet Union. It is called the "Cold" War because the US and the USSR never actually fought each other directly. Instead, they opposed each other in conflicts known as proxy wars, where each country chose a side to support.
	2.531	A speech was made by American President Harry S. Truman to the U.S. Congress on 12 March 1947. In this speech he said he thought that The United States should help Greece and Turkey to stop them being 'Totalitarianists' although he meant Soviet Communism. This became known as the Truman Doctrine. Some Historians believe that this was the start of the Cold War.
	2.079	The Sino-Soviet split (1960–1989) was a time when the relations between the People's Republic of China and the Soviet Union weakened during the Cold

In [14]:
search(query='How long do cats live?')

Input question How long do cats live?
Top-3 lexical search (BM25) hits
	22.997	Reliable information on the lifespans of house cats is hard to find. However, research has been done to get an estimate (an educated guess) on how long cats usually live. Cats usually live for 13 to 20 years. Sometimes cats can live for 22 to 30 years but there are claims of cats dying at ages of more than 30 years old.
	16.974	The sabertoothed cats or sabretooth cats are some of the best known and most popular extinct animals. They are among the most impressive carnivores that ever have lived. These cats had long canines and jaws which opened wider than modern cats. This suggests a different style of killing from modern felines.
	16.490	The Cyprus cat is a breed of cat. These cats are thought to have first come from ancient Egypt or Palestine. They were brought to the island of Cyprus by St. Helen. These are now common domestic cats that live in homes or outside. Many of these cats still live all over Cypru

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


--------------------------

Top-3 bi-encoder Retrieval hits
	10.431	Reliable information on the lifespans of house cats is hard to find. However, research has been done to get an estimate (an educated guess) on how long cats usually live. Cats usually live for 13 to 20 years. Sometimes cats can live for 22 to 30 years but there are claims of cats dying at ages of more than 30 years old.
	2.998	The sand cat ("Felis margarita") is a small wild cat in the Felinae subfamily. It is distributed over African and Asian deserts. Sometimes people call it "desert cat," but that is really the name of a different animal. The sand cat does live in deserts, even the Sahara and Arabian Desert. It is also found in Iran and Pakistan. In zoos, this cat can live for up to 13 years.
	2.348	Bobcat ("Lynx rufus") are fierce cats that live in forests, swamps, mountains, prairie, and deserts in much of North America. Bobcats are generally nocturnal (most active at night), but are most active at dawn and dusk.

In [15]:
search(query='How many people live in Melbourne?')

Input question How many people live in Melbourne?
Top-3 lexical search (BM25) hits
	17.634	Melbourne is a town in south Derbyshire, England. 6,500 people live in Melbourne. It is eight miles south of Derby and two miles from the River Trent.
	16.498	Footscray is a suburb of Melbourne. It is 5 kilometres from Melbourne's centre. About 13,000 people live in Footscray.
	15.058	Ballarat is a city in central Victoria, Australia. Nearly 90,200 people live there, which makes it the third biggest city in Victoria, after Melbourne and Geelong. It is also the biggest city that is not on the coast in Victoria. It is about north-west of Melbourne. The city area covers about .


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


--------------------------

Top-3 bi-encoder Retrieval hits
	9.152	Melbourne is a town in south Derbyshire, England. 6,500 people live in Melbourne. It is eight miles south of Derby and two miles from the River Trent.
	7.577	Melbourne is the second largest city in Australia. It is the capital of Victoria, which is a state in the south-east of Australia. The population of Melbourne is 4.9 million.
	4.817	It has about 110,000 people living there.

--------------------------

Top-3 Cross-Encoder Re-ranker hits
	9.152	Melbourne is a town in south Derbyshire, England. 6,500 people live in Melbourne. It is eight miles south of Derby and two miles from the River Trent.
	7.577	Melbourne is the second largest city in Australia. It is the capital of Victoria, which is a state in the south-east of Australia. The population of Melbourne is 4.9 million.
	4.817	It has about 110,000 people living there.


In [16]:
search(query='Which US president was killed?')

Input question Which US president was killed?
Top-3 lexical search (BM25) hits
	10.179	Lyndon Baines Johnson (August 27, 1908 – January 22, 1973) was a member of the Democratic Party and the 36th president of the United States serving from 1963 to 1969. Johnson took over as president when President Kennedy was killed in November 1963. He was then re-elected in the 1964 election.
	10.091	Lech Kaczyński, the fourth President of the Republic of Poland, died on 10 April 2010. He died in a plane crash outside of Smolensk, Russia. The plane was a Tu-154 belonging to the Polish Air Force. The crash killed all 96 on board. His wife, Maria Kaczyńska, was also among those killed.
	9.791	Jacobo Majluta Azar (October 9, 1934 – March 2, 1996) was a Dominican politician. He was Vice President of the Dominican Republic during the Antonio Guzmán Fernández presidency between 1978 to 1982. He became President of the Dominican Republic after Guzmán Fernández killed himself in 1982. He was president for a

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


--------------------------

Top-3 bi-encoder Retrieval hits
	9.278	William McKinley, the 25th President of the United States, was assassinated on September 6, 1901, inside the Temple of Music on the grounds of the Pan-American Exposition in Buffalo, New York.
	8.403	John F. Kennedy was the 35th President of the United States. He was assassinated (murdered) in Dealey Plaza, Dallas, Texas, on Friday, November 22, 1963. This happened while he was traveling in a Presidential motorcade with his wife Jacqueline, the Governor of Texas John Connally, and the governor's wife Nellie.
	7.914	On December 26, 2006, Gerald Ford, the 38th President of the United States, died at his home in Rancho Mirage, California at 6:45 p.m. local time (02:45, December 27, UTC).

--------------------------

Top-3 Cross-Encoder Re-ranker hits
	9.278	William McKinley, the 25th President of the United States, was assassinated on September 6, 1901, inside the Temple of Music on the grounds of the Pan-American Exposit

In [17]:
search(query='When is LunarNewYear')

Input question When is LunarNewYear
Top-3 lexical search (BM25) hits
	9.340	Chinese New Year, known in China as the SpringFestival and in Singapore as the LunarNewYear, is a holiday on and around the new moon on the first day of the year in the traditional Chinese calendar. This calendar is based on the changes in the moon and is only sometimes changed to fit the seasons of the year based on how the Earth moves around the sun. Because of this, Chinese New Year is never on January1. It moves around between January21 and February20.
	0.000	Tardets-Sorholus is a commune of the Pyrénées-Atlantiques "département" in the southwestern part of France.
	0.000	Taron-Sadirac-Viellenave is a commune of the Pyrénées-Atlantiques "département" in the southwestern part of France.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


--------------------------

Top-3 bi-encoder Retrieval hits
	7.094	Chinese New Year, known in China as the SpringFestival and in Singapore as the LunarNewYear, is a holiday on and around the new moon on the first day of the year in the traditional Chinese calendar. This calendar is based on the changes in the moon and is only sometimes changed to fit the seasons of the year based on how the Earth moves around the sun. Because of this, Chinese New Year is never on January1. It moves around between January21 and February20.
	-1.367	The Lunar Society was an important club in the Midlands of 18th century England. It was a dinner club, and a learned society. Its members were industrialists and inventors, natural philosophers (scientists), and other intellectuals. They met regularly in Birmingham and elsewhere from 1765 to 1813.
	-3.161	A lunar eclipse is an astronomical phenomenon. It happens when the moon passes through the shadow of the Earth which can only occur during a full moon. Luna