<a href="https://colab.research.google.com/github/castorini/anserini-notebooks/blob/master/pyserini_msmarco_passage_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pyserini Demo on the MS MARCO Passage Dataset

This notebook allows you to replicate the BM25 baseline for the [MS MARCO passage ranking task](http://www.msmarco.org/) with [Pyserini](https://github.com/castorini/anserini/blob/master/docs/pyserini.md), the Python interface to [Anserini](http://anserini.io).


## Installation


Install Python dependencies

In [0]:
%%capture
!pip install pyjnius==1.2
!pip install -i https://test.pypi.org/simple/ pyserini==0.6.1.post0

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

Fix annoying known issue with pyjnius (see [this explanation](https://github.com/castorini/pyserini/blob/master/README.md#known-issues) for more details):


In [0]:
%%capture
!mkdir -p /usr/lib/jvm/java-1.11.0-openjdk-amd64/jre/lib/amd64/server/
!ln -s /usr/lib/jvm/java-1.11.0-openjdk-amd64/lib/server/libjvm.so /usr/lib/jvm/java-1.11.0-openjdk-amd64/jre/lib/amd64/server/libjvm.so

Let's grab the pre-built index:

In [0]:
%%capture
!wget https://www.dropbox.com/s/zo15pi6iilgjjew/index-msmarco-passage-20191117-0ed488.tar.gz
!tar xvfz index-msmarco-passage-20191117-0ed488.tar.gz

Sanity check of index size:

In [0]:
!du -h index-msmarco-passage-20191117-0ed488

2.5G	index-msmarco-passage-20191117-0ed488


## Usage

You can use `pysearch` to search over an index. Here's the basic usage:

In [0]:
from pyserini.search import pysearch

topics = pysearch.get_topics('msmarco_passage_dev_subset')
for topicid in topics:
    query = topics[topicid]['title'];
    print('{} {}'.format(topicid, query))

print('{} queries total'.format(len(topics)))


2 Androgen receptor define
1215 3 levels of government in canada and their responsibilities
1288 3/5 of 60
1576 60x40 slab cost
2235 Bethel University was founded in what year
2798 Does Suddenlink Carry ESPN3
2962 Explain what a bone scan is and what it is used for.
4696 Is the Louisiana sales tax 4.75
4947 Ludacris Net Worth
5925 Sony PS-LX300USB how to connect to pc
6217 The hormone that does the opposite of calcitonin is
6791 What Does Noel Mean in the Bible
7968 When did the earthquake hit San Francisco during the World Series
8701 _____ is the ability of cardiac pacemaker cells to spontaneously initiate an electrical impulse without being stimulated from another source, such as a nerve.
8714 _____ is the name used to refer to the era of legalized segregation in the united states
8798 _______ is a fuel produced by fermenting crops.
8854 ________ disparity refers to the slightly different view of the world that each eye receives.cyclopeanbinocularmonoculartrichromatic
9083 _________

In [0]:
from pyserini.search import pysearch

topics = pysearch.get_topics('msmarco_passage_dev_subset')
searcher = pysearch.SimpleSearcher('index-msmarco-passage-20191117-0ed488')
hits = searcher.search(topics[1102400]['title'])

# Prints the first 10 hits
for i in range(0, 10):
    print('{} {} {}'.format(i+1, hits[i].score, hits[i].content))

1 17.335800170898438 Why do Bears hibernate? March 31, 2010, Joan, Leave a comment. Why do bears hibernate? When we hear the word âhibernateâ we always associate it with bears. That is because, while a lot of animals go through hibernation during the winter season such as squirrels, rodents and bats, the bear is the most famous when it comes to hibernating. What comes first to our mind is why do bears hibernate? First of all, letâs get to know the meaning of hibernation.
2 13.230899810791016 Why do bears hibernate? Watch this to discover how much effort is spent on survival during winter in the world of the Big Sky Bears. Subscribe to BBC Earth: http://bit.ly/BBCEarthSub
3 13.135700225830078 Technically, as the other anwerer said, bears do not hibernate, but there isn't a good term for what they do. Some use 'winter lethergy' or winter torpor' or some other phrase to distinguish it, but hibernation is commonly used by the public.ll kinds of bears technically don't hibernate. They

The `hits` data structure holds the `docid`, the retrieval score, as well as the document content:

In [0]:
from IPython.core.display import display, HTML
display(HTML('<div style="font-family: Times New Roman; padding-bottom:10px">' + hits[0].content + '</div>'))

Let's run all the queries from the dev set:

In [0]:
from pyserini.search import pysearch

def do_run(file, topics, searcher):
    with open(file, 'w') as runfile:
        cnt = 0
        print('Running {} queries in total'.format(len(topics)))
        for topicid in topics:
            query = topics[topicid]['title'];
            hits = searcher.search(query, 1000)
            for i in range(0, len(hits)):
                _ = runfile.write('{} Q0 {} {} {:.6f} Anserini\n'.format(topicid, hits[i].docid, i+1, hits[i].score))
            cnt += 1
            if cnt % 100 == 0:
                print('{} queries completed'.format(cnt))

searcher = pysearch.SimpleSearcher('index-msmarco-passage-20191117-0ed488')
topics = pysearch.get_topics('msmarco_passage_dev_subset')

do_run('run-msmarco-passage-bm25.txt', topics, searcher)


Running 6980 queries in total
100 queries completed
200 queries completed
300 queries completed
400 queries completed
500 queries completed
600 queries completed
700 queries completed
800 queries completed
900 queries completed
1000 queries completed
1100 queries completed
1200 queries completed
1300 queries completed
1400 queries completed
1500 queries completed
1600 queries completed
1700 queries completed
1800 queries completed
1900 queries completed
2000 queries completed
2100 queries completed
2200 queries completed
2300 queries completed
2400 queries completed
2500 queries completed
2600 queries completed
2700 queries completed
2800 queries completed
2900 queries completed
3000 queries completed
3100 queries completed
3200 queries completed
3300 queries completed
3400 queries completed
3500 queries completed
3600 queries completed
3700 queries completed
3800 queries completed
3900 queries completed
4000 queries completed
4100 queries completed
4200 queries completed
4300 queries 

In [0]:
%%capture
!wget -O jtreceval-0.0.5-jar-with-dependencies.jar https://search.maven.org/remotecontent?filepath=uk/ac/gla/dcs/terrierteam/jtreceval/0.0.5/jtreceval-0.0.5-jar-with-dependencies.jar
!wget https://raw.githubusercontent.com/castorini/anserini/master/src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt

In [0]:
!java -jar jtreceval-0.0.5-jar-with-dependencies.jar qrels.msmarco-passage.dev-subset.txt run-msmarco-passage-bm25.txt

runid                 	all	Anserini
num_q                 	all	6980
num_ret               	all	6974598
num_rel               	all	7437
num_rel_ret           	all	6309
map                   	all	0.1926
gm_map                	all	0.0168
Rprec                 	all	0.1048
bpref                 	all	0.8526
recip_rank            	all	0.1960
iprec_at_recall_0.00  	all	0.1964
iprec_at_recall_0.10  	all	0.1964
iprec_at_recall_0.20  	all	0.1964
iprec_at_recall_0.30  	all	0.1964
iprec_at_recall_0.40  	all	0.1952
iprec_at_recall_0.50  	all	0.1952
iprec_at_recall_0.60  	all	0.1898
iprec_at_recall_0.70  	all	0.1898
iprec_at_recall_0.80  	all	0.1893
iprec_at_recall_0.90  	all	0.1893
iprec_at_recall_1.00  	all	0.1893
P_5                   	all	0.0591
P_10                  	all	0.0394
P_15                  	all	0.0301
P_20                  	all	0.0246
P_30                  	all	0.0182
P_100                 	all	0.0069
P_200                 	all	0.0038
P_500                 	all	0.0017
P_1000           