# Indexing excercise 

In this excercise, we are going to index the [MS MARCO](http://www.msmarco.org/) passage collection and explore some features of the index.

We use [Anserini](https://github.com/castorini/anserini]) toolkit and its python interface [Pyserini](https://github.com/castorini/pyserini)  to run our experiments. 

***This notebook is created based on Anserini/Pyserini tutorials. You can learn more by checking their repositories and tutorials.* 

## 1. Setup the environmet



Install Pyserini via PyPI:

In [None]:
pip install pyserini

Clone the Ansirini repository from GitHub:

In [None]:
!git clone https://github.com/castorini/anserini.git
!cd anserini && git checkout ad5ba1c76196436f8a0e28efdb69960d4873efe3

## 2. Get the collection and prepare the files
MS MARCO (MicroSoft MAchine Reading COmprehension) is a large-scale dataset that defines many tasks from question answering to ranking. Here we focus on the collection designed for passage re-ranking.

In [None]:
!wget https://msmarco.blob.core.windows.net/msmarcoranking/collection.tar.gz -P data/msmarco_passage/

In [None]:
!ls data/msmarco_passage/ 
!tar xvfz data/msmarco_passage/collection.tar.gz -C data/msmarco_passage

The original MS MARCO collection is a tab-separated values (TSV) file. We need to convert the collection into the jsonl format that can be processed by Anserini. jsonl files contain JSON object per line.

This command generates 9 jsonl files in your data/msmarco_passage/collection_jsonl directory, each with 1M lines (except for the last one, which should have 841,823 lines).

In [None]:
!cd anserini && python ./src/main/python/msmarco/convert_collection_to_jsonl.py \
 --collection_path ../data/msmarco_passage/collection.tsv --output_folder ../data/msmarco_passage/collection_jsonl


**Check the data!**

jsonl files are JSON files with keys id and contents:

In [None]:
!wc -l data/msmarco_passage/collection_jsonl/*

In [None]:
!head -5 data/msmarco_passage/collection_jsonl/docs00.json

Remove the original files to make room for the index. 
Check the contents of `data/msmarco_passage` before and after.

In [None]:
!ls data/msmarco_passage
!rm data/msmarco_passage/*.tsv
!ls data/msmarco_passage
!rm -rf sample_data

## 3. Generate the index
Some common indexing options with Anserini:

* input: Path to collection
* threads: Number of threads to run
* collection: Type of Anserini Collection, e.g., LuceneDocumentGenerator, TweetGenerator (subclass of LuceneDocumentGenerator for TREC Microblog)
* index: Path to index output
* storePositions: Boolean flag to store positions
* storeDocvectors: Boolean flag to store document vbectors
* storeRawDocs: Boolean flag to store raw document text
* keepStopwords: Boolean flag to keep stopwords (False by default)
* stemmer: Stemmer to use (Porter by default)

We now have everything in place to index the collection. The indexing speed may vary,the process takes about 10 minutes in Google colab.




In [None]:
!python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator -threads 9 \
-input data/msmarco_passage/collection_jsonl -index indexes/lucene-index-msmarco-passage -storePositions -storeDocvectors -storeRaw

Check the size of the index at the specified destination:

In [None]:
!ls indexes
!du -h indexes/lucene-index-msmarco-passage

##4. Explore the index

We can now explore the index using the The IndexReader class of Pyserini. 

Read [Usage of the Index Reader API](https://github.com/castorini/pyserini/blob/master/docs/usage-indexreader.md) notebook for more information on accessing and manipulating an inverted index.

In [11]:
from pyserini.index import IndexReader

index_reader = IndexReader('indexes/lucene-index-msmarco-passage')

Compute the collection and document frequencies of a term:

In [None]:
term = 'played'

# Look up its document frequency (df) and collection frequency (cf).
# Note, we use the unanalyzed form:
df, cf = index_reader.get_term_counts(term)

analyzed_form = index_reader.analyze(term)
print(f'Analyzed form of term "{analyzed_form[0]}": df={df}, cf={cf}')

Get basic index statistics of the index.

Note that unless the underlying index was built with the `-optimize` option (i.e., merging all index segments into a single segment), unique_terms will show -1 (nope, that's not a bug).

In [None]:
index_reader.stats()

Get the postings list of a term, and traverse postings.





In [None]:
term = "played"

postings_list = index_reader.get_postings_list(term)
for posting in postings_list:
    print(f'docid={posting.docid}, tf={posting.tf}, pos={posting.positions}')