# Indexing Excercise 

This exercise has two parts: 

- In part 1, we are going to index the [MS MARCO](http://www.msmarco.org/) passage collection Pyserini toolkit and explore some features of the index. For this part, you only need to run code and understand it. You will be using the index and code snippets in the next assignment.

- In part 2, we are going to write a code for generating an inverted index and index part of MS MARCO collection. For this part, you need to first run the first part (1.1 and 1.2) to build the environment and prepare the data.




## PART 1: Generate the index via Pyserini

We use [Anserini](https://github.com/castorini/anserini]) toolkit and its python interface [Pyserini](https://github.com/castorini/pyserini)  to run our experiments. 

***This part is created based on Anserini/Pyserini tutorials. You can learn more by checking their repositories and tutorials.* 

### 1.1 Setup the environment

Install Pyserini:

In [1]:
!pip install pyserini

Collecting pyserini
  Obtaining dependency information for pyserini from https://files.pythonhosted.org/packages/bc/e8/adeba14e4c8d2e2cb26089337fb94c40116ffd73db19f24cc9fd13c1d47c/pyserini-0.22.0-py3-none-any.whl.metadata
  Downloading pyserini-0.22.0-py3-none-any.whl.metadata (4.5 kB)
Collecting Cython>=0.29.21 (from pyserini)
  Obtaining dependency information for Cython>=0.29.21 from https://files.pythonhosted.org/packages/e8/1a/26113a7a220b360a13f1a060deb1461bf55d433673dc79e523b6648ccc2d/Cython-3.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading Cython-3.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.1 kB)
Collecting pyjnius>=1.4.0 (from pyserini)
  Obtaining dependency information for pyjnius>=1.4.0 from https://files.pythonhosted.org/packages/62/c9/6ae043600ddeae09376fe60c09d1b39265991f811bdab58a9218c3aab6ae/pyjnius-1.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading pyjnius-1.5.0-

Clone the Anserini repository from GitHub:

In [2]:
!git clone https://github.com/castorini/anserini.git
!cd anserini && git checkout ad5ba1c76196436f8a0e28efdb69960d4873efe3

Cloning into 'anserini'...
remote: Enumerating objects: 29354, done.[K
remote: Counting objects: 100% (3672/3672), done.[K
remote: Compressing objects: 100% (959/959), done.[K
remote: Total 29354 (delta 3146), reused 3059 (delta 2634), pack-reused 25682[K
Receiving objects: 100% (29354/29354), 85.32 MiB | 19.19 MiB/s, done.
Resolving deltas: 100% (19795/19795), done.
Note: switching to 'ad5ba1c76196436f8a0e28efdb69960d4873efe3'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at ad5ba1c7 Release notes for v0.9

### 1.2 Get the collection and prepare the files
MS MARCO (MicroSoft MAchine Reading COmprehension) is a large-scale dataset that defines many tasks from question answering to ranking. Here we focus on the collection designed for passage re-ranking.

In [3]:
!wget https://msmarco.blob.core.windows.net/msmarcoranking/collection.tar.gz -P data/msmarco_passage/

--2023-09-08 13:48:10--  https://msmarco.blob.core.windows.net/msmarcoranking/collection.tar.gz
Resolving msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)... 20.150.34.4
Connecting to msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)|20.150.34.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1035009698 (987M) [application/octet-stream]
Saving to: ‘data/msmarco_passage/collection.tar.gz’


2023-09-08 13:49:26 (13.0 MB/s) - ‘data/msmarco_passage/collection.tar.gz’ saved [1035009698/1035009698]



In [4]:
!ls data/msmarco_passage/ 
!tar xvfz data/msmarco_passage/collection.tar.gz -C data/msmarco_passage

collection.tar.gz
collection.tsv


The original MS MARCO collection is a tab-separated values (TSV) file. We need to convert the collection into the jsonl format that can be processed by Anserini. jsonl files contain JSON object per line.

This command generates 9 jsonl files in your data/msmarco_passage/collection_jsonl directory, each with 1M lines (except for the last one, which should have 841,823 lines).

In [5]:
!cd anserini && python ./src/main/python/msmarco/convert_collection_to_jsonl.py \
 --collection_path ../data/msmarco_passage/collection.tsv --output_folder ../data/msmarco_passage/collection_jsonl


Converting collection...
Converted 0 docs in 1 files
Converted 100000 docs in 1 files
Converted 200000 docs in 1 files
Converted 300000 docs in 1 files
Converted 400000 docs in 1 files
Converted 500000 docs in 1 files
Converted 600000 docs in 1 files
Converted 700000 docs in 1 files
Converted 800000 docs in 1 files
Converted 900000 docs in 1 files
Converted 1000000 docs in 2 files
Converted 1100000 docs in 2 files
Converted 1200000 docs in 2 files
Converted 1300000 docs in 2 files
Converted 1400000 docs in 2 files
Converted 1500000 docs in 2 files
Converted 1600000 docs in 2 files
Converted 1700000 docs in 2 files
Converted 1800000 docs in 2 files
Converted 1900000 docs in 2 files
Converted 2000000 docs in 3 files
Converted 2100000 docs in 3 files
Converted 2200000 docs in 3 files
Converted 2300000 docs in 3 files
Converted 2400000 docs in 3 files
Converted 2500000 docs in 3 files
Converted 2600000 docs in 3 files
Converted 2700000 docs in 3 files
Converted 2800000 docs in 3 files
Conv

**Check the data!**

jsonl files are JSON files with keys id and contents:

In [6]:
!wc -l data/msmarco_passage/collection_jsonl/*

   1000000 data/msmarco_passage/collection_jsonl/docs00.json
   1000000 data/msmarco_passage/collection_jsonl/docs01.json
   1000000 data/msmarco_passage/collection_jsonl/docs02.json
   1000000 data/msmarco_passage/collection_jsonl/docs03.json
   1000000 data/msmarco_passage/collection_jsonl/docs04.json
   1000000 data/msmarco_passage/collection_jsonl/docs05.json
   1000000 data/msmarco_passage/collection_jsonl/docs06.json
   1000000 data/msmarco_passage/collection_jsonl/docs07.json
    841823 data/msmarco_passage/collection_jsonl/docs08.json
   8841823 total


In [7]:
!head -5 data/msmarco_passage/collection_jsonl/docs00.json

{"id": "0", "contents": "The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated."}
{"id": "1", "contents": "The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science."}
{"id": "2", "contents": "Essay on The Manhattan Project - The Manhattan Project The Manhattan Project was to see if making an atomic bomb possible. The success of this project would forever change the world forever making it known that something this powerful can be manmade."}
{"id": "3", "contents": "The Manhattan Project was the name for a project conducted during World War II, to develop the first atomic bomb. It refers specifically to t

Remove the original files to make room for the index. 
Check the contents of `data/msmarco_passage` before and after.

In [8]:
!ls data/msmarco_passage
!rm data/msmarco_passage/*.tsv
!ls data/msmarco_passage
!rm -rf sample_data

collection.tar.gz  collection.tsv  collection_jsonl
collection.tar.gz  collection_jsonl


### 1.3 Generate the index using Pyserini


Here are some common indexing options with Pyserini (for more options, check Pyserini documentation):

```
* input: Path to collection
* threads: Number of threads to run
* collection: Type of Anserini Collection, e.g., LuceneDocumentGenerator, TweetGenerator (subclass of LuceneDocumentGenerator for TREC Microblog)
* index: Path to index output
* storePositions: Boolean flag to store positions
* storeDocvectors: Boolean flag to store document vectors
* storeRawDocs: Boolean flag to store raw document text
* keepStopwords: Boolean flag to keep stopwords (False by default)
* stemmer: Stemmer to use (Porter by default)
```

We now have everything in place to index the collection. **The indexing speed may vary, the process may take about 10 minutes (or more) in Google Colab.**




In [9]:
!python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator -threads 9 \
-input data/msmarco_passage/collection_jsonl -index indexes/lucene-index-msmarco-passage -storePositions -storeDocvectors -storeRaw

pyserini.index is deprecated, please use pyserini.index.lucene.
2023-09-08 13:57:01,779 INFO  [main] index.IndexCollection (IndexCollection.java:380) - Setting log level to INFO
2023-09-08 13:57:01,780 INFO  [main] index.IndexCollection (IndexCollection.java:383) - Starting indexer...
2023-09-08 13:57:01,781 INFO  [main] index.IndexCollection (IndexCollection.java:385) - DocumentCollection path: data/msmarco_passage/collection_jsonl
2023-09-08 13:57:01,781 INFO  [main] index.IndexCollection (IndexCollection.java:386) - CollectionClass: JsonCollection
2023-09-08 13:57:01,781 INFO  [main] index.IndexCollection (IndexCollection.java:387) - Generator: DefaultLuceneDocumentGenerator
2023-09-08 13:57:01,782 INFO  [main] index.IndexCollection (IndexCollection.java:388) - Threads: 9
2023-09-08 13:57:01,782 INFO  [main] index.IndexCollection (IndexCollection.java:389) - Language: en
2023-09-08 13:57:01,782 INFO  [main] index.IndexCollection (IndexCollection.java:390) - Stemmer: porter
2023-09-0

Check the size of the index at the specified destination:

In [10]:
!ls indexes
!du -h indexes/lucene-index-msmarco-passage

lucene-index-msmarco-passage
4.1G	indexes/lucene-index-msmarco-passage


### 1.4 Explore Pyserini index

We can now explore the index using the The IndexReader class of Pyserini. 

Read [Usage of the Index Reader API](https://github.com/castorini/pyserini/blob/master/docs/usage-indexreader.md) notebook for more information on accessing and manipulating an inverted index.

In [11]:
from pyserini.index import IndexReader

index_reader = IndexReader('indexes/lucene-index-msmarco-passage')

Compute the collection and document frequencies of a term:

In [12]:
term = 'played'

# Look up its document frequency (df) and collection frequency (cf).
# Note, we use the unanalyzed form:
df, cf = index_reader.get_term_counts(term)

analyzed_form = index_reader.analyze(term)
print(f'Analyzed form of term "{analyzed_form[0]}": df={df}, cf={cf}')

Analyzed form of term "plai": df=155044, cf=200696


Get basic index statistics of the index.

Note that unless the underlying index was built with the `-optimize` option (i.e., merging all index segments into a single segment), unique_terms will show -1 (think what could be reason).

In [13]:
index_reader.stats()

{'total_terms': 352316036,
 'documents': 8841823,
 'non_empty_documents': 8841823,
 'unique_terms': -1}

## PART 2: Generate the index yourself

### 2.1 Processing the text

We need to process the text, which includes tokenization, stopword removal, and lowercasing.

In [14]:
STOPWORDS = ['a', 'an', 'and', 'are', 'as', 'at', 'be', 'but', 'by', 'for', 'if', 'in', 'into', 'is', 'it', 'no', 'not', 'of', 'on', 'or', 'such', 'that', 'the', 'their', 'then', 'there', 'these', 'they', 'this', 'to', 'was', 'will', 'with']

def process(text):
    terms = []
    # Remove special characters
    chars = ['\'', '.', ':', ',', '!', '?', '(', ')']
    for ch in chars:
        if ch in text:
            text = text.replace(ch, ' ')
    
    # Lowercasing and stopword removal
    for term in text.split():
        term = term.lower()
        if term not in STOPWORDS:
            terms.append(term)
    return terms
    

### 2.2 Complete the code for Inverted Index

Implement the InvertedIndex class. 

Write the index to a file, where posting list of each term is presented in a line with this format: `Term1 docID1:freq1 docID2:freq2 ...`, e.g., 

```
term1 1:1 4:2 5:1
term2 2:1 
term3 1:3 3:3 9:2
...
```



In [17]:
class InvertedIndex(object):
    def __init__(self):
        self.index = {}

    def add_posting(self, term:str, doc_id:int, count:int):
        """Adds a posting (term and Document ID) to the index."""
        # =======Your code=======
        if term not in self.index:
            self.index[term] = []
        self.index[term].append((doc_id, count))
        # =======================

    def get_posting(self,term:str):
        """Returns the posting list of the term from the index."""
        # =======Your code=======
        if term in self.index:
            return self.index[term]
        else:
            return None
        # =======================
        
    def get_dictionary(self):
        """Returns the dictionary of the index (unique terms in the index)."""
        # =======Your code=======
        return list(self.index.keys())   
        # =======================
    
    def write_to_file(self, filename_index:str):
        """Writes the index to a textfile."""
        # =======Your code=======
        with open(filename_index, 'w') as f:
            for term, postings in self.index.items():
                f.write(f"{term} ")
                for posting in postings:
                    f.write(f"{posting[0]}:{posting[1]} ")
                f.write('\n')
        # =======================

Run this to test your code. If everything is correct, you should not get errors here. 

In [18]:
index = InvertedIndex()
index.add_posting("t1", 1, 2)
index.add_posting("t1", 2, 1)
index.add_posting("t2", 2, 3)
assert len(index.get_dictionary()) == 2
assert len(index.get_posting("t1")) == 2
assert index.get_posting("t3") == None
index.write_to_file("data/msmarco_passage/collection_jsonl/text_index.txt")

### 2.3 Index part of the MS MARCO collection

Complete the code to process the text and create the index. 
Note that we are only interested in indexing `docs00.json` file and it takes few minutes to create the index.

In [19]:
import collections
import json

ind = InvertedIndex()
file = "data/msmarco_passage/collection_jsonl/docs00.json"
index_file = "data/msmarco_passage/collection_jsonl/tiny_index.txt"

def index(jsonl_file):
    with open(jsonl_file, 'r') as f:
        for line in f:
            doc = json.loads(line)
            # =======Your code=======
            text = doc['contents']
            doc_id = doc['id']
            terms = process(text)

            for term, count in collections.Counter(terms).items():
                ind.add_posting(term, doc_id, count)            
            # =======================
            
index(file)
ind.write_to_file(index_file)


Run this to test your code. 

In [20]:
with open(index_file, 'r') as fp:
    assert len(fp.readlines()) == 698784

assert len(ind.get_posting("pressingly")) == 3
assert len(ind.get_posting("veada")) == 2

## Handing in

Hand in both the result file and the filled-in notebook:

- The result file should be named STUDENTNUMBER_FIRSTNAME_LASTNAME_tiny_index.txt
- The notebook should be named STUDENTNUMBER_FIRSTNAME_LASTNAME_indexing.ipynb