# Retrieval exercise

In this exercise, you will implement the query likelihood model with Jelinek-Mercer smoothing. This assignment builds on the previous assignment and you will need to reuse some code from the Indexing notebook.


## 1. Build the index
Download the MS MARCO passage collection and build an index using [Pyserini](https://github.com/castorini/pyserini).


In [1]:
pip install pyserini

Collecting pyserini
  Downloading pyserini-0.13.0-py3-none-any.whl (72.8 MB)
[K     |████████████████████████████████| 72.8 MB 35 kB/s 
Collecting pyjnius>=1.2.1
  Downloading pyjnius-1.4.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 65.9 MB/s 
Collecting sentencepiece>=0.1.95
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 69.8 MB/s 
[?25hCollecting transformers>=4.6.0
  Downloading transformers-4.12.3-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 43.4 MB/s 
Collecting nmslib
  Downloading nmslib-2.1.1-cp37-cp37m-manylinux2010_x86_64.whl (13.5 MB)
[K     |████████████████████████████████| 13.5 MB 75 kB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 46.3 MB

In [2]:
!git clone https://github.com/castorini/anserini.git
!cd anserini && git checkout ad5ba1c76196436f8a0e28efdb69960d4873efe3

Cloning into 'anserini'...
remote: Enumerating objects: 17780, done.[K
remote: Counting objects: 100% (2251/2251), done.[K
remote: Compressing objects: 100% (1082/1082), done.[K
remote: Total 17780 (delta 1338), reused 1795 (delta 1012), pack-reused 15529[K
Receiving objects: 100% (17780/17780), 31.31 MiB | 18.17 MiB/s, done.
Resolving deltas: 100% (10256/10256), done.
Note: checking out 'ad5ba1c76196436f8a0e28efdb69960d4873efe3'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at ad5ba1c7 Release notes for v0.9.2 (#1197)


In [3]:
!wget https://msmarco.blob.core.windows.net/msmarcoranking/collection.tar.gz -P data/msmarco_passage/

--2021-11-04 12:45:10--  https://msmarco.blob.core.windows.net/msmarcoranking/collection.tar.gz
Resolving msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)... 20.150.34.4
Connecting to msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)|20.150.34.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1035009698 (987M) [application/octet-stream]
Saving to: ‘data/msmarco_passage/collection.tar.gz’


2021-11-04 12:45:49 (25.1 MB/s) - ‘data/msmarco_passage/collection.tar.gz’ saved [1035009698/1035009698]



In [4]:
!ls data/msmarco_passage/ 
!tar xvfz data/msmarco_passage/collection.tar.gz -C data/msmarco_passage

collection.tar.gz
collection.tsv


In [5]:
!cd anserini && python ./src/main/python/msmarco/convert_collection_to_jsonl.py \
 --collection_path ../data/msmarco_passage/collection.tsv --output_folder ../data/msmarco_passage/collection_jsonl


Converting collection...
Converted 0 docs in 1 files
Converted 100000 docs in 1 files
Converted 200000 docs in 1 files
Converted 300000 docs in 1 files
Converted 400000 docs in 1 files
Converted 500000 docs in 1 files
Converted 600000 docs in 1 files
Converted 700000 docs in 1 files
Converted 800000 docs in 1 files
Converted 900000 docs in 1 files
Converted 1000000 docs in 2 files
Converted 1100000 docs in 2 files
Converted 1200000 docs in 2 files
Converted 1300000 docs in 2 files
Converted 1400000 docs in 2 files
Converted 1500000 docs in 2 files
Converted 1600000 docs in 2 files
Converted 1700000 docs in 2 files
Converted 1800000 docs in 2 files
Converted 1900000 docs in 2 files
Converted 2000000 docs in 3 files
Converted 2100000 docs in 3 files
Converted 2200000 docs in 3 files
Converted 2300000 docs in 3 files
Converted 2400000 docs in 3 files
Converted 2500000 docs in 3 files
Converted 2600000 docs in 3 files
Converted 2700000 docs in 3 files
Converted 2800000 docs in 3 files
Conv

In [6]:
!ls data/msmarco_passage
!rm data/msmarco_passage/*.tsv
!ls data/msmarco_passage
!rm -rf sample_data

collection_jsonl  collection.tar.gz  collection.tsv
collection_jsonl  collection.tar.gz


In [7]:
!python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator -threads 9 \
-input data/msmarco_passage/collection_jsonl -index indexes/lucene-index-msmarco-passage -storePositions -storeDocvectors -storeRaw

2021-11-04 12:48:02,065 INFO  [main] index.IndexCollection (IndexCollection.java:643) - Setting log level to INFO
2021-11-04 12:48:02,072 INFO  [main] index.IndexCollection (IndexCollection.java:646) - Starting indexer...
2021-11-04 12:48:02,073 INFO  [main] index.IndexCollection (IndexCollection.java:648) - DocumentCollection path: data/msmarco_passage/collection_jsonl
2021-11-04 12:48:02,074 INFO  [main] index.IndexCollection (IndexCollection.java:649) - CollectionClass: JsonCollection
2021-11-04 12:48:02,074 INFO  [main] index.IndexCollection (IndexCollection.java:650) - Generator: DefaultLuceneDocumentGenerator
2021-11-04 12:48:02,074 INFO  [main] index.IndexCollection (IndexCollection.java:651) - Threads: 9
2021-11-04 12:48:02,075 INFO  [main] index.IndexCollection (IndexCollection.java:652) - Stemmer: porter
2021-11-04 12:48:02,075 INFO  [main] index.IndexCollection (IndexCollection.java:653) - Keep stopwords? false
2021-11-04 12:48:02,076 INFO  [main] index.IndexCollection (Inde

In [8]:
from pyserini.index import IndexReader

index_reader = IndexReader('indexes/lucene-index-msmarco-passage')

## 2. Download and read the query file
You will rank MSMARCO passages for this set of queries.

In [9]:
!wget http://gem.cs.ru.nl/IR-Course-2021-2022/queries.txt
    
queries = dict()
with open("queries.txt", "r") as f:
    for line in f:
        cols = line.split("\t")
        queries[cols[0].strip()] = cols[1].strip()

# queries

--2021-11-04 13:02:56--  http://gem.cs.ru.nl/IR-Course-2021-2022/queries.txt
Resolving gem.cs.ru.nl (gem.cs.ru.nl)... 131.174.31.31
Connecting to gem.cs.ru.nl (gem.cs.ru.nl)|131.174.31.31|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2275 (2.2K) [text/plain]
Saving to: ‘queries.txt’


2021-11-04 13:02:57 (189 MB/s) - ‘queries.txt’ saved [2275/2275]



## 3. Implement the retrieval model
You will implement language model with Jelinek-Mercer (JM) smoothing:
$$score(q,d) = \sum_{t \in q} log ((1-\lambda) \frac{c(t, d)}{|d|} + \lambda \frac{c (t, C)}{|C|}),$$
where $c (t, d)$ and $c (t, C)$ represent frequency of a term in a document and collection, respectively.

**Notes about your implementation:**
- Skip a term if it does not exist in the whole collection. This avoids $log(0)$.
- Make sure to use the right form of a query (analyzed vs. not analyzed)
- Use natural logarithm 


### 3.1. Obtain collection length
In this code, the global variable `len_C` denotes collection length.

In [10]:
global len_C
len_C = index_reader.stats()['total_terms']

print("Collection length =", len_C)

Collection length = 352316036


### 3.2. Obtain document length

In [11]:
def len_doc(d):
    len_d = sum(index_reader.get_document_vector(d).values())
    return len_d

doc = "2674124" # this is an example document
print("Length of document \""+doc+"\":", len_doc(doc))   

Length of document "2674124": 31


### 3.3. Obtain collection frequency of a term

In [35]:
def coll_freq(term):
    df, cf = index_reader.get_term_counts(term)
    return cf
term = "record"
print("Frequency of term \""+term+"\" in the collection:", coll_freq(term)) 

Frequency of term "record" in the collection: 226439


### 3.4. Obtain term frequency


In [36]:
def term_freq(t, d):
    dv = index_reader.get_document_vector(d)
    if t in dv.keys():
        return dv[t]
    return 0

term = "record"
doc = "2674124"
print("Frequency of term \""+term+"\" in document \""+doc+"\":", term_freq(term, doc)) 

Frequency of term "record" in document "2674124": 2


## 3. Implement the retrieval model
You will implement language model with Jelinek-Mercer (JM) smoothing:
$$score(q,d) = \sum_{t \in q} log ((1-\lambda) \frac{c(t, d)}{|d|} + \lambda \frac{c (t, C)}{|C|}),$$
where $c (t, d)$ and $c (t, C)$ represent frequency of a term in a document and collection, respectively.

**Notes about your implementation:**
- Skip a term if it does not exist in the whole collection. This avoids $log(0)$.
- Make sure to use the right form of a query (analyzed vs. not analyzed)
- Use natural logarithm 


### 3.5. Compute JM-smoothed probability for a single term

In [37]:
import math

def prob_t_Md(t, d, lambd):
    cf = coll_freq(t)
    if cf <= 0:
        return 0
    p_t_Md = ((1 - lambd) * term_freq(t, d) / len_doc(d) + 
             lambd * cf / len_C)
    return p_t_Md

term = "record"
doc = "2674124" 
print("p(t|Md) =", prob_t_Md(term, doc, 0.1))


p(t|Md) = 0.05812878768549357


### 3.6. Compute JM-smoothed probability for a query

In [38]:
import math

def score_doc(q, d, lambd):
    p_q_Md = 0
    for term in q:
        p_q_Md += math.log(prob_t_Md(term, d, lambd))
    return p_q_Md

query = index_reader.analyze(queries["23849"])
doc = "2674124" 
print("p(q|Md) =", score_doc(query, doc, 0.1))            

p(q|Md) = -12.763186444333824


## 4. Rank documents for the given queries
Ranking is done in two steps:
1. First pass retrieval: Use a fast ranker (i.e., Anserini SimpleSearcher) ro rank all documents for a given query.
2. Second pass retrieval: Re-rank top-100 documents from the 1st pass retrieval using your retrieval model. This is to make the ranking process efficient.

**Notes:**
- You need to change the default values of SimpleSearcher functions to obtain top-100 documents
- Set the value of lambda to 0.1
- Store your final ranking results in the `results` variable. Every item in the `results` list is a list containing queryID, documentID, and score. This is an example how the content of results should look like:

`[['23849', '4348282', -10.65],
 ['23849', '7119957', -12.63],
 ['23849', '', -17.687729001682484], 
 ...]`

In [39]:
from pyserini.search import SimpleSearcher

results = []
searcher = SimpleSearcher("indexes/lucene-index-msmarco-passage")
lambd = 0.1

for qID, q in zip(queries.keys(), queries.values()):
    for d in searcher.search(q, 100):
        results.append([qID, d.docid, score_doc(index_reader.analyze(queries[qID]), str(d.docid), lambd)])
results

[['1030303', '8726436', -13.36961560248476],
 ['1030303', '8726435', -15.502391047225595],
 ['1030303', '8726429', -15.604977448038678],
 ['1030303', '8726437', -15.851837894324593],
 ['1030303', '7156982', -15.91338110569287],
 ['1030303', '8726433', -16.0310618938083],
 ['1030303', '8726434', -16.08740354177648],
 ['1030303', '8726430', -16.395704265639907],
 ['1030303', '1305521', -27.64988446875305],
 ['1030303', '1305520', -27.67929834596989],
 ['1030303', '6222298', -27.945001427813594],
 ['1030303', '7284047', -24.083997913808915],
 ['1030303', '5285789', -23.558285653614412],
 ['1030303', '3302257', -23.341453320667405],
 ['1030303', '1305528', -27.664699550543464],
 ['1030303', '8230659', -24.129391038838868],
 ['1030303', '336399', -28.03015920419607],
 ['1030303', '309441', -27.377017546002076],
 ['1030303', '5824363', -24.53991606373053],
 ['1030303', '7508058', -24.7620958889916],
 ['1030303', '3302249', -24.818348396489505],
 ['1030303', '8726432', -22.98438075592842],
 [

In [40]:
import numpy as np

fl_results = []
for result in np.array(results)[:,2]:
    fl_results.append(float(result))
sum(fl_results)

-160109.87527309445

Write your results into a file.
Submit this file together with the completed notebook.

In [19]:
# check duplicates
check = set()
for res in results:
    if ((res[0], res[1])) in check:
        raise Exception("Error: Duplicate query-doc is found", res[0], res[1])
    check.add((res[0], res[1]))

# write results in a file
output_str = "\n".join([l[0] + "\tQ0\t" + l[1] + "\t0\t" + str(l[2]) + "\tlm_jm" for l in results])
open("lm_jm.run", "w").write(output_str)

246929

## Handing in
Hand in both the result file and the filled-in notebook:
- The result file should be named `STUDENTNUMBER_FIRSTNAME_LASTNAME_lm_jm.run`
- The notebook should be named `STUDENTNUMBER_FIRSTNAME_LASTNAME_retrieval.ipynb`
