# Retrieval exercise

In this exercise, you will implement the query likelihood model with Jelinek-Mercer smoothing. This assignment builds on the previous assignment and you will need to reuse some code from the Indexing notebook.


## 1. Build the index
Download the MS MARCO passage collection and build an index using [Pyserini](https://github.com/castorini/pyserini).


In [1]:
# =======Your code=======

# =======================


## 2. Download and read the query file
You will rank MSMARCO passages for this set of queries.

In [12]:
!wget http://gem.cs.ru.nl/IR-Course-2021-2022/queries.txt
    
queries = dict()
with open("queries.txt", "r") as f:
    for line in f:
        cols = line.split("\t")
        queries[cols[0].strip()] = cols[1].strip()

# queries

--2021-09-15 22:10:45--  http://gem.cs.ru.nl/IR-Course-2021-2022/queries.txt
Resolving gem.cs.ru.nl (gem.cs.ru.nl)... 131.174.31.31
Connecting to gem.cs.ru.nl (gem.cs.ru.nl)|131.174.31.31|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2275 (2,2K) [text/plain]
Saving to: ‘queries.txt.2’


2021-09-15 22:10:45 (2,73 MB/s) - ‘queries.txt.2’ saved [2275/2275]



## 3. Implement the retrieval model
You will implement language model with Jelinek-Mercer (JM) smoothing:
$$score(q,d) = \sum_{t \in q} log ((1-\lambda) \frac{c(t, d)}{|d|} + \lambda \frac{c (t, C)}{|C|}),$$
where $c (t, d)$ and $c (t, C)$ represent frequency of a term in a document and collection, respectively.

**Notes about your implementation:**
- Skip a term if it does not exist in the whole collection. This avoids $log(0)$.
- Make sure to use the right form of a query (analyzed vs. not analyzed)
- Use natural logarithm 


### 3.1. Obtain collection length
In this code, the global variable `len_C` denotes collection length.

In [3]:
global len_C
# =======Your code=======

# =======================

print("Collection length =", len_C)

### 3.2. Obtain document length

In [4]:
def len_doc(d):
    # =======Your code=======

    # =======================
    return len_d

doc = "2674124" # this is an example document
print("Length of document \""+doc+"\":", len_doc(doc))   

### 3.3. Obtain collection frequency of a term

In [5]:
def coll_freq(t):
    # =======Your code=======

    # =======================
    return cf
term = "record"
print("Frequency of term \""+term+"\" in the collection:", coll_freq(term)) 

### 3.4. Obtain term frequency


In [6]:
def term_freq(t, d):
    # =======Your code=======

    # =======================
    return tf

term = "record"
doc = "2674124"
print("Frequency of term \""+term+"\" in document \""+doc+"\":", term_freq(term, doc)) 

### 3.5. Compute JM-smoothed probability for a single term

In [7]:
def prob_t_Md(t, d, lambd):
    # =======Your code=======

    # =======================
    return p_t_Md

term = "record"
doc = "2674124" 
print("p(t|Md) =", prob_t_Md(term, doc, 0.1))


### 3.6. Compute JM-smoothed probability for a query

In [8]:
import math

def score_doc(q, d, lambd):
    # =======Your code=======

    # =======================
    return p_q_Md

query = index_reader.analyze(queries["23849"])
doc = "2674124" 
print("p(q|Md) =", score_doc(query, doc, 0.1))            

## 4. Rank documents for the given queries
Ranking is done in two steps:
1. First pass retrieval: Use a fast ranker (i.e., Anserini SimpleSearcher) ro rank all documents for a given query.
2. Second pass retrieval: Re-rank top-100 documents from the 1st pass retrieval using your retrieval model. This is to make the ranking process efficient.

**Notes:**
- You need to change the default values of SimpleSearcher functions to obtain top-100 documents
- Set the value of lambda to 0.1
- Store your final ranking results in the `results` variable. Every item in the `results` list is a list containing queryID, documentID, and score. This is an example how the content of results should look like:

`[['23849', '4348282', -10.65],
 ['23849', '7119957', -12.63],
 ['23849', '', -17.687729001682484], 
 ...]`

In [9]:
from pyserini.search import SimpleSearcher

results = []
searcher = SimpleSearcher("indexes/lucene-index-msmarco-passage")
# =======Your code=======

# =======================
results

Write your results into a file.
Submit this file together with the completed notebook.

In [10]:
# check duplicates
check = set()
for res in results:
    if ((res[0], res[1])) in check:
        raise Exception("Error: Duplicate query-doc is found", res[0], res[1])
    check.add((res[0], res[1]))

# write results in a file
output_str = "\n".join([l[0] + "\tQ0\t" + l[1] + "\t0\t" + str(l[2]) + "\tlm_jm" for l in results])
open("lm_jm.run", "w").write(output_str)

## Handing in
Submit the result file (ranked documents), the filled-in notebook, and the pdf version of your notebook:

- The result file should be named STUDENTNUMBER_FIRSTNAME_LASTNAME_lm_jm.run
- The notebook should be named STUDENTNUMBER_FIRSTNAME_LASTNAME_retrieval.ipynb
- The pdf version of your notebook should be named STUDENTNUMBER_FIRSTNAME_LASTNAME_retrieval.pdf