#ENGG*6600: Special Topics in Information Retrieval - Fall 2022
##Assignment 3: Retrieval Models (Total : 100 points)

**Description**

This is a coding assignment where you will implement three retrieval models. Basic proficiency in Python is recommended.  

**Instructions**

* To start working on the assignment, you would first need to save the notebook to your local Google Drive. For this purpose, you can click on *Copy to Drive* button. You can alternatively click the *Share* button located at the top right corner and click on *Copy Link* under *Get Link* to get a link and copy this notebook to your Google Drive.  

*   For questions with descriptive answers, please replace the text in the cell which states "Enter your answer here!" with your answer. If you are using mathematical notation in your answers, please define the variables.
*   You should implement all the functions yourself and should not use a library or tool for the computation.
*   For coding questions, you can add code where it says "enter code here" and execute the cell to print the output.
* To create the final pdf submission file, execute *Runtime->RunAll* from the menu to re-execute all the cells and then generate a PDF using *File->Print->Save as PDF*. Make sure that the generated PDF contains all the codes and printed outputs before submission.
To create the final python submission file, click on File->Download .py.


**Submission Details**

* Due data: Nov. 03, 2022 at 11:59 PM (EST).
* The final PDF and python file must be uploaded on CourseLink.
* After copying this notebook to your Google Drive, please paste a link to it below. Use the same process given above to generate a link. ***You will not recieve any credit if you don't paste the link!*** Make sure we can access the file.
***LINK: *https://colab.research.google.com/drive/1fZN1APdzxXOVE-tgu4kqD7lpsfybRd7a?usp=sharing***

**Academic Honesty**

Please follow the guidelines under the *Collaboration and Help* section in the first lecture.     

# Download input files and code

Please execute the cell below to download the input files.

In [None]:

import os
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)


import os
import zipfile

download = drive.CreateFile({'id': '1obnYvxGG8-x025j2U8aVYf4yBc5mFQsw'})
download.GetContentFile('HW03.zip')

with zipfile.ZipFile('HW03.zip', 'r') as zip_file:
    zip_file.extractall('./')
os.remove('HW03.zip')
# We will use hw1 as our working directory
os.chdir('HW03')

#Setting the input files
queries_file = "queries_tok_clean_kstem"
col = "antique-collection.tok.clean_kstem"
qrel_file = "test.qrel"


# 1 : Initial Data Setup (10 points)

We use files from the ANTIQUE  [https://arxiv.org/pdf/1905.08957.pdf] dataset for this assignment. As described in the previous assignments, this is a passage retrieval dataset.

The description of the input files provided for this assignment is given below.

**Query File**

We randomly sampled a set of 15 queries from the test set of the ANTIQUE dataset. Each row of the input file contains the following information:

*queryid query_text*

The id and text information is tab separated. queryid is a unique identifier for a query and query text has been pre-processed to remove punctutation, tokenised and stemmed using the Krovetz stemmer.  


**Query Relevance (qrel) file**

The qrel file contains the relevance judgements (ground truth) for the query passage combinations. Each row of the file contains the following information.

*queryid topicid passageid relevance_judgement*

Please note that the entries are space separated. The second column (topicid) can be ignored.

Given below are a couple of rows of a sample qrel file.

*2146313 U0 2146313_0 4*

*2146313 Q0 2146313_23 2*

The relevance judgements range from values 1-4. The description of the labels is given below:

Label 1: Non-Relevant

Label 2: Slightly Relevant

Label 3 : Relevant

Label 4: Highly Relevant

Note: that for metrics with binary relevance assumptions, Labels 1 and 2 are considered non-relevant and Labels 3 and 4 are considered relevant.

Note: if a query-document pair is not listed in the qrels file, we assume that the document is not relevant to the query.


**Collection file**

Each row of the file consists of the following information:

*passage_id  passage_text*

The id and text information is tab separated. The passage text has been pre-processed to remove punctutation, tokenised and stemmed using the Krovetz stemmer (same as queries). The terms in the passage text can be accessed by splitting the text based on space.


In this section, you have to implement the following:

* Load the queries from the query file into a datastructure
* Load the query relevance information into datastructure. You can reuse some of the code written in Assignment 1 for this and make modifications to it as needed.


You can use any additional datastructures than the suggested ones for your implementation.




In [None]:
import numpy as np
import pandas as pd
'''
This function is used to load query file information into datastructure(s).
Return Variables:
queries - mapping from queryid to querytext
'''
def loadQueries(queries_file):
    #enter your code here
    #queries=pd.read_csv(queries_file,sep='\t',names=["query_id", "query"])

    #return queries
    queries = {}
    with open(queries_file) as fd:
      for i in fd:
          queryId, query = [x for x in i.split('\t') if x != '\n']
          queries[queryId] = query
      return queries


'''
This function is used to load qrel file information into datastructure(s).
The qrel file format is the same as the one provided in Assignment 1 and is given below:
"queryid topicid passageid relevance_judgement"
The entries are space separated.
You can copy your qrel loading code from Assignment 1 and make modifications if necessary.

Return Variables:
num_queries - number of queries in the qrel file
qrels - query relevance information
'''

def loadQrels(qrel_file):
     #enter your code here
     qrel_dataframe=pd.read_csv(qrel_file, names=['q_id','topic_id','p_id','relevance'], sep=' ')
     #enter your code here
     qrels=qrel_dataframe.drop(columns="topic_id")

     #finding number of unique queries
     num_queries=qrels.q_id.unique().size

     return num_queries,qrels

# You can return additional datastructures for your implementation.
queries = loadQueries(queries_file)
num_queries, qrels = loadQrels(qrel_file)


print ('Total Num of queries in the query file  : {0}'.format(len(queries)))
print ('Total Num of queries in the qrel file  : {0}'.format(num_queries))


Total Num of queries in the query file  : 15
Total Num of queries in the qrel file  : 15


In [None]:
qrels

Unnamed: 0,q_id,p_id,relevance
0,1844896,1844896_0,4
1,1844896,1844896_4,3
2,1844896,4083883_2,1
3,1844896,1844896_1,4
4,1844896,1844896_2,2
...,...,...,...
451,1262692,247023_6,3
452,1262692,1499030_5,3
453,1262692,2916758_0,3
454,1262692,1105845_15,3



In the cell below, an inverted index with count has been created in memory. Please run the cell and use the variables for implementing the retrieval models.

In [None]:


'''
An inverted index with count information.
'''
class indexCount:
   pcount = 0
   ctf = {}
   sumdl = 0
   avgdl = 0
   doclen = {}
   index = {}
   probctf = {}


   def __init__(self, col):
     self.col = col


   def create_index(self):
     for line in open(self.col, encoding="utf8"):
       pid,ptext = line.strip().split('\t')
       self.pcount+=1

       if pid not in self.doclen:
           self.doclen[pid]=0
       pfreq = {}
       for term in ptext.split(' '):
           self.sumdl+=1
           if term not in self.ctf:
             self.ctf[term]=0
           self.ctf[term]+=1
           self.doclen[pid]+=1

           if term not in pfreq:
              pfreq[term]=0
           pfreq[term]+=1


       for k,v in pfreq.items():
        if k not in self.index:
          self.index[k]=[]
        self.index[k].append(pid+":"+str(v))

     for k,v in self.ctf.items():
        self.probctf[k]=v/float(self.sumdl)

     self.avgdl = self.sumdl/float(self.pcount)




buildIndex = indexCount(col)
buildIndex.create_index()


'''
inverted index with count: dict with term as key and posting list as value
posting list is a list with each element in the format "passage_id:term frequency"
Example - {'the': ['2020338_0:11', '3174498_1:4']}
'''
index = buildIndex.index

#total number of passages in the collection
num_passages = buildIndex.pcount

# Average passage length
avgdl = buildIndex.avgdl

# Collection Term Frequency : dict with term as the key and the term frequency in collection as value
ctf = buildIndex.ctf

# Probability Term Frequencies : dict with terms as key and probability distribution over term frequencies as value
probctf = buildIndex.probctf

#dict with passageId as key and number of tokens in the passage as value
doclen = buildIndex.doclen

# Total number of tokens in the collection
totNumTerms = buildIndex.sumdl


print ('Total number of passages in the collection :{0}'.format(num_passages))
print('Average passage length :{0}'.format(avgdl))
print('Total num of unique terms :{0}'.format(len(ctf)))
print('Total num of terms in the collection :{0}'.format(totNumTerms))


Total number of passages in the collection :403492
Average passage length :41.11619809066846
Total num of unique terms :149467
Total num of terms in the collection :16590057


# 2 : Vector Space model (VSM model) (30 points)

In the cell below, implement the VSM model given in Slide 19 of 'Basic Retrieval Models Part 1'. The score function has been given below for reference.

$$ score(q,p) = \sum_{w \in {q \cap p}} count(w,q) \frac{ln(1+ln(1+count(w,p)))}{1-b+b \frac{|p|}{avgdl}} ln \frac{|C|+1}{df(w)} $$

$score(q,p)$ - score assigned to a passage $p$ for a query $q$

$count(w,q)$ - number of times term $w$ occurs in query $q$

$count(w,p)$ - number of times term $w$ occurs in passage $p$

$b$ - set this to 0.75

$|p|$ - Number of tokens in passage $p$

$avgdl$ - Average number of tokens in passages in collection

$|C|$ - number of passages in collection $C$

$df(w)$ - number of passages containing term $w$

Please note that we consider each query term once, since this is equivalent to a dot product.

For each query, you have to return the top 5 retrieved passages ranked based on the score returned by the VSM model using "term at a time" scoring method.


In [None]:
'''
Rank passages for each query and return top 5 passages.
Return Variables:
final_ranking_vsm : map with query id as key and list of top 5 ranked passages as value
'''
import operator


def vsm(queries, index, avgdl, num_passages, doclen):
    final_ranking_vsm={}
    for line in open('queries_tok_clean_kstem', encoding="utf8"):
        qid,qtext = line.strip().split('\t')
        query_vocab = []
        for word in qtext.split():
            if word not in query_vocab:
                query_vocab.append(word)

        query_wc = {}
        for word in query_vocab:
            query_wc[word] = qtext.lower().split().count(word)

        score={}
        for w in query_vocab:
            m=len(index[w])
            for i in index[w]:
                pid,pcount=i.split(':')
                #f upper part
                k=np.log(1+ np.log(1+int(pcount)))
                #f-lower part
                p=(1-0.75)+(0.75*(int(doclen[pid])/avgdl))
                #first part
                f=(k/p)

                c=(num_passages+1)/m
                sc=np.log(c)
                #ith score calculation
                scorei=query_wc[word]*f*sc
                #adding into final score
                if pid not in score:
                    score[pid]=int(scorei)
                else:
                    score[pid]=int(score[pid])+scorei
        final_ranking_vsm[qid]=sorted(score.items(),key=operator.itemgetter(1), reverse=True)[:5]

    return final_ranking_vsm

final_ranking_vsm = vsm(queries, index, avgdl, num_passages, doclen)


# Hint: The score would be in the interval: [13,14]
print ('The top retrieved passage and score for query id "3698636" using VSM is : {0}'.format(final_ranking_vsm['3698636'][:1]))

The top retrieved passage and score for query id "3698636" using VSM is : [('754739_3', 13)]


# 3: BM25 (30 points)

In the cell below, implement the BM25 model given in slide 31 of 'Basic Retrieval Models Part 3'.

$$score(q,p) = \sum_{w \in {q \cap p}} \frac{(k_1+1) count(w,p)}{k_1(1-b+b(\frac{|p|}{avgdl})) + count(w,p)} ln\frac{|C|-df(w)+0.5}{df(w)+0.5}$$


$score(q,p)$ - score assigned to a passage $p$ for a query $q$

$count(w,p)$ - number of times term $w$ occurs in passage $p$

$b$ - set this to 0.75

$|p|$ - Number of tokens in passage $p$

$avgdl$ - Average number of tokens in passages in collection

$|C|$ - number of passages in collection $C$

$df(w)$ - number of passages containing term $w$

$k_1$ - set to 1.2

Please note that we iterate over all query tokens including repetitions.

Similar to the previous model, return the top 5 retrieved passages for each query ranked based on the BM25 scoring using "term at a time" scoring method.

In [None]:
'''
Rank passages for each query and return top 5 passages.
Return Variables:
final_ranking_bm25 : map with query id as key and list of top 5 ranked passages as value
'''
def bm25(queries, index, avgdl, num_passages, doclen):
    final_ranking_bm25={}
    for line in open('queries_tok_clean_kstem', encoding="utf8"):
        qid,qtext = line.strip().split('\t')
        query_vocab = []
        for word in qtext.split():
            if word not in query_vocab:
                query_vocab.append(word)
        '''
        query_wc = {}
        for word in query_vocab:
            query_wc[word] = qtext.lower().split().count(word)
        '''
        score={}
        for w in query_vocab:
            m=len(index[w])
            for i in index[w]:
                pid,pcount=i.split(':')
                #first part
                fu=(1.2+1)*int(pcount)
                u=(int(doclen[pid]))/avgdl
                ul=1-0.75+(0.75*u)
                fl=(1.2*ul)+int(pcount)
                f=fu/fl

                #second part
                lu=num_passages-m+0.5
                ll=m+0.5
                l=np.log(lu/ll)

                #ith score calculation
                scorei=f*l
                #adding into final score
                if pid not in score:
                    score[pid]=int(scorei)
                else:
                    score[pid]=int(score[pid])+scorei

        final_ranking_bm25[qid]=sorted(score.items(),key=operator.itemgetter(1), reverse=True)[:5]
    return final_ranking_bm25


final_ranking_bm25 = bm25(queries, index, avgdl, num_passages, doclen)

# Hint: The score would be in the interval: [18,19]
print ('The top retrieved passage and score for query id "3698636" using BM25 is : {0}'.format(final_ranking_bm25['3698636'][:1]))

The top retrieved passage and score for query id "3698636" using BM25 is : [('3698636_9', 17.405618765524807)]


# 4: Evaluation (30 points)

In the cell, evaluate the top 5 retrieved passages coresponding to each of the models using Precision@5 and Recall@5 metrics.
You can use the code from assignment 1 modified as needed.

In [None]:
queryids=list(final_ranking_bm25.keys())

In [None]:
#finding retrived relevant passages
qrelpid={}
for line in open('test.qrel', encoding="utf8"):
    qid,tid,pid,r=line.strip().split(' ')
    if (qid in qrelpid) and (int(r)>2) :
        qrelpid[qid].append(pid)
    elif(qid not in qrelpid) and (int(r)>2):
        qrelpid[qid]=[pid]

In [None]:
final_ranking_vsm_list={}
for k in queryids:
    l=[]
    for i in final_ranking_vsm[k]:
        l.append(i[0])
    final_ranking_vsm_list[k]=l
    final_ranking_bm25_list={}

for k in queryids:
    l=[]
    for i in final_ranking_bm25[k]:
        l.append(i[0])
    final_ranking_bm25_list[k]=l

In [None]:
# return precision of top 5 retrieved passages
def calcPrecision(top, qrelpid, rank_in):
    count=0
    for i in queryids:
        x=set(qrelpid[i]).intersection(set(rank_in[i]))
        count=count+len(x)
    return (count/top)/len(queryids)

# return recall of top 5 retrieved passages
def calcRecall(top, qrelpid, rank_in):
    #enter your code here
    recall_avg=0
    for i in queryids:
        x=set(qrelpid[i]).intersection(set(rank_in[i]))
        recall_avg=recall_avg+len(x)/len(qrelpid[i])
    #print(recall_avg)
    return recall_avg/len(queryids)


# Hint: Precision value interval [0.1,0.2],  Recall value interval [0.04,0.05]
print("Evaluate VSM model")
print ('Precision at top 5 : {0}'.format(calcPrecision(5, qrelpid, final_ranking_vsm_list)))
print ('Recall at top 5 : {0}'.format(calcRecall(5, qrelpid, final_ranking_vsm_list)))
print ("*********************************************************************")
# Hint: Precision value interval [0.3,0.4], Recall value interval [0.10,0.20]
print("Evaluate BM25 model")
print ('Precision at top 5 : {0}'.format(calcPrecision(5, qrelpid, final_ranking_bm25_list)))
print ('Recall at top 5 : {0}'.format(calcRecall(5, qrelpid, final_ranking_bm25_list)))
print ("*********************************************************************")
# Hint: Precision value interval [0.3,0.4], Recall value interval [0.1,0.2]



Evaluate VSM model
Precision at top 5 : 0.08
Recall at top 5 : 0.01963169412444775
*********************************************************************
Evaluate BM25 model
Precision at top 5 : 0.36000000000000004
Recall at top 5 : 0.1192170416808098
*********************************************************************
