

```
# This is formatted as code
```

#ENGG*6600: Special Topics in Information Retrieval - Fall 2022
##Assignment 1: Information Retrieval Metrics (Total : 100 points)

**Description**

This is a coding assignment where you will write and execute code to evaluate ranked outputs generated by a retrieval model or a recommender system. Basic proficiency in Python is recommended.  

**Instructions**

* To start working on the assignment, you would first need to save the notebook to your local Google Drive. For this purpose, you can click on *Copy to Drive* button. You can alternatively click the *Share* button located at the top right corner and click on *Copy Link* under *Get Link* to get a link and copy this notebook to your Google Drive.  

*   For questions with descriptive answers, please replace the text in the cell which states "Enter your answer here!" with your answer. If you are using mathematical notation in your answers, please define the variables.
*   You should implement all the functions yourself and should not use a library or tool for computing the metrics.
*   For coding questions, you can add code where it says "enter code here" and execute the cell to print the output.
* To create the final pdf submission file, execute *Runtime->RunAll* from the menu to re-execute all the cells and then generate a PDF using *File->Print->Save as PDF*. Make sure that the generated PDF contains all the codes and printed outputs before submission. You are responsible for uploading the correct pdf with all the information required for grading.
To create the final python submission file, click on File->Download .py.


**Submission Details**

* Due data: Sep. 26, 2022 at 11:59 PM (EST).
* The final PDF and python file must be uploaded on dropbox at CourseLink.
* After copying this notebook to your Google Drive, please paste a link to it below. Use the same process given above to generate a link. ***You will not recieve any credit if you don't paste the link!*** Make sure we can access the file.
***LINK: https://colab.research.google.com/drive/1Jv-NzbhSkzzvPz1P-V9rhNI1YwhoHhXt?usp=sharing

**Academic Honesty**

Please follow the guidelines under the *Collaboration and Help* section in the first lecture.     

# Download input files

Please execute the cell below to download the input files.

In [None]:

import os
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)


import os
import zipfile

download = drive.CreateFile({'id': '1myaSouVnJygjLlQI54_S0vRXLNYekEC8'})
download.GetContentFile('HW01.zip')

with zipfile.ZipFile('HW01.zip', 'r') as zip_file:
    zip_file.extractall('./')
os.remove('HW01.zip')
# We will use hw1 as our working directory
os.chdir('HW01')

#Setting the input files
qrel_file = "antique-train-final.qrel"
rank_file = "ranking_file"

# 1: Initial Data Setup (10 Points)

We use the files from the ANTIQUE dataset [https://arxiv.org/pdf/1905.08957.pdf] for this assignment. This is a passage retrieval dataset for non-factoid questions created by the Center for Intelligent Information Retrieval (CIIR) at UMass Amherst.

The description of the input files provided for this assignment is given below.

**1) Query Relevance (qrel) file**

The qrel file contains the relevance judgements (ground truth) for the query passage combinations. The file consists of 4 columns with the information given below.

*\[queryid]  [topicid]  [passageid]  [relevancejudgment]*

Entries in each row are space separated. The second column (topicid) can be ignored.

Given below are a couple of rows of a sample qrel file.

*2146313 U0 2146313_0 4*

*2146313 Q0 2146313_23 2*

The relevance judgements range from values 1-4.
The description of the labels is given below:

Label 1: Non-Relevant

Label 2: Slightly Relevant

Label 3 : Relevant

Label 4: Highly Relevant

**Note**: that for metrics with binary relevance assumptions, Labels 1 and 2 are considered non-relevant and Labels 3 and 4 are considered relevant.

**Note**: if a query-document pair is not listed in the qrels file, we assume that the document is not relevant to the query.

**2) Ranking file**

The evaluation metric value has to be calculated for the input ranking file. The file was generated using a standard search engine by executing a ranking model, retrieving the top 100 passages for each of the train queries of the ANTIQUE dataset. The format of this file is given below.

*\[queryid]  [topicid]  [passageid]  [rank] [relevance_score]  [indri]*

Similar to the qrel file, the entries in each row are space delimited.

Given below are some sample examples of the ranking file contents.

*3097310 Q0 2367043_3 1 -6.01785 indri*

*3097310 Q0 3007432_0 2 -6.22531 indri*

*3097310 Q0 674672_2 3 -6.28514 indri*

**Note**: For this assignment, you would only need the information from Column 1(queryid) and Column 3(passageid). The passages corresponding to each query is ranked with respect to the relevance score (highest to lowest), therefore you would not need to use Column 4 (rank) explicitly.




In [None]:
import numpy as np
import pandas as pd

#first creating dataframes for qrel and rank matrix
qrel_dataframe=pd.read_csv("antique-train-final.qrel", names=['q_id','topic_id','p_id','relevance'], sep=' ')
rank_dataframe=pd.read_csv("ranking_file", names=['q_id','topic_id','p_id','rank_id','relevance_score','indri'], sep=' ')

In order to make it easier to access this information in subsequent cells, please store them in appropriate data structures in the cell below.

In [None]:

'''
In this function, load the qrel file into qrel datastructure
Return Variables:
num_queries_1 - Number of unique queries in the qrel file
num_rel - Number of total relevant passages in the dataset across all queries
qrels - datastructure with the query passage relevance information
'''
def loadQrels(qrel_file):
      #qrels datastructure(dataframe) from qrel datframe
      qrels=qrel_dataframe.drop(columns="topic_id")

      #finding number of unique queries
      num_queries_1=qrels.q_id.unique().size

      #finding number of passages with relevance score more than 3
      num_rel=len(qrels.loc[(qrels['relevance']==4) | (qrels['relevance']==3)])

      return num_queries_1, num_rel, qrels

'''
In this function, load the ranking files into rank_in datastructure
Return Variables:
num_queries_2 - Number of unique queries in the ranking file
rank_in - datastructure with stored ranking information
'''
def  loadRankfile(rank_file):
      #enter your code here
      #enter your code here
      rank_in= rank_dataframe.drop(columns=["topic_id","relevance_score","indri"])

      num_queries_2=rank_in.q_id.unique().size

      return num_queries_2, rank_in



''' You can return single/multiple variables to store data if that makes it convenient
for data processing.
This has been given as an example. However, you would still need to correctly print the
queries in both files and total relevant passages.'''
num_queries_1, num_rel, qrels  = loadQrels(qrel_file)
num_queries_2, rank_in = loadRankfile(rank_file)

# print to ensure the file has been read correctly
print ('Total Num of queries in the qrel file  : {0}'.format(num_queries_1))
print ('Total Num of queries in the rank file  : {0}'.format(num_queries_2))
print ('Total number of relevant passages in the dataset :{0}'.format(num_rel))

Total Num of queries in the qrel file  : 2426
Total Num of queries in the rank file  : 2426
Total number of relevant passages in the dataset :19813




# 2 : Precision (15 Points)


Question 2.1 (5 points)

Definition of Precision corresponding to the top *k* (P@k):




**Answer:**

Precision is the fraction of the **documents retrieved that are relevant to the user's information need**.

For the Precision at k [P(k)] is the fraction of the relevant documents from the top k retrived documents(n) and total number(k) retrieved documents.


> **Precision@k**    
=   Relevant Retrieved @k / Total Retrieved @ k   
                     =   n / k

Where,
**n**= the number of Relevant Retrieved documents till k,
**k**= the number of Total Retrieved at k
https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)




Question 2.2 (10 points)

In the cell below, please enter the code to print the P@k where k={5,10} for the input ranking file.  As mentioned above, please note that the final value is the average of metric values across all queries.


In [None]:
'''
In this function, calculate Precision@k, given the input ranking information (rank_in)
and the query passage relevance information (qrels).
Return Value:
precision - Precision@k
'''
def calcPrecision(k, qrels, rank_in):
      #first finding relevant document according to k value
      basic=pd.Series(rank_in['q_id'].unique()).reset_index(name='q_id').drop(columns="index")
      rank_filtered=rank_in[rank_in.rank_id<=k]
      qrels_relevant=qrels[qrels.relevance>=3]

      #creating relevant document counts with respect to each query id
      relevant_retrived=rank_filtered[rank_filtered['p_id'].isin(qrels_relevant['p_id'])]
      relevant_count=relevant_retrived['q_id'].value_counts().rename_axis('q_id').reset_index(name='relevant_count')
      ans_a_pre=pd.merge(basic, relevant_count, how="outer", on='q_id').fillna(0)

      basic=basic.merge(ans_a_pre)
      basic['precision']= basic['relevant_count']/k

      precision= basic['precision'].mean()

      return precision

print ('Precision at top 5 : {0}'.format(calcPrecision(5, qrels, rank_in)))
print ('Precision at top 10 : {0}'.format(calcPrecision(10, qrels, rank_in)))



Precision at top 5 : 0.20263808738664466
Precision at top 10 : 0.15704863973619126


# 3 : Recall (15 points)

Question 3.1 (5 points)

Give the definition of Recall corresponding to the top *k* retrieved results for *n* queries (R@k). Please note that you have to use averaging to aggregate the results from all queries.

Recall is the fraction of the **documents that are relevant to the query that are successfully retrieved.**

For the Recall at k [P(k)] is the fraction of the relevant documents from the top k retrived documents(n) and total retrieved documents(k).


>
	         Recall@k   =   Relevant Retrieved @k /  Total Possible Relevant

                        =   n / r

Where


> **n**= the number of Relevant Retrieved documents till k,

> **k**= the number of Total Relevant in database for that query






Sources: https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)


Question 3.2 (10 points)

In the cell below, please enter the code to print Recall (R@k) where k={5,10} for the input ranking file. As mentioned above, please note that the final value is the average of metric values across all queries.

In [None]:
'''
In this function, calculate Recall@k, given the input ranking information (rank_in)
and the query passage relevance information (qrels).
Return Value:
recall - Recall@k
'''
def calcRecall(k, qrels, rank_in):

    basic=pd.Series(rank_in['q_id'].unique()).reset_index(name='q_id').drop(columns="index")
    #finding relevant documents
    qrels_relevant=qrels[qrels.relevance>=3]

    #first finding relevant document according to k value
    rank_filtered=rank_in[rank_in.rank_id<=k]
    relevant_retrived=rank_filtered[rank_filtered['p_id'].isin(qrels_relevant['p_id'])]
    relevant_count=relevant_retrived['q_id'].value_counts().rename_axis('q_id').reset_index(name='relevant_count')
    ans_a_pre=pd.merge(basic, relevant_count, how="outer", on='q_id').fillna(0)

    #recall calculation
    z=qrels_relevant['q_id'].value_counts().rename_axis('q_id').reset_index(name='total_relevant')
    recall_counts=pd.merge(ans_a_pre, z, on="q_id")
    recall_counts['recall']=recall_counts['relevant_count']/recall_counts['total_relevant']

    basic=basic.merge(recall_counts)
    recall=recall_counts['recall'].mean()

    return recall

print ('Recall at top 5 : {0}'.format(calcRecall(5, qrels, rank_in)))
print ('Recall at top 10 : {0}'.format(calcRecall(10, qrels, rank_in)))


Recall at top 5 : 0.17867080409742425
Recall at top 10 : 0.2734828274489789


# 4 : F1 Measure (15 Points)

Question 4.1 (5 points)

Give the definition of F1 measure corresponding to the top *k* retrieved results for *n* queries (F1@k). Please note that you have to use averaging to aggregate the results from all queries.

**Answer:**

The harmonic mean of the precision@k and recall@k is called as F1 measure at k.



>F1@k   =  1/P@k+1/R@k    
	      =  (2*P@k*R@k)/(P@k+R@k)

 https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)



Question 4.2 (10 points)

In the cell below, please enter the code to print the F1@k where k={5,10} for the input ranking file.  Please note that you have to calculate F1 score for each query and compute the final score by averaging the metric values across all queries.

In [None]:
'''
In this function, calculate F1@k, given the input ranking information (rank_in)
and the query passage relevance information (qrels).
Return Value:
f1 - F1@k
'''

def calcFScore(k, qrels, rank_in):

    basic=pd.Series(rank_in['q_id'].unique()).reset_index(name='q_id').drop(columns="index")
    #finding relevant documents
    qrels_relevant=qrels[qrels.relevance>=3]

    #fnding relevant first k documents
    rank_filtered=rank_in[rank_in.rank_id<=k]
    relevant_retrived=rank_filtered[rank_filtered['p_id'].isin(qrels_relevant['p_id'])]
    relevant_count=relevant_retrived['q_id'].value_counts().rename_axis('q_id').reset_index(name='relevant_count')
    ans_a_pre=pd.merge(basic, relevant_count, how="outer", on='q_id').fillna(0)

    #recall calculation
    z=qrels_relevant['q_id'].value_counts().rename_axis('q_id').reset_index(name='total_relevant')
    recall_counts=pd.merge(ans_a_pre, z, on="q_id")
    recall_counts['recall']=recall_counts['relevant_count']/recall_counts['total_relevant']
    basic=basic.merge(recall_counts)


    #precision calculation
    basic['precision']= basic['relevant_count']/k

    #f1score calculation
    basic['f1']=(2*(basic['precision']*basic['recall'])/(basic['precision']+basic['recall'])).fillna(0)
    f1=basic['f1'].mean()

    return f1

print ('F1 score at top 5 : {0}'.format(calcFScore(5, qrels, rank_in)))
print ('F1 score at top 10 : {0}'.format(calcFScore(10, qrels, rank_in)))

F1 score at top 5 : 0.16780895628714962
F1 score at top 10 : 0.17655764871188323


# 5 : Mean Reciprocal Rank (MRR) (15 Points)

Question 5.1 (5 points)

Give the definition of MRR@k corresponding to the top *k* retrieved results for *n* queries (MRR@k). Please note that you have to use averaging to aggregate the results from all queries.

**Answer:**
For evaluating the possible outcomes for al**l the queries produced of a process in ordered by the possibility of righteousness is called the Mean Reciprocal Rank.**
The mean of the summation of the reciprocal of the first relevant retrieved document in each query for the top k retrieved documents is MRR@k.

MRR@k=   1/Q *    ∑   (1/ranki)

Sources:
https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)



Question 5.2 (10 points)

In the cell below, please enter the code to print the MRR@k where k={5,10} for the input ranking file. As mentioned above, please note that the final value is the average of metric values across all queries.

In [None]:
'''
In this function, calculate MRR@k, given the input ranking information (rank_in)
and the query passage relevance information (qrels).
Return Value:
mrr - MRR@k
'''

def calcMRR(k, qrels, rank_in):
     #filtering first k documents
    filtered=rank_in[rank_in.rank_id<=k]
    da=pd.merge(filtered,qrels,how='left').fillna(0)
    #finding relevant documents for query id
    db=da.loc[da['relevance']>=3]
    #finding first relevant document rank
    dc=db.loc[db.groupby('q_id').rank_id.idxmin()]
    mrr=(1/dc['rank_id']).mean()

    return mrr

print ('MRR at top 5 : {0}'.format(calcMRR(5, qrels, rank_in)))
print ('MRR at top 10 : {0}'.format(calcMRR(10, qrels, rank_in)))

MRR at top 5 : 0.7228743748161224
MRR at top 10 : 0.6435193263055858


# 6 : Mean Average Precision (MAP) (15 points)

Question 6.1 (5 points)

Give the definition of MAP@k corresponding to the top *k* retrieved results for *n* queries (MAP@k). Please note that you have to use averaging to aggregate the results from all queries.

**Answer:**
Mean average precision at k is defined as **the average precision of all the queries for the top k retrieved results divide by the number of all the queries.**

MAP@k   =     1/Q * ∑ AP@k

where,
**Q** is the total number of quries

AP@k is the average precision for the top k documents for all queries



Question 6.2 (10 points)

In the cell below, please enter the code to print the MAP@k where k={50, 100} for the input ranking file. As mentioned above, please note that the final value is the average of metric values across all queries.


In [None]:
'''
In this function, calculate MAP@k, given the input ranking information (rank_in)
and the query passage relevance information (qrels).
Return Value:
map - MAP@k
'''

def calcMAP(k, qrels, rank_in):
  #filtering first k documents
  filtered=rank_in[rank_in.rank_id<=k]
  da=pd.merge(filtered,qrels,how='left').fillna(0)

  #updating relevant documnt dataframe for final calculation
  da['relevance'].values[da['relevance'].values <3 ]=0
  da['relevance'].values[da['relevance'].values >=3 ]=1
  da['precision']=da['relevance']/da['rank_id']
  db=da.loc[da['relevance']==1]

  #calculation of MAP
  map=db.groupby('q_id', as_index=False)['precision'].mean()['precision'].mean()

  return map

print ('MAP at top 50 : {0}'.format(calcMAP(50, qrels, rank_in)))
print ('MAP at top 100 : {0}'.format(calcMAP(100, qrels, rank_in)))

MAP at top 50 : 0.3202907736005067
MAP at top 100 : 0.27487083134970786


# 7 : Normalized Discounted Cumulative Gain (NDCG) (15 Points)

Question 7.1 (5 points)

Give the definition of NDCG@k corresponding to the top *k* retrieved results for *n* queries (NDCG@k). Use the definition discussed in the lectures. Note that this metric considers graded relevance judgments and you should not binarize the labels. To assign zero gain to non-relevant documents, decrease all relevance labels in the ANTIQUE qrels by 1 point i.e. map relevance judgements 1-4 to 0-3. Please note that you have to use averaging to aggregate the results from all queries.

**Answer:**

The **Discounted Cumulative Gain** for k shown recommendations (DCG@k) sums the relevance of the shown items.

The **Normalized Cumulative Gain** for k shown recommendations (NDCG@k) divides this score by the maximum possible value of DCG@k.

Here, the maximum possible value of DCG@k would be if the items in the** ranking were sorted by the relevance value(relevance score-1/2/3/4)**.
This is called Ideal Discounted Cumulative Gain (IDCG@k).

IDCG@k is calculated by sorting the ranking by the true unknown relevance (in descending order) and then use the formula for DCG@k).
Here, formula for DCG@k is



> DCG@k=∑rel/(log2(i+1))



> NDCG@k=(DCG@k)/(IDCG@k)





Question 7.2 (10 points)

In the cell below, please enter the code to print the NDCG@k where k={5, 10} for the input ranking file. As mentioned above, please note that the final value is the average of metric values across all queries.

Use log base 2 for the calculations.


In [None]:

'''
In this function, calculate NDCG@k, given the input ranking information (rank_in)
and the query passage relevance information (qrels).
Return Value:
ndcg - NDCG@k
'''

def DCG(k, qrels, rank_in):
    #filtering first k documents
    rank_filtered=rank_in[rank_in.rank_id<=k]

    #processing dataframe for final calculation
    da=pd.merge(rank_filtered,qrels,how='left').fillna(0)
    da['log2+1']=np.log2(da['rank_id']+1)
    da['divide']=da['relevance']/da['log2+1']

    #calculation of DCG
    db=da.groupby('q_id', as_index=False)['divide'].sum()
    DCG=db['divide'].mean()
    return DCG

def iDCG(k, qrels, rank_in):
    #filtering first k documents
    rank_filtered=rank_in[rank_in.rank_id<=k]

    #processing dataframe for final calculation
    da=pd.merge(rank_filtered,qrels,how='left').fillna(0)
    da=da.sort_values(by=['q_id','relevance'],ascending=False)
    da['index']=(da.groupby('q_id', as_index=False).cumcount())+1
    da['log2+1']=np.log2(da['index']+1)
    da['divide']=da['relevance']/da['log2+1']

    #calculation of iDCG
    db=da.groupby('q_id', as_index=False)['divide'].sum()
    DCG=db['divide'].mean()
    return DCG

def calcNDCG(k, qrels, rank_in):
  #final calculation of ndcg=dcg/idcg

  ndcg=DCG(k, qrels, rank_in)/iDCG(k, qrels, rank_in)
  return ndcg

print ('NDCG at top 5 : {0}'.format(calcNDCG(5, qrels, rank_in)))
print ('NDCG at top 10 : {0}'.format(calcNDCG(10, qrels, rank_in)))

NDCG at top 5 : 0.8214802384921778
NDCG at top 10 : 0.7539200567798043


References:


1.   [Stanford handout](https://web.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6-per.pdf)
2.  [ A Theoretical Analysis of NDCG Ranking Measures by Wang et al.](http://proceedings.mlr.press/v30/Wang13.pdf)

