#ENGG*6600: Special Topics in Information Retrieval - Fall 2022
##Assignment 7: Learning to Rank (Total : 100 points)

**Description**

This assignment consists of programming and analytical questions on Learning to Rank models and neural ranking models. Basic proficiency in Python is recommended.  

**Instructions**

* To start working on the assignment, you would first need to save the notebook to your local Google Drive. For this purpose, you can click on *Copy to Drive* button. You can alternatively click the *Share* button located at the top right corner and click on *Copy Link* under *Get Link* to get a link and copy this notebook to your Google Drive.  

*   For questions with descriptive answers, please replace the text in the cell which states "Enter your answer here!" with your answer. If you are using mathematical notation in your answers, please define the variables.
*   You should implement all the functions yourself and should not use a library or tool for the computation.
*   For coding questions, you can add code where it says "enter code here" and execute the cell to print the output.
* To create the final pdf submission file, execute *Runtime->RunAll* from the menu to re-execute all the cells and then generate a PDF using *File->Print->Save as PDF*. Make sure that the generated PDF contains all the codes and printed outputs before submission.
To create the final python submission file, click on File->Download .py.


**Submission Details**

* Due data: December. 5, 2022 at 11:59 PM (EDT).
* The final PDF and python file must be uploaded on CourseLink.
* After copying this notebook to your Google Drive, please paste a link to it below. Use the same process given above to generate a link. ***You will not recieve any credit if you don't paste the link!*** Make sure we can access the file.
***LINK: *https://colab.research.google.com/drive/1zKzc-C7RBnyjJ1IhYsprh8dlfKgMiyWg?usp=sharing***

**Academic Honesty**

Please follow the guidelines under the Collaboration and Help section in the first lecture.      

# Download input files

Please execute the cell below to download the input files.

In [None]:

import os
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)


import os
import zipfile
import numpy as np

download = drive.CreateFile({'id': '11K8r5a_Aj9S4Cpd8k3x76ccBTluNguip'})
download.GetContentFile('HW07.zip')

with zipfile.ZipFile('HW07.zip', 'r') as zip_file:
    zip_file.extractall('./')
os.remove('HW07.zip')
# We will use hw05 as our working directory
os.chdir('HW07')

#Setting the input files
passage_file = "antique-collection.tok.clean_kstem"
test_queries_file = "antique-test-queries.tok.clean_kstem"
train_queries_file = "antique-train-queries.tok.clean_kstem"
val_queries_file = "antique-val-queries.tok.clean_kstem"
sample = "sample.txt"
stopwords_file = "stopword_INQUERY"
val_baseline_features_file = "val_baseline_features_top10"
test_baseline_features_file = "test_baseline_features_top10"
train_baseline_features_file = "train_baseline_features_top10"


# 1 : Initial Data Setup (30 points)

We use collection from the ANTIQUE  [https://arxiv.org/pdf/1905.08957.pdf] dataset for this assignment. As described in the previous assignments, this is a passage retrieval dataset.

The description of the input files provided for this assignment is given below.

**Collection file**

Each row of the file consists of the following information:

*passage_id  passage_text*

The id and text information is tab separated. The passage text has been pre-processed to remove punctutation, tokenised and stemmed using the Krovetz stemmer. The terms in the passage text can be accessed by splitting the text based on space.

**Query files**

You are provided with train,validation and test query files.  Each row of the file consists of the following information:

*query_id  query_text*

The id and text information is tab separated. The query text has been pre-processed to remove punctutation, tokenised and stemmed using the Krovetz stemmer. The terms in the text can be accessed by splitting the text based on space.

**Feature files**

You are provided with train,validation and test feature files. Each row of the file consists of the following information:

*query_id  passage_id relevance_score vsm_score bm25_score*

Each row contains features for a (query,passage) pair and is space separated. The relevance_score is the human annotated relevance score. vsm_score and bm25 scores are the relevance scores for the pair corresponding to the two different scoring methods.

**Stopwords file**

The stopword file contains the list of stopwords. This file has a stopword per line.

**Sample file**

For this assignment, we use the pyltr Learning to Rank framework from [https://github.com/harshhpareek/pyltr]. The input file to the framework has to be set similar to the sample.txt in the following format:

*relevance_score qid:query_id 1:feature1 2:feature2 #docid = passage_id*

Each entry is space separated. This file has been provided only for reference to create files of the same format.

In the cell below, you have to implement the following:


*   Load the query files
*   Load the collection
*   Load the stopwords




In [None]:

'''
In this function, load the query files into dict
Return Variables:
queries - dict with qid as key and querytext as value
'''
def loadQueryFile(filename):
    #enter your code here
    queries={}
    for line in open(filename, encoding="utf8"):
      qid,qtext=line.strip().split('\t')
      queries[qid]=qtext
    return queries


'''
In this function, load the collection into dict
Return Variables:
coll - dict with passage id as key and passage text as value
'''
def loadCollection(passage_file):
    #enter your code here
    coll={}
    for line in open(passage_file, encoding="utf8"):
       pid,ptext = line.strip().split('\t')
       coll[pid]=ptext
    return coll

'''
In this function, load the stopwords into dict
Return Variables:
stopwords - dict with stopword as key
'''
def loadStopWords(stopwords_file):
    #enter your code here
    stopwords={}
    for line in open(stopwords_file, encoding="utf8"):
      stopwords[line.strip()]=None
    return stopwords


train_queries = loadQueryFile(train_queries_file)
val_queries = loadQueryFile(val_queries_file)
test_queries = loadQueryFile(test_queries_file)
coll = loadCollection(passage_file)
stopwords = loadStopWords(stopwords_file)

print('Total Number of train queries: {0}'.format(len(train_queries)))
print('Total Number of validation queries: {0}'.format(len(val_queries)))
print('Total Number of test queries: {0}'.format(len(test_queries)))
print('Total Number of passages in the collection: {0}'.format(len(coll)))
print('Total Number of stopwords: {0}'.format(len(stopwords)))


Total Number of train queries: 2226
Total Number of validation queries: 200
Total Number of test queries: 200
Total Number of passages in the collection: 403492
Total Number of stopwords: 418


# 2 : Feature Preparation (40 points)

The input feature file consists of two main features : VSM score and bm25 score of query,passage pairs. In this section, you will implement three additional features and use the information to create input feature file which contains the 5 features. The feature file must have the same format as sample.

In the cell below, implement the following features:

*  Number of unique term overlap between query and passage after excluding stopwords and words with only one character from both.

  [ Example = Query : why do a cat headbutt

  Passage : cat fight for attention and domination if you show a can of food to my cat he headbutt it.

  Number of Overlapped terms for the query/passage pair: 2  ]

*  Number of terms in query
*  Number of terms in passage


In [None]:
'''
In this function, create new feature file with additional features in the format required as input
Return Variables:
There is no return variable. You would create a new feature file "final_features_file"
One example of the row of the newly created file is
"0 qid:3990512 1:3.5053628162466897 2:10.841493137122296 3:1 4:6 5:112 #docid = 882429_10"
The format of the file is:
"relevance_score qid:query_id 1:feature1 2:feature2 3:feature3 4:feature4 5:feature5 #docid = passage_id"
You can read through the input baseline_features_file, create additional features and
add the updated entry into the new file.
'''
listofsw=list(stopwords.keys())

def rmstopwords(string):
    querywords = string.split()
    resultwords  = [word for word in querywords if ((word.lower() not in listofsw) and (len(word) != 1))]
    result = ' '.join(resultwords)
    return result

def featureCreation(baseline_features_file, stopwords, train_queries, coll, final_features_file):

    with open(str(final_features_file),'w') as f:
        for line in open(baseline_features_file, encoding="utf8"):
            qid,pid,r,vsm,bm25 =line.strip().split(' ')
            query=train_queries[qid]
            passage=coll[pid]
            srquery=rmstopwords(query)
            srpassage=rmstopwords(passage)
            common=len(set(srquery.split()) &  set(srpassage.split()))
            s="{} qid:{} 1:{} 2:{} 3:{} 4:{} 5:{} #docid = {}\n".format(r,qid,vsm,bm25,common,len(query),len(passage),pid)
            f.write(s)



featureCreation(train_baseline_features_file, stopwords, train_queries, coll, 'train_features_final')
featureCreation(val_baseline_features_file, stopwords, val_queries, coll, 'val_features_final')
featureCreation(test_baseline_features_file, stopwords, test_queries, coll, 'test_features_final')


# 3 : Model Training and Evaluation (30 points)

In the cell below, the pyltr library is used to train and evaluate ANTIQUE data using LambdaMART model. This takes less than a minute to execute.


In [None]:
!pip install --upgrade scikit-learn==0.22

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-learn==0.22
  Downloading scikit_learn-0.22-cp38-cp38-manylinux1_x86_64.whl (7.0 MB)
[K     |████████████████████████████████| 7.0 MB 5.1 MB/s 
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0.2
    Uninstalling scikit-learn-1.0.2:
      Successfully uninstalled scikit-learn-1.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yellowbrick 1.5 requires scikit-learn>=1.0.0, but you have scikit-learn 0.22 which is incompatible.
imbalanced-learn 0.8.1 requires scikit-learn>=0.24, but you have scikit-learn 0.22 which is incompatible.[0m
Successfully installed scikit-learn-0.22


In [None]:
import warnings
warnings.filterwarnings('ignore')

import pyltr
with open('train_features_final') as trainfile, open('val_features_final') as valifile, open('test_features_final') as evalfile:
    TX, Ty, Tqids, _ = pyltr.data.letor.read_dataset(trainfile)
    VX, Vy, Vqids, _ = pyltr.data.letor.read_dataset(valifile)
    EX, Ey, Eqids, _ = pyltr.data.letor.read_dataset(evalfile)


metric = pyltr.metrics.NDCG(k=10)
monitor = pyltr.models.monitors.ValidationMonitor(VX, Vy, Vqids, metric=metric, stop_after=250)

model = pyltr.models.LambdaMART(
    metric=metric,
    n_estimators=100,
    learning_rate=0.02,
    max_features=0.5,
    query_subsample=0.5,
    max_leaf_nodes=10,
    min_samples_leaf=64,
    verbose=1,
)
model.fit(TX, Ty, Tqids,monitor=monitor)
Epred = model.predict(EX)
print ('LambdaMART model test score :'+ str(metric.calc_mean(Eqids, Ey, Epred)))


Early termination at iteration  99
LambdaMART model test score :0.7976729429247746


Describe how LambdaMART works. A brief description the model and training objective would be sufficient.

LambdaMART is a combination of LambdaRank and MART (Multiple Additive Regression Trees). LambdaMART is a technique where ranking is transformed into a pairwise classification or regression problem. The algorithms consider a pair of items at a single time, coming up with a viable ordering of those items before initiating the final order of the entire list.

MART uses gradient boosted decision trees for prediction tasks. However, LambdaMART improves this by using gradient boosted decision trees with a cost function derived from LambdaRank to order any ranking situation.

Training Objective of Lambdamart:

Learning to Rank (LTR)or LambdaMart is a class of techniques that apply supervised machine learning (ML) to solve ranking problems. The main difference between LTR and traditional supervised ML is this:
1. Traditional ML solves a prediction problem (classification or regression) on a single instance at a time. E.g. if you are doing spam detection on email, you will look at all the features associated with that email and classify it as spam or not. The aim of traditional ML is to come up with a class (spam or no-spam) or a single numerical score for that instance.
2. LTR solves a ranking problem on a list of items. The aim of LTR is to come up with optimal ordering of those items. As such, LTR doesn’t care much about the exact score that each item gets, but cares more about the relative ordering among all the items.

The most common application of LTR is search engine ranking, but it’s useful anywhere you need to produce a ranked list of items.