# Zoekmachines Project 2022

This notebook contains a pipeline which you can use and extend to return relevant passages given users' queries. The pipeline contains following parts:
* Dataloader
* Preprocessing
* Full-ranking + Feature construction
* Re-ranking
* Evaluation
* Submission to CodaLab leaderboard  
  
Amongst them, dataloader, preprocessing, full-ranking + feature construction, re-ranking and evaluaton are modules of an information retrieval system.
The implementation for them are very basic, and so there is large room for you to improve them.

At the end of the pipeline you are asked to submit your top 100 ranked list of passages on the test set to the CodaLab leaderboard. This way you can keep track of other teams performance and measure yourself with your peers.

First, you need to check whether you have the right Python version and the necessary packages installed. 
Please run the cell below to ensure this.

In [6]:
!python3 -V # please make sure this is python 3
# !pip3 install torch
# !pip3 install numpy
# !pip3 install nltk
# !pip3 install wheel
# !pip3 install memory_profiler
%load_ext memory_profiler

The memory_profiler extension is already loaded. To reload it, use:
  %reload_ext memory_profiler


Python is niet gevonden. Voer zonder argumenten uit om te installeren vanuit de Microsoft Store of schakel deze snelkoppeling uit via Instellingen > Aliassen voor app-uitvoering beheren.


The imports below are all you need for this project. 

In [7]:
import os
import codecs
import json
import argparse
from collections import defaultdict, Counter
from zipfile import ZipFile

import numpy as np

import torch.utils.data as data
import torch
import torch.nn as nn
import torch.optim as optim

import nltk

## Dataset and Dataloader

Here are information for the datasets we provided.
* Regrading passges, we have a large passge collection `passages_large.json` and a small passage collection `passages_small.json`. Note that `passages_large.json` are more larger and so contains more labelled passges. Therefore, the retrieval on `passages_large.json` can potentially lead better performance than on `passages_small.json`. We encourage you to use the larger one.
* Regrading queries, we have a training set `training_queries.json` with 8000 queries, a validation set `validation_queries.json` with 200 queries, and a test set `test_queries.json` with 200 queries.
* Regarding labels, we have corresponding label sets `training_labels.json`, `validation_labels.json` and `test_labels.json` for queries on training, validation and test sets, respectively, where each query id has one or multiple corresponding labelled passage id(s) as well as the relevance score(s) of labelled passage(s) to the query. 
** The relevance scores on the training set are not graded (i.e., the scores are always 1), while the scores on the validation and test sets are graded (i.e., the scores can be 1, 2, 3, etc.). The larger the score, the more relevant the passge is. 
** The label set on the test set `Test_labels.json` is unseen, and the evaluation on the CodaLab leaderboard is conducted on this unseen test set.

Here are more details.

|Dataset Name |The number of records |Format|
|:---|:---|:---|
|passages_large.json  | 10,000,000  | {passge_id: passage_text, passge_id: …}|
|passages_small.json  |1,000,000    |  {passge_id: passage_text, passge_id: …}|
|training_queries.json | 8,000       |  {query_id: query_text, query_id: …}|
|validation_queries.json | 200      |{query_id: query_text, query_id: …}
|test_queries.json        | 200      |{query_id:, query_text, query_id: …}|
|training_labels.json      | 8,000      |{query_id:  {passage_id: relevance_score, passage_id: …}, query_id: …}|
|validation_labels.json    |200   |{query_id:  {passage_id: relevance_score, passage_id: …}, query_id: …}|
|test_labels.json (unseen)          |200     | {query_id:  {passage_id: relevance_score, passage_id: …}, query_id: …}|


To load the data, run the following cell.

In [8]:
def passage_loader(path):
    print("load passages from: {}".format(path))   
    passages = json.load(open(path, 'r', encoding="utf-8", errors="ignore"))    
    return passages

def query_loader(path):    
    print("load queries from: {}".format(path))
    queries = json.load(open(path, 'r'))    
    return queries


def label_loader(path):
    print("Load labels from: {}".format(path))
    labels = json.load(open(path, 'r'))    
    return labels

# you can choose passages_small.json or passages_large.json 
%memit passages = passage_loader("data/passages_small.json")

%memit queries_training = query_loader("data/training_queries.json")
%memit queries_validation = query_loader("data/validation_queries.json")
%memit queries_test = query_loader("data/test_queries.json")

%memit labels_training = label_loader("data/training_labels.json")
%memit labels_validation = label_loader("data/validation_labels.json")

load passages from: data/passages_small.json
peak memory: 1037.60 MiB, increment: 849.95 MiB
load queries from: data/training_queries.json
load queries from: data/training_queries.json
load queries from: data/training_queries.json
load queries from: data/training_queries.json
peak memory: 720.27 MiB, increment: 4.14 MiB
load queries from: data/validation_queries.json
load queries from: data/validation_queries.json
load queries from: data/validation_queries.json
load queries from: data/validation_queries.json
peak memory: 720.02 MiB, increment: 0.01 MiB
load queries from: data/test_queries.json
load queries from: data/test_queries.json
load queries from: data/test_queries.json
load queries from: data/test_queries.json
peak memory: 720.03 MiB, increment: 0.00 MiB
Load labels from: data/training_labels.json
Load labels from: data/training_labels.json
Load labels from: data/training_labels.json
Load labels from: data/training_labels.json
peak memory: 726.06 MiB, increment: 6.03 MiB
Load la

Let's take a look at a loaded query, a label and a passage.

In [9]:
list(queries_training.items())[0]

('qid_1',
 ')what was the immediate impact of the success of the manhattan project?')

In [10]:
list(labels_training.items())[0]

('qid_1', {'pid_2255197': 1})

In [11]:
list(passages.items())[0]

('pid_811758',
 'The file browser will appear. Select the DXF File you want to import and click Import. 3. Position Image on Canvas. Using the cursor select where you want the image to be placed and click and drag to position the image on the canvas. 4. Edit your DXF file. Make your edits to the image. 5.')

## Preprocessing

The preprocessing step only has a tokenizer based on spaces.  
Feel free to implement your own preprocessing steps, like:  
* tokenising words with more advanced methods (e.g., nltk, spaCy and so on)
* lowercasing
* stemming
* stop words removal, etc.


In [7]:
def process_passages(passages):
    passages_tokenised = {}
    for passage_id in passages.keys():
        passages_tokenised[passage_id] = passages[passage_id].split()
    return passages_tokenised


def process_queries(queries):   
    queries_tokenised = {}  
    for query_id in queries.keys():
        queries_tokenised[query_id] = queries[query_id].split()
    return queries_tokenised  


%memit tokenised_queries_training = process_queries(queries_training)
%memit tokenised_queries_validation = process_queries(queries_validation)
%memit tokenised_queries_test = process_queries(queries_test)

%memit tokenised_passages = process_passages(passages)

peak memory: 818.98 MiB, increment: -6.76 MiB
peak memory: 482.38 MiB, increment: -340.46 MiB
peak memory: 200.99 MiB, increment: -113.17 MiB
peak memory: 2232.12 MiB, increment: 2024.39 MiB


Let's take a look at a tokenised query and a passage.

In [8]:
print(tokenised_queries_training['qid_1']) 

[')what', 'was', 'the', 'immediate', 'impact', 'of', 'the', 'success', 'of', 'the', 'manhattan', 'project?']


In [9]:
print(tokenised_passages['pid_811758'])     

['The', 'file', 'browser', 'will', 'appear.', 'Select', 'the', 'DXF', 'File', 'you', 'want', 'to', 'import', 'and', 'click', 'Import.', '3.', 'Position', 'Image', 'on', 'Canvas.', 'Using', 'the', 'cursor', 'select', 'where', 'you', 'want', 'the', 'image', 'to', 'be', 'placed', 'and', 'click', 'and', 'drag', 'to', 'position', 'the', 'image', 'on', 'the', 'canvas.', '4.', 'Edit', 'your', 'DXF', 'file.', 'Make', 'your', 'edits', 'to', 'the', 'image.', '5.']


## Full-Ranking + Feature Construction

Given a user query, full-ranking aims to quickly and roughly rank all passages and return a ranked list of passages.
Here, we implement __Term Frequency (TF)__ and regard it as a full-ranking method.

You are encouraged to implement more advanced full-ranking methods, such as TF-IDF, BM25 and so on. You are also encouraged to add a __relevance feedback module__ here to further improve the performance.
You are asked to implement a full-ranking method by yourself, and you are not allowed to use some off-the-shelf pacakges, such as scikit-learn, pandas and so on.

Next, let's conduct full-ranking on the __training__ set.

In [10]:
from tf import TermFrequency

In [11]:
%memit full_ranker = TermFrequency(tokenised_passages)

peak memory: 1846.24 MiB, increment: -77.58 MiB


In [None]:
# For each query, calculte scores of all passages on the training set.
%memit scores = full_ranker.score(tokenised_queries_training)

# rank the calclulated scores from largest to smallest.
for q_id, p2score in scores.items():
    sorted_p2score=sorted(p2score.items(), key=lambda x:x[1], reverse = True)
    scores[q_id]=sorted_p2score

When the full-ranking calculation finishes, the top 100 ranked list of passages are stored. 
In parallel, the features for the next re-ranking are also constructed. 
Here we only consider two features, __TF scores__ and __passge lengths__.

Before outputing the ranked list, you need first create an output directory.

In [None]:
if not os.path.exists("output/"):
    os.makedirs("output/")

In [None]:
# output the result file and build_features  
with codecs.open("output/full_ranking_training_result.text", "w", "utf-8") as file:
    for q_id, p2score in scores.items():
        ranking=0
        for (p_id, score) in p2score[:100]:
            ranking+=1         
            feature_1 = score
            feature_2 = len(full_ranker.passages[p_id])             
    
            file.write('\t'.join([q_id, p_id, str(ranking), str(feature_1), str(feature_2), "full_ranking_on_the_training_set"])+os.linesep) 

print("Produce file {}".format("output/full_ranking_training_result.text"))         

Similarly, you can conduct full-ranking on the validation set.

In [None]:
# For each query, calculte scores of all passages on the validation set.
%memit scores = full_ranker.score(tokenised_queries_validation)

# rank the calclulated scores from largest to smallest.
for q_id, p2score in scores.items():
    sorted_p2score=sorted(p2score.items(), key=lambda x:x[1], reverse = True)
    scores[q_id]=sorted_p2score
    
with codecs.open("output/full_ranking_validation_result.text", "w", "utf-8") as file:
    for q_id, p2score in scores.items():
        ranking=0
        for (p_id, score) in p2score[:100]:
            ranking+=1         
            feature_1 = score
            feature_2 = len(full_ranker.passages[p_id])             
    
            file.write('\t'.join([q_id, p_id, str(ranking), str(feature_1), str(feature_2), "full_ranking_on_the_validation_set"])+os.linesep) 

print("Produce file {}".format("output/full_ranking_validation_result.text")) 

Also, you can conduct full-ranking on the test set.

In [None]:
# For each query, calculte scores of all passages on the test set.
%memit scores = full_ranker.score(tokenised_queries_test)

# rank the calclulated scores from largest to smallest.
for q_id, p2score in scores.items():
    sorted_p2score=sorted(p2score.items(), key=lambda x:x[1], reverse = True)
    scores[q_id]=sorted_p2score
    
with codecs.open("output/full_ranking_test_result.text", "w", "utf-8") as file:
    for q_id, p2score in scores.items():
        ranking=0
        for (p_id, score) in p2score[:100]:
            ranking+=1         
            feature_1 = score
            feature_2 = len(full_ranker.passages[p_id])             
    
            file.write('\t'.join([q_id, p_id, str(ranking), str(feature_1), str(feature_2), "full_ranking_on_the_test_set"])+os.linesep) 

print("Produce file {}".format("output/full_ranking_test_result.text")) 

## Re-ranking

Based on the top 100 ranked list of passages from full-ranking, re-ranking is conducted to further carefully rank the top 100 passages from full-ranking.   
Re-ranking methods are usually based on neural networks, and more complex than full-ranking ones.

Here, we implement __RankNet__ （https://dl.acm.org/doi/pdf/10.1145/1102351.1102363) and regard it as a re-ranking method.

You are encouraged to implement more advanced re-ranking methods, such as LambdaRank or even embedding-based rankers.
Also, you are asked to implement a re-ranking method by yourself, and you are not allowed to use some off-the-shelf re-ranking models.

First, we need set the hyperparameters for RankNet.

In [None]:
# hyperparameters for RankNet
parser = argparse.ArgumentParser()
parser.add_argument("--epochs", type=int, default=30)
parser.add_argument("--lr", type=float, default=0.001)
parser.add_argument("--input_size", type=int, default=2)
parser.add_argument("--hidden_size1", type=int, default=128)
parser.add_argument("--hidden_size2", type=int, default=128)
parser.add_argument("--output_size", type=int, default=1)
parser.add_argument("--batch_size", type=int, default=512)
parser.add_argument("--random_seed", type=int, default=0)
args = parser.parse_known_args()[0]

Also, we need to ensure reproducibility.

In [None]:
np.random.seed(args.random_seed)
torch.manual_seed(args.random_seed)
torch.cuda.manual_seed_all(args.random_seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = True

In [None]:
from ranknet import train, inference

Next, you need to train RankNet on the training set.

In [None]:
# load the full-ranking result on the training set.
q_id = []
features = []
labels = []
        
print("Load file {}".format("output/full_ranking_training_result.text"))
with codecs.open("output/full_ranking_training_result.text", "r", "utf-8") as file:
    for line in file.readlines():
        content = line.split('\t')
        q_id.append(content[0]) 
        features.append([float(content[3]),float(content[4])])
        labels.append(labels_training[content[0]][content[1]] if content[1] in labels_training[content[0]] else 0)

# train model
%memit train(args, q_id, features, labels)

Then, you need to conduct inference on the validation set.

In [None]:
# load the full-ranking result on the validation set.

print("Load file {}".format("output/full_ranking_validation_result.text")) 
q_id = []
p_id = []
features = []
        
with codecs.open("output/full_ranking_validation_result.text", "r", "utf-8") as file:
    for line in file.readlines():
        content = line.split('\t')
        q_id.append(content[0]) 
        features.append([float(content[3]),float(content[4])])
        p_id.append(content[1])

# conduct inference on the validation set.
%memit scores = inference(args, q_id, p_id, features) 

# rank the calclulated scores from largest to smallest.
for q_id, p2score in scores.items():
    sorted_p2score=sorted(p2score.items(), key=lambda x:x[1], reverse = True)
    scores[q_id]=sorted_p2score
        
with codecs.open("output/re_ranking_validation_result.text", "w", "utf-8") as file:
    for q_id, p2score in scores.items():
        ranking=0
        for (p_id, score) in p2score:
            ranking+=1           
                    
            file.write('\t'.join([q_id, p_id, str(ranking), str(score), "re_ranking_on_the_validation_set"])+os.linesep)

# output the result file. 
print("Produce file {}".format("re_ranking_validation_result.text")) 

Similaly, you need to conduct inference on the test set.

In [None]:
# load the full-ranking result on the test set.

print("Load file {}".format("output/full_ranking_test_result.text")) 
q_id = []
p_id = []
features = []
        
with codecs.open("output/full_ranking_test_result.text", "r", "utf-8") as file:
    for line in file.readlines():
        content = line.split('\t')
        q_id.append(content[0]) 
        features.append([float(content[3]),float(content[4])])
        p_id.append(content[1])

# conduct inference on the validation set.
%memit scores = inference(args, q_id, p_id, features) 

# rank the calclulated scores from largest to smallest.
for q_id, p2score in scores.items():
    sorted_p2score=sorted(p2score.items(), key=lambda x:x[1], reverse = True)
    scores[q_id]=sorted_p2score
        
with codecs.open("output/re_ranking_test_result.text", "w", "utf-8") as file:
    for q_id, p2score in scores.items():
        ranking=0
        for (p_id, score) in p2score:
            ranking+=1           
                    
            file.write('\t'.join([q_id, p_id, str(ranking), str(score), "re_ranking_on_the_test_set"])+os.linesep)

# output the result file. 
print("Produce file {}".format("re_ranking_test_result.text")) 

## Evaluation

Here, we evaluate the final ranked lists on the validation with Mean Reciprocal Rank (MRR).
We encourage you to add more metrics like NDCG that considers graded relevance.

In [None]:
from metrics import mrr

Evaluate full-ranking on the validation set.

In [None]:
scores = defaultdict(list)
with codecs.open("output/full_ranking_validation_result.text", "r", "utf-8") as file:
    for line in file.readlines():
        content = line.split('\t')
        scores[content[0]].append(content[1])

print("Full-ranking")
print('MRR@{}: {:.4f}'.format(100, mrr(scores, labels_validation, 100)))

Evaluate re-ranking on the validation set.

In [None]:
scores = defaultdict(list)
with codecs.open("output/re_ranking_validation_result.text", "r", "utf-8") as file:
    for line in file.readlines():
        content = line.split('\t')
        scores[content[0]].append(content[1])

print("Re-ranking")
print('MRR@{}: {:.4f}'.format(100, mrr(scores, labels_validation, 100)))

## Submission to CodaLab Leaderboard

You are asked to zip the final ranked lists on the validation and test sets and then submit the zipped file to the leaderboard.

In [None]:
# zip results
studentnumber = "201814828"
studentname = "ChuanMeng"

filename = f"{studentnumber}_{studentname}_codalab_submission.zip"

with ZipFile("output/"+filename, 'w') as zipObj:
    zipObj.write("output/re_ranking_validation_result.text","re_ranking_validation_result.text")
    zipObj.write("output/re_ranking_test_result.text","re_ranking_test_result.text")