# Argument retrieval for comparative questions

The task has been selected among the ones proposed by Touché at CLEF 2022 and you can find a detailed explanation in the [relative website](https://touche.webis.de/clef22/touche22-web/argument-retrieval-for-comparative-questions.html).   
To recap, given a comparative question and a collection of documents we need to retrieve the most relevant text passages for either compared object or for both and to detect their respective stances with respect to the object they talk about. In the first part of the notebook we will explain in detail how we structured the document retrieval part and at the end we will also test the stance detection task (the training and full explanation of which model we used for stance detection can be found in the other notebook 'stance_detection.ipynb' ).

In the notebook you will have to use the indexes or some different datasets, in order to speed up the download or the processes that are computationally heavy we created a shared folder on Drive where you can find the files needed to test our proposed system.      
Link to Drive shared folder: TODO: add link.

To be able to mount the shared files you first need to "Add shortcut to your Drive" when accessing the specified link.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Installation and import of dependencies

Some functions written in external Python scripts are essentials to procede with the execution of the network (e.g. creating the retrieval pipeline or download datasets). Mandatory folders, which contains python scripts, are *src* and *utils*. In addition, a *requirements.txt* is provided to install dependencies.

The aforementioned folders and files are expected in the current working directory, i.e. in Colab it is the */content/* directory.

In [None]:
!git clone --recursive https://github.com/castorini/pygaggle.git

Cloning into 'pygaggle'...
remote: Enumerating objects: 1539, done.[K
remote: Counting objects: 100% (609/609), done.[K
remote: Compressing objects: 100% (207/207), done.[K
remote: Total 1539 (delta 514), reused 430 (delta 402), pack-reused 930[K
Receiving objects: 100% (1539/1539), 505.03 KiB | 3.32 MiB/s, done.
Resolving deltas: 100% (988/988), done.
Submodule 'tools' (https://github.com/castorini/anserini-tools.git) registered for path 'tools'
Cloning into '/content/pygaggle/tools'...
remote: Enumerating objects: 718, done.        
remote: Counting objects: 100% (475/475), done.        
remote: Compressing objects: 100% (411/411), done.        
remote: Total 718 (delta 73), reused 456 (delta 63), pack-reused 243        
Receiving objects: 100% (718/718), 57.78 MiB | 21.22 MiB/s, done.
Resolving deltas: 100% (157/157), done.
Submodule path 'tools': checked out '808f48711b5e172da6aec8b1855518c8ea65489f'


In [None]:
!pip install -q -r requirements.txt

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m130.7/130.7 MB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.6/8.6 MB[0m [31m91.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.0/70.0 KB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m78.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m [31m67.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m63.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m65.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m71.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
!pip -q install pygaggle/

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.9/43.9 KB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m39.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.9/24.9 MB[0m [31m45.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25.8/25.8 MB[0m [31m49.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m76.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m454.4/454.4 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.5/72.5 KB[0m [31m7.3 MB/s[0m eta [

In [None]:
# python modules
import itertools
import os
import os.path

# 3rd-party modules
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from tqdm import tqdm

# user modules
import utils.manage_files

## Confirm the remote directory

The remote directory (e.g. in Google Drive) contains computed indexes and downloaded files, in order to save time if some tests and computations are needed. For this reason, the correct path is asked to the user in order to retrieve files from the correct directory.

In [None]:
# default path for the remote directory
default_remote_path = "/content/drive/MyDrive/tmp/NLP_project/"
while True:
    """ ask the user if the default path is correct, if so the loop is exited,
    otherwise a new path is asked, which is then checked if exists """

    res = input(f"Are you sure {default_remote_path} is the right path? y/n ")
    if res == "y":
        remote_path = default_remote_path
    else:
        remote_path = input("Print new path: ")
        # check the path if exists
    if os.path.isdir(remote_path):
        break
    else:
        print("Invalid path")
print(f"The path of the root folder is {remote_path}")

Are you sure /content/drive/MyDrive/tmp/NLP_project/ is the right path? y/n y
The path of the root folder is /content/drive/MyDrive/tmp/NLP_project/


## Download datasets

In order to simplify importing the different necessary files, we decided to create a .tar.gz that contains all of them.   
Otherwise the files will be downloaded from the links that were given by the Touché team.

In [None]:
datasets_filename = "downloads.tar.gz"
datasets_path = os.path.join(remote_path, datasets_filename)

In [None]:
# Run the following cell if you want to import the files from Drive
!cp $datasets_path .
!tar -xzvf $datasets_filename
!rm $datasets_filename

downloads/
downloads/topics-task-2/
downloads/topics-task-2/topics-task-2.xml
downloads/touche-task2-2022-quality.qrels
downloads/touche-task2-passages-version-002.jsonl
downloads/touche-task2-passages-version-002-expanded-with-doc-t5-query.jsonl
downloads/topics-task-2-2021/
downloads/topics-task-2-2021/topics-task2-51-100.xml
downloads/touche-task2-2022-stance.qrels
downloads/touche-task2-2022-relevance.qrels


### ClueWeb12 corpus

The available documents for the document retrieval task have been selected among the ClueWeb12 dataset. 

For all tested retrieval methods, the training corpus is the same and it is downloaded from the task website. However, there are two different types of corpus: one with only passage texts, while the other one has passage texts expanded with generated queries.

We work with both datasets for two reasons:
- Explore how they can be used in distinct types of indexes
- The difference in the performance noted in the same index.

We load both datasets using a function that you can find in the '*utils*' directory, if the file is not already present then data are downloaded from the provided URLs. Then, the data in the *JSONL* format are loaded in Pandas DataFrames.

Download and load in a DataFrame the corpus with only passage texts

In [None]:
url_corpus = "https://zenodo.org/record/6802592/files/touche-task2-passages-version-002.jsonl.gz?download=1"
corpus_df = utils.manage_files.open_df(utils.manage_files.download_files(url_corpus))


Downloading and extracting files
'/content/downloads/touche-task2-passages-version-002.jsonl' already present


Take a look at the corpus

In [None]:
corpus_df.head()

Unnamed: 0,id,contents,chatNoirUrl
0,clueweb12-0000tw-14-21168___1,"Shuga: Love, Sex, Money MTV Shuga Home Swag Bl...",https://chatnoir.eu/cache?uuid=f338e91e-a3e9-5...
1,clueweb12-0000tw-14-21168___2,We LOVE sending #TeamShuga the exclusives. Ban...,https://chatnoir.eu/cache?uuid=f338e91e-a3e9-5...
2,clueweb12-0000tw-14-21168___3,Now take note.. because you will be seeing a w...,https://chatnoir.eu/cache?uuid=f338e91e-a3e9-5...
3,clueweb12-0000tw-22-19226___1,Sex and love: The modern matchmakers | The Eco...,https://chatnoir.eu/cache?uuid=2bf4b08d-2f65-5...
4,clueweb12-0000tw-22-19226___2,But have they? Feb 11th 2012 | from the print ...,https://chatnoir.eu/cache?uuid=2bf4b08d-2f65-5...


In [None]:
print(f"The corpus has {corpus_df.shape[0]} elements.")

The corpus has 868655 elements.


Download and load in a DataFrame the corpus with passage texts and expanded queries

In [None]:
url_corpus_exp = "https://zenodo.org/record/6873567/files/touche-task2-passages-version-002-expanded-with-doc-t5-query.jsonl.gz?download=1"
corpus_df_exp = utils.manage_files.open_df(utils.manage_files.download_files(url_corpus_exp))


Downloading and extracting files
'/content/downloads/touche-task2-passages-version-002-expanded-with-doc-t5-query.jsonl' already present


Difference of a passage text between the two corpus

In [None]:
print("Original text:", corpus_df.contents.iloc[0])
print("Original text expanded:", corpus_df_exp.contents.iloc[0])

Original text: Shuga: Love, Sex, Money MTV Shuga Home Swag Blog Cast Swag Video Team Shuga Partners Shuga Talks (NEW!) Unicef G-PANGE MTV Base MTV Staying Alive Shuga Premiere: Today Is The Day! Today is the day! #ShugaPremiere. We talk Twitter hashtags, competition winners and of course the FREE d/l of the official Shuga: Love, Sex, Money track feat Banky W, WizKid, L-Tido and Bon’eye Download The Official Shuga: Love, Sex, Money Track HERE For Free! Be the first to download the Shuga: Love, Sex, Money track featuring Banky W, WizKid, L-Tido and Bon’eye exclusively here! That Shuga Love Sex Money Premiere The Shuga Track, worldwide Twitter trending and how YOU can win a ticket to the OFFICIAL Shuga: Love, Sex, Money premiere… The Official Shuga: Love, Sex, Money Trailer The Official Shuga: Love, Sex, Money trailer is here!
Original text expanded: Do Asian-Americans Face Bias in Admissions at Elite Colleges? - NYTimes.com Home Page Today's Paper Video Most Popular Times Topics Search A

### Topics, quality, relevance and stance

In this subsection we download the list of possible topics, that are the actual queries to submit to the model. The available topics for the 2022 task are only 50, so we decided to retrieve also possible queries of the 2021 task. There are two reasons for this choice:
- Prove that our model is not biased towards only some topics, but it is rather robust with more queries provided.
- The quality and relevance judgements available in the 2022 task are based also on previous year topics, therefore without any additional queries our results would be limited to only a subset of possible scores.

As previously mentioned, three types of judgements are provided in the *QRELS* format, which are: **quality**, **relevance** and **stance**. These will be used to evaluate our different pipelines, but only the latter will be used to evaluate the stance classification.

The headers to load the judgements in a Pandas DataFrame are: *TOPIC*, *Q0*, *DOC_ID*, *SCORE*.

In total, the list of **topics** contains 100 arguments (50 from 2022 and 50 from 2021) but judgements available for this task are 50, distributed among the two lists.

Therefore we will perform two types of evaluations:
- Quantitative: the *nDCG* score on the join of provided topics and judgements (limited to only 50 topics)
- Qualitative: we select some queries and retrieved documents which we'll be judging based on stance, relevance and quality.

#### Topics

In [None]:
# Download and parse the xml file of the 0-50 topics of Touche 2022
url_topics_22 = "https://zenodo.org/record/6873559/files/topics-task-2.zip?download=1"
topics_22_filename = os.path.join(utils.manage_files.download_files(url_topics_22), "topics-task-2.xml")
topics = utils.manage_files.open_xml(topics_22_filename)

# Download and parse the xml file of the 51-100 topics of Touche 2021
url_topics_21 = "https://zenodo.org/record/6873565/files/topics-task-2-2021.zip?download=1"
topics_21_filename = os.path.join(utils.manage_files.download_files(url_topics_21), "topics-task2-51-100.xml")
topics += utils.manage_files.open_xml(topics_21_filename)


Downloading and extracting files
'/content/downloads/topics-task-2' already present

Downloading and extracting files
'/content/downloads/topics-task-2-2021' already present


As you can see below we have a list of 100 topics but for the evaluation we will use only the ones selected by the Touchè team.

In [None]:
print(f"There are {len(topics)} topics.\n{topics}")

There are 100 topics.
['What is the difference between sex and love?', 'Which is better, a laptop or a desktop?', 'Which is better, Canon or Nikon?', 'What are the best dish detergents?', 'What are the best cities to live in?', 'What is the longest river in the U.S.?', 'Which is healthiest: coffee, green tea or black tea and why?', 'What are the advantages and disadvantages of PHP over Python and vice versa?', 'Why is Linux better than Windows?', 'How to sleep better?', 'Should I buy an LCD TV or a plasma TV?', 'Train or plane? Which is the better choice?', 'What is the highest mountain on Earth?', 'Should one prefer Chinese medicine or Western medicine?', 'What are the best washing machine brands?', 'Should I buy or rent?', 'Do you prefer cats or dogs, and why?', 'What is the better way to grill outdoors: gas or charcoal?', 'Which is better, MAC or PC?', 'What is better: to use a brush or a sponge?', 'Which is better, Linux or Microsoft?', 'Which is better, Pepsi or Coke?', 'What is b

#### Relevance judgements

In [None]:
# Download 2022 relevance qrels for 50 topics
url_relevance = "https://zenodo.org/record/6873567/files/touche-task2-2022-relevance.qrels?download=1"

rel_names = ["topic", "0", "doc_id", "relevance"]

relevance_df = utils.manage_files\
        .open_df(utils.manage_files.download_files(url_relevance), names=rel_names, sep=" ")\
        .drop("0", axis=1)


Downloading files


Downloading file: 100%|██████████| 78.4k/78.4k [00:00<00:00, 640kB/s]


In [None]:
relevance_df.head()

Unnamed: 0,topic,doc_id,relevance
0,12,clueweb12-0002wb-18-34442___2,0
1,12,clueweb12-0004wb-69-30215___112,0
2,12,clueweb12-0004wb-78-20304___1,1
3,12,clueweb12-0004wb-78-20304___11,2
4,12,clueweb12-0008wb-62-05967___1,0


#### Quality judgements

In [None]:
# Download 2022 quality qrels for 50 topics
url_quality = "https://zenodo.org/record/6873567/files/touche-task2-2022-quality.qrels?download=1"

qual_names = ["topic", "0", "doc_id", "quality"]

quality_df = utils.manage_files\
        .open_df(utils.manage_files.download_files(url_quality), names=qual_names, sep=" ")\
        .drop("0", axis=1)


Downloading files


Downloading file: 100%|██████████| 78.4k/78.4k [00:00<00:00, 652kB/s]


In [None]:
quality_df.head()

Unnamed: 0,topic,doc_id,quality
0,12,clueweb12-0002wb-18-34442___2,2
1,12,clueweb12-0004wb-69-30215___112,2
2,12,clueweb12-0004wb-78-20304___1,2
3,12,clueweb12-0004wb-78-20304___11,2
4,12,clueweb12-0008wb-62-05967___1,0


#### Stance judgements

In [None]:
# Download 2022 stance qrels for 50 topics
url_stance = "https://zenodo.org/record/6873567/files/touche-task2-2022-stance.qrels?download=1"

stance_names = ["topic", "0", "doc_id", "stance"]

stance_df = utils.manage_files\
        .open_df(utils.manage_files.download_files(url_stance), names=stance_names, sep=" ")\
        .drop("0", axis=1)


Downloading files


Downloading file: 100%|██████████| 84.9k/84.9k [00:00<00:00, 257kB/s] 


In [None]:
stance_df.head()

Unnamed: 0,topic,doc_id,stance
0,12,clueweb12-0002wb-18-34442___2,NO
1,12,clueweb12-0004wb-69-30215___112,NO
2,12,clueweb12-0004wb-78-20304___1,SECOND
3,12,clueweb12-0004wb-78-20304___11,NEUTRAL
4,12,clueweb12-0008wb-62-05967___1,NO


## Document retrieval and indexes explanation

During the task all the teams had been provided with an API key for the [ChatNoir](https://www.chatnoir.eu/doc/) system. It is an Elasticsearch-based search engine offering a document retrieval interface for different corpus (including ClueWeb12). The API returns the most relevant documents with respect to a query and further information for each of them, such as the BM25 score, the page rank score and the spam score. Unfortunately we were not able to obtain an API key, therefore we used the provided corpus of almost 900K passage texts to create our own indexes on which to perform document retrieval. 

There are different types of indexes used in Information Retrieval, the main ones are:
- ***sparse indexes***: is a type of index that only stores a subset of the terms that appear in a document collection. This can be useful for optimizing index size and search efficiency when working with large collections.
- ***dense indexes***: we store the embeddings of the documents, in our case created by TCT Colbert, therefore we are capturing a semantic meaning into a fixed length vector. In order to retrieve the most similar to a given query the index computes a distance between the embedding of the query and the others, we decided to use the inner product that is a default metric.

To improve the robustness and the quality of our retrieval we decided to mix up the two approaches considering a hybrid pipeline and an approach that uses a sparse index and reranking made by MonoT5. 



We found on the web different valid libraries that allow to create an index given a set of documents.
- The first one is [Pyserini](https://github.com/castorini/pyserini), a Python toolkit for reproducible information retrieval research with sparse and dense representations. The sparse index can be created on a custom collection of documents while the creation of a dense index for our own documents is not currently available.
- This led us to look for another library that allows building a dense index and we found [autofaiss](https://github.com/criteo/autofaiss). ***autofaiss*** creates [Faiss](https://github.com/facebookresearch/faiss) knn indexes selecting the most optimal similarity search parameters. It only needs the embedding vectors for each document, that we computed using the ***pyserini encode*** module and [TCT-Colbert-v2](https://arxiv.org/abs/2112.01488) pre-trained on [MS MARCO dataset](https://microsoft.github.io/msmarco/).


At this point you can simply ignore the subsections that explain the creation and go to the 'Models' section, where you will find the code to import the saved indexes and to test the models.

In [None]:
def create_folder(folder_path: str):
    try:
        os.makedirs(folder_path, exist_ok=False)
        print("Folder created")
    except:
        print("Folder already exists")

In [None]:
def save_corpus(df: pd.DataFrame, folder_path: str):
    if not os.path.exists(folder_path):
        os.makedirs(folder_path)

    save_path = os.path.join(folder_path, "index.jsonl")
    df[["id", "contents"]].to_json(save_path, orient="records", lines=True)

### Creation of a sparse index 

In order to create a sparse index given a .jsonl file, Pyserini needs to have only 2 keys, 'id' and 'contents', so we have to remove the other columns from the dataframe and save the new .jsonl file.   
First of all we create a 'collections' directory where to put the file and then we save it.

Both corpus can be saved, but for the sparse index only the one with expanded queries is used.

In [None]:
# save in the collections folder the corpus expanded
collection_exp_path = "collections/corpus_exp"
save_corpus(corpus_df_exp, collection_exp_path)

To build the sparse index with the Lucene inverted index on our passages corpus a script is invoked, that is `python -m pyserini.index.lucene`. 

[Available parameters](https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/IndexCollection.java) are:

Required:
- `--input collections/`: Directory location where to find the json collection.
- `--threads int`: Number of indexing threads.
- `--collection JsonCollection`: To indicate to the documents ingestor that the documents in input are inside a json file.
- `--generator DefaultLuceneGenerator`: Document generator class to create the index.

Optional general arguments:
- `--verbose bool`: Enables verbose logging for each indexing thread; can be noisy if collection has many small file segments. *Defaults to false*.
- `--quiet bool`: Turns off all logging. *Defaults to false*.

Optional arguments
- `--index indexes/sparse_exp/`: Directory where to save the index files.
- `--storePositions bool`: Boolean switch to index store term positions; needed for phrase queries. *Defaults to false*.
- `--storeDocvectors bool`: Boolean switch to store document vectors; needed for (pseudo) relevance feedback. *Defaults to false*.
- `--storeContents bool`: Boolean switch to store document contents. *Defaults to false*.
- `--storeRaw bool`: Boolean switch to store raw source documents. *Defaults to false*.
- `--keepStopwords bool`: Boolean switch to keep stopwords. *Defaults to false*.
- `--stopwords str`: Path to file with stopwords. *Defaults to null*.
- `--stemmer str`: Stemmer: one of the following {porter, krovetz, none}. Defaults to 'porter'.
- `--bm25.accurate bool`: Anserini uses an algorithm that is more computationally expensive but more accurate. The "accurate" variant of BM25 computes the idf of terms by taking into account accurate document lengths. If not set an approximation for idf will be used. *Defaults to false*.
- `--pretokenized bool`: index pre-tokenized collections without any additional stemming, stopword processing. *Defaults to false*.

In [None]:
index_exp_path = "indexes/sparse_exp"
create_folder(index_exp_path)

Folder created


Create the sparse index 

In [None]:
!python -m pyserini.index.lucene --collection JsonCollection --input $collection_exp_path --index $index_exp_path --bm25.accurate --generator DefaultLuceneDocumentGenerator --threads 2 --storePositions --storeDocvectors --storeRaw

By default Pyserini performs stemming with 'porter' and stopwords removal on the input texts.

### Creation of a dense index

In order to create a dense index given a .jsonl file, Pyserini needs to have only 2 keys, 'id' and 'contents', so we have to remove the other columns from the dataframe and save the new .jsonl file.   
First of all we create a 'collections' directory where to put the file and then we save it.

Both corpus can be saved, but for the dense index only the one with only passage texts is used.

In [None]:
# save in the collections folder the corpus
collection_path = "collections/corpus"
save_corpus(corpus_df, collection_path)

As explained before, Pyserini doesn't support yet the creation of a dense index on a custom documents collection. Thus we first created the embedding of the documents with Pyserini and then we created the index with autofaiss.   
The parameters passed to the encode module are:

Input:
- `--corpus`: Directory that contains corpus files to be encoded, in jsonl format.
- `--fields`: Keys of the json to consider for the embedding.
- `--shard-id 0`: Number of shard in case we want to split the index.
- `--shard-num 1`: In our case we have only one shard, but here you can set multiple shard and then you should run the command multiple times changing the shard-id.

Output:
- `--embeddings`: Directory where to put the embeddings once computed.

Encoder:
- `--encoder`: Encoder to use in order to compute the embeddings (in our case TCT Colbert pre-trained on the second version of MS MARCO).
- `--fields`: Fields to encode, equal to the previous `--fields` parameter.
- `--batch-size`: Batch size to use.
- `--fp16`: Speed up the computation if PyTorch autocast is used for inference.

In [None]:
embeddings_path = "embeddings/dense"
create_folder(embeddings_path)

Folder already exists


Create the embeddings

In [None]:
!python -m pyserini.encode input --corpus $collection_path --fields text --shard-id 0 --shard-num 1 output --embeddings $embeddings_path encoder --encoder castorini/tct_colbert-v2-hnp-msmarco --fields text --batch-size 32 --fp16

Downloading: 100% 559/559 [00:00<00:00, 532kB/s]
Downloading: 100% 438M/438M [00:10<00:00, 42.5MB/s]
Downloading: 100% 334/334 [00:00<00:00, 336kB/s]
Downloading: 100% 232k/232k [00:00<00:00, 258kB/s]
Downloading: 100% 112/112 [00:00<00:00, 92.3kB/s]
868655it [00:08, 99616.76it/s]
100% 27146/27146 [1:17:09<00:00,  5.86it/s]


After that we have the embeddings in a .jsonl file, in the 'vector' column, we need to split this very huge file into smaller ones in order to make the size suitable for the RAM. Moreover autofaiss taks as input .npy files therefore we decided to split the embedding vectors in .npy files that contains 70000 vectors each.

Create the npy_embeddings directory

In [None]:
npy_path = "npy_embeddings/"
create_folder(npy_path)

Folder already exists


In [None]:
def convert_to_npy(path, npy_path, file_len, chunksize=70000):
    '''
        It takes as input a .jsonl file and it creates some .npy files taking
        only the 'vector' key, that is the embedding of the documents. It saves
        the results in the npy_embeddings directory.
        Parameters:
            - path: str 
                The path of the .jsonl file that contains the embeddings in the 'vector' key.
            - file_len: int
                The number of elements inside the .jsonl file.
            - chunksize: int
                The number of lines to read at each step, so the number of vectors for each new
                .npy file.
    '''
    steps = file_len//chunksize
    for i, chunk in enumerate(tqdm(pd.read_json(path, lines=True, chunksize=chunksize), total=steps)):
        npy_list = []
        for vect in chunk['vector'].to_numpy():
            npy_list.append(vect)

        # Save different files to avoid RAM consumption
        idx_filename = os.path.join(npy_path, f"embeddings_{i+10}.npy")
        np.save(idx_filename, np.array(npy_list))
        del npy_list

In [None]:
# Actually call the function
file_len = corpus_df_exp.shape[0]
embeddings_filename = os.path.join(embeddings_path, "embeddings.jsonl")
convert_to_npy(embeddings_filename, npy_path, file_len)

13it [04:28, 20.67s/it]


Given the embedding vectors autofaiss automatically creates a dense index executing the following cell:

In [None]:
index_dense_path = "indexes"
create_folder(index_dense_path)

Folder created


In [None]:
from autofaiss import build_index

knn_filename = os.path.join(index_dense_path, "dense.index")
dense_filename = os.path.join(index_dense_path, "dense_index_infos.json")

# Load the .npy files from the "npy_embeddings" directory where we saved them
build_index(embeddings=npy_path, 
            index_path=knn_filename,
            index_infos_path=dense_filename, 
            max_index_memory_usage="6GB",
            current_memory_available="9GB")

100%|██████████| 13/13 [00:00<00:00, 663.57it/s]
 36%|███▌      | 41/114 [00:29<00:02, 24.65it/s]

## Models

To load the pre-saved indexes from the Drive shared folder run the following cell:

In [None]:
indexes_filename = "indexes.tar.gz"
indexes_path = os.path.join(remote_path, indexes_filename)

In [None]:
# Load the indexes from Drive
!cp $indexes_path .
!tar -xzvf $indexes_filename
!rm $indexes_filename

We created a ```DocumentsIndex``` class to import the dense and sparse indexes and to set different parameters, you can find it in the ```src``` directory of the project.

Before presenting the different pipelines that we implemented for document retrieval, we create the instance for the dense index, since it was quite heavy in memory and we use the same instance for all the pipelines that need it. 
We also create a directory for saving the results of the search.

In [None]:
results_folder = "results/"
create_folder(results_folder)

Folder created


In [None]:
rel_qrels = "downloads/touche-task2-2022-relevance.qrels"
quality_qrels = "downloads/touche-task2-2022-quality.qrels"

In [None]:
from src.documents_index import DocumentsIndex
from utils.retrieval_util import print_ndcg, retrieve_docs_ranked
from src.sparse_pipeline import SparsePipeline
from src.dense_pipeline import DensePipeline
from src.hybrid_pipeline import HybridPipeline
from src.monot5_pipeline import MonoT5Pipeline
from src.evaluate_qrels import compute_recall



In [None]:
dense_index = DocumentsIndex('indexes/dense.index', 'dense')

Loading the dense index file ...
Loading the encoder castorini/tct_colbert-v2-hnp-msmarco ...


Downloading:   0%|          | 0.00/559 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/334 [00:00<?, ?B/s]


The process is finished correctly!



The following 2 functions can be used to print the nDCG score that we computed for different files and to return a list of urls given the corpus and the dictionary of the results.

### Sparse index (BM25 score)

It's a probabilistic retrieval model for estimating the relevance of a passage given a query. In this simple pipeline we use the sparse index created before, so considering the accurate BM25 for the ranking and we compute the *nDCG@5* as all the teams did during the task. This pipeline is the ***baseline*** for document retrieval and it's the less effective in terms of *nDCG@5* computed on the quality and relevance. 

As default value for the parameters of BM25, we leave $k=0.9$ and $b=0.4$ because these are the ones that work better in combination with the dense index in the hybrid pipeline. 

To get the best possible results in terms of *nDCG@5* using only the sparse index we perform GridSearch over the possible parameters w.r.t to the relevance results. However, in a certain range of parameters we achieve similar scores, thus performing GridSearch allows us only to obtain the best score to reflect other users' results. 

At the end, we noticed that the best values are $k=1.05$ and $b=0.7$, which allows to score **0.472** in relevance and **0.500** quality.

To see how the pipeline is implemented you can check ```src/sparse_pipeline.py``` file that contains the related class. In the following experiments we retrieve the top-100 documents, which we consider a good size for computing the Recall@K and the *nDCG@K* to evaluate both retrieval and ranking.

In [None]:
sparse_index = DocumentsIndex("indexes/sparse_index", 'sparse', set_bm25=True, k=1.05, b=0.7)  

Loading the sparse index file ...

The process is finished correctly!



In [None]:
sparse_pipeline = SparsePipeline(results_folder, "sparse_pipeline", sparse_index)

# Evaluate the pipeline both on relevance and quality .qrels
sparse_scores, ndcg_sparse = sparse_pipeline.compute_results(
                                [rel_qrels, quality_qrels],
                                topics, k=100, 
                                evaluate=True,
                                clean_query=False
                             )

print_ndcg(ndcg_sparse)

The nDCG for touche-task2-2022-relevance qrel is:
               Tag    nDCG@5
0  sparse_pipeline  0.472967 

The nDCG for touche-task2-2022-quality qrel is:
               Tag    nDCG@5
0  sparse_pipeline  0.500204 



In [None]:
compute_recall(rel_qrels, "results/sparse_pipeline_results.qrels")

0.6927539674320243

### Sparse index using T5 expanded corpus

The Touchè teams also released a corpus with a DocT5Query expansion at the end of each document. It has been proved that this could lead to [better performances](https://arxiv.org/abs/2103.04831) therefore we think that it was worth to try.  
We created a sparse index based on the expanded corpus and the results were quite satisfying and allowed us to break the 0.5 threshold for the nDCG score also with a sparse index. 

In [None]:
sparse_exp_index = DocumentsIndex("indexes/sparse_exp_index", 'sparse', set_bm25=True, k=1.55, b=0.8)  

Loading the sparse index file ...

The process is finished correctly!



Similar to the baseline model, we performs GridSearch over possible parameters. We noticed that an higher value for the *k* parameter is needed to achieve better results, probably this is due to the document expansion performed on the original corpus. 

At the end, with $k=1.55$ and $b=0.8$ we obtain the *nDCG@5* of **0.508** on relevance and **0.548** on quality.

In [None]:
sparse_exp_pipeline = SparsePipeline(results_folder, "sparse_exp_pipeline", sparse_exp_index)

# Evaluate the pipeline both on relevance and quality .qrels
sparse_scores, ndcg_sparse = sparse_exp_pipeline.compute_results(
                                [rel_qrels, quality_qrels],
                                topics, k=100, 
                                evaluate=True,
                                clean_query=False
                             )

print_ndcg(ndcg_sparse)

The nDCG for touche-task2-2022-relevance qrel is:
                   Tag    nDCG@5
0  sparse_exp_pipeline  0.508426 

The nDCG for touche-task2-2022-quality qrel is:
                   Tag    nDCG@5
0  sparse_exp_pipeline  0.548904 



In [None]:
compute_recall(rel_qrels, "results/sparse_exp_pipeline_results.qrels")

0.7364467296259576

### Dense index

In this case the approach is quite different, as seen before the index computes the knn between the query embedding and the document embeddings. Which type of encoder did we use?

We decided to use a version of TCT-Colbert-v2 pre-trained on the second version of MS MARCO dataset. You can find more information about TCT-Colbert in the [paper](https://aclanthology.org/2021.repl4nlp-1.17.pdf). 
The general idea is that the embeddings capture the semantic meaning of the documents, capturing also a notion of terms importance and then we will explore the similarities between embeddings considering the inner product between the vectors.

In this pipeline we need to encode the queries with TCT-Colbert-v2 before giving to the index, the entire process is done within the 'search' function implemented in the ```DocumentsIndex``` class. After that we encoded the queries we can give the vectors to the index, then it will retrieve the top-k documents with the highest inner product with respect to the query embedding.

Moreover we used a parameter to clean the queries removing punctuation and in the case of the dense index this led us to the best **nDCG@5**.

Our better results are **0.594** on relevance and **0.614** on quality.

In [None]:
dense_pipeline = DensePipeline(results_folder, 'dense_pipeline', dense_index)

dense_scores, ndcg_dense = dense_pipeline.compute_results(
                                [rel_qrels, quality_qrels],
                                topics, 
                                corpus_df, 
                                k=100, 
                                clean_query=True,
                                evaluate=True
                            )

print_ndcg(ndcg_dense)

The nDCG for touche-task2-2022-relevance qrel is:
              Tag    nDCG@5
0  dense_pipeline  0.594256 

The nDCG for touche-task2-2022-quality qrel is:
              Tag    nDCG@5
0  dense_pipeline  0.614225 



In [None]:
compute_recall(rel_qrels, "results/dense_pipeline_results.qrels")

0.7068018240091396

### Hybrid pipeline (sparse + dense index)

We also decided to test a combination of the 2 approaches, retrieving k documents from both indexes and combining in a clever way the obtained scores. We read the main idea behind the following algorithm in the HybridSearcher class of Pyserini.

**Algorithm**
1. First of all we save the minimum and maximum scores of the retrieved documents for both the sparse and the dense indexes (0 if no documents retrieved).
2. Then we iterate over the union of the ids that have been retrieved from the indexes, at this point if a document was found by an index, the relative score will be taken, otherwise the minimum score will be considered.
3. If we want we can normalize the scores and at the end we sum the scores multiplying the sparse score for an alpha value (in our case alpha=0.2). 
4. At the end we return the re-ranked list considering the new computed scores.

Multiplying by alpha we are giving more weight to the dense index, this is why we know that it performs better and we don't want that the 2 scores have the same weights on the sum. We decided to retrieve 700 documents per index because we observed empirically that even if we increase this number, the results of the ranking would not change.

We observe that better performance were reached using the sparse index with passage texts expanded. The best parameters for the sparse index are default, i.e. $k=0.9$ and $b=0.6$. Using the *nDCG@5* metric, we achieve **0.619** on relevance and **0.655** on quality. 

In [None]:
sparse_hybrid_index = DocumentsIndex("indexes/sparse_exp_index", 'sparse', set_bm25=True, k=0.9, b=0.6)  

hybrid_pipeline = HybridPipeline(results_folder, 'hybrid_pipeline', sparse_hybrid_index, dense_index)

scores_hybrid, ndcg_hybrid = hybrid_pipeline.compute_results(
                                [rel_qrels, quality_qrels],
                                topics, corpus_df, k=700, alpha=0.2, evaluate=True, k_end=100
                             )

print_ndcg(ndcg_hybrid)

Loading the sparse index file ...

The process is finished correctly!

The nDCG for touche-task2-2022-relevance qrel is:
               Tag    nDCG@5
0  hybrid_pipeline  0.618735 

The nDCG for touche-task2-2022-quality qrel is:
               Tag    nDCG@5
0  hybrid_pipeline  0.655515 



In [None]:
compute_recall(rel_qrels, "results/hybrid_pipeline_results.qrels")

0.8027718715098813

### Sparse index + MonoT5 re-ranking

# TODO: SCRIVI QUALCOSAAAA

In [None]:
alt_sparse_exp_index = DocumentsIndex("indexes/sparse_exp_index", 'sparse', set_bm25=True, k=1.05, b=0.7)  

monot5_pipeline = MonoT5Pipeline(results_folder, "sparse_monot5_pipeline", alt_sparse_exp_index)

# Evaluate the pipeline both on relevance and quality .qrels
_, ndcg_monot5 = monot5_pipeline.compute_results(
                                [rel_qrels, quality_qrels],
                                topics, k=100, 
                                evaluate=True,
                                corpus_df= corpus_df_exp,
                                clean_query=False
                             )

print_ndcg(ndcg_monot5)

Loading the sparse index file ...

The process is finished correctly!

The nDCG for touche-task2-2022-relevance qrel is:
                      Tag    nDCG@5
0  sparse_monot5_pipeline  0.726057 

The nDCG for touche-task2-2022-quality qrel is:
                      Tag    nDCG@5
0  sparse_monot5_pipeline  0.699194 



## Further attempts 

First of all we want to show you how the preprocessing impacted the document retrieval. Then, we have tried query expansion on the sparse index to show future directions in the research.

### Preprocessing attempt

Given that we worked with texts taken from the internet, we wanted to try to preprocess the corpus to check if the overall performances of the pipelines can get better. Our goal was to remove some noise from the text such that the document retrieval could focus on the important words.

We wrote a function to perform the following cleaning operations:
1. Make the documents lowercase.
2. Expand contractions.
3. Remove words with numbers inside.
4. Replace \n, characters that are not in the english alphabet and punctuation with a space.
5. Remove adjacent spaces.
6. Remove URLs and stopwords.
7. Perform lemmatization.

In order to do the lemmatization we decided to use the ***spacy*** library, while for the expansion of the contractions we used ***contractions*** library.

In [None]:
import contractions
import spacy

import string, re
# To enable progress bar in apply function
from tqdm.notebook import tqdm
tqdm.pandas()

nlp = spacy.load("en_core_web_sm", disable=['ner','parser'])
nlp.max_length=5000000

In [None]:
# Clean the documents performing pre-processing
def clean_documents(text, nlp):
    clean = text.lower()
    clean = contractions.fix(clean)
    # Remove punctuation
    clean = clean.translate(str.maketrans(string.punctuation, ' '*len(string.punctuation)))
    # Remove words with a number inside
    clean = re.sub('\w*\d\w*','', clean)
    clean = re.sub('\n',' ', clean)
    clean = re.sub(r"https?:/*\S+", "", clean)
    # Remove characters different from letters
    clean = re.sub('[^a-z]',' ', clean)
    # Remove adjacent spaces
    clean = re.sub(' +',' ', clean)

    clean = ' '.join([token.lemma_ for token in list(nlp(clean)) if not token.is_stop])
    return clean

We saved the results of the cleaning in a new column of the dataframe in order to access them easily.

In [None]:
corpus_df['clean'] = corpus_df['contents'].progress_apply(clean_documents, nlp=nlp)

Unfortunately we discover that the entire process was useless for our task and that the preprocessing affected negatively the quality of the document retrieval. 

#### Sparse index

Considering the same parameters of the baseline sparse index, we achieved worst results in both judgements using the metric *nDCG@5*: **0.411** on relevance and **0.437** on quality, a drop of *~0.6*.

In [None]:
sparse_clean_index = DocumentsIndex("indexes/sparse_clean_index", 'sparse_clean', set_bm25=True, k=1.05, b=0.7)

sparse_clean_pipeline = SparsePipeline(results_folder, "sparse_clean_pipeline", sparse_clean_index)

# Evaluate the pipeline both on relevance and quality .qrels
_, ndcg_clean_sparse = sparse_clean_pipeline.compute_results(
                                [rel_qrels, quality_qrels],
                                topics, k=40, 
                                evaluate=True,
                                clean_query=False
                             )

print_ndcg(ndcg_clean_sparse)

Loading the sparse index file ...

The process is finished correctly!

The nDCG for touche-task2-2022-relevance qrel is:
                     Tag    nDCG@5
0  sparse_clean_pipeline  0.411673 

The nDCG for touche-task2-2022-quality qrel is:
                     Tag    nDCG@5
0  sparse_clean_pipeline  0.437932 



#### Dense index

The dense index is even worse than the cleaned sparse index, which is surprising because without preprocessing it is better by a remarkable margin.

Using the metric *nDCG@5* we got **0.384** on relevance and **0.429** on quality.

In [None]:
dense_clean_index = DocumentsIndex('indexes/dense_clean.index', 'dense')

dense_clean_pipeline = DensePipeline(results_folder, 'dense_clean_pipeline', dense_clean_index)

_, ndcg_dense_clean = dense_clean_pipeline.compute_results(
                                [rel_qrels, quality_qrels],
                                topics, 
                                corpus_df, 
                                k=40, 
                                clean_query=True,
                                evaluate=True
                            )

print_ndcg(ndcg_dense_clean)

Loading the dense index file ...
Loading the encoder castorini/tct_colbert-v2-hnp-msmarco ...

The process is finished correctly!

The nDCG for touche-task2-2022-relevance qrel is:
                    Tag    nDCG@5
0  dense_clean_pipeline  0.384234 

The nDCG for touche-task2-2022-quality qrel is:
                    Tag    nDCG@5
0  dense_clean_pipeline  0.429425 



### RM3

# TODO: scrivere qualcosa

In [None]:
rm3_sparse_index = DocumentsIndex("indexes/sparse_exp_index", 'sparse', set_rm3=True, ft=20, fd=3, lam=0.9, set_bm25=True, k=1.05, b=0.7)  

rm3_sparse_pipeline = SparsePipeline(results_folder, "rm3_sparse_pipeline", rm3_sparse_index)

# Evaluate the pipeline both on relevance and quality .qrels
_, ndcg_rm3_sparse = rm3_sparse_pipeline.compute_results(
                                [rel_qrels, quality_qrels],
                                topics, k=40, 
                                evaluate=True,
                                clean_query=False
                             )

print_ndcg(ndcg_rm3_sparse)

Loading the sparse index file ...

The process is finished correctly!

The nDCG for touche-task2-2022-relevance qrel is:
                   Tag    nDCG@5
0  rm3_sparse_pipeline  0.497615 

The nDCG for touche-task2-2022-quality qrel is:
                   Tag    nDCG@5
0  rm3_sparse_pipeline  0.529924 

