# Argument retrieval for comparative questions

The task has been selected among the ones proposed by Touché at CLEF 2022 and you can find a detailed explanation in the [relative website](https://touche.webis.de/clef22/touche22-web/argument-retrieval-for-comparative-questions.html).   
To recap, given a comparative question and a collection of documents we need to retrieve the most relevant text passages for either compared object or for both and to detect their respective stances with respect to the object they talk about. In the first part of the notebook we will explain in detail how we structured the document retrieval part and at the end we will also test the stance detection task (the training and full explanation of which model we used for stance detection can be found in the other notebook 'stance_detection.ipynb' ).

In the notebook you will have to use the indexes or some different datasets, in order to speed up the download or the processes that are computationally heavy we created a shared folder on Drive where you can find the files needed to test our proposed system.
Link to Drive shared folder: TODO: add link.

To be able to mount the shared files you first need to "Add shortcut to your Drive" when accessing the specified link.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Installation and import of dependencies

In [None]:
!pip tqdm

In [None]:
import os
import re

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import xml.etree.ElementTree as ET

import utils.manage_files

## Download datasets

In order to simplify importing the different necessary files, we decided to create a .tar.gz that contains all of them.   
Otherwise the files will be downloaded from the links that were given by the Touché team.

In [None]:
# Run the following cell if you want to import the files from Drive
!cp /content/drive/MyDrive/NLP_project/downloads.tar.gz .
!tar -xvf downloads.tar.gz
!rm downloads.tar.gz

### ClueWeb12 corpus

The available documents for the document retrieval task have been selected among the ClueWeb12 dataset. We load it using a function that you can find in the '*utils*' directory, if the file is already present the function simply create a class with the already available file.

In [None]:
url_corpus = "https://zenodo.org/record/6802592/files/touche-task2-passages-version-002.jsonl.gz?download=1"
zip_path_corpus = "corpus.jsonl.gz"
file_path_corpus = "corpus.jsonl"

download_corpus = utils.manage_files.DownloadFile(file_path_corpus, zip_path_corpus, url_corpus)
download_corpus()

After that we can load the jsonl on a Pandas dataframe and we can explore the data.

In [None]:
corpus_df = pd.read_json(download_corpus.file_name, lines=True)
corpus_df.head()

In [None]:
print(f"The corpus has {len(corpus_df)} elements.")

### Topics, quality, relevance and stance

In this subsection we download the list of possible topics, that are the actual queries to submit to the model and the quality, relevance and stance qrels files that will be used to evaluate our different pipelines.  
The list of ***topics*** contain 100 arguments but for this task only 50 were selected by the team, therefore we will evaluate our models on these ones.

The .qrels files have four columns: TOPIC, Q0, DOC_ID, SCORE.

#### Topics

In [None]:
# Download and parse the xml file of the 0-50 topics
url_topics = "https://zenodo.org/record/6873559/files/topics-task-2.zip?download=1"
zip_path_topics = "topics-task-2.zip"
file_path_topics = "topics-task-2"

download_topics = utils.manage_files.DownloadFile(file_path_topics, zip_path_topics, url_topics)
download_topics()
# Retrieve a list of strings from the xml
topics = utils.manage_files.open_xml(download_topics.file_name + "/topics-task-2.xml")

# Topics from 51 to 100
url_topics_21 = "https://zenodo.org/record/6873565/files/topics-task-2-2021.zip?download=1"
zip_path_topics_21 = "topics-task-2-2021.zip"
file_path_topics_21 = "topics-task-2-2021"

download_topics_21 = utils.manage_files.DownloadFile(file_path_topics_21, zip_path_topics_21, url_topics_21)
download_topics_21()
topics += utils.manage_files.open_xml(download_topics_21.file_name + "/topics-task2-51-100.xml")

As you can see below we have a list of 100 topics but for the evaluation we will use only the ones selected by the Touchè team.

In [None]:
print(f"There are {len(topics)} topics.\n{topics}")

#### Relevance

In [None]:
# Download relevance qrels 50 topics
url_relevance = "https://zenodo.org/record/6873567/files/touche-task2-2022-relevance.qrels?download=1"
file_path_rel = "relevance.qrels"

download_relevance = utils.manage_files.DownloadFile(file_path_rel, url=url_relevance)
download_relevance()

In [None]:
relevance_df = pd.read_csv(download_relevance.file_name, index_col=None, 
                    names=["topic", "0", "doc_id", "relevance"], sep=" ").drop("0", axis=1)

relevance_df.head()

#### Quality

In [None]:
# Download relevance qrels 50 topics
url_quality = "https://zenodo.org/record/6873567/files/touche-task2-2022-quality.qrels?download=1"
file_path_qual = "quality.qrels"

download_quality = utils.manage_files.DownloadFile(file_path_qual, url=url_quality)
download_quality()

In [None]:
quality_df = pd.read_csv(download_quality.file_name, index_col=None, 
                    names=["topic", "0", "doc_id", "quality"], sep=" ").drop("0", axis=1)

quality_df.head()

#### Stance

In [None]:
# Download relevance qrels 50 topics
url_stance = "https://zenodo.org/record/6873567/files/touche-task2-2022-stance.qrels?download=1"
file_path_stance = "stance.qrels"

download_stance = utils.manage_files.DownloadFile(file_path_stance, url=url_stance)
download_stance()

In [None]:
stance_df = pd.read_csv(download_stance.file_name, index_col=None, 
                    names=["topic", "0", "doc_id", "stance"], sep=" ").drop("0", axis=1)

stance_df.head()

## Document retrieval and ranking

During the task all the teams had been provided with an API key for the [ChatNoir](https://www.chatnoir.eu/doc/) system. It is an Elasticsearch-based search engine offering a document retrieval interface for different corpus (including ClueWeb12). The API returns the most relevant documents with respect to a query and further information for each of them, such as the BM25 score, the page rank score and the spam score. Unfortunately we were not able to obtain an API key, therefore we decided to create our own indexes on which to perform document retrieval. There are different types of indexes used in IR, the main ones are:
- ***sparse indexes***: is a type of index that only stores a subset of the terms that appear in a document collection. This can be useful for optimizing index size and search efficiency when working with large collections.
- ***dense indexes***: we store the embeddings of the documents, in our case created by TCT Colbert, therefore we are capturing a semantic meaning into a fixed length vector. In order to retrieve the most similar to a given query the index computes a distance between the embedding of the query and the others, we decided to use the inner product that is a default metric.

To improve the robustness and the quality of our retrieval we decided to mix up the two approaches considering a hybrid pipeline and an approach that uses a sparse index and reranking made by MonoT5. 



We found on the web different valid libraries that allow to create an index given a set of documents.
- The first one is [Pyserini](https://github.com/castorini/pyserini), a Python toolkit for reproducible information retrieval research with sparse and dense representations. The sparse index can be created on a custom collection of documents while the creation of a dense index for our own documents is not currently available.
- This led us to look for another library that allows building a dense index and we found [autofaiss](https://github.com/criteo/autofaiss). ***autofaiss*** creates [Faiss](https://github.com/facebookresearch/faiss) knn indexes selecting the most optimal similarity search parameters. It only needs the embedding vectors for each document, that we computed using the ***pyserini encode*** module and [TCT-Colbert](https://arxiv.org/pdf/2010.11386.pdf) pre-trained on the second version of [MS MARCO dataset](https://microsoft.github.io/msmarco/).


In [None]:
# Installation of the different libraries for using the indexes
!pip install -q pyserini
!pip install -q faiss-cpu==1.7.2
!pip install -q autofaiss

At this point you can simply run the following cell to import the pre-saved indexes and ignore the subsections that explain the creation and go to the 'Models' section.

In [None]:
# Load the indexes from Drive
!cp /content/drive/MyDrive/NLP_project/indexes.zip .
!unzip indexes.zip
!rm indexes.zip

### Creation of a sparse index 

In order to create a sparse index given a .jsonl file, Pyserini needs to have only 2 keys, 'id' and 'contents', so we have to remove the other columns from the dataframe and save the new .jsonl file.   
First of all we create a 'collections' dir where to put the file and then we save it.

In [None]:
!mkdir collections
corpus_df.drop('chatNoirUrl', axis=1).to_json('collections/corpus_index.jsonl', orient="records", lines=True)

Now we can run the command to create the sparse index with different parameters:
- ```--collection JsonCollection``` is used to indicate to the documents ingestor that the documents in input are inside a json file.
- ```--input collections```, it's simply the directory where to find the json file.
- ```--index indexes/sparse_index```, it's the directory where to save the index files.
- ```--bm25.accurate``` if set, Anserini uses an algorithm that is more computationally expensive but more accurate. The "accurate" variant of BM25 computes the idf of terms by taking into account accurate document lengths. If not set an approximation for idf will be used.
- ```--generator DefaultLuceneGenerator```, is the default generator to create the index.
- ```--threads 2``` represents the number of threads to use while creating the index.
- ```--storePositions``` stores term positios, needed for phrase queries.
- ```--storeDocvectors``` stores document vectors, needed for (pseudo) relevance feedback.
- ```-storeRaw``` stores raw source documents.

In [None]:
# Create the sparse index 
!python -m pyserini.index.lucene --collection JsonCollection --input collections/ --index indexes/sparse_index --bm25.accurate --generator DefaultLuceneDocumentGenerator --threads 2 --storePositions --storeDocvectors --storeRaw

By default Pyserini performs stemming with 'porter' and stopwords removal on the input texts.

### Creation of a dense index

As explained before, Pyserini doesn't support yet the creation of a dense index on a custom documents collection. Thus we first created the embedding of the documents with Pyserini and then we created the index with autofaiss.   
The parameters passed to the encode module are:
- ```--corpus```, that is the .jsonl file where to find the documents (the same as before).
- ```--fields```, the key of the json to consider for the embedding.
- ```--shard-id 0```, the number of shard in case we want to split the index.
- ```--shard-num 1```, in our case we have only one shard, but here you can set multiple shard and then you should run the command multiple times changing the shard-id.
- ```--embeddings```, the directory where to put the embeddings once computed.
- ```--encoder```, the encoder to use in order to compute the embeddings (in our case TCT Colbert pre-trained on the second version of MS MARCO).
- ```--fields```, needs to be equal to the previous --fields parameter.
- ```--batch```, the batch to use.
- ```--fp16```, to speed up the computation if PyTorch autocast is used for inference.

In [None]:
# Compute the embeddings
!python -m pyserini.encode input --corpus collections/corpus_index.jsonl --fields text --shard-id 0 --shard-num 1 output --embeddings embeddings/ encoder --encoder castorini/tct_colbert-v2-hnp-msmarco --fields text --batch 32 --fp16

After that we have the embeddings in a .jsonl file, in the 'vector' column, we need to split this very huge file into smaller ones in order to make the size suitable for the RAM. Moreover autofaiss taks as input .npy files therefore we decided to split the embedding vectors in .npy files that contains 75000 vectors each.

In [None]:
# Create the npy_embeddings directory
!mkdir npy_embeddings

import tqdm

def convert_to_npy(path, file_len, chunksize=75000):
    '''
        It takes as input a .jsonl file and it creates some .npy files taking
        only the 'vector' key, that is the embedding of the documents. It saves
        the results in the npy_embeddings directory.
        Parameters:
            - path: str 
                The path of the .jsonl file that contains the embeddings in the 'vector' key.
            - file_len: int
                The number of elements inside the .jsonl file.
            - chunksize: int
                The number of lines to read at each step, so the number of vectors for each new
                .npy file.
    '''
    steps = file_len//chunksize
    for i, chunk in enumerate(tqdm.tqdm(pd.read_json(path, lines=True, chunksize=chunksize), total=steps)):
        npy_list = []
        for vect in chunk['vector'].to_numpy():
            npy_list.append(vect)

        # Save different files to avoid RAM consumption
        np.save(f'npy_embeddings/embeddings_{i+10}.npy', np.array(npy_list))
        del npy_list

In [None]:
# Actually call the function
file_len = 868655
convert_to_npy('embeddings/embeddings.jsonl', file_len)

Given the embedding vectors autofaiss automatically creates a dense index executing the following cell:

In [None]:
from autofaiss import build_index

# Load the .npy files from the "npy_embeddings" directory where we saved them
build_index(embeddings="npy_embeddings", index_path="indexes/knn.index",
            index_infos_path="indexes/dense_index_infos.json", max_index_memory_usage="6GB",
            current_memory_available="9GB")

## Models

To load the pre-saved indexes from the Drive shared folder run the following cell:

In [None]:
# Load the indexes from Drive
!cp /content/drive/MyDrive/NLP_project/indexes.zip .
!unzip indexes.zip
!rm indexes.zip

We created a ```DocumentsIndex``` class to import the dense and sparse indexes and to set different parameters, you can find it in the ```src``` directory of the project.

Before presenting the different pipelines that we implemented for document retrieval, we create the instance for the dense index, since it was quite heavy in memory and we use the same instance for all the pipelines that need it. 
We also create a directory for saving the results of the search.

In [None]:
!mkdir results
from src.DocumentsIndex import DocumentsIndex

In [None]:
dense_index = DocumentsIndex('indexes/knn.index', 'dense')

The following 2 functions can be used to print the nCDG score and to return the list of urls given the corpus and the dictionary of the results.

In [None]:
def print_ndcg(ndcg_scores):
    '''
        It prints the nDCG score for each key in the input.

        Parameters: 
            - ndcg_scores: dict
                The dictionary that contains for each .qrels file
                the mean nDCG.
    '''
    if ndcg_scores and isinstance(ndcg_scores, dict):
        for key, ndcg in ndcg_scores.items():
            print(f"The nDCG for {key} qrel is:")
            print(ndcg, '\n')
    else:
        print("nDCG has not been computed, set 'evaluate=True' to compute it.")


# Retrieve the urls from the corpus given the results of the search on the index
def retrieve_docs_ranked(corpus, hits, k=10):
    '''
        It returns a list of ids and urls retrieved from the corpus
        using the ids within hits.

        Parameters:
            - corpus: pd.DataFrame
                The corpus from which we want to retrieve the urls.
            - hits: dict
                The dictionary that has ids and scores for a specific
                topic.
            - k: int
                The number of elements to returns. 
    '''
    urls = []
    for el in hits['ids'][:k]:
        urls.append(corpus[corpus['id'] == el]['chatNoirUrl'].item())
    ids = [val for val in hits['ids'][:k]]
    return ids, urls

### Sparse index (BM25 score)

It's a probabilistic retrieval model for estimating the relevance of a passage given a query. In this simple pipeline we use the sparse index created before, so considering the accurate BM25 for the ranking and we compute the nDCG@5 as all the teams did during the task. This pipeline is the ***baseline*** for a retrieval approach and it's the less effective in terms of nDCG@5 computed on the quality and relevance. 

As default value for the parameters of BM25, we leave k=0.9 and b=0.4 because these are the ones that work better in combination with the dense index in the hybrid pipeline. 

To get the best possible results in terms of nDCG@5 using only the sparse index the best values for the parameters are k=1.15 and b=0.75.

To see how the pipeline is implemented you can check ```src/SparsePipeline.py``` file that contains the related class. In the following experiments we retrieve the top-40 documents since for computing the nDCG@5 score we would only need 5 of them.

In [None]:
from src.SparsePipeline import SparsePipeline

# The index with the default values (k=0.9, b=0.4)
sparse_index = DocumentsIndex('indexes/sparse_index', 'sparse')

In [None]:
sparse_pipeline = SparsePipeline('results', "sparse_pipeline", sparse_index)

# Evaluate the pipeline both on relevance and quality .qrels
sparse_scores, ndcg_sparse = sparse_pipeline.compute_results(
                                ['downloads/relevance.qrels', 'downloads/quality.qrels'],
                                topics, k=40, evaluate=True,
                                clean_query=False
                             )

print_ndcg(ndcg_sparse)

In the following cells you can find the best possible result if you want to use only the sparse index.

In [None]:
# Set the parameters of bm25
sparse_index_best = DocumentsIndex('indexes/sparse_index', 'sparse', 
                                    set_bm25=True, k1=1.15, b=0.75)

In [None]:
sparse_pipeline_best = SparsePipeline('results', "sparse_pipeline_best", sparse_index_best)

# Evaluate the pipeline both on relevance and quality .qrels
sparse_scores_best, ndcg_sparse_best = sparse_pipeline_best.compute_results(
                                ['downloads/relevance.qrels', 'downloads/quality.qrels'],
                                topics, k=40, evaluate=True,
                                clean_query=False
                             )

print_ndcg(ndcg_sparse_best)

### Dense index

In this case the approach is quite different, as seen before the index computes the knn between the query embedding and the document embeddings. Which type of encoder did we use?

We decided to use a version of TCT-Colbert pre-trained on the second version of MS MARCO dataset. You can find more information about TCT-Colbert in the [paper](https://arxiv.org/pdf/2010.11386.pdf). 
The general idea is that the embeddings capture the semantic meaning of the documents, capturing also a notion of terms importance and then we will explore the similarities between embeddings considering the inner product between the vectors.

In this pipeline we need to encode the queries with TCT-Colbert before giving to the index, the entire process is done within the 'search' function implemented in the ```DocumentsIndex``` class. After that we encoded the queries we can give the vectors to the index, then it will retrieve the top-k documents with the highest inner product with respect to the query embedding.

Moreover we used a parameter to clean the queries removing punctuation and in the case of the dense index this led us to the best nDCG@5.

In [None]:
from src.DensePipeline import DensePipeline

dense_pipeline = DensePipeline('results', 'dense_pipeline', dense_index)

dense_scores, ndcg_dense = dense_pipeline.compute_results(
                                ['downloads/relevance.qrels', 'downloads/quality.qrels'],
                                topics, corpus_df, k=40, clean_query=True,
                                evaluate=True
                            )

print_ndcg(ndcg_dense)

### Hybrid pipeline (sparse + dense index)

We also decided to test a combination of the 2 approaches, retrieving k documents from both indexes and combining in a clever way the obtained scores. We read the main idea behind the following algorithm in the HybridSearcher class of Pyserini.

**Algorithm**
1. First of all we save the minimum and maximum scores of the retrieved documents for both the sparse and the dense indexes (0 if no documents retrieved).
2. Then we iterate over the union of the ids that have been retrieved from the indexes, at this point if a document was found by an index, the relative score will be taken, otherwise the minimum score will be considered.
3. If we want we can normalize the scores and at the end we sum the scores multiplying the sparse score for an alpha value (in our case alpha=0.2). 
4. At the end we return the re-ranked list considering the new computed scores.

Multiplying by alpha we are giving more weight to the dense index, this is why we know that it performs better and we don't want that the 2 scores have the same weights on the sum.

In [None]:
from src.HybridPipeline import HybridPipeline

hybrid_pipeline = HybridPipeline('results', 'hybrid_pipeline', sparse_index, dense_index)

scores_hybrid, ndcg_hybrid = hybrid_pipeline.compute_results(
                                ['downloads/relevance.qrels', 'downloads/quality.qrels'],
                                topics, corpus_df, k=700, alpha=0.2, evaluate=True
                             )

print_ndcg(ndcg_hybrid)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=ce24eff3-eda0-4403-93ec-34f248019e53' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>