# CQADupStack Collection

The [*CQADupStack*](https://github.com/D1Doris/CQADupStack) is "[a] Benchmark Data Set for Community Question-Answering Research" [1] that is a part of the [*Benchmarking Information Retrieval (BEIR)*](https://github.com/beir-cellar/beir) collection.

CQADupStack contains data from 12 different [*Stackexchange*](https://stackexchange.com/) subforums based on the data dump released on September 26, 2014.

Your tasks, reviewed by your colleagues and the course instructors, are the following:

TODO


1. *Implement a ranked retrieval system*, [1, Chapter 6] which will produce a list of documents from the CQADupStack collection in a descending order of relevance to a query from the CQADupStack collection.
2. *Document your code* in accordance with [PEP 257](https://www.python.org/dev/peps/pep-0257/), ideally using [the NumPy style guide](https://numpydoc.readthedocs.io/en/latest/format.html#docstring-standard) as seen in the code from exercises.
   *Stick to a consistent coding style* in accordance with [PEP 8](https://www.python.org/dev/peps/pep-0008/).
3. *Reach at least XX% mean average precision* [1, Section 8.4] with your system on the CQADupStack collection.
4.   _[Upload an .ipynb file](https://is.muni.cz/help/komunikace/spravcesouboru#k_ss_1) with this Jupyter notebook to the homework vault in IS MU._ You MAY also include a brief description of your information retrieval system and a link to an external service such as [Google Colaboratory](https://colab.research.google.com/), [DeepNote](https://deepnote.com/), or [JupyterHub](https://iirhub.cloud.e-infra.cz/).







[1] Hoogeveen, Doris and Verspoor, Karin M. and Baldwin, Timothy. [*CQADupStack: A Benchmark Data Set for Community Question-Answering Research*](https://dl.acm.org/doi/10.1145/2838931.2838934). ACM, 2015.

### Import the utility tools from the git repository.

First, we will install [our library](https://github.com/MIR-MU/pv211-utils).

It may be necessary to restart the runtime to get the installed packages to work.

In [1]:
! pip install git+https://github.com/MIR-MU/pv211-utils.git@add-beir-datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://gitlab.fi.muni.cz/xstefan3/pv211-utils.git@dataset_and_irsystem_evaluator
  Cloning https://gitlab.fi.muni.cz/xstefan3/pv211-utils.git (to revision dataset_and_irsystem_evaluator) to /tmp/pip-req-build-mwh950k4
  Running command git clone -q https://gitlab.fi.muni.cz/xstefan3/pv211-utils.git /tmp/pip-req-build-mwh950k4
  Running command git checkout -b dataset_and_irsystem_evaluator --track origin/dataset_and_irsystem_evaluator
  Switched to a new branch 'dataset_and_irsystem_evaluator'
  Branch 'dataset_and_irsystem_evaluator' set up to track remote branch 'dataset_and_irsystem_evaluator' from 'origin'.


### Define the necessary classes

These will eventually represent the Queries, Documents and Relevance Judgements from the CQADupStack collection.

Query and Document consist only of their IDs and bodies.
Judgements are also just a Set of Tuples that represent pairs of relevant Document-Query combinations.

In [2]:
from pv211_utils.beir.entities import BeirDocumentBase, BeirQueryBase, BeirJudgementBase
from typing import Set

class Query(BeirQueryBase):
    """
    A processed query form the Beir collection.

    Parameters
    ----------
    query_id : int
        The number
    body : str
        Text of a query

    """
    def __init__(self, query_id: int, body: str):
        super().__init__(query_id, body)

class Document(BeirDocumentBase):
    """
    A processed document form the Beir collection.

    Parameters
    ----------
    document_id : str
        A unique identifier of the document.
    body : str
        The text of the document.

    """
    def __init__(self, document_id: str, body: str):
        super().__init__(document_id, body)
        
BeirJudgements = Set[BeirJudgementBase]


### Define the datasets that are to be used.


TODO - Either this or the one bellow

RawBeirDataset stores the basic setup 
- Name of the dataset
- Subset(s) to use
- Alternative(s) to the subset(s) if they are not available


RawBeirDatasets then stores 
- Common download location
- List of RawBeirDataset instances

If more than one datasets are used, they will be merged and used as one. This functionality is primarly aimed at the CQADupstack datasets, but shuld work with any other combination as well.

## Loading the datasets


First, we import the RawBeirDataset, RawBeirDatasets classes.

The former is used to simply save the name and desired subsets of the selected CQADupStack dataset. All CQADupStack datasets contain only the testing subset out-of-the-box.

The latter is necessary for the purpose of using multiple datasets at once. In this example, we combine the Android and Programmers datasets.

It is necessary to define the path to the download directory, where the datasets will be stored. If a desired dataset is already present in this selected directory, repeated download will not be necessary.



### CQADupStack contains 12 datasets:
- Android
- English
- Gaming
- GIS
- Mathematica
- Physics
- Programmers
- Stats
- TeX
- Unix
- Webmasters
- WordPress

These are represented by theirs lowercase names.



In [3]:
from pv211_utils.beir.entities import RawBeirDataset, RawBeirDatasets


android = RawBeirDataset( "android", test = True)
english = RawBeirDataset("english", test = True)
gaming = RawBeirDataset("gaming", test = True)
gis = RawBeirDataset("gis", test = True)
mathematica = RawBeirDataset("mathematica", test = True)
physics = RawBeirDataset("physics", test = True)
programmers = RawBeirDataset("programmers", test = True)
stats = RawBeirDataset("stats", test = True)
tex = RawBeirDataset("tex", test = True)
unix = RawBeirDataset("unix",test = True)
webmasters = RawBeirDataset("webmasters", test = True)
wordpress = RawBeirDataset("wordpress", test = True)

#programmers = RawBeirDataset("programmers",test = True)
#android = RawBeirDataset("android",test = True)
download_location = "datasets"
desired_datasets = RawBeirDatasets([android,english,gaming,gis,mathematica,physics,programmers,stats,tex,unix,webmasters,wordpress],download_location)

# If download_location is set to None or omitted and the data_path is set instead the data should not redownload
# desired_datasets = RawBeirDatasets([android,english,gaming,gis,mathematica,physics,programmers,stats,tex,unix,webmasters,wordpress],data_path = "datasets/cqadupstack")



### Load and split raw data
Once we have all the desired datasets defined we can load them, load_BEIR_datasets downloads (if necessary), loads, and prepares the raw data from the selected datasets.

It returns three values:
raw_train_data, raw_dev_data, raw_test_data,
but as the train and dev subsets are not present in these datasets, these can be ignored.

To get the train and dev(validation) subsets, we use the split_BEIR_dataset to split the original test data.



In [4]:
from pv211_utils.beir.loader import load_beir_datasets, split_beir_dataset


_, _, raw_test_data = load_beir_datasets(desired_datasets)
# Leave 90% for training and 10% for validation and testing
raw_train_data, raw_test_data = split_beir_dataset(raw_test_data, split_factor=0.9)
# Subsequently split this into 5% for validation and 5% for testing
raw_dev_data, raw_train_data = split_beir_dataset(raw_train_data, split_factor=0.5)


  from tqdm.autonotebook import tqdm


### The loaded raw data consists of three parts.
- corpus (the set of documents)
- queries (the search terms)
- qrels (the relevance judgements)

In [5]:
raw_corpus_test = list(raw_test_data)[0]
raw_queries_test = list(raw_test_data)[1]
raw_qrels_test = list(raw_test_data)[2]

raw_corpus_train = list(raw_train_data)[0]
raw_queries_train = list(raw_train_data)[1]
raw_qrels_train = list(raw_train_data)[2]

raw_corpus_dev = list(raw_dev_data)[0]
raw_queries_dev = list(raw_dev_data)[1]
raw_qrels_dev = list(raw_dev_data)[2]

### Process the loaded data

In order to use the loaded data it is necessary to process it.

load_X functions return a list of the processed data.

If desired or necessary it is possible to limit the number of used queries.
(to speed up the evaluation)

In [6]:
from pv211_utils.beir.loader import load_documents,load_queries,load_judgements

# generally the same set of documents is used for train, dev, and test - no need to store it multiple times
documents = load_documents(raw_corpus_test)
max_test_queries = None # for HotpotQA I suggest for example just 200 test queries, None == use all avaialbe 
test_queries = load_queries(raw_queries_test,max_test_queries)
test_judgements = load_judgements(test_queries,documents,raw_qrels_test)
        


## Implement the Information Retrieval system

Next, we will define a class named `IRSystem` that will represent your information retrieval system. Your class must define a method name `search` that takes a query and returns documents in descending order of relevance to the query.

This example returns documents in a decreasing order according to
a [*Okapi BestMatch25*](https://en.wikipedia.org/wiki/Okapi_BM25) similarity score between the documents and the given query.

It also allows for the use of a [*re-ranking*](https://developers.google.com/machine-learning/recommendation/dnn/re-ranking) function, which takes the selected top k results from the base function and reorders them to achieve better results.

You can use this example as the basis for your own implementation of an Information Retrieval System. Experiment with better suited preprocessing options. Try different base and re-ranking functions. Play around with the various hyperparameters, either by hand or you can use the dev subset to help you find the best setup. And of course, you can scratch this entire piece of code and make something completely new.




In [7]:
import math
import torch
from typing import Iterable

from multiprocessing import get_context, Pool
from tqdm.notebook import tqdm

from pv211_utils.beir.irsystem import BeirIRSystemBase
from pv211_utils.beir.rerank import GenericReRank

from gensim.utils import simple_preprocess
from gensim.summarization import bm25
from sentence_transformers import CrossEncoder



class IRSystem(BeirIRSystemBase):
    """
    A system that returns documents ordered based on the
    Okapi BestMach25 score.

    """
    def __init__(self,rerank_first_k = 0):
        #with Pool(None) as pool:    # None means all CPUs
        document_bodies = (simple_preprocess(doc.body) for doc in documents.values())
        document_bodies = tqdm(document_bodies, desc='Preprocessing documents', total=len(documents))
        index_to_document = dict(enumerate(documents.values()))
        self.index_to_document = index_to_document
        bm25_model = bm25.BM25(document_bodies)
        self.bm25_model = bm25_model
        self.rerank_first_k = rerank_first_k
        if rerank_first_k != 0:
            cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", device="cuda")
            self.reranker = GenericReRank(cross_encoder)


    def search(self, query: Query) -> Iterable[Document]:
        query_doc = simple_preprocess(query.body)
        similarities = enumerate(self.bm25_model.get_scores(query_doc))
        similarities = sorted(similarities, key=lambda item: item[1], reverse=True)
        ranked_documents = (self.index_to_document[i] for i,_ in similarities)

        if self.rerank_first_k !=0:
            #Re-ranking top k results
            reranked = self.reranker.rerank_top_k(query, ranked_documents, self.rerank_first_k)
            for doc in reranked:
                yield doc
        # Yield the rest
        for doc in ranked_documents: 
            yield doc



### Evaluate the system on a given dataset

Lastly, we will evaluate the IR system using the [Mean Average Precision](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision) (MAP).

Set the rerank_first_k parameter to a nonzero number such as a 100 to use the reranking function on the top k results of every search.



In [None]:
from pv211_utils.beir.eval import BeirEvaluation
from pv211_utils.beir.leaderboard import BeirLeaderboard

submit_result = False
author_name = 'Surname, Name'
system = IRSystem(rerank_first_k=0)

evaluation = BeirEvaluation(system, test_judgements, BeirLeaderboard(), author_name, num_workers=1)
evaluation.evaluate(tqdm(test_queries.values(), desc='Querying the system'), submit_result)