# CQADupStack Collection

The *CQADupStack* is "[a] Benchmark Data Set for Community Question-Answering Research" [1] that is a part of the [*Benchmarking Information Retrieval (BEIR)*](https://github.com/beir-cellar/beir) collection.

CQADupStack contains data from 12 different [*Stackexchange*](https://stackexchange.com/) subforums based on the data dump released on September 26, 2014.

Your tasks, reviewed by your colleagues and the course instructors, are the following:



1. *Implement a ranked retrieval system*, [1, Chapter 6] which will produce a list of documents from the CQADupStack collection in a descending order of relevance to a query from the CQADupStack collection.
2. *Document your code* in accordance with [PEP 257](https://www.python.org/dev/peps/pep-0257/), ideally using [the NumPy style guide](https://numpydoc.readthedocs.io/en/latest/format.html#docstring-standard) as seen in the code from exercises.
   *Stick to a consistent coding style* in accordance with [PEP 8](https://www.python.org/dev/peps/pep-0008/).
3. *Reach at least 22% mean average precision* [1, Section 8.4] with your system on the CQADupStack collection.
4.   _[Upload an .ipynb file](https://is.muni.cz/help/komunikace/spravcesouboru#k_ss_1) with this Jupyter notebook to the homework vault in IS MU._ You MAY also include a brief description of your information retrieval system and a link to an external service such as [Google Colaboratory](https://colab.research.google.com/), [DeepNote](https://deepnote.com/), or [JupyterHub](https://iirhub.cloud.e-infra.cz/).







[1] Hoogeveen, Doris and Verspoor, Karin M. and Baldwin, Timothy. [*CQADupStack: A Benchmark Data Set for Community Question-Answering Research*](https://dl.acm.org/doi/10.1145/2838931.2838934). ACM, 2015.

### Import the utility tools from the git repository.

First, we will install [our library](https://github.com/MIR-MU/pv211-utils).

It may be necessary to restart the runtime to get the installed packages to work.

In [1]:
%%capture
! pip install git+https://github.com/MIR-MU/pv211-utils.git

### Define the necessary classes

These will eventually represent the Queries, Documents and Relevance Judgements from the CQADupStack collection.

Query and Document consist only of their IDs and bodies.
Judgements are also just a Set of Tuples that represent pairs of relevant Document-Query combinations.

In [2]:
from pv211_utils.beir.entities import BeirDocumentBase, BeirQueryBase, BeirJudgementBase
from typing import Set


class Query(BeirQueryBase):
    """
    A processed query form the Beir collection.

    Parameters
    ----------
    query_id : int
        The number
    body : str
        Text of a query

    """

    def __init__(self, query_id: int, body: str):
        super().__init__(query_id, body)

    def __str__(self):
        return self.body


class Document(BeirDocumentBase):
    """
    A processed document form the Beir collection.

    Parameters
    ----------
    document_id : str
        A unique identifier of the document.
    body : str
        The text of the document.

    """

    def __init__(self, document_id: str, body: str):
        super().__init__(document_id, body)

    def __str__(self):
        return self.body


BeirJudgements = Set[BeirJudgementBase]


## Loading the datasets
### CQADupStack contains 12 datasets that will be loaded and merged:
- Android
- English
- Gaming
- GIS
- Mathematica
- Physics
- Programmers
- Stats
- TeX
- Unix
- Webmasters
- WordPress

For more details: <a href=http://nlp.cis.unimelb.edu.au/resources/cqadupstack/>CQADupStack site</a>.



In [3]:
from pv211_utils.datasets import CQADupStackDataset 

data = CQADupStackDataset()

  from tqdm.autonotebook import tqdm


  0%|          | 0/22998 [00:00<?, ?it/s]

  0%|          | 0/40221 [00:00<?, ?it/s]

  0%|          | 0/45301 [00:00<?, ?it/s]

  0%|          | 0/37637 [00:00<?, ?it/s]

  0%|          | 0/16705 [00:00<?, ?it/s]

  0%|          | 0/38316 [00:00<?, ?it/s]

  0%|          | 0/32176 [00:00<?, ?it/s]

  0%|          | 0/42269 [00:00<?, ?it/s]

  0%|          | 0/68184 [00:00<?, ?it/s]

  0%|          | 0/47382 [00:00<?, ?it/s]

  0%|          | 0/17405 [00:00<?, ?it/s]

  0%|          | 0/48605 [00:00<?, ?it/s]

In [4]:
documents = data.load_documents(document_class=Document)

train_queries = data.load_train_queries(query_class=Query)
train_judgements = data.load_train_judgements()

validation_queries = data.load_validation_queries(query_class=Query)
validation_judgements = data.load_validation_judgements()

### Implementation of information retrieval system

Here we will define our IR system. If you want to use your own class it must define a method name `search` that takes a query and returns documents in descending order of relevance to the query.

This example returns documents in a decreasing order according to
a [*Okapi BestMatch25+*](https://en.wikipedia.org/wiki/Okapi_BM25#Modifications) similarity score between the documents and the given query.

If you wish you might use [preprocessing](https://github.com/MIR-MU/pv211-utils/tree/main/pv211_utils/preprocessing) or [ensemble](https://github.com/MIR-MU/pv211-utils/blob/developer/pv211_utils/ensembles.py) techniques from our library.

In [5]:
from pv211_utils.systems import BM25PlusSystem
from pv211_utils.preprocessing.preprocessing import NoneDocPreprocessing, SimpleDocPreprocessing

system = BM25PlusSystem(documents, SimpleDocPreprocessing())

### Evaluate the system on a given dataset

We will evaluate the IR system using the [Mean Average Precision](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision) (MAP).

In [6]:
from pv211_utils.beir.leaderboard import BeirLeaderboard
from pv211_utils.beir.eval import BeirEvaluation

submit_result = False
author_name = 'Surname, Name'

test_queries = data.load_test_queries(Query)
test_judgements = data.load_test_judgements()


evaluation = BeirEvaluation(system, test_judgements, k=10, leaderboard=BeirLeaderboard(), author_name=author_name, num_workers=8)
evaluation.evaluate(test_queries, submit_result)


Your system achieved **21.96% MAP score**.

You need at least **22%** to pass. 😢

Try playing with the preprocessing of queries and documents! 💡

Set `submit_result = True` and write your name to the `author_name` variable to submit your result to [the leaderboard](https://docs.google.com/spreadsheets/d/e/2PACX-1vSnyvgqXDq3XPzGz3eLz_8JPwceou10HiEShI0wJ2A8vlosRZc1QhKZ10aOmmQFitv2yPAyBERD2wwx/pubhtml ). 🏆

The best submissions on the leaderboard will receive *small awards during the semester*, and some *__seriously big__ awards* after the personal check at the end of the competition (2023-04-30). Please be polite, do not spoil the game for the others, and **have fun!** 😉