# Second Term Project: CQADupStack Collection

The *CQADupStack* is "[a] Benchmark Data Set for Community Question-Answering Research" [1] that is a part of the [*Benchmarking Information Retrieval (BEIR)*](https://github.com/beir-cellar/beir) collection.

CQADupStack contains data from 12 different [*Stackexchange*](https://stackexchange.com/) subforums based on the data dump released on September 26, 2014.

Your tasks, reviewed by your colleagues and the course instructors, are the following:



1. *Implement a ranked retrieval system*, [1, Chapter 6] which will produce a list of documents from the CQADupStack collection in a descending order of relevance to a query from the CQADupStack collection.
2. *Document your code* in accordance with [PEP 257](https://www.python.org/dev/peps/pep-0257/), ideally using [the NumPy style guide](https://numpydoc.readthedocs.io/en/latest/format.html#docstring-standard) as seen in the code from exercises.
   *Stick to a consistent coding style* in accordance with [PEP 8](https://www.python.org/dev/peps/pep-0008/).
3. *Reach at least 25% mean average precision at 10* [1, Section 8.4] with your system on the CQADupStack collection.
4. _[Upload an .ipynb file](https://is.muni.cz/help/komunikace/spravcesouboru#k_ss_1) with this Jupyter notebook to the homework vault in IS MU._ You MAY also include a brief description of your information retrieval system and a link to an external service such as [Google Colaboratory](https://colab.research.google.com/), [DeepNote](https://deepnote.com/), or [JupyterHub](https://iirhub.cloud.e-infra.cz/).







[1] Hoogeveen, Doris and Verspoor, Karin M. and Baldwin, Timothy. [*CQADupStack: A Benchmark Data Set for Community Question-Answering Research*](https://dl.acm.org/doi/10.1145/2838931.2838934). ACM, 2015.

## Import the utility tools from the git repository.

First, we will install [our library](https://github.com/MIR-MU/pv211-utils).

It may be necessary to restart the runtime to get the installed packages to work.

In [1]:
%%capture
! pip install git+https://github.com/MIR-MU/pv211-utils.git

## Define the necessary classes

These will eventually represent the Queries, Documents and Relevance Judgements from the CQADupStack collection.

Query and Document consist only of their IDs and bodies.
Judgements are also just a Set of Tuples that represent pairs of relevant Document-Query combinations.

In [2]:
from pv211_utils.beir.entities import BeirDocumentBase, BeirQueryBase, BeirJudgementBase
from typing import Set


class Query(BeirQueryBase):
    """
    A processed query form the Beir collection.

    Parameters
    ----------
    query_id : int
        The number
    body : str
        Text of a query

    """

    def __init__(self, query_id: int, body: str):
        super().__init__(query_id, body)

    def __str__(self):
        return self.body


class Document(BeirDocumentBase):
    """
    A processed document form the Beir collection.

    Parameters
    ----------
    document_id : str
        A unique identifier of the document.
    body : str
        The text of the document.

    """

    def __init__(self, document_id: str, body: str):
        super().__init__(document_id, body)

    def __str__(self):
        return self.body


BeirJudgements = Set[BeirJudgementBase]


## Loading the datasets
### CQADupStack contains 12 datasets that will be loaded and merged:
- Android
- English
- Gaming
- GIS
- Mathematica
- Physics
- Programmers
- Stats
- TeX
- Unix
- Webmasters
- WordPress

For more details: <a href=http://nlp.cis.unimelb.edu.au/resources/cqadupstack/>CQADupStack site</a>.



In [3]:
from pv211_utils.datasets import CQADupStackDataset 

data = CQADupStackDataset()

  from tqdm.autonotebook import tqdm


  0%|          | 0/22998 [00:00<?, ?it/s]

  0%|          | 0/40221 [00:00<?, ?it/s]

  0%|          | 0/45301 [00:00<?, ?it/s]

  0%|          | 0/37637 [00:00<?, ?it/s]

  0%|          | 0/16705 [00:00<?, ?it/s]

  0%|          | 0/38316 [00:00<?, ?it/s]

  0%|          | 0/32176 [00:00<?, ?it/s]

  0%|          | 0/42269 [00:00<?, ?it/s]

  0%|          | 0/68184 [00:00<?, ?it/s]

  0%|          | 0/47382 [00:00<?, ?it/s]

  0%|          | 0/17405 [00:00<?, ?it/s]

  0%|          | 0/48605 [00:00<?, ?it/s]

In [4]:
documents = data.load_documents(document_class=Document)

train_queries = data.load_train_queries(query_class=Query)
train_judgements = data.load_train_judgements()

validation_queries = data.load_validation_queries(query_class=Query)
validation_judgements = data.load_validation_judgements()

## Implementation of information retrieval system

Here we will define our IR system. If you want to use your own class it must define a method name `search` that takes a query and returns documents in descending order of relevance to the query.

This example returns documents in a decreasing order according to
a [*Okapi BestMatch25+*](https://en.wikipedia.org/wiki/Okapi_BM25#Modifications) similarity score between the documents and the given query.

If you wish you might use [preprocessing](https://github.com/MIR-MU/pv211-utils/tree/main/pv211_utils/preprocessing) or [ensemble](https://github.com/MIR-MU/pv211-utils/blob/developer/pv211_utils/ensembles.py) techniques from our library.

In [5]:
#imports
import torch
import numpy as np
from numpy.linalg import norm
from pv211_utils.cranfield.irsystem import CranfieldIRSystemBase
from tqdm import tqdm
from typing import Iterable, Set, List
from sentence_transformers import SentenceTransformer, util, losses, InputExample
from sentence_transformers.losses import MultipleNegativesRankingLoss
from torch.utils.data import DataLoader

In [6]:
# if training is True then the data will be transformed again which takes more than 1 hours
# if training is False then the data and model will be loaded (see cell with comment "loading")
training = False

In [7]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

device(type='cuda')

In [8]:
# TRAINING
if training:
    mmodel = SentenceTransformer('all-mpnet-base-v2').to(device)
    train_examples = [InputExample(texts=[i.body, j.body]) for i,j in list(train_judgements)]
    
    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=64)
    train_loss = losses.MultipleNegativesRankingLoss(model=mmodel)

    mmodel.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)

In [9]:
if training:
    # it is faster when you encoding multiple document in one time
    semantic_document_vectors = []
    lst_loop = []
    counter = 0
    for doc in tqdm(list(documents.values())):
        lst_loop.append(doc.body)
        counter += 1
        if counter == 200:
            semantic_document_vectors.extend(mmodel.encode(lst_loop, convert_to_tensor=True))
            lst_loop = []
            counter = 0

    semantic_document_vectors.extend(mmodel.encode(lst_loop, convert_to_tensor=True))
    vectors = torch.stack(semantic_document_vectors).cpu().numpy()

In [10]:
# optional save
# if training:
#     with open('document_vectors.npy', 'wb') as f:
#         np.save(f, my_ssssssemantic_document_vectors)
#     mmodel.save("model")

In [11]:
# loading, data avaible at https://drive.google.com/drive/folders/1g74J2yYMwaqlpU1_wY_oX78zIxmM3Sb6?usp=sharing
if not training:
    with open('my_document_vectors.npy', 'rb') as f:
        vectors = np.load(f)
    mmodel = SentenceTransformer('model').to(device)

In [12]:
class IRSystem(CranfieldIRSystemBase):

    def __init__(self, documents):
        index_to_document = dict(enumerate(documents.values()))
        self.index_to_document = index_to_document
        self._count = 0


    def search(self, query: Query) -> Iterable[Document]:
        self._count += 1
        if self._count % 15 == 0:
            print(self._count/3, end="% ")
        similarities_dict = {}
        query_transformer = mmodel.encode(query.body)
        a = vectors.dot(query_transformer)
        for i in range(len(a)):
            similarities_dict[i] = a[i]
        
        result = [(key, value) for key, value in similarities_dict.items()]
        result = sorted(result, key=lambda item: item[1], reverse=True)

        for document_number, _ in result:
            document = self.index_to_document[document_number]
            yield document

## Evaluate the system on a given dataset

We will evaluate the IR system using the [Mean Average Precision](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision) (MAP).

In [13]:
from pv211_utils.beir.leaderboard import BeirLeaderboard
from pv211_utils.beir.eval import BeirEvaluation

test_queries = data.load_test_queries(Query)
test_judgements = data.load_test_judgements()
leaderboard = BeirLeaderboard()

In [14]:
submit_result = False
author_name = 'Strompová, Alžbeta'

system = IRSystem(documents)
evaluation = BeirEvaluation(IRSystem(documents), test_judgements, k=10, leaderboard=leaderboard, author_name=author_name)
evaluation.evaluate(test_queries, submit_result)

5.0% 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 40.0% 45.0% 50.0% 55.0% 60.0% 65.0% 70.0% 75.0% 80.0% 85.0% 90.0% 95.0% 100.0% 

Your system achieved **39.74% MAP score**.

Congratulations, you passed the **25%** minimum! 🥳

Set `submit_result = True` and write your name to the `author_name` variable to submit your result to [the leaderboard](https://docs.google.com/spreadsheets/d/e/2PACX-1vSLYKoYcsTgyTp2T-pNgW2heZrwvmBVKAgWAAG_vELv8kgnxHffnJ-IKt5huAacvO7r-zKWOgSiqWFU/pubhtml?gid=0&single=true). 🏆

The best submissions on the leaderboard will receive *small awards during the semester*, and some *__seriously big__ awards* after the personal check at the end of the competition (2023-05-01). Please be polite, do not spoil the game for the others, and **have fun!** 😉