# Data Loading

The datasets module provides interface for loading Arqmath, Cranfield, Trec, and Beir collection datasets.

## Cranfield

The Cranfield collection consists of 1398 abstracts of aerodynamics journal articles, a set of 225 queries,
and exhaustive relevance judgments of all (query, document) pairs.[1, Section 8.2]

To load cranfield collection we need to construct `CranfieldDataset` object where we can set how to split the data set into test, train, and validation sets.



[1] Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. [*Introduction to information retrieval*](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf). Cambridge university press, 2008.

In [None]:
from pv211_utils.datasets import CranfieldDataset

# Creating the dataset object and setting the parameters.
cranfield_data = CranfieldDataset(test_split_size=0.2, validation_split_size=0)

# The parameters can be changed with methods. For example:
cranfield_data.set_validation_split_size(new_size=0.1)

# Loading documents.
documents = cranfield_data.load_documents()  # -> OrderedDict of {document_id : Document}

print("document body:")
print(list(documents.values())[0].body)

# Examples of loading queries and judgements
queries = cranfield_data.load_test_queries() # -> OrderedDict of {query_id : Query}
judgements = cranfield_data.load_train_judgements() # -> Set of (Query, Document) pairs

## Arqmath

Arqmath dataset is based on threads from math StackExchange and consists of queries, answers, questions, and relevance judgements between queries and answers. For more information see <a href="https://www.cs.rit.edu/~dprl/ARQMath/index.html">Arqmath web</a>.

We can construct `ArqmathDataset` object where we can set how to split the data set into test, train, and validation sets. The test/train split is determined by years (chosen year becomes test set and the remaining two become train set), the validation set is obtained by further splitting the train set. We also need to choose text format, which defines in which format/encoding the text and mathematical formulae in questions, answers, and queries will be.

Available years - `2020`, `2021`, `2022`.

Available text formats: 
- `text` - Plain text which does not contain any mathematical formulae.
- `text+latex` - Plain text with mathematical formulae in LaTeX surrounded by dollar signs.
- `text+prefix` - Plain text with mathematical formulae in [the mathtuples format][5] of [the Tangent-L system][6].
- `text+tangentl` - Plain text with mathematical formulae in [the prefix format][1].
- `xhtml+latex` - XHTML text with mathematical formulae in LaTeX, surrounded by the `<span class="math-container">` tags.
- `xhtml+cmml` - XHTML text with mathematical formulae in the [Presentation MathML][4] XML format.
- `xhtml+pmml`- XHTML text with mathematical formulae in the [Content MathML][3] XML format.

 [1]: http://ceur-ws.org/Vol-2696/paper_235.pdf#page=5
 [2]: https://en.wikipedia.org/wiki/Polish_notation
 [3]: https://www.w3.org/TR/MathML2/chapter4.html
 [4]: https://www.w3.org/TR/MathML2/chapter3.html
 [5]: https://github.com/fwtompa/mathtuples
 [6]: http://ceur-ws.org/Vol-2936/paper-05.pdf#page=3

In [None]:
from pv211_utils.datasets import ArqmathDataset

# Creating the dataset object and setting the parameters.
arqmath_data = ArqmathDataset(year=2022, text_format="text", validation_split_size=0.2)

# The parameters can be changed with methods. For example:
arqmath_data.set_text_format(new_text_format="text+latex")

# Loading answers and questions.
questions = arqmath_data.load_questions() # -> OrderedDict of {question_id : Question}
answers = arqmath_data.load_answers() # -> OrderedDict of {answer_id : Answer}

print("answer body:")
print(list(answers.values())[0].body)

print("question body:")
print(list(questions.values())[0].body)

# Examples of loading queries and judgements
queries = arqmath_data.load_test_queries() # -> OrderedDict of {query_id : Query}
judgements = arqmath_data.load_train_judgements() # -> Set of (Query, Document) pairs


## Trec
"Text Retrieval Conference (TREC). The U.S. National Institute of StandardsTREC
and Technology (NIST) has run a large IR test bed evaluation series since
1992.[...] TRECs 6–8 provide 150 information needs
over about 528,000 newswire and Foreign Broadcast Information Service
articles. [...] there are no exhaustive relevance judgments." [1, Section 8.2]

To load Trec collection we need to construct `TrecDataset` object where we can set how to split the train


[1] Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. [*Introduction to information retrieval*](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf). Cambridge university press, 2008.

In [None]:
from pv211_utils.datasets import TrecDataset

# Creating the dataset object and setting the parameters.
trec_data = TrecDataset(validation_split_size=0)

# The parameters can be changed with methods. For example:
trec_data.set_validation_split_size(new_size=0.2)

# Loading documents.
documents = trec_data.load_documents() # -> OrderedDict of {document_id : Document}

print("document body:")
print(list(documents.values())[0].body)

# Examples of loading queries and judgements
queries = trec_data.load_test_queries() # -> OrderedDict of {query_id : Query}
judgements = trec_data.load_train_judgements() # -> Set of (Query, Document) pairs

## Beir
Beir is a benchmark/collection of heterogenous datasets. For more info about the collection or the available datasets see <a href="https://github.com/beir-cellar/beir">Beir github page </a>.

To load an dataset from the Beir collection we need to create a `BeirDataset` object and specify the desired dataset's name. The splitting into test/train/validation sets is as provided by the given dataset (see <a href="https://github.com/beir-cellar/beir">Beir github page </a>).

Available datasets: 
- `msmarco` and `msmarco-v2` - A dataset of question from Bing search query logs with human generated answer.
- `trec-covid` - A collection of covid related questions and documents from covid-19 open research dataset.
- `nfcorpus` - A medical dataset consisting of questions and medical documents.
- `nq` -  Questions and wikipedia based documents.
- `hotpotqa` - Wikipedia based question-answer pairs.
- `fiqa` - Questions and documents from various financial sources.
- `arguana` - A corpus of argument and best counterargument pairs.
- `webis-touche2020` - A corpus focused on looking for arguments for and against a given topic. 
- `quora` - A dataset consisting of potential question duplicate pairs.
- `dbpedia-entity` - Consists of free text queries and entities (entity search). 
- `scidocs` - A dataset of scientific documents and indications of their relatedness.
- `fever` - A dataset of claims with anotations consisting of their validity and the evidence.
- `climate-fever` - A datasets of claims about climate with evidence for or against them from wikipedia articles.
- `scifact`- Scientific claims paired with evidence that supports/refutes them.

In [None]:
from pv211_utils.datasets import BeirDataset

# Creating the dataset object and specifying the dataset to be loaded.
beir_data = BeirDataset(dataset_name="scifact") 

# Loading documents.
documents = beir_data.load_documents() # -> OrderedDict of {document_id : Document}

print("document body:")
print(list(documents.values())[0].body)

# Examples of loading queries and judgements
queries = beir_data.load_test_queries() # -> OrderedDict of {query_id : Query}
judgements = beir_data.load_train_judgements() # -> Set of (Query, Document) pairs

# Preprocessing

## Document preprocessing

The preprocessing module contains several classes for preprocessing documents, taking string inputs and producing outputs as lists of strings.

In [None]:
from pv211_utils.preprocessing import NoneDocPreprocessing, LowerDocPreprocessing

text = "2 Horses jumping over a fence in San-Francisco (město v U.S.A.)"

# NoneDocPreprocessing split input by spaces

none_preprocess = NoneDocPreprocessing()
print(none_preprocess(text))

lower_preprocess = LowerDocPreprocessing()
print(lower_preprocess(text))

In [None]:
from pv211_utils.preprocessing import SimpleDocPreprocessing

# SimpleDocPreprocessing split input by spaces and can remove accentuation or too short/long words

preprocess = SimpleDocPreprocessing()
preprocess2 = SimpleDocPreprocessing(min_len=0)
preprocess3 = SimpleDocPreprocessing(min_len=4, max_len=6, deacc=True)

print(preprocess(text))
print(preprocess2(text))
print(preprocess3(text))


In [None]:
from pv211_utils.preprocessing import DocPreprocessing
from gensim.parsing.porter import PorterStemmer
import nltk
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
nltk.download('omw-1.4')

def simple_stem(x):
    return x.replace("es", "").replace("ing", "")


stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# DocPreprocessing offers more options, like stemming or lemmatization

preprocess = DocPreprocessing(stem=simple_stem, stopwords=['fence'])
preprocess2 = DocPreprocessing(stem=stemmer.stem, lemm=lemmatizer.lemmatize)
preprocess3 = DocPreprocessing(stem=stemmer.stem)
preprocess4 = DocPreprocessing(lemm=lemmatizer.lemmatize)

print(preprocess(text))
print(preprocess2(text))
print(preprocess3(text))
print(preprocess4(text))


## Math Preprocessing

It's often a good idea to preprocess plain-text math. Language models like BERT are often trained on Latex representations, so you will achieve the best results with that. If you are trying to do something fancier, like building the syntax trees, you may find it easier to work with MathML representations.

In [None]:
from pv211_utils.preprocessing import exp_to_latex, exp_to_pmathml, exp_to_cmathml

pyexp = "2**3 + ((4*x)**2) / 8"

print(f"expression:\n{pyexp}")

print(f"\nlatex:\n{exp_to_latex(pyexp)}")

print(f"\npresentation mathml:\n{exp_to_pmathml(pyexp)}")

print(f"\ncontent mathml:\n{exp_to_cmathml(pyexp)}")

# Information Retrieval Systems

There are several systems in pv211_utils. Ranging from traditional systems like BM25 to systems based on Transformer architectures. Feel free to use and explore them, but note that for the state-of-the-art system, you might need to write and tune them from scratch.

In [None]:
from pv211_utils.datasets import CranfieldDataset
from pv211_utils.systems import BM25PlusSystem
from pv211_utils.preprocessing import NoneDocPreprocessing
from pv211_utils.evaluation_metrics import mean_precision

cranfield = CranfieldDataset(0.25)

judgements = cranfield.load_test_judgements()
queries = cranfield.load_test_queries()
documents = cranfield.load_documents()

preprocessing = NoneDocPreprocessing()
bm25 = BM25PlusSystem(documents, preprocessing)
result = mean_precision(system=bm25, queries=queries, judgements=judgements, k=5, num_processes=1)

print(result)


pv211_utils contains two transformer-based IR systems - retriever and reranker.

- Retriever systems compute vector representation for each document, and during a search, they will compare them with query representation using similarity measures like cosine similarity or euclidean distance.
- Reranker systems consist of the retriever part, but additionally reranks top-k documents using CrossEncoder (for more see: https://www.sbert.net/examples/applications/cross-encoder/README.html)

In [None]:
from pv211_utils.datasets import ArqmathDataset
from pv211_utils.systems import RetrieverSystem
from pv211_utils.evaluation_metrics import mean_average_precision
from sentence_transformers.SentenceTransformer import SentenceTransformer

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

arqmath_data = ArqmathDataset(2020, "text+latex")

answers = arqmath_data.load_answers()
# There is 1.4M answers in full ARQMath. Let's use just a small subset for demonstration.
answers_subset = dict(list(answers.items())[0:50000])

# use pretrained transformer model from hugging face and embed our subset
retriever_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
retriever_system = RetrieverSystem(retriever_model, answers_subset)

queries = arqmath_data.load_test_queries()
judgements = arqmath_data.load_test_judgements()

# evaluate retriever system
result = mean_average_precision(retriever_system, queries, judgements, 100, 1)
print(result)

# Evaluation metrics

The `evaluation_metrics` module provides functions/metrics for evaluation of IR systems. 
The metrics included are:
- Mean Precision
- Mean Recall
- Mean Average Precision (MAP)
- Normalized Discounted Cumulative Gain (nDCG)
- Mean Bpref

## Mean Precision

Precision is a fraction of number of relevant retrieved documents in the first k retrieved documets and k, then we take the mean precision over all queries. The formula:

$\text{P}_k = \frac{\text{number of relevant documents in top k}}{k}$

$\text{MP}_k = \frac{1}{|Q|} \sum_{q \in Q} \text{P}_k(q) $

Where $Q$ is set of queries.

<br> 

To calculate mean precision of an IR system we call `mean_precision` function :

In [None]:
judgements = cranfield.load_test_judgements()
queries = cranfield.load_test_queries()
documents = cranfield.load_documents()

preprocessing = NoneDocPreprocessing()
bm25 = BM25PlusSystem(documents, preprocessing)

In [None]:
from pv211_utils.evaluation_metrics import mean_precision 

mp_score = mean_precision(system=bm25,            # System to be evaluated (must follows IRSystemBase template).
                          queries=queries,        # Queries to be used in the evaluation.
                          judgements=judgements,  # Judgements to be used in the evaluation.
                          k=10,                   # Depth of the evaluation.
                          num_processes=4)        # Number of processes/workers to be used to run the evaluation.

print(f"Mean Precision: {mp_score}")

## Mean Recall

Recall is a fraction of number of relevant retrieved documents in the first k retrieved documets and number of relevant documents, then we take the mean recall over all queries. The formula:

$\text{R}_k = \frac{\text{number of relevant documents in top k}}{\text{mumber of relevant documents}}$

$\text{MR}_k = \frac{1}{|Q|} \sum_{q \in Q} \text{R}_k(q) $

Where $Q$ is set of queries.

<br> 

To calculate mean recall of an IR system we call `mean_recall` function:

In [None]:
from pv211_utils.evaluation_metrics import mean_recall 

mr_score = mean_recall(system=bm25,            # System to be evaluated (must follows IRSystemBase template).
                       queries=queries,        # Queries to be used in the evaluation.
                       judgements=judgements,  # Judgements to be used in the evaluation.
                       k=10,                   # Depth of the evaluation.
                       num_processes=4)        # Number of processes/workers to be used to run the evaluation.

print(f"Mean Recall: {mr_score}")

## MAP 

MAP is claculated by firstly calculating an average precision (AP) at k (we only take first k documents returned into consideration) for each query, then we calculate the mean of these APs. The average precision assigns less value to the relevant documents that are lower in the ranked list of retrieved documents than to those in higher positions. The formula:

$\text{AP}_k = \frac{1}{\text{number of relevant documents in top k}} * \sum_{i = 1}^{k} \text{P}_i * r_i$

$\text{MAP}_k = \frac{1}{|Q|} \sum_{q \in Q} \text{AP}_k(q)$

Where $\text{P}_i$ is precision at i, $r_i$ is indicator of relevance of i-th document, and $Q$ is set of queries.

<br> 

To calculate MAP of an IR system we call `mean_average_precision` function :

In [None]:
from pv211_utils.evaluation_metrics import mean_average_precision

map_score = mean_average_precision(system=bm25,            # System to be evaluated (must follows IRSystemBase template).
                                   queries=queries,        # Queries to be used in the evaluation.
                                   judgements=judgements,  # Judgements to be used in the evaluation.
                                   k=10,                   # Depth of the evaluation.
                                   num_processes=4)        # Number of processes/workers to be used to run the evaluation.

print(f"MAP: {map_score}")

## Mean nDCG

nDCG is calculated as fraction of discounted cumulative gain (DCG) and ideal discounted cumulative gain (IDCG). The DCG calculates the cumulative 
gain of relevance of document in the retrieved document list, where the document's relevance value is penalized (discounted) for being lower in the result.
The IDCG represents the maximum possible DCG for given query and is used to normalize the DCG. This normalization is done because the value of DCG also depends on the list's size or more specifically, the total number of relevant documents in the result list. The formula:

$\text{DCG}_k = \sum_{i=1}^{k}\frac{r_i}{\log_{2}(i + 1)}$

$\text{IDCG}_k = \sum_{i=1}^{|\text{rel}_k|}\frac{\text{rel}_k[i]}{\log_{2}(i + 1)}$

$\text{mean_nDCG}_k = \frac{1}{|Q|} \sum_{q \in Q} \frac{\text{DCG}_k(q)}{\text{IDCG}_k(q)}$

Where $r_i$ is an indicator of relevance of the i-th document, $\text{rel}_k$ is a list of relevant documents sorted by their relevance up to position k,<br>
 $\text{rel}_k[i]$ is a relevance of the i-th document in $\text{rel}_k$ list, and $Q$ is set of queries.

<br> 

To calculate mean nDCG of an IR system we call `normalized_discounted_cumulative_gain` function :

In [None]:
from pv211_utils.evaluation_metrics import normalized_discounted_cumulative_gain 

ndcg_score = normalized_discounted_cumulative_gain(system=bm25,            # System to be evaluated (must follows IRSystemBase template).
                                                   queries=queries,        # Queries to be used in the evaluation.
                                                   judgements=judgements,  # Judgements to be used in the evaluation.
                                                   k=10,                   # Depth of the evaluation.
                                                   num_processes=4)        # Number of processes/workers to be used to run the evaluation.

print(f"nDCG: {ndcg_score}")

## Mean Bpref

"Bpref is a preference-based information retrieval measure that considers whether relevant documents are ranked above irrelevant ones.<br>
 It is designed to be robust to missing relevance judgments, such that it gives the same experimental outcome with incomplete judgments <br> that Mean Average Precision would with complete judgments."[2]
The formula:

$\text{Bpref}_k = \frac{1}{|R|} \sum_{r \in R} 1 - \frac{\text{|n ranked higher than r|}}{|R|}$

$\text{mean_Bpref}_k = \frac{1}{|Q|} \sum_{q \in Q} \text{Bpref}_k(q)$

Where $R$ is set of relevant documents from the top k documents in result list, n is a nonrelevant document from first $|R|$ retrieved nonrelevant documents, and $Q$ is set of queries.

<br>

To calculate mean mean Bpref of an IR system we call `mean_bpref` function:


<br>

[2]Craswell, N. (2009). Bpref. In: LIU, L., ÖZSU, M.T. (eds) Encyclopedia of Database Systems. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-39940-9_489

In [None]:
from pv211_utils.evaluation_metrics import mean_bpref 

bpref_score = mean_bpref(system=bm25,            # System to be evaluated (must follows IRSystemBase template).
                         queries=queries,        # Queries to be used in the evaluation.
                         judgements=judgements,  # Judgements to be used in the evaluation.
                         k=10,                   # Depth of the evaluation.
                         num_processes=4)        # Number of processes/workers to be used to run the evaluation.

print(f"Bpref: {bpref_score}")

# Ensembling Algorithms

The `ensembles` module provides functions for ensambling of IR systems. These included methods based on:
- inverse mean rank
- inverse median rank
- reciprocal rank fusion
- IBC
- weighted IBC
- RBC

An example of how to create an ensamble IR system using inverse mean rank:

In [None]:
# importing an ensamble algorithm
from pv211_utils.ensembles import inverse_mean_rank
# other imports
from pv211_utils.systems import BM25PlusSystem, TfidfSystem
from pv211_utils.datasets import CranfieldDataset
from pv211_utils.preprocessing import NoneDocPreprocessing
from pv211_utils.irsystem import IRSystemBase
from pv211_utils.evaluation_metrics import mean_average_precision
 
# Create the systems to be ensambled and load data
data = CranfieldDataset()
system_1 = BM25PlusSystem(data.load_documents(), NoneDocPreprocessing())
system_2 = TfidfSystem(data.load_documents(), NoneDocPreprocessing())

# Create ensamble IR system
class EnsambleSystem(IRSystemBase):
    def __init__(self, systems):
        self.systems = systems

    def search(self, query):
        return inverse_mean_rank(query, self.systems)

# Ensambling the system_1 and system_2
ens_system = EnsambleSystem([system_1, system_2])

# We can evaluate its MAP score and compare it to individual systems' scores
print(f"BM25 system's MAP: {mean_average_precision(system_1, data.load_test_queries(), data.load_test_judgements(), 10, 4)}")
print(f"TF-IDF system's MAP: {mean_average_precision(system_2, data.load_test_queries(), data.load_test_judgements(), 10, 4)}")
print(f"ensamble system's MAP: {mean_average_precision(ens_system, data.load_test_queries(), data.load_test_judgements(), 10, 4)}")

## Inverse Mean Rank

The inverse mean rank ensembling algorithm takes list of IR systems and a query and produces a list of documents ranked by they inverse mean rank. The formula for inverse mean rank of document $i$:

$\text{inverse_mean_rank}_i = \frac{1}{\text{mean}(\text{ranks}_i)}$

Where $\text{ranks}_i$ is a list of ranks of document $i$ in the results of individual systems.

## Inverse Median Rank

The inverse median rank ensembling algorithm takes list of IR systems and a query and produces a list of documents ranked by they inverse median rank. The formula for inverse median rank of document $i$:

$\text{inverse_median_rank}_i = \frac{1}{\text{median}(\text{ranks}_i)}$

Where $\text{ranks}_i$ is a list of ranks of document $i$ in the results of individual systems.

## Reciprocal Rank Fusion

The reciprocal rank fusion (RRF) ensembling algorithm takes list of IR systems, a query, and a parameter $k$ and produces a list of documents ranked by RRF score, where for document $i$ the formula is: 

$\text{RRF_score}_i = \sum_{s \in S}\frac{1}{k + \text{rank}_i(s)}$

Where $S$ is the set of the ensambled systems, $\text{rank}_i(s)$ is the rank of document $i$ in system's $s$ result, and $k$ is a parameter.

The authors'reason behind the constant $k$ in the formula is that it should mitigate the impact of high ranks given by outlier systems.[1]

[1] Gordon V. Cormack, Charles L A Clarke, and Stefan Buettcher. 2009. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (SIGIR '09). Association for Computing Machinery, New York, NY, USA, 758–759. https://doi.org/10.1145/1571941.1572114

## IBC

The IBC ensembling algorithm takes list of IR systems and a query and produces a list of documents ranked by score given by formula (for document $i$):

$\text{ibc}_i = \frac{|D| - \text{median}(\text{ranks}_i)}{|D|}$

Where $\text{ranks}_i$ is a list of ranks of document $i$ in the results of individual systems, and D is set of all documents.

The IBC also incorporates a tie braking mechanism, where ties between documents are broken by calculating new scores from rank selected uniformly at random from their lists of ranks (instead of taking median rank). Further ties are broken randomly.

## Weighted IBC

The Weighted IBC (WIBC) ensembling algorithm takes list of IR systems, a query, and a weights of systems and produces a list of documents ranked by score given by formula (for document $i$):

$\text{wibc}_i = \frac{|D| - \text{weighted_median}(\text{ranks}_i, \text{ weights})}{|D|}$

Where $\text{ranks}_i$ is a list of ranks of document $i$ in the results of individual systems, weights is list of systems' weights, and D is set of all documents.

The WIBC also incorporates a tie braking mechanism, where ties between documents are broken by calculating new scores from rank selected at random (distribution is defined by weights) from their lists of ranks (instead of taking median rank). Further ties are broken randomly.

## RBC

The RBC algorithm uses a trained model (linear regression by default) to estimate relevance of a document based on its ranks in result lists of the individual systems (more specifically the score used to train/predict is calculated as $\frac{|D| - \text{rank}}{|D|}$, where $D$ is set of all documents). The result document list is sorted by this predicted relevance.

To create RBC ensamble we create an object of rbc class, where the parameters are systems to be ensambled, queries and judgements used for trainign the model, and optionally a pipeline (the default is standard scaler and linear regression model).

In [None]:
# Import the rbc class.
from pv211_utils.ensembles import Rbc
# Other imports.
from pv211_utils.systems import BM25PlusSystem, TfidfSystem
from pv211_utils.datasets import CranfieldDataset 
from pv211_utils.preprocessing import DocPreprocessing 
from pv211_utils.evaluation_metrics import mean_average_precision
 
# Create the systems to be ensambled and load data.
data = CranfieldDataset(0.1)
system_1 = BM25PlusSystem(data.load_documents(), DocPreprocessing())
system_2 = TfidfSystem(data.load_documents(), DocPreprocessing())

# Create an ensamble system.
rbc_ens = Rbc([system_1, system_2], # Systems to be ensambled.
              data.load_train_queries(), # Queries used for training.
              data.load_train_judgements()) # Judgements used for training.


# We can evaluate its MAP score and compare it to individual systems' scores
print(f"BM25 system's MAP: {mean_average_precision(system_1, data.load_test_queries(), data.load_test_judgements(), 10, 4)}")
print(f"TF-IDF system's MAP: {mean_average_precision(system_2, data.load_test_queries(), data.load_test_judgements(), 10, 4)}")
print(f"RBC ensamble system's MAP: {mean_average_precision(rbc_ens, data.load_test_queries(), data.load_test_judgements(), 10, 4)}")