# NIR 2022 - Lab 3: Evaluation Metrics

In Lab 2, we have seen how to index a collection of documents and how to search the index with different systems in PyTerrier.
At the end of Lab 2, we also saw how to evaluate the performance of the different systems using standard metrics such as MAP and NDCG.

Today, we will take a closer look at standard evaluation metrics.
In particular, we will see how to use `pytrec_eval`, a Python library to evaluate on TREC-like data whether you use PyTerrier or not.

## Systems Setup

We will start by building an index of our data collection and a few systems in PyTerrier.
This step is only required to obtain system outputs.

As we will see shortly, `pytrec_eval` only needs access to output files, which can be obtained in any other way.

In [None]:
# Load the data
import pandas as pd

# corpus
docs_df = pd.read_csv('data/lab_docs.csv', dtype=str)
print(docs_df.shape)
print(docs_df.head())

# topics
topics_df = pd.read_csv('data/lab_topics.csv', dtype=str)
print(topics_df.shape)
print(topics_df.head())

In [None]:
# Init PyTerrier
import pyterrier as pt
if not pt.started():
    pt.init()

In [None]:
# Build index
indexer = pt.DFIndexer("./indexes/default", overwrite=True, blocks=True)
index_ref = indexer.index(docs_df["text"], docs_df["docno"])
index = pt.IndexFactory.of(index_ref)
print(index.getCollectionStatistics().toString())

In [None]:
# Build IR systems
tf = pt.BatchRetrieve(index, wmodel="Tf")
tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

## Search and Evaluate in PyTerrier

In PyTerrier, we can use `search()` to search for documents relevant for a given query.

In [None]:
# Search the index for a query using TF-IDF model
tfidf.search("black wall").head(10)

We can also search for multiple queries at once by grouping them in a Pandas DataFrame and then using the `transform()` method.

In [None]:
# Search the index for multiple queries using TF-IDF model
queries = pd.DataFrame([["q1", "dragon"], ["q2", "wall"]], columns=["qid", "query"])
tfidf.transform(queries).head(10)

Finally, PyTerrier provides an interface for evaluating the performance of IR systems through the `Experiment` abstraction.
Behind the scenes, `pt.Experiment` uses the `pytrec_eval` library!

In [None]:
qrels_df = pd.read_csv('data/lab_qrels.csv', dtype=str)
qrels_df.head()

In [None]:
topics_df.head()

In [None]:
# Evaluate systems on the first three topics using the PyTerrier Experiment interface
qrels_df = qrels_df.astype({'label': 'int32'})
pt.Experiment(
    retr_systems=[tf, tfidf, bm25],
    names=['TF', 'TF-IDF', 'BM25'],
    topics=topics_df[:3],
    qrels=qrels_df,
    eval_metrics=["map", "ndcg", "ndcg_cut_10", "P_10"])

## Transformers & Operators

You'll have noted that BatchRetrieve has a `transform()` method that takes as input a dataframe, and returns another dataframe, which is somehow a *transformation* of the earlier dataframe (e.g., a retrieval transformation). In fact, `BatchRetrieve` is just one of many similar objects in PyTerrier, which we call [transformers](https://pyterrier.readthedocs.io/en/latest/transformer.html) (represented by the `TransformerBase` class).

Let's give a look at a `BatchRetrieve` transformer, starting with one for the TF_IDF weighting model.

In [None]:
# check tfidf is a transformer...
print(isinstance(tfidf, pt.transformer.TransformerBase))

In [None]:
# this prints the type hierarchy of the TF_IDF class
tfidf.__class__.__mro__

The interesting capability of all transformers is that they can be combined using Python operators (this is called operator overloading).

Concretely, imagine that you want to chain transformers together – e.g. rank documents first by Tf then re-ranked the exact same documents by TF_IDF. We can do this using the >> operator – we call this composition, or "then".

In [None]:
# now let's define a pipeline 
pipeline = tf >> tfidf
print(isinstance(tfidf, pt.transformer.TransformerBase))

In [None]:
print(tf.search("black wall"))
print(pipeline.search("black wall"))

## Practice Task – Pipeline Construction

Create a ranker that performs the follinwg:
 - obtains the top 10 highest scoring documents by term frequency (`wmodel="Tf"`)
 - obtains the top 10 highest scoring documents by TF.IDF (`wmodel="TF_IDF"`)
 - reranks only those documents found in BOTH of the previous retrieval settings using BM25.

How many documents are retrieved by this full pipeline for the query `"black wall"`. 

If you obtain the correct solution, the document with docid `'1357'` should have a score 14.5976

In [None]:
# Todo


### Saving system ouputs

We now save the output of each query onto disk so we can later evaluate it with `pytrec_eval`.

In [None]:
topics_df

In [None]:
!mkdir outputs

In [None]:
# Save system rankings in TREC format
# qid Q0 docno rank score tag
tf_run = []
for _, row in topics_df.iterrows():
    qid, query = row
    res_df = tf.search(query)
    for _, res_row in res_df.iterrows():
        _, docid, docno, rank, score, query = res_row
        row_str = f"{qid} 0 {docno} {rank} {score} tfidf"
        tf_run.append(row_str)
with open("outputs/tf.run", "w") as f:
    for l in tf_run:
        f.write(l + "\n")
        
tfidf_run = []
for _, row in topics_df.iterrows():
    qid, query = row
    res_df = tfidf.search(query)
    for _, res_row in res_df.iterrows():
        _, docid, docno, rank, score, query = res_row
        row_str = f"{qid} 0 {docno} {rank} {score} tfidf"
        tfidf_run.append(row_str)
with open("outputs/tfidf.run", "w") as f:
    for l in tfidf_run:
        f.write(l + "\n")

bm25_run = []
for _, row in topics_df.iterrows():
    qid, query = row
    res_df = bm25.search(query)
    for _, res_row in res_df.iterrows():
        _, docid, docno, rank, score, query = res_row
        row_str = f"{qid} 0 {docno} {rank} {score} tfidf"
        bm25_run.append(row_str)
with open("outputs/bm25.run", "w") as f:
    for l in bm25_run:
        f.write(l + "\n")

bm25_run[0]

## pytrec_eval

[pytrec_eval](https://github.com/cvangysel/pytrec_eval) is a Python interface to TREC's evaluation tool [`trec_eval`](https://github.com/usnistgov/trec_eval).
You can install it as follows.

In [None]:
!pip install pytrec_eval

pytrec_eval requires three arguments:
- qrel: a dictionary mapping each query id to the relevant documents and their labels. For example:
```python
qrel = {
    'q1': {'d1': 0, 'd2': 1, 'd3': 0},
    'q2': {'d2': 1, 'd3': 1},
}
```
- metrics: a set of standard metrics to be used to assess your system. See [here](http://www.rafaelglater.com/en/post/learn-how-to-use-trec_eval-to-evaluate-your-information-retrieval-system) for a list of available metrics.
- run: similar to `qrel`, this is a dictionary of a given run which maps each query id to the relevant documents and their scores. For example:
```python
run = {
    'q1': {'d1': 1.0, 'd2': 0.0, 'd3': 1.5},
    'q2': {'d1': 1.5, 'd2': 0.2, 'd3': 0.5}
}
```

In [None]:
# Load qrels
qrels_df = pd.read_csv('data/lab_qrels.csv', dtype=str)
print(qrels_df.shape)
print(qrels_df.head())

qrels_dict = dict()
for _, r in qrels_df.iterrows():
    qid, docno, label, iteration = r
    if qid not in qrels_dict:
        qrels_dict[qid] = dict()
    qrels_dict[qid][docno] = int(label)

Check out `pytrec_eval.parse_qrel()` to quickly load qrels files in TREC format (as in your project).

In [None]:
import pytrec_eval

In [None]:
# Build evaluator based on the qrels and metrics
metrics = {"map", "ndcg", "ndcg_cut_10", "P_10"}
my_qrel = {q: d for q, d in qrels_dict.items() if q in {'1015979', '2674', '340095'}}  # let's evaluate the first 3 topics to compare with PyTerrier above
evaluator = pytrec_eval.RelevanceEvaluator(my_qrel, metrics)

In [None]:
# Load run
with open("outputs/tf.run", 'r') as f_run:
    tf_run = pytrec_eval.parse_run(f_run)

In [None]:
# Evaluate tf model
tf_evals = evaluator.evaluate(tf_run)
tf_evals

In [None]:
tf_metric2vals = {m: [] for m in metrics}
for q, d in tf_evals.items():
    for m, val in d.items():
        tf_metric2vals[m].append(val)

In [None]:
# Compute average across topics
for m in metrics:
    print(m, '\t', pytrec_eval.compute_aggregated_measure(m, tf_metric2vals[m]))