We will in this notebook evaluate the indri_ql_baseline run files for both training and testing. This is the description of the run:


We provide an Indri baseline run with Query Likelihood run, including both the topics and run files. Queries are generated by running AllenNLP coreference resolution to perform rewriting and stopwords are removed using the Indri stopword list.

In [1]:
from trectools import TrecEval, TrecRun, TrecQrel

## File formats
We need two files to evaluate
#### Run file:

FORMAT: qid Q0 docno rank score tag

* qid	is the query number
* Q0	is the literal Q0
* docno	is the id of a document returned for qid
* rank	(1-999) is the rank of this response for this qid
* score	is a system-dependent indication of the quality of the response
* tag	is the identifier for the system

#### TrecQrel:

FORMAT: qid 0 docno relevance

* qid	is the query number
* 0	is the literal 0
* docno	is the id of a document in your collection
* relevance	is how relevant is docno for qid

### Explore run file

In [2]:
run_path = "baseline/train_topics.teIn"
with open(run_path, 'r') as f:
    run = []
    for line in f:
        run.append(line.strip())

In [3]:
run[:10]

['1_1 Q0 MARCO_955948 1 -5.32579 indri',
 '1_1 Q0 MARCO_6203672 2 -5.40291 indri',
 '1_1 Q0 MARCO_5692406 3 -5.412 indri',
 '1_1 Q0 MARCO_849267 4 -5.4137 indri',
 '1_1 Q0 MARCO_2331424 5 -5.41859 indri',
 '1_1 Q0 MARCO_4455128 6 -5.4235 indri',
 '1_1 Q0 MARCO_8528286 7 -5.42714 indri',
 '1_1 Q0 MARCO_5780723 8 -5.42758 indri',
 '1_1 Q0 MARCO_920443 9 -5.44404 indri',
 '1_1 Q0 CAR_87772d4208721133d00d7d62f4eaaf164da5b4e3 10 -5.44505 indri']

### Explore query file
This is in Indri format

In [4]:
query_path = "treccastweb/2019/data/training/train_topics_mod.qrel"
with open(query_path, 'r') as f:
    qrels = []
    for line in f:
        qrels.append(line.strip())

In [5]:
qrels[:10]

['1_1 0 MARCO_955948 2',
 '1_1 0 MARCO_6203672 2',
 '1_1 0 MARCO_849267 0',
 '1_1 0 MARCO_2331424 0',
 '1_1 0 MARCO_4455128 0',
 '1_1 0 MARCO_5692406 1',
 '1_1 0 MARCO_8528286 0',
 '1_1 0 CAR_87772d4208721133d00d7d62f4eaaf164da5b4e3 0',
 '1_1 0 MARCO_920443 0',
 '1_1 0 MARCO_4903530 0']

## Evaluate with TrecEval
#### 1. Evaluate the given train run and qrel

In [6]:
run = TrecRun(run_path)
qrels = TrecQrel(query_path)

In [7]:
te = TrecEval(run, qrels)

In [8]:
map_score = te.get_map(depth=10)
mrr = te.get_reciprocal_rank(depth=10)
ndcg = te.get_ndcg(depth=3)

In [9]:
print('MAP: {}\nMRR: {}\nNDCG@3: {}'.format(map_score, mrr, ndcg))

MAP: 0.03701183744675955
MRR: 0.07325927892842392
NDCG@3: 0.043880365013731826


#### 2. Evaluate the given test run and qrel

The scores should be:
* MAP: 0.139
* MRR: 0.328
* NDCG@3: 0.152

In [10]:
test_run_path = "baseline/test_topics.teIn"
test_qrels_path = "2019qrels.txt"
run = TrecRun(test_run_path)
qrels = TrecQrel(test_qrels_path)

In [11]:
te = TrecEval(run, qrels)

Check per_query

In [12]:
map_score = te.get_map(per_query=True)
mrr = te.get_reciprocal_rank(per_query=True)
ndcg = te.get_ndcg(depth=3, per_query=True)

In [13]:
mrr.mean()

recip_rank@1000    0.407231
dtype: float64

In [14]:
map_score.mean()

MAP@1000    0.129923
dtype: float64

In [15]:
ndcg.mean()

NDCG@3    0.393203
dtype: float64

Check per_query=false

In [16]:
map_score = te.get_map(per_query=False)
mrr = te.get_reciprocal_rank(per_query=False)
ndcg = te.get_ndcg(depth=3, per_query=False)

In [17]:
print('MAP: {}\nMRR: {}\nNDCG@3: {}'.format(map_score, mrr, ndcg))

MAP: 0.04692423734583531
MRR: 0.11477276954435729
NDCG@3: 0.0533574446354449


### Run from trec_eval repository
go to the trec_eval repository and run:

* For training: ./trec_eval ../treccastweb/2019/data/training/train_topics_mod.qrel ../baseline/train_topics.teIn
* For test: ./trec_eval ../2019qrels.txt ../baseline/test_topics.teIn


# TODO
Find out why we get wrong score with both this python package and the original trec_eval program