# Assignment 2A, Part 1: Evaluation

You are given two sample files, `data/sample_ranking.csv` and `data/sample_qrels.csv`, to test your solution.

This notebook is to be used for evaluating the rankings generated in [Part 2](2_Retrieval.ipynb) and [Part 3](3_Multifield_retrieval.ipynb).

In [1]:
RANKING_FILE = "data/sample_ranking.csv"  # file with the document rankings
QRELS_FILE = "data/sample_qrels.csv"  # file with the relevance judgments (ground truth)

**TODO**: Complete the function that calculates evaluation metrics for a given a ranking (`ranking`) against the ground truth (`gt`). It should return the results as a dictionary, where the key is the retrieval metric.

(Hint: see [Exercises #1 and #2 from Lecture 8](https://github.com/kbalog/uis-dat640-fall2019/tree/master/exercises/lecture_08).)

In [2]:
def eval_query(ranking, gt):
    """Calculates the ranking against the ground truth for a given query."""
    p5, p10, ap, rr, num_rel = 0, 0, 0, 0, 0

    for i, doc_id in enumerate(ranking):
        if doc_id in gt:
            num_rel += 1  
            pi = num_rel / (i + 1)
            ap += pi  # AP
            
            if i < 10:
                p10 += 1
    
            if rr == 0:  # Reciprocal rank
                rr = 1 / (i + 1)
                
    p10 /= 10
    ap /= len(gt)
    
    return {"P10": p10, "AP": ap, "RR": rr}

**TODO**: Complete the function that evaluates an output file, which contains rankings for a set of queries. It is almost complete, you just need to add the computation of mean scores (over the entire query set).

In [3]:
def eval(gt_file, output_file):
    """Prints evaluation scores for each query as well as the means over the query set."""
    # load data from ground truth file
    gt = {}  # holds a list of relevant documents for each queryID
    with open(gt_file, "r") as fin:
        header = fin.readline().strip()
        if header != "queryID,docIDs":
            raise Exception("Incorrect file format!")
        for line in fin.readlines():
            qid, docids = line.strip().split(",")
            gt[qid] = docids.split()
            
    # load data from output file
    output = {}
    with open(output_file, "r") as fin:
        header = fin.readline().strip()
        if header != "QueryId,DocumentId":
            raise Exception("Incorrect file format!")
        for line in fin.readlines():
            qid, docid = line.strip().split(",")
            if qid not in output:
                output[qid] = []
            output[qid].append(docid)
    
    # evaluate each query that is in the ground truth
    print("  QID  P@10   (M)AP  (M)RR")
    sum_p10, sum_ap, sum_rr = 0, 0, 0
    for qid in sorted(gt.keys()):
        res = eval_query(output.get(qid, []), gt.get(qid, []))
        print("%5s %6.3f %6.3f %6.3f" % (qid, res["P10"], res["AP"], res["RR"]))
        sum_p10 += res["P10"]
        sum_ap += res["AP"]
        sum_rr += res["RR"]
    
    # TODO compute averages over the entire query set
    size = len(gt.keys())
    avg_p10 = round(sum_p10 / size, 3)
    avg_ap = round(sum_ap / size, 3)
    avg_rr = round(sum_rr / size, 3)
    
    # print averages
    print("%5s %6.3f %6.3f %6.3f" % ("ALL", avg_p10, avg_ap, avg_rr))

### Main

In [4]:
eval(QRELS_FILE, RANKING_FILE)

  QID  P@10   (M)AP  (M)RR
   Q1  0.200  0.467  1.000
   Q2  0.500  0.925  1.000
   Q3  0.100  0.500  0.500
  ALL  0.267  0.631  0.833
