# Week 3 Overview

During this week's lessons, you will learn how to evaluate an information retrieval system (a search engine), including the basic measures for evaluating a set of retrieved results and the major measures for evaluating a ranked list, including the average precision (AP) and the normalized discounted cumulative gain (nDCG), and practical issues in evaluation, including statistical significance testing and pooling.

## Key Phrases and Concepts

Keep your eyes open for the following key terms or phrases as you complete the readings and interact with the lectures. These topics will help you better understand the content in this module.

- Cranfield evaluation methodology
- Precision and recall
- Average precision, mean average precision (MAP), and geometric mean average precision (gMAP)
- Reciprocal rank and mean reciprocal rank
- F-measure
- Normalized discounted cumulative Gain (nDCG)
- Statistical significance test

## Goals and Objectives

After you actively engage in the learning experiences in this module, you should be able to:

- Explain the Cranfield evaluation methodology and how it works for evaluating a text retrieval system.
- Explain how to evaluate a set of retrieved documents and how to compute precision, recall, and F1.
- Explain how to evaluate a ranked list of documents.
- Explain how to compute and plot a precision-recall curve.
- Explain how to compute average precision and mean average precision (MAP).
- Explain how to evaluate a ranked list with multi-level relevance judgments.
- Explain how to compute normalized discounted cumulative gain.
- Explain why it is important to perform statistical significance tests.

## Guiding Questions

Develop your answers to the following guiding questions while completing the readings and working on assignments throughout the week.

### Why is evaluation so critical for research and application development in text retrieval?

- Text retrieval is empirical task, so we need to measure the text retrieval result quality based on user, not subjectively measured.
- We need to understand what actual utility of text retrieval system from user perspective. To do that, we need to evaluate each possible utilty and measure them through user study.
- Measure actual utility on only one system and method is not enough. We need to do evaluation on different systems and methods to reveal exact utility to user.

### How does the Cranfield evaluation methodology work?

Let:
- \\(D\\) is set of documents \\(\{d_1, d_2, ..., d_n\}\\)
- \\(Q\\) is set of queries \\(\{q_1, q_2, ..., q_n\}\\)
- \\(S\\) is set of systems \\(\{s_1, s_2, ..., s_n\}\\)
- \\(R\\) is set of relevance judgement by users for each system in \\(S\\) for each document in \\(D\\) should be have relevance judgement in \\(J\\), such that \\(R_{Si} = \{d_i \rightarrow j_i, ..., d_n \rightarrow j_i \ | \ d_i \in D \ , \ j_i \in J\}\\)

Suppose:
- We have two systems \\(S = \{A, B\}\\)
- We want to match query \\(Q_i\\) to each document in \\(D\\) using each system
- We have boolean judgement, such that \\(J = \{+, -\}\\)
- \\(R_A\\) return \\(\{d_2 \rightarrow +, d_1 \rightarrow +, d_4 \rightarrow -\}\\)
- \\(R_B\\) return \\(\{d_1 \rightarrow +, d_4 \rightarrow -, d_3 \rightarrow -, d_5 \rightarrow +, d_2 \rightarrow +\}\\)

Then:
- By using **precission**, we decide that \\(R_A\\) better that \\(R_B\\), since \\(2/3 > 3/5\\)

![cranfield](images/cranfield.png)

### How do we evaluate a set of retrieved documents?

- Using **Precision** to evaluate degree of relevant from set of retrieved documents.
- Using **Recall** to evaluate relevant ratio of retrieved against not retreived.
- Using **F1** to combine them.

### How do you compute precision, recall, and F1?

Consider this matrix:

Doc \ Action | Retrieved | Not Retrieved |
-------------|-----------|---------------|
Relevant     |     a     |       b       |
Not Relevant |     c     |       d       |

$$Precision = \frac{a}{a+c}$$
$$Recall = \frac{a}{a+b}$$
$$\eqalign{
    F_{\beta} &= \frac{1}{\frac{\beta^2}{\beta^{2+1}}\frac{1}{R} + \frac{1}{\beta^2+1}\frac{1}{P}}\\
              &= \frac{(\beta^2+1)P*R}{\beta^2P+R}\\
              \text{if } \beta = 1\\
    F_1       &= \frac{2PR}{P+R}
}$$

### How do we evaluate a ranked list of search results?

Let:
- Users walking trough retrieved documents and judge each of document.
- Compute precision-recall each level on set of retrieved document, such that for \\(N\\) retrieved documents, we have \\(N\\) precission-recall. 

Suppose:
- Judge \\(J\\) is binary judgement, such that \\(J = \{+, -\}\\)
- Precision-recall computation result:

Doc, judge  | Precision | Recall |
------------|-----------|--------|
\\(D_1+\\)  | 1/1       |  1/10  |
\\(D_2+\\)  | 2/2       |  2/10  |
\\(D_3-\\)  | 2/3       |  2/10  |
\\(D_4-\\)  | 2/4       |  2/10  |
\\(D_5+\\)  | 3/5       |  3/10  |
\\(D_6-\\)  | 3/6       |  3/10  |
\\(D_7-\\)  | 3/7       |  3/10  |
\\(D_8+\\)  | 4/8       |  4/10  |
\\(D_9-\\)  | 4/9       |  4/10  |
\\(D_{10}-\\) | 4/10      |  4/10  |

Then we got:
- A precision-recall curve:
![Precision Recall Curve](images/precision-recall.png)

### How do you compute average precision? How do you compute mean average precision (MAP) and geometric mean average precision (gMAP)?

For single search engine system and specific query,
**Average precision** of ranked list \\(L\\):

$$avg(L) = \frac{1}{|Rel|}\sum_{i=1}^n p(i)$$

where:
- Length of \\(L\\) is \\(n\\)
- \\(Rel\\) is total relevant documents in the collections
- $$p(i) = \begin{cases}
0,& \text{if } D_i \text{ is judged as not relevant}\\
\frac{\sum_{rel}}{rank},& \text{if } D_i \text{ is judged as relevant}
\end{cases}$$
- \\(\sum_{rel}\\) is current total of judged relevance document in \\(i \ rank\\)


For multiple search engine system and multiple queries, **Mean Average Precision** (MAP) is arithmetic mean of all the average precisions over several queries or topics, Let \\(\mathcal{L} = L_1, L_2, ..., L_m\\) be the ranked lists returned from running \\(m\\) different queries. Then we have:

$$MAP(\mathcal{L}) = \frac{1}{m} \sum\limits_{i=1}^m avp(\mathcal{L}_i)$$


**geometric Mean Average Precision** (gMAP) enchance MAP capability to capture low ranked queries that far away from average value. We defined gMAP as:

$$gMap(\mathcal{L}) = \big( \prod\limits_{i=1}^m avp(\mathcal{L}_i) \big)^{\frac{1}{m}}$$

or in log space as

$$gMAP(\mathcal{L}) = exp \big( \frac{1}{m} \sum\limits_{i=1}^m ln \ avp(\mathcal{L}_i) \big)$$

### What is mean reciprocal rank?

Reciprocal rank is special case of MAP where there are always \\(r\\) relevant document on the entire collection, such that average precision will always has value equal to \\(\frac{1}{r}\\) where \\(r\\) is the position (rank) of the single relevant document.

### Why is MAP more appropriate than precision at k documents when comparing two retrieval methods?

Using precision at k documents in comparing two retrieval methods produce unfair measurement since each methods retrieved different \\(k\\) documents. MAP is more usefull in comparing two retrieval methods because MAP provide a way to measure total precision of each method relative to average precision. Thus, we may see that average precision is expected precision which can be achieved by single retrieval method.

### Why is precision at k documents more meaningful than average precision from a user’s perspective?

Since order of retrieved document represent probability of relevance, than user tend to consider most \\(k\\) document. Thus, user's perspective is subjective preferences. Also, in the case of question and answer search engine, top answer is always prefered to be right answer.

### How can we evaluate a ranked list of search results using multi-level relevance judgments?

Use **Cumulative Gain** (CG) and **Discounted Cumulative Gain** (DCG):

Let:
- \\(r_i\\) is **gain** of result \\(i\\)
- \\(i\\) is **index of set of retrieved document**, such that \\(\{d_1, d_2, ..., d_n\}\\)

Then:
- $$CG(L) = \sum\limits_{i=1}^n r_i$$
- $$DCG(L) = r_1 + \sum\limits_{i=2}^n \frac{r_i}{log_2 i}$$

### How do you compute normalized discounted cumulative gain (nDCG)?

Let:
- \\(iDCG\\) is **ideal Discounted Cumulative Gain**

Then:
- $$nDCG(L) = \frac{DCG(L)}{iDCG}$$

### Why is normalization necessary in nDCG? Does MAP need a similar normalization?

Because we need absolute measurement for all systems, while nDCG introduce ideal DCG as absolute measurement. By using absolute measurement, we can ensure comparability across queries.

MAP can not use normalization, since ideal MAP is always 1.

### Why is it important to perform statistical significance tests when we compare the retrieval accuracies of two search engine systems?

Statistical significance test provide a way to assess the variance in average precision scores across these different queries. If there's a big variance, that means the results could fluctuate according to different queries, which makes the result unreliable.

One popular statistical signficance test is **Wilcoxon signed-rank test**.

## Additional Readings and Resources

- Mark Sanderson. Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval 4, 4 (2010), 247-375.
- Diane Kelly, Methods for Evaluating Interactive Information Retrieval Systems with Users. Foundations and Trends in Information Retrieval 3(1-2): 1-224 (2009)
- C. Zhai and S. Massung. Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM Book Series, Morgan & Claypool Publishers, 2016. Chapter 9