These metrics are crucial when your model's output is an *ordered list* of items, and the quality of this ordering is what you want to evaluate. This is common in:

* **Information Retrieval:** Search engines ranking web pages.
* **Recommendation Systems:** Recommending movies, products, or articles in a ranked order.
* **Document Retrieval:** Finding relevant documents for a query from a large corpus.

The key here is that simply predicting *if* an item is relevant isn't enough; its *position* in the ranked list matters greatly.

---

## 34. Mean Average Precision (MAP)

MAP is a popular metric that evaluates the overall quality of a ranked list across multiple queries or users. To understand MAP, we first need to understand Precision@K and Average Precision (AP).

### A. Precision@K (P@K)

* **Concept:** Measures the precision (proportion of relevant items) within the top K items of a ranked list.
* **Formula:**
    $P@K = \frac{\text{Number of relevant items in the top K positions}}{K}$
* **Interpretation:** A score between 0 and 1. $P@K = 0.6$ means 60% of the top K items presented were relevant.
* **Pros:** Simple to calculate and interpret. Focuses on the quality of the very top results, which are often the most visible to users.
* **Cons:**
    * Ignores the ranking *within* the top K (a relevant item at rank 1 has the same impact as one at rank K).
    * Doesn't consider recall (how many of *all* relevant items were found).
    * Can be sensitive to the choice of K.
* **Example:**
    A search query returns a ranked list of 10 documents. Let 'R' denote a relevant document and 'N' an irrelevant one.
    Ranked List: `[R, N, R, R, N, N, R, N, N, N]`
    Total relevant documents for this query = 4 (at positions 1, 3, 4, 7).

    * $P@1 = 1/1 = 1.0$ (1 relevant in top 1)
    * $P@3 = 2/3 \approx 0.67$ (2 relevant in top 3: R at 1, R at 3)
    * $P@5 = 3/5 = 0.6$ (3 relevant in top 5: R at 1, R at 3, R at 4)
    * $P@10 = 4/10 = 0.4$ (4 relevant in top 10)
* **Implementation:** Usually calculated manually as part of AP.
----


### B. Average Precision (AP)

* **Concept:** AP summarizes the precision-recall trade-off for a *single query* or user. It's the average of Precision@K values, but calculated *only at the ranks where a relevant item is found*. It rewards models that rank relevant items higher in the list.
* **Formula:**
    $AP = \frac{\sum_{k=1}^{N} (P@k \times rel(k))}{\text{Total number of relevant items for the query}}$
    Where:
    * $N$ is the total number of items in the ranked list for the query.
    * $P@k$ is the Precision at cut-off $k$.
    * $rel(k)$ is an indicator function: 1 if the item at rank $k$ is relevant, 0 otherwise.
* **Interpretation:** A score between 0 and 1 for a single query. A higher AP means relevant items are generally ranked higher and found consistently. An AP of 1.0 means all relevant items were found and ranked at the very top of the list.
* **Pros:**
    * Considers the order of relevant items.
    * Rewards finding many relevant items and placing them early.
    * More stable and comprehensive than a single P@K value for a query.
* **Cons:**
    * Assumes binary relevance (an item is either relevant or not; no degrees of relevance).
* **Example:**
    Using the same ranked list: `[R, N, R, R, N, N, R, N, N, N]`
    Relevant items are at ranks 1, 3, 4, 7. Total relevant items = 4.

    1.  Item at rank 1 is **R**: $P@1 = 1/1 = 1.0$. $rel(1)=1$. Term = $1.0 \times 1 = 1.0$.
    2.  Item at rank 2 is N: $rel(2)=0$. Term = $P@2 \times 0 = 0$.
    3.  Item at rank 3 is **R**: $P@3 = 2/3 \approx 0.667$. $rel(3)=1$. Term = $0.667 \times 1 \approx 0.667$.
    4.  Item at rank 4 is **R**: $P@4 = 3/4 = 0.75$. $rel(4)=1$. Term = $0.75 \times 1 = 0.75$.
    5.  Item at rank 5 is N: $rel(5)=0$. Term = $0$.
    6.  Item at rank 6 is N: $rel(6)=0$. Term = $0$.
    7.  Item at rank 7 is **R**: $P@7 = 4/7 \approx 0.571$. $rel(7)=1$. Term = $0.571 \times 1 \approx 0.571$.
    (No more relevant items)

    $AP = \frac{(1.0 \times 1) + (0 \times P@2) + (0.667 \times 1) + (0.75 \times 1) + (0 \times P@5) + (0 \times P@6) + (0.571 \times 1)}{\text{4}}$
    $AP = \frac{1.0 + 0.667 + 0.75 + 0.571}{4} = \frac{2.988}{4} \approx 0.747$

In [13]:
# **Implementation (Conceptual Manual Python Calculation):**

# Scikit-learn's `average_precision_score` is designed for binary classification true labels and continuous scores. For AP in ranking from an already ranked list, manual calculation is clearer:
import sklearn
def calculate_ap(ranked_relevance_list, total_relevant_docs):
# """
# Calculates Average Precision (AP) for a single query.
# ranked_relevance_list: list of 0s and 1s, where 1 means relevant.
# total_relevant_docs: total number of truly relevant docs for this query.
# """
    if total_relevant_docs == 0:
        return 0.0

    hits = 0
    sum_precision_at_k = 0.0
    for k, is_relevant in enumerate(ranked_relevance_list):
        if is_relevant:
            hits += 1
            precision_at_k = hits / (k + 1)
            sum_precision_at_k += precision_at_k

    return sum_precision_at_k / total_relevant_docs

# Example from above
ranked_relevance = [1, 0, 1, 1, 0, 0, 1, 0, 0, 0] # R=1, N=0
total_relevant = 4
ap = calculate_ap(ranked_relevance, total_relevant)

----

### C. Mean Average Precision (MAP)

* **Concept:** The MAP is the mean of Average Precision (AP) scores calculated over a set of multiple queries or users. It provides a single figure measure of quality across all queries.
* **Formula:**
    $MAP = \frac{1}{|Q|} \sum_{q=1}^{|Q|} AP_q$
    Where:
    * $|Q|$ is the total number of queries.
    * $AP_q$ is the Average Precision for query $q$.
* **Interpretation:** A score between 0 and 1. A higher MAP indicates better overall ranking performance across the entire set of queries. An MAP of 0.75 means that, on average, a query's ranked list scores 0.75 on AP.
* **Pros:**
    * Provides a single, comprehensive metric for evaluating ranking systems over multiple queries.
    * Widely used and understood in information retrieval and recommendation literature.
    * Penalizes systems that perform poorly on many queries.
* **Cons:**
    * Still assumes binary relevance (relevant/not relevant).
    * Can be influenced by the number of relevant documents per query. Queries with very few relevant documents might have unstable AP scores.
* **Example:**
    Suppose we have AP scores for 3 queries:
    * Query 1 AP: 0.747
    * Query 2 AP: 0.500
    * Query 3 AP: 0.950 (Corrected from previous example's calculation)
    $MAP = \frac{0.747 + 0.500 + 0.950}{3} = \frac{2.197}{3} \approx 0.732$

In [14]:
# Implementation:** There's no direct `map_score` in scikit-learn. It involves calculating AP for each query and then averaging these AP scores.

# Continuing from the AP example:
ap_scores = []

# Query 1
ranked_relevance_q1 = [1, 0, 1, 1, 0, 0, 1, 0, 0, 0]
total_relevant_q1 = 4
ap_scores.append(calculate_ap(ranked_relevance_q1, total_relevant_q1)) # Expected 0.747

# Query 2 (example data)
ranked_relevance_q2 = [0, 1, 0, 1, 0] # Relevant at rank 2 and 4
total_relevant_q2 = 2
# P@2 (1st relevant) = 1/2 = 0.5
# P@4 (2nd relevant) = 2/4 = 0.5
# AP = (0.5 + 0.5) / 2 = 0.5
ap_scores.append(calculate_ap(ranked_relevance_q2, total_relevant_q2)) # Expected 0.5

# Query 3 (example data)
ranked_relevance_q3 = [1, 1, 1, 0, 1] # Relevant at 1, 2, 3, 5
total_relevant_q3 = 4
# P@1 (1st relevant) = 1/1 = 1.0
# P@2 (2nd relevant) = 2/2 = 1.0
# P@3 (3rd relevant) = 3/3 = 1.0
# P@5 (4th relevant) = 4/5 = 0.8
# AP = (1.0 + 1.0 + 1.0 + 0.8) / 4 = 3.8 / 4 = 0.95
ap_scores.append(calculate_ap(ranked_relevance_q3, total_relevant_q3)) # Expected 0.95

if ap_scores:
    map_score = sum(ap_scores) / len(ap_scores)
    print(f"AP Scores: {[round(s, 3) for s in ap_scores]}")
    print(f"MAP: {map_score:.3f}") # Expected MAP: (0.747 + 0.5 + 0.95) / 3 = 0.732
else:
    print("No AP scores to calculate MAP.")
    pass

AP Scores: [0.747, 0.5, 0.95]
MAP: 0.732


* **Context:** MAP is a standard evaluation metric in academic information retrieval (e.g., TREC conferences) and is often used to compare search algorithms or recommendation models that produce ranked lists based on binary relevance.

---

## 35. Normalized Discounted Cumulative Gain (NDCG)

NDCG is designed to evaluate rankings where items have *graded relevance* (e.g., not just relevant/irrelevant, but "highly relevant," "somewhat relevant," "irrelevant"). It also emphasizes placing highly relevant items at the top of the list. We need to understand CG, DCG, and IDCG first.

### A. Cumulative Gain (CG@K)

* **Concept:** The sum of the relevance scores of the items in the top K positions of a ranked list.
* **Formula:**
    $CG@K = \sum_{i=1}^{K} rel_i$
    Where $rel_i$ is the graded relevance score of the item at rank $i$.
* **Interpretation:** The total relevance accumulated within the top K items. Higher CG@K indicates more relevant items are in the top K.
* **Pros:** Simple to understand and compute. Handles graded relevance scores.
* **Cons:** Ignores the position of items within the top K. A highly relevant item at rank K contributes the same as if it were at rank 1.
* **Example:**
    Ranked list of documents with graded relevance scores (0=irrelevant, 1=somewhat, 2=relevant, 3=highly relevant):
    Relevance scores in ranked order: `rel = [3, 0, 2, 2, 1]`

    * $CG@1 = 3$
    * $CG@3 = 3 + 0 + 2 = 5$
    * $CG@5 = 3 + 0 + 2 + 2 + 1 = 8$
* **Implementation:** Manual sum.
---

### B. Discounted Cumulative Gain (DCG@K)

* **Concept:** Similar to CG, but it applies a logarithmic discount to the relevance scores based on their rank. Relevant items appearing lower in the list contribute less to the DCG.
* **Formula:**
    $DCG@K = \sum_{i=1}^{K} \frac{rel_i}{\log_2(i+1)}$
    (Scikit-learn uses this formula where the discount $\log_2(i+1)$ means $\log_2(rank+1)$ effectively.)
* **Interpretation:** The total discounted relevance accumulated in the top K items. Higher DCG@K is better.
* **Pros:**
    * Values item position: Highly relevant items are rewarded more if ranked higher.
    * Handles graded relevance.
* **Cons:**
    * Absolute DCG values are not easily comparable across different queries or different K.
    * Not normalized.
* **Example:**
    Using `rel = [3, 0, 2, 2, 1]`
    Discount factors $\log_2(i+1)$: $\log_2(2)=1$, $\log_2(3)\approx1.585$, $\log_2(4)=2$, $\log_2(5)\approx2.322$, $\log_2(6)\approx2.585$.

    * $DCG@1 = 3 / 1 = 3.0$
    * $DCG@3 = (3/1) + (0/1.585) + (2/2) = 3 + 0 + 1 = 4.0$
    * $DCG@5 = (3/1) + (0/1.585) + (2/2) + (2/2.322) + (1/2.585) \approx 3 + 0 + 1 + 0.861 + 0.387 \approx 5.248$

In [17]:
# Implementation (Scikit-learn `dcg_score`):**

from sklearn.metrics import dcg_score
import numpy as np

# True relevance scores in the order they were ranked by the model
true_relevance_ranked_by_model = np.asarray([[3, 0, 2, 2, 1]]) # Needs 2D array

# print(f"DCG@1: {dcg_score(true_relevance_ranked_by_model, k=1):.3f}")
# print(f"DCG@3: {dcg_score(true_relevance_ranked_by_model, k=3):.3f}")
# print(f"DCG@5: {dcg_score(true_relevance_ranked_by_model, k=5):.3f}")
# Expected Outputs: DCG@1: 3.000, DCG@3: 4.000, DCG@5: 5.248

### C. Ideal Discounted Cumulative Gain (IDCG@K)

* **Concept:** The DCG score of a "perfect" or ideal ranking for the top K items. This is achieved by taking all known items for a query, sorting them by their true relevance scores in descending order, and then calculating DCG@K on this ideal list.
* **Formula:** Same as DCG@K, but applied to the ideally ranked relevance scores:
    $IDCG@K = \sum_{i=1}^{K} \frac{rel_i^{ideal}}{\log_2(i+1)}$
* **Interpretation:** The maximum possible DCG@K for a given query and set of documents at cut-off K.
* **Pros:** Provides the normalization factor needed for NDCG.
* **Cons:** Requires knowing all true relevance scores for all documents relevant to a query to construct the ideal list.
* **Example:**
    Suppose for our query, the full set of available items and their true relevance scores are: DocA (3), DocB (0), DocC (2), DocD (2), DocE (1), DocF(3).
    Sorted ideally: `[3, 3, 2, 2, 1, 0]`
    Ideal relevance list for top 5: `ideal_rel = [3, 3, 2, 2, 1]`

    * $IDCG@1 = 3 / 1 = 3.0$
    * $IDCG@3 = (3/1) + (3/1.585) + (2/2) \approx 3 + 1.893 + 1 = 5.893$
    * $IDCG@5 = (3/1) + (3/1.585) + (2/2) + (2/2.322) + (1/2.585) \approx 3 + 1.893 + 1 + 0.861 + 0.387 \approx 7.141$


In [19]:
# **Implementation:** Calculate DCG on the ideally sorted list of true relevance scores.

# Full set of true relevances for all docs for the query, then sorted.
ideal_true_relevance_scores = np.asarray([[3, 3, 2, 2, 1]]) # Sorted ideally

    # print(f"IDCG@1: {dcg_score(ideal_true_relevance_scores, k=1):.3f}")
    # print(f"IDCG@3: {dcg_score(ideal_true_relevance_scores, k=3):.3f}")
    # print(f"IDCG@5: {dcg_score(ideal_true_relevance_scores, k=5):.3f}")
    # Expected Outputs: IDCG@1: 3.000, IDCG@3: 5.893, IDCG@5: 7.141

### D. Normalized Discounted Cumulative Gain (NDCG@K)

* **Concept:** DCG@K normalized by IDCG@K. This scales the DCG score to a range between 0 and 1, making it comparable across different queries and different numbers of results.
* **Formula:**
    $NDCG@K = \frac{DCG@K}{IDCG@K}$
    (If IDCG@K is 0, NDCG@K is typically defined as 0.)
* **Interpretation:** A score between 0 and 1.
    * NDCG@K = 1 means the model's ranking is identical to the ideal ranking for the top K items.
    * NDCG@K = 0 means the model found no relevant items in the top K (or DCG@K was 0).
    * Higher values indicate better ranking performance.
* **Pros:**
    * Considers both the relevance of items and their position in the list.
    * Handles graded relevance scores.
    * **Normalized:** Allows for fair comparison across queries.
    * Widely used and robust.
* **Cons:**
    * Can be more complex to calculate manually.
    * IDCG calculation requires knowing the complete set of relevance judgments for a query.
* **Example:**
    Using our calculated values:
    * $DCG@3 = 4.0$, $IDCG@3 = 5.893$ (based on global ideal `[3,3,2]`)
        $NDCG@3 = 4.0 / 5.893 \approx 0.679$
    * $DCG@5 = 5.248$, $IDCG@5 = 7.141$ (based on global ideal `[3,3,2,2,1]`)
        $NDCG@5 = 5.248 / 7.141 \approx 0.735$

In [21]:
# **Implementation (Scikit-learn `ndcg_score`):**
# `ndcg_score` calculates DCG for `y_true` based on the order induced by `y_score`. For IDCG, it sorts `y_true` by its own values to get the ideal order of those specific items.

from sklearn.metrics import ndcg_score
import numpy as np

    # Case 1: Manual calculation using pre-computed DCG and IDCG
dcg_at_5 = 5.248
idcg_at_5_global = 7.141 # Based on a globally defined ideal list
ndcg_at_5_manual = dcg_at_5 / idcg_at_5_global if idcg_at_5_global > 0 else 0.0
    # print(f"Manual NDCG@5 (using global IDCG): {ndcg_at_5_manual:.3f}") # Expected: 0.735

    # Case 2: Using sklearn.metrics.ndcg_score
    # y_true are the true relevances of the items.
    # y_score are the scores given by the model to these items.
    # Example:
    # True relevances of items in their original (e.g., database) order:
true_relevances_all = np.asarray([[3, 0, 2, 2, 1, 3]]) # DocA, DocB, DocC, DocD, DocE, DocF
    # Model's scores for these items:
model_prediction_scores = np.asarray([[0.9, 0.1, 0.7, 0.6, 0.3, 0.8]])
    # Based on these scores, model would rank: A(3), F(3), C(2), D(2), E(1), B(0)
    # Model's ranked true relevances: [3, 3, 2, 2, 1, 0]

    # ndcg_score sorts y_true by y_score to get DCG,
    # and sorts y_true by its own values to get IDCG.
    # print(f"Sklearn NDCG@5: {ndcg_score(true_relevances_all, model_prediction_scores, k=5):.3f}")
    # This will use the top 5 items from true_relevances_all based on model_prediction_scores
    # For DCG: items with scores 0.9,0.8,0.7,0.6,0.3 -> relevances [3 (A), 3 (F), 2 (C), 2 (D), 1 (E)]
    # DCG@5 for [3,3,2,2,1] is 7.141 (from our IDCG@5 example)
    # For IDCG: it sorts true_relevances_all = [3,0,2,2,1,3] to [3,3,3,2,2,1] and takes top 5: [3,3,3,2,2]
    # IDCG@5 for [3,3,3,2,2] is 3/1 + 3/1.585 + 3/2 + 2/2.322 + 2/2.585
    # = 3 + 1.893 + 1.5 + 0.861 + 0.774 = 8.028
    # Sklearn NDCG@5 = 7.141 / 8.028 = 0.889 (approximately)

**Note on `ndcg_score`:** If you provide `y_true` as the actual relevance scores of the items in the order they were presented by your model, and `y_score` as dummy scores that reflect this order (e.g., `[num_items, num_items-1, ..., 1]`), then the DCG part will be correct. However, the IDCG part will be based on sorting *only those presented items*. If the ideal ranking involves items *not* presented by your model (but known for the query), `ndcg_score` might not give the globally correct NDCG. In such cases, calculate DCG (for model's list) and IDCG (for global ideal list) separately and then divide.

* **Context:** NDCG is one of the most popular and robust metrics for evaluating ranked search results and recommendation lists, especially when relevance is not binary. It's often the primary metric in academic research and industry benchmarks for ranking tasks.

---

### Summary for Ranking Metrics:

* **MAP** is excellent for tasks with **binary relevance** judgments, where the order of relevant items matters significantly. It averages performance across multiple queries.
* **NDCG** is preferred for tasks with **graded relevance** judgments (e.g., 0-5 stars) and where the position of highly relevant items is paramount.

Both metrics are more complex than simple accuracy or precision but provide a much more nuanced view of ranking quality. The choice depends on the nature of your relevance data and the specific aspects of ranking performance you want to prioritize.