*UE Learning from User-generated Data, CP MMS, JKU Linz 2025*
# Exercise 4: Evaluation

Evaluating a recommender system using offline metrics is crucial to ensuring its quality before deployment. The choice of evaluation metrics is typically guided by the specific application scenario of the recommendation system. Therefore, it is essential to understand how each metric is calculated and what it measures.

In this exercise we evaluate accuracy of the three different RecSys we already implemented (``TopPop``, ``ItemKNN`` and ``SVD``). The first two tasks are about predictive quality metrics, precisely about ``Precision@K`` and ``Recall@K`` respectively. Afterwards, we take a look into ranking quality metrics, especially into ``DCG`` and ``nDCG``. At the end all three recommender systems are evaluated based on these metrics. 

The implementations for the three recommender systems are provided in a file ``rec.py`` and are imported later in the notebook.

Make sure to rename the notebook according to the convention:

LUD25_ex04_k<font color='red'><Matr. Number\></font>_<font color='red'><Surname-Name\></font>.ipynb

for example:

LUD25_ex04_k000007_Bond_James.ipynb

## Overview

Please consult the lecture slides and the presentation from UE Session 4 for a recap.

|Metric|Range|Selection criteria|Limitation|
|------|-------------------------------|---------|----------|
|Precision|$\geq 0$ and $\leq 1$|The closer to $1$ the better.|Only for hits in recommendations. Rank-agnostic.                                                        |
|Recall|$\geq 0$ and $\leq 1$|The closer to $1$ the better.|Only for hits in the ground truth. Rank-agnostic.                                                          |
|nDCG|$\geq 0$ and $\leq 1$|The closer to $1$ the better.|Does not penalize for irrelevant/missing items in the ranking. For example, the following two recommended lists 1,1,1 and 1,1,1,0 would be considered equally good, even if the latter contains an irrelevant item. |

## Implementation
In this exercise, as before, you are reqired to write a number of functions. Only implemented functions are graded. Insert your implementations into the templates provided. Please don't change the templates even if they are not pretty. Don't forget to test your implementation for correctness and efficiency. **Make sure to try your implementations on toy examples and sanity checks.**

Please **only use libraries already imported in the notebook**.

In [1]:
import pandas as pd
import numpy as np
from typing import Callable

## <font color='red'>TASK 1/3</font>: Predictive Quality Metrics

### Precision@K

Precision@k evaluates *how many items in the recommendation list are relevant* (hit) in the ground-truth data. Precision@K is calculated separately for every user and then averaged across all users. For each user, the precision score is normalized by **k**.

It is defined as:

$Precision@K = \frac{1}{|Users|} \sum_{u \in Users} \frac{|\text{Relevant items}_u \cap \text{Recommended Items}_u(K)|}{K}$


#### Input:
* prediction - (**not** an interaction matrix!) numpy array with recommendations. Row index corresponds to ``user_id``, column index corresponds to the rank of the contained recommended item. Every cell (i,j) contains ``item id`` recommended to the user (i) on the position (j) in the list. For example:

The following predictions ``[[12, 7, 99], [0, 97, 6]]`` mean that the user with ``id==1`` (second row) got recommended item **0** on the top of the list, item **97** on the second place and item **6** on the third place.

* test_interaction_matrix - (plain interaction matrix, the same format as before!) interaction matrix built from interactions held out as a test set, rows - users, columns - items, cells - 0 or 1

* topK - integer - top "how many" to consider for the evaluation. By default top 10 items are to be considered

#### Output:
* average ``Precision@k`` score across all users

In [2]:
def get_pk_score(predictions: np.ndarray, test_interaction_matrix: np.ndarray, topK=10) -> float:
    """
    predictions - np.ndarray, predictions of the recommendation algorithm for each user;
    test_interaction_matrix - np.ndarray, test interaction matrix for each user;
    topK - int, topK recommendations should be evaluated;
    
    returns - float, average precision@K score over all users;
    """
    score = None
    
    # TODO: YOUR IMPLEMENTATION.

    precision = []

    for user, user_pred in enumerate(predictions):
        user_rec = user_pred[:topK]
        num = np.count_nonzero(test_interaction_matrix[user][user_rec])
        precision.append(num/topK)

    score = np.mean(precision)
    return score

In [3]:
predictions = np.array([[0, 1, 2, 3], [3, 2, 1, 0]])
test_interaction_matrix = np.array([[1, 0, 0, 0], [0, 0, 0, 1]])

pk_score = get_pk_score(predictions, test_interaction_matrix, topK=4)

assert np.isclose(pk_score, 0.25), "precision@K score is incorrect."

### Recall@K

Recall@k evaluates *how many relevant items in the ground-truth data are in the recommendation list*. Recall@K is calculated separately for every user and then averaged across all users. For each user, the recall score is normalized by the total number of ground-truth items.

It is defined as:  

$Precision@K = \frac{1}{|Users|} \sum_{u \in Users} \frac{|\text{Relevant items}_u \cap \text{Recommended Items}_u(K)|}{|\text{Relevant Items}_u|}$

**Follow the "same" input and output defintion as for Precison@K**.

In [4]:
def get_rk_score(predictions: np.ndarray, test_interaction_matrix: np.ndarray, topK=10) -> float:
    """
    predictions - np.ndarray, predictions of the recommendation algorithm for each user;
    test_interaction_matrix - np.ndarray, test interaction matrix for each user;
    topK - int, topK recommendations should be evaluated;
    
    returns - float, average recall@K score over all users;
    """
    score = 0.0
    
    # TODO: YOUR IMPLEMENTATION.

    recall = []

    for user, user_pred in enumerate(predictions):
        user_rec = user_pred[:topK]                   
        num_hits = np.count_nonzero(test_interaction_matrix[user][user_rec])
        num_relevant = np.count_nonzero(test_interaction_matrix[user])
        
        if num_relevant > 0:
            recall_u = num_hits / num_relevant
        else:
            recall_u = 0.0

        recall.append(recall_u)

    score = np.mean(recall)
    
    return score

In [5]:
predictions = np.array([[0, 1, 2, 3], [3, 2, 1, 0]])
test_interaction_matrix = np.array([[1, 0, 0, 0], [0, 0, 0, 1]])

rk_score = get_rk_score(predictions, test_interaction_matrix, topK=4)

assert np.isclose(rk_score, 1), "recall@K score is incorrect."

## Questions

* Assume a case, a user wants to find all good items. What is more important, high recall or high precision?
* Write a one-sentence situation where high precision is more important than high recall.
* How do recall and precision relate at Rth (Precision@R and Recall@R) position in the ranking (where R is the number of relevant items)?

*-- Answer Here --*

1) High recall since I want to be sure that I do not miss any good items, which leads to minimize the false negative
2) Any sczenario in which proposing multiple alternative is not possible due to time (e.g., medical support) or lack of resources (luxury market). In those cases, having extremely relevant items (high precision) is recommended even if we miss some relevant items. 
3) At $R-th$ position the two parameters are the same Precision@R = Recall@R

## <font color='red'>TASK 2/3</font>: Ranking Quality Metrics

Implement DCG and nDCG in the corresponding templates.

### DCG Score

DCG@K measures the relevance of ranked items while giving higher importance to items appearing earlier in the ranking. It incorporates a logarithmic discount factor to penalize relevant items appearing lower in the ranking.

nDCG@K is calculated separately for every user and then averaged across all users. It is defined as:  

$DCG@K = \sum^K_{i=1} \frac{relevancy_i}{log_2(i+1)}$

**Follow the "same" input and output defintion as for Precison@K**.

Don't forget, DCG is calculated for every user separately and then the average is returned.

<font color='red'>**Attention!**</font> Use logarithm with base 2 for discounts! Remember that the top1 recommendation shouldn't get discounted! Users without interactions in the test set shouldn't contribute to the score.

In [6]:
def get_dcg_score(predictions: np.ndarray, test_interaction_matrix: np.ndarray, topK=10) -> float:
    """
    predictions - np.ndarray - predictions of the recommendation algorithm for each user.
    test_interaction_matrix - np.ndarray - test interaction matrix for each user.
    
    returns - float - mean dcg score over all user.
    """
    score = None

    # TODO: YOUR IMPLEMENTATION.

    user_dcg_scores = []

    for user_idx, user_preds in enumerate(predictions):
        # Skip users with no relevant test interactions
        if np.count_nonzero(test_interaction_matrix[user_idx]) == 0:
            continue

        # Take topK items
        top_k_items = user_preds[:topK]

        # Compute DCG for this user
        dcg = 0.0
        for rank, item_id in enumerate(top_k_items):
            relevancy = test_interaction_matrix[user_idx, item_id]
            if relevancy > 0:
                dcg += relevancy / np.log2(rank + 2)

        user_dcg_scores.append(dcg)

    if len(user_dcg_scores) == 0:
        return 0.0
    
    score = np.mean(user_dcg_scores)

    return score

In [7]:
predictions = np.array([[0, 1, 2, 3], [3, 2, 1, 0]])
test_interaction_matrix = np.array([[1, 0, 0, 0], [0, 0, 0, 1]])

dcg_score = get_dcg_score(predictions, test_interaction_matrix, topK=4)

assert np.isclose(dcg_score, 1), "1 expected"

## Questions

* Can DCG score be higher than 1? Why?
* Can the average DCG score be higher than 1? Why?

*-- Answer Here --*

1) Yes. Assuming I have a user who has two relevant items in the first two positions. It will lead to $1/log_2(1+1) + 1/log_2(2+1) > 1$. 
2) Yes. Since it is an average and I wrote from point 1) that the DCG can be bigger than 1, also the average over the user can be $>1$. 

### nDCG Score

nDCG is a metric that evaluates how well the recommender performs in recommending ranked items to users. Therefore both hit of relevant items and correctness in ranking of these items matter to the nDCG evaluation. The total nDCG score is normalized by the total number of users.

nDCG@K is calculated separately for every user and then averaged across all users. It is defined as:  

$nDCG@K = \frac{DCG@K}{iDCG@K}$

**Follow the "same" input and output defintion as for Precison@K**

<font color='red'>**Attention!**</font> Remember that ideal DCG is calculated separetely for each user and depends on the number of tracks held out for them as a Test set! Use logarithm with base 2 for discounts! Remember that the top1 recommendation shouldn't get discounted!

<font color='red'>**Note:**</font> nDCG is calculated for **every user separately** and then the average is returned. You do not necessarily need to use the function you implemented above. Writing nDCG from scatch might be a good idea as well.

In [8]:
def get_ndcg_score(predictions: np.ndarray, test_interaction_matrix: np.ndarray, topK=10) -> float:
    """
    predictions - np.ndarray - predictions of the recommendation algorithm for each user;
    test_interaction_matrix - np.ndarray, test interaction matrix for each user;
    topK - int, topK recommendations should be evaluated;
    
    returns - float, average ndcg score over all users;
    """   
    score = 0
    # TODO: YOUR IMPLEMENTATION.

    user_ndcg_scores = []
    
    for user_idx, user_preds in enumerate(predictions):
        user_relevant_items = np.count_nonzero(test_interaction_matrix[user_idx])
        if user_relevant_items == 0:
            continue
        
        # Compute DCG per user
        dcg = 0.0
        top_k_items = user_preds[:topK]  
        for rank, item_id in enumerate(top_k_items):
            relevancy = test_interaction_matrix[user_idx, item_id]
            if relevancy > 0:
                dcg += 1 / np.log2(rank + 2)
        
        max_relevant_in_top_k = min(user_relevant_items, topK)
        idcg = 0.0
        for ideal_rank in range(max_relevant_in_top_k):
            idcg += 1 / np.log2(ideal_rank + 2)
        
        ndcg = dcg / idcg if idcg > 0 else 0.0
        user_ndcg_scores.append(ndcg)
    
    if len(user_ndcg_scores) == 0:
        return 0.0
    
    score = np.mean(user_ndcg_scores)

    return score

In [9]:
predictions = np.array([[0, 1, 2, 3], [3, 2, 1, 0], [1, 2, 3, 0], [-1, -1, -1, -1]])
test_interaction_matrix = np.array([[1, 0, 0, 0], [0, 0, 0, 1], [0, 0, 0, 0], [0, 0, 0, 0]])

ndcg_score = get_ndcg_score(predictions, test_interaction_matrix, topK=4)

assert np.isclose(ndcg_score, 1), "ndcg score is not correct."

### Questions

* Can nDCG score be higher than 1?

*-- Answer Here --*

1) No, because by definition the iDCG@K is the ideal DCG (all relevant items are ranked at the top). Therefore, can be at maximum 1.

## <font color='red'>TASK 3/3</font>: Evaluation
Use the provided ``rec.py`` (see imports below) to build a simple evaluation framework. It should be able to evaluate ``POP``, ``ItemKNN`` and ``SVD``.


In [10]:
from rec import inter_matr_implicit
from rec import svd_decompose, svd_recommend_to_list  #SVD
from rec import recTopK  #ItemKNN
from rec import recTopKPop  #TopPop

In [11]:
def read(dataset, file):
    return pd.read_csv(dataset + '/' + dataset + '.' + file, sep='\t')


users = read("lfm-tiny-tunes", 'user')
items = read("lfm-tiny-tunes", 'item')
train_inters = read("lfm-tiny-tunes", 'inter_train')
test_inters = read("lfm-tiny-tunes", 'inter_test')

train_interaction_matrix = inter_matr_implicit(users=users, items=items, interactions=train_inters,
                                               dataset_name="lfm-tiny-tunes")
test_interaction_matrix = inter_matr_implicit(users=users, items=items, interactions=test_inters,
                                              dataset_name="lfm-tiny-tunes")

### Get Recommendations

In [12]:
def get_recommendations_for_algorithms(config: dict) -> dict:
    """
    config - dict - configuration as defined in the next cell

    returns - dict - already predefined below with name "rec_dict"
    """

    #use this structure to return results
    rec_dict = {"recommenders": {
        "SVD": {
            #Add your predictions here
            "recommendations": np.array([], dtype=np.int64)
        },
        "ItemKNN": {
            "recommendations": np.array([], dtype=np.int64)
        },
        "TopPop": {
            "recommendations": np.array([], dtype=np.int64)
        },
    }}

    # TODO: YOUR IMPLEMENTATION.

    # Take info from config
    train_inter = config["train_inter"]
    top_k = config["top_k"]
    svd_config = config["recommenders"]["SVD"]
    itemknn_config = config["recommenders"]["ItemKNN"]
    toppop_config = config["recommenders"]["TopPop"] 

    num_users = train_inter.shape[0]

    n_factors = svd_config.get("n_factors", 50) 
    U, V = svd_decompose(train_inter, f=n_factors)

    svd_recs = np.zeros((num_users, top_k), dtype=np.int64)

    for user_id in range(num_users):
        seen_item_ids = np.nonzero(train_inter[user_id])[0]
        recs_for_user = svd_recommend_to_list(
            user_id=user_id,
            seen_item_ids=seen_item_ids,
            U=U,
            V=V,
            topK=top_k
        )
        svd_recs[user_id, :] = recs_for_user

    rec_dict["recommenders"]["SVD"]["recommendations"] = svd_recs


    # ItemKNN
    n_neighbors = itemknn_config.get("n_neighbours", 5)
    itemknn_recs = np.zeros((num_users, top_k), dtype=np.int64)

    for user_id in range(num_users):
        recs_for_user = recTopK(
            inter_matr=train_inter,
            user=user_id,
            top_k=top_k,
            n=n_neighbors
        )
        itemknn_recs[user_id, :] = recs_for_user

    rec_dict["recommenders"]["ItemKNN"]["recommendations"] = itemknn_recs

    # TopPop
    toppop_recs = np.zeros((num_users, top_k), dtype=np.int64)
    for user_id in range(num_users):
        recs_for_user = recTopKPop(
            inter_matr=train_inter,
            user=user_id,
            top_k=top_k
        )
        toppop_recs[user_id, :] = recs_for_user

    rec_dict["recommenders"]["TopPop"]["recommendations"] = toppop_recs

    return rec_dict

In [13]:
config_predict = {
    #interaction matrix
    "train_inter": train_interaction_matrix,
    #topK parameter used for all algorithms
    "top_k": 10,
    #specific parameters for all algorithms
    "recommenders": {
        "SVD": {
            "n_factors": 50
        },
        "ItemKNN": {
            "n_neighbours": 5
        },
        "TopPop": {
        }
    }
}

In [14]:
recommendations = get_recommendations_for_algorithms(config_predict)

assert "SVD" in recommendations["recommenders"] and "recommendations" in recommendations["recommenders"]["SVD"]
assert isinstance(recommendations["recommenders"]["SVD"]["recommendations"], np.ndarray)
assert np.issubdtype(recommendations["recommenders"]["SVD"]["recommendations"].dtype, np.integer), "Predictions must contain integer indices"
assert "ItemKNN" in recommendations["recommenders"] and "recommendations" in recommendations["recommenders"]["ItemKNN"]
assert isinstance(recommendations["recommenders"]["ItemKNN"]["recommendations"], np.ndarray)
assert np.issubdtype(recommendations["recommenders"]["ItemKNN"]["recommendations"].dtype, np.integer), "Predictions must contain integer indices"
assert "TopPop" in recommendations["recommenders"] and "recommendations" in recommendations["recommenders"]["TopPop"]
assert isinstance(recommendations["recommenders"]["TopPop"]["recommendations"], np.ndarray)
assert np.issubdtype(recommendations["recommenders"]["TopPop"]["recommendations"].dtype, np.integer), "Predictions must contain integer indices"


### Evaluate Recommendations

Implement the function such that it evaluates the previously generated recommendations. Make sure you use the provided config dictionary and pay attention to the structure for the output dictionary.

In [15]:
def evaluate_algorithms(config: dict, calculate_ndcg_score: Callable,
    calculate_pk_score: Callable, calculate_rk_score: Callable,) -> dict:
    """
    config - dict, configuration containing recommenders and test interaction matrix;
    calculate_ndcg_score - callable, function to calculate the ndcg score;
    calculate_pk_score - callable, function to calculate the precision@k score;
    calculate_rk_score - callable, function to calculate the recall@k score;

    returns - dict, { Recommender Key from input dict: 
        {"ndcg" : float "ndcg score"}
        {"pk" : float "precision@k score"}
        {"rk" : float "recall@k score"}
    };
    """

    metrics = {
        "SVD": {
            "pk": None,
            "rk":None,
            "ndcg": None,
        },
        "ItemKNN": {
            "pk": None,
            "rk":None,
            "ndcg": None,
        },
        "TopPop": {
            "pk": None,
            "rk":None,
            "ndcg": None,
        },
    }

    # TODO: YOUR IMPLEMENTATION.

    top_k = config["top_k"]
    test_inter = config["test_inter"]

    for alg_name in metrics.keys():
        preds = config["recommenders"][alg_name]["recommendations"]
        
        pk_value = calculate_pk_score(preds, test_inter, top_k)
        rk_value = calculate_rk_score(preds, test_inter, top_k)
        ndcg_value = calculate_ndcg_score(preds, test_inter, top_k)

        metrics[alg_name]["pk"] = float(pk_value)
        metrics[alg_name]["rk"] = float(rk_value)
        metrics[alg_name]["ndcg"] = float(ndcg_value)
    
    return metrics

In [16]:
config_test = {
    "top_k": 10,
    "test_inter": test_interaction_matrix,
    "recommenders": {}  # here you can access the recommendations from get_recommendations_for_algorithms

}
# add dictionary with recommendations to config dictionary
config_test.update(recommendations)

### Evaluating Every Algorithm
Make sure everything works.
We expect KNN to outperform other algorithms on our small data sample.

In [17]:
evaluations = evaluate_algorithms(config_test, get_ndcg_score, get_pk_score, get_rk_score)

evaluation_metrics = ["pk", "rk", "ndcg"]
recommendation_algs = ["SVD", "ItemKNN", "TopPop"]

for metric in evaluation_metrics:
    for algorithm in recommendation_algs:
        assert algorithm in evaluations and metric in evaluations[algorithm] and isinstance(evaluations[algorithm][metric], float)

In [18]:
for recommender in evaluations.keys():
    print(f"{recommender}:")
    print(f"p@k: {evaluations[recommender]['pk']}")
    print(f"r@k: {evaluations[recommender]['rk']}")
    print(f"ndcg: {evaluations[recommender]['ndcg']}\n")

SVD:
p@k: 0.04288065843621399
r@k: 0.18918746548376178
ndcg: 0.14300409512681314

ItemKNN:
p@k: 0.06534979423868313
r@k: 0.28769321670556236
ndcg: 0.20568927986328173

TopPop:
p@k: 0.0325925925925926
r@k: 0.14350453387490425
ndcg: 0.09429753895348715



## Questions and Potential Future Work
* How would you try improve performance of all three algorithms?
1) Parameter tuning 
* What other metrics would you consider to compare these recommender systems?

2a) how many unique items are reccomended (coverage overall the catalogue). High coverage means that the system is not recommending the same stuff to everybody. 

2b) Weighted precision, giving more weights to correct reccomendations at higher ranks 

In [19]:
# The end.