*UE Learning from User-generated Data, CP MMS, JKU Linz 2023*
# Exercise 4: Evaluation

In this exercise we evaluate accuracy of three different RecSys we already implemented. First we implement DCG and nDCG metrics, then we create a simple evaluation framework to compare the three recommenders in terms of nDCG. The implementations for the three recommender systems are provided in a file rec.py and are imported later in the notebook.
Please consult the lecture slides and the presentation from UE Session 4 for a recap.

Make sure to rename the notebook according to the convention:

LUD23_ex03_k<font color='red'><Matr. Number\></font>_<font color='red'><Surname-Name\></font>.ipynb

for example:

LUD23_ex03_k000007_Bond_James.ipynb

## Implementation
In this exercise, as before, you are reqired to write a number of functions. Only implemented functions are graded. Insert your implementations into the templates provided. Please don't change the templates even if they are not pretty. Don't forget to test your implementation for correctness and efficiency. **Make sure to try your implementations on toy examples and sanity checks.**

Please **only use libraries already imported in the notebook**.

In [7]:
import pandas as pd
import numpy as np

## <font color='red'>TASK 1/2</font>: Evaluation Metrics

Implement DCG and nDCG in the corresponding templates.

### DCG Score
Implement DCG following the input/output convention:
#### Input:
* predictions - (not an interaction matrix!) numpy array with recommendations. Row index corresponds to User_id, column index corresponds to the rank of the item mentioned in the sell. Every cell (i,j) contains **item id** recommended to the user (i) on the position (j) in the list. For example:

The following predictions structure [[12, 7, 99], [0, 97, 6]] means that the user with id==1 (second row) got recommended item **0** on the top of the list, item **97** on the second place and item **6** on the third place.

* test_interaction_matrix - (plain interaction matrix format as before!) interaction matrix constructed from interactions held out as a test set, rows - users, columns - items, cells - 0 or 1

* topK - integer - top "how many" to consider for the evaluation. By default top 10 items are to be considered

#### Output:
* DCG score

Don't forget, DCG is calculated for every user separately and then the average is returned.


<font color='red'>**Attention!**</font> Use logarithm with base 2 for discounts! Remember that the top1 recommendation shouldn't get discounted!

In [8]:
def get_dcg_score(predictions: np.ndarray, test_interaction_matrix: np.ndarray, topK=10) -> float:
    """
    predictions - np.ndarray - predictions of the recommendation algorithm for each user.
    test_interaction_matrix - np.ndarray - test interaction matrix for each user.
    
    returns - float - mean dcg score over all user.
    """
    score = None

    # TODO: YOUR IMPLEMENTATION.
    n_users = predictions.shape[0]
    discounts = np.log2(np.arange(2, topK+2))
    dcg_scores = np.zeros(n_users)
    
    for user in range(n_users):
        top_items = predictions[user][:topK]
        relevant_items = test_interaction_matrix[user].nonzero()[0]
        relevant_top_items = np.intersect1d(top_items, relevant_items)
        relevance = np.zeros(topK)
        relevance[np.where(np.isin(top_items, relevant_top_items))] = 1
        dcg_scores[user] = np.sum(relevance / discounts[:len(relevance)])
    
    score = np.mean(dcg_scores)

    return score

In [9]:
predictions = np.array([[0, 1, 2, 3], [3, 2, 1, 0]])
test_interaction_matrix = np.array([[1, 0, 0, 0], [0, 0, 0, 1]])

dcg_score = get_dcg_score(predictions, test_interaction_matrix, topK=4)

assert np.isclose(dcg_score, 1), "1 expected"

* Can DCG score be higher than 1? - Yes
* Can the average DCG score be higher than 1? - Yes
* Why? - The average DCG score can be higher than 1 because it is the sum of discounted relevance scores for all users divided by the total number of users. If the relevance scores are high and the recommended items are ranked well, the resulting DCG scores can be higher than 1, leading to an average DCG score greater than 1.

### nDCG Score

Following the same parameter convention as for DCG implement nDCG metric.

<font color='red'>**Attention!**</font> Remember that ideal DCG is calculated separetely for each user and depends on the number of tracks held out for them as a Test set! Use logarithm with base 2 for discounts! Remember that the top1 recommendation shouldn't get discounted!

<font color='red'>**Note:**</font> nDCG is calculated for **every user separately** and then the average is returned. You do not necessarily need to use the function you implemented above. Writing nDCG from scatch might be a good idea as well.

In [10]:
def get_ndcg_score(predictions: np.ndarray, test_interaction_matrix: np.ndarray, topK=10) -> float:
    """
    predictions - np.ndarray - predictions of the recommendation algorithm for each user.
    test_interaction_matrix - np.ndarray - test interaction matrix for each user.
    topK - int - topK recommendations should be evaluated.
    
    returns - average ndcg score over all users.
    """
    score = None
    
    # TODO: YOUR IMPLEMENTATION.
    n_users = predictions.shape[0]
    discounts = np.log2(np.arange(2, topK+2))
    ndcg_scores = np.zeros(n_users)
    
    for user in range(n_users):
        top_items = predictions[user][:topK]
        relevant_items = test_interaction_matrix[user].nonzero()[0]
        relevant_top_items = np.intersect1d(top_items, relevant_items)
        relevance = np.zeros(topK)
        relevance[np.where(np.isin(top_items, relevant_top_items))] = 1
        ideal_relevance = np.zeros(len(relevant_items))
        ideal_relevance[0] = 1
        ideal_dcg = np.sum(ideal_relevance / discounts[:len(ideal_relevance)])
        dcg = np.sum(relevance / discounts[:len(relevance)])
        
        if ideal_dcg > 0:
            ndcg_scores[user] = dcg / ideal_dcg
        else:
            ndcg_scores[user] = 0
    
    score = np.mean(ndcg_scores)

    return score

In [11]:
predictions = np.array([[0, 1, 2, 3], [3, 2, 1, 0]])
test_interaction_matrix = np.array([[1, 0, 0, 0], [0, 0, 0, 1]])

ndcg_score = get_ndcg_score(predictions, test_interaction_matrix, topK=4)

assert np.isclose(ndcg_score, 1), "ndcg score is not correct."

* Can nDCG score be higher than 1? - No

## <font color='red'>TASK 2/2</font>: Evaluation
Use provided rec.py (see imports below) to build a simple evaluation framework. It should be able to evaluate POP, ItemKNN and SVD.

*Make sure to place provided rec.py next to your notebook for the imports to work.*


In [12]:
from rec import svd_decompose, svd_recommend_to_list  #SVD
from rec import inter_matr_implicit
from rec import recTopK  #ItemKNN
from rec import recTopKPop  #TopPop

Load the users, items and both the train interactions and test interactions
from the **new version of the lfm-tiny dataset** provided with the assignment

In [13]:
def read(dataset, file):
    return pd.read_csv(dataset + '/' + dataset + '.' + file, sep='\t')

# TODO: YOUR IMPLEMENTATION

users = read('lfm-tiny', 'user')
items = read('lfm-tiny', 'item')
train_inters = read('lfm-tiny', 'inter_train')
test_inters = read('lfm-tiny', 'inter_test')

train_interaction_matrix = inter_matr_implicit(users=users, items=items, interactions=train_inters,
                                               dataset_name="lfm-tiny")
test_interaction_matrix = inter_matr_implicit(users=users, items=items, interactions=test_inters,
                                              dataset_name="lfm-tiny")

In [15]:
test_interaction_matrix.shape

(1194, 412)

### Get Recommendations

Implement the function below to get recommendations from all 3 recommender algorithms. Make sure you use the provided config dictionary and pay attention to the structure for the output dictionary - we will use it later.

In [117]:
config_predict = {
    #interaction matrix
    "train_inter": train_interaction_matrix,
    #topK parameter used for all algorithms
    "top_k": 10,
    #specific parameters for all algorithms
    "recommenders": {
        "SVD": {
            "n_factors": 50
        },
        "ItemKNN": {
            "n_neighbours": 5
        },
        "TopPop": {
        }
    }
}

In [118]:
def get_recommendations_for_algorithms(config: dict) -> dict:
    """
    config - dict - configuration as defined above

    returns - dict - already predefined below with name "rec_dict"
    """

    #use this structure to return results
    rec_dict = {"recommenders": {
        "SVD": {
            #Add your predictions here
            "recommendations": np.array([])
        },
        "ItemKNN": {
            "recommendations": np.array([])
        },
        "TopPop": {
            "recommendations": np.array([])
        },
    }}

    # TODO: YOUR IMPLEMENTATION.
    train_inter = config["train_inter"]
    n_users = train_inter.shape[0]
    top_k = config["top_k"]
    svd_n_factors = config["recommenders"]["SVD"]["n_factors"]
    knn_n_neighbours = config["recommenders"]["ItemKNN"]["n_neighbours"]
    U_final, V_final = svd_decompose(inter_matr=train_inter, f=svd_n_factors)
    
    svd_predictions = []
    knn_predictions = []
    pop_predictions = []
    
    for user in range(n_users):
        print(f"{user + 1}/{n_users}", end="\r")
        user_svd_prediction = svd_recommend_to_list(user_id=user,
                                                seen_item_ids=train_inter[user].nonzero(),
                                                U=U_final,
                                                V=V_final,
                                                topK=top_k)
        
        user_knn_prediction = recTopK(inter_matr=train_inter,
                                      user=user,
                                      top_k=top_k,
                                      n=knn_n_neighbours)
        
        user_pop_prediction = recTopKPop(inter_matr=train_inter, 
                                         user=user, 
                                         top_k=top_k)
        
        svd_predictions.append(user_svd_prediction)
        knn_predictions.append(user_knn_prediction)
        pop_predictions.append(user_pop_prediction)
    
    svd_predictions = np.array(svd_predictions)
    knn_predictions = np.array(knn_predictions)
    pop_predictions = np.array(pop_predictions)
    
    
    rec_dict["recommenders"]["SVD"]["recommendations"] = svd_predictions
    rec_dict["recommenders"]["ItemKNN"]["recommendations"] = knn_predictions
    rec_dict["recommenders"]["TopPop"]["recommendations"] = pop_predictions
    

    return rec_dict

In [119]:
recommendations = get_recommendations_for_algorithms(config_predict)

assert "SVD" in recommendations["recommenders"] and "recommendations" in recommendations["recommenders"]["SVD"]
assert isinstance(recommendations["recommenders"]["SVD"]["recommendations"], np.ndarray)
assert "ItemKNN" in recommendations["recommenders"] and "recommendations" in recommendations["recommenders"]["ItemKNN"]
assert isinstance(recommendations["recommenders"]["ItemKNN"]["recommendations"], np.ndarray)
assert "TopPop" in recommendations["recommenders"] and "recommendations" in recommendations["recommenders"]["TopPop"]
assert isinstance(recommendations["recommenders"]["TopPop"]["recommendations"], np.ndarray)


1194/1194

### Evaluate Recommendations

Implement the function such that it evaluates the previously generated recommendations. Make sure you use the provided config dictionary and pay attention to the structure for the output dictionary.

In [120]:
config_test = {
    "top_k": 10,
    "test_inter": test_interaction_matrix,
    "recommenders": {}  # here you can access the recommendations from get_recommendations_for_algorithms

}
# add dictionary with recommendations to config dictionary
config_test.update(recommendations)

In [121]:
def evaluate_algorithms(config: dict) -> dict:
    """
    config - dict - configuration as defined above

    returns - dict - { Recommender Key from input dict: { "ndcg": float - ndcg from evaluation for this recommender} }
    """

    metrics = {
        "SVD": {
        },
        "ItemKNN": {
        },
        "TopPop": {
        },
    }

    # TODO: YOUR IMPLEMENTATION.
    top_k = config["top_k"]
    test_inter = config["test_inter"]
    
    metrics["SVD"]["ndcg"] = get_ndcg_score(predictions=config["recommenders"]["SVD"]["recommendations"],
                                            test_interaction_matrix=test_inter,
                                            topK=top_k)
    
    metrics["ItemKNN"]["ndcg"] = get_ndcg_score(predictions=config["recommenders"]["ItemKNN"]["recommendations"],
                                            test_interaction_matrix=test_inter,
                                            topK=top_k)
    
    metrics["TopPop"]["ndcg"] = get_ndcg_score(predictions=config["recommenders"]["TopPop"]["recommendations"],
                                            test_interaction_matrix=test_inter,
                                            topK=top_k)
    
    return metrics

### Evaluating Every Algorithm
Make sure everything works.
We expect KNN to outperform other algorithms on our small data sample.

In [122]:
evaluations = evaluate_algorithms(config_test)

assert "SVD" in evaluations and "ndcg" in evaluations["SVD"] and isinstance(evaluations["SVD"]["ndcg"], float)
assert "ItemKNN" in evaluations and "ndcg" in evaluations["ItemKNN"] and isinstance(evaluations["ItemKNN"]["ndcg"], float)
assert "TopPop" in evaluations and "ndcg" in evaluations["TopPop"] and isinstance(evaluations["TopPop"]["ndcg"], float)

In [123]:
for recommender in evaluations.keys():
    print(f"{recommender} ndcg: {evaluations[recommender]['ndcg']}")

SVD ndcg: 0.1584739887967836
ItemKNN ndcg: 0.27235871317055826
TopPop ndcg: 0.1424654348837062


## Questions and Potential Future Work
* How would you try improve performance of all three algorithms? - Tune hyperparameters
* What other metrics would you consider to compare these recommender systems? - F-measure, Average Precision (AP), Mean Average Precision (MAP)

In [124]:
# The end.