In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np

The project will focus on the top-k recommendation problem, that is the evaluation will only apply to the first k items that our recommender systems suggest.

It's useful to create some synthetic dataset to test our evaluator. For simplicity, only one user is sufficient.

We only need a synthetic test dataset since our dummy recommender systems won't have any actual implementation. It's worth noting that the test_ratings will only contain 5-rating, since that's what we did when we split the dataset into training set and testing set (get only 5-rating into our testing set). This methodology has been used in most papers.

In [12]:
training_ratings = pd.DataFrame({
    'user_id': [1],
    'book_id': [123],
    'rating': [1]
})

In [3]:
testing_ratings = pd.DataFrame({
    'user_id': [1, 1, 1, 1],
    'book_id': [1, 2, 3, 4],
    'rating': [5, 5, 5, 5]
})
testing_ratings

Unnamed: 0,user_id,book_id,rating
0,1,1,5
1,1,2,5
2,1,3,5
3,1,4,5


And our dummer recommender

The expected output of our recommender system should be a dataframe linking our user_id to the top-k items that it suggests. For simplicity, k=3 is assumed here, although k=10 will be our main focus.

In [14]:
class DummyRecommender():
    name = 'Dummy RS'
    
    def fit(self, training_ratings):
        pass
    
    def all_recommendation(self):
        return {1: [6, 5, 1]}

The dummy recommender will recommend item 6, 5 and 1 to user 1

In [16]:
model = DummyRecommender()
model.fit(_)
model.all_recommendation()

{1: [6, 5, 1]}

## The evaluator

In [19]:
class Evaluator():
    def __init__(self, k=10, training_ratings=None, testing_ratings=None):
        self.k = k
        if training_ratings is not None:
            self.training_ratings = training_ratings
            self.num_users = len(self.training_ratings.user_id.unique())
            self.num_books = len(self.training_ratings.book_id.unique())
        if testing_ratings is not None:
            self.testing_ratings = testing_ratings
            self.testing_idx = {}
            for user_id in testing_ratings.user_id.unique():
                self.testing_idx[user_id] = testing_ratings[testing_ratings.user_id==user_id].book_id.values
        self.result = {}
    
    def _average_precision(self, pred, truth):
        in_arr = np.in1d(pred, truth)
        score = 0.0
        num_hits = 0.0
        for idx, correct in enumerate(in_arr):
            if correct:
                num_hits += 1
                score += num_hits / (idx + 1)
        return score / min(len(truth), self.k)
    
    def evaluate(self, model):
        model.fit(self.training_ratings)
        preds = model.all_recommendation()
        user_ids = list(preds.keys())
        ap_sum = 0
        for user_id in preds.keys():
            pred = preds[user_id][:self.k]
            truth = self.testing_idx[user_id]
            ap_sum += self._average_precision(pred, truth)
        
        self.result[model.name] = {}
        self.result[model.name]['Mean Average Precision'] = "%.2f%%" % (ap_sum / self.num_users * 100)
        
    def print_result(self):
        print(pd.DataFrame(self.result).loc[['Mean Average Precision']])

We'll be using average precision for now as starters. This project will focus on other metrics which will show more about other qualities instead of simply effectiveness.

Let's test our synthetic testing ratings

In [22]:
evl = Evaluator(k=3, training_ratings=training_ratings, testing_ratings=testing_ratings)
evl.evaluate(model)
evl.print_result()

                       Dummy RS
Mean Average Precision   11.11%


The dummy prediction is [6, 5, 1]

The truth ratings is [1, 2, 3, 4]

In the truth array, only item 1 at position 3 is retrieved by our dummy model => average precision = 1 / 3 * 1 / 3 = 1 / 9

Our evaluator is working as intended.

For the next 2-3 notebooks this evaluator will be copied over to test it (before making it into its own file)