# Recommender Systems 2020/21


## Practice 4 - Building an ItemKNN Recommender From Scratch

This practice session is created to provide a guide to students of how to crete a recommender system from scratch, going from the data loading, processing, model creation, evaluation, hyperparameter tuning and a sample submission to the competition. 

Outline:
- Data Loading with Pandas (MovieLens 10M, link: http://files.grouplens.org/datasets/movielens/ml-10m.zip)
- Data Preprocessing
- Dataset splitting in Train, Validation and Testing
- Similarity Measures
- Collaborative Item KNN
- Evaluation Metrics
- Evaluation Procedure
- Hyperparameter Tuning
- Submission to competition

In [1]:
__author__ = 'Fernando Benjamín Pérez Maurera'
__credits__ = ['Fernando Benjamín Pérez Maurera']
__license__ = 'MIT'
__version__ = '0.1.0'
__maintainer__ = 'Fernando Benjamín Pérez Maurera'
__email__ = 'fernandobenjamin.perez@polimi.it'
__status__ = 'Dev'

import os
from typing import Tuple, Callable, Dict, Optional, List

import numpy as np
import pandas as pd
import scipy.sparse as sp

from sklearn.model_selection import train_test_split


## Dataset Loading with pandas

The Movielens 10M dataset is a collection of ratings given by users to items. They are stored in a columnar `.dat` file using `::` as separators for each attribute, and every row follows this structure: `<user_id>::<item_id>::<rating>::<timestamp>`. 

The function `read_csv` from pandas provides a wonderful and fast interface to load tabular data like this. For better results and performance we provide the separator `::`, the column names `["user_id", "item_id", "ratings", "timestamp"]`, and the types of each attribute in the `dtype` parameter.

In [2]:
pd.read_csv?

In [3]:
def load_data():
    return pd.read_csv("./data/Movielens_10M/ml-10M100K/ratings.dat", 
                       sep="::", 
                       names=["user_id", "item_id", "ratings", "timestamp"],
                       header=None,
                       dtype={"user_id": np.int32,
                               "item_id": np.int32,
                               "ratings": np.int32,
                               "timestamp": np.int64})

In [4]:
ratings = load_data()

  if __name__ == '__main__':


In [5]:
ratings

Unnamed: 0,user_id,item_id,ratings,timestamp
0,1,122,5,838985046
1,1,185,5,838983525
2,1,231,5,838983392
3,1,292,5,838983421
4,1,316,5,838983392
5,1,329,5,838983392
6,1,355,5,838984474
7,1,356,5,838983653
8,1,362,5,838984885
9,1,364,5,838983707


## Data Preprocessing

This section wors with the previously-loaded ratings dataset and extracts the number of users, number of items, and min/max user/item identifiers. Exploring and understanding the data is an essential step prior fitting any recommender/algorithm. 

In this specific case, we discover that item identifiers go between 1 and 65133, however, there are only 10677 different items (meaning that ~5/6 of the items identifiers are not present in the dataset). To ease further calculations, we create new contiguous user/item identifiers, we then assign each user/item only one of these new identifiers. To keep track of these new mappings, we add them into the original dataframe using the `pd.merge` function.

In [6]:
pd.merge?

In [7]:
def preprocess_data(ratings: pd.DataFrame):
    unique_users = ratings.user_id.unique()
    unique_items = ratings.item_id.unique()
    
    num_users, min_user_id, max_user_id = unique_users.size, unique_users.min(), unique_users.max()
    num_items, min_item_id, max_item_id = unique_items.size, unique_items.min(), unique_items.max()
    
    print(num_users, min_user_id, max_user_id)
    print(num_items, min_item_id, max_item_id)
    
    mapping_user_id = pd.DataFrame({"mapped_user_id": np.arange(num_users), "user_id": unique_users})
    mapping_item_id = pd.DataFrame({"mapped_item_id": np.arange(num_items), "item_id": unique_items})
    
    ratings = pd.merge(left=ratings, 
                       right=mapping_user_id,
                       how="inner",
                       on="user_id")
    
    ratings = pd.merge(left=ratings, 
                       right=mapping_item_id,
                       how="inner",
                       on="item_id")
    
    return ratings
    

In [8]:
ratings = preprocess_data(ratings)

69878 1 71567
10677 1 65133


In [9]:
ratings 

Unnamed: 0,user_id,item_id,ratings,timestamp,mapped_user_id,mapped_item_id
0,1,122,5,838985046,0,0
1,139,122,3,974302621,128,0
2,149,122,2,1112342322,136,0
3,182,122,3,943458784,168,0
4,215,122,4,1102493547,201,0
5,217,122,3,844429650,203,0
6,281,122,3,844437024,265,0
7,326,122,3,838997566,307,0
8,351,122,1,955831012,332,0
9,357,122,3,945884437,338,0


## Dataset Splitting into Train, Validation, and Test

This is the last part before creating the recommender. However, this step is super *important*, as it is the base for the training, parameters optimization, and evaluation of the recommender(s).

In here we read the ratings (which we loaded and preprocessed before) and create the `train`, `validation`, and `test` User-Rating Matrices (URM). It's important that these are disjoint to avoid information leakage from the train into the validation/test set, in our case, we are safe to use the `train_test_split` function from `scikit-learn` as the dataset only contains *one* datapoint for every `(user,item)` pair. On another topic, we first create the `test` set and then we create the `validation` by splitting again the `train` set.


`train_test_split` takes an array (or several arrays) and divides it into `train` and `test` according to a given size (in our case `testing_percentage` and `validation_percentage`, which need to be a float between 0 and 1).

After we have our different splits, we create the *sparse URMs* by using the `csr_matrix` function from `scipy`.

In [10]:
train_test_split?

In [11]:
def dataset_splits(ratings, num_users, num_items, validation_percentage: float, testing_percentage: float):
    seed = 1234
    
    (user_ids_training, user_ids_test,
     item_ids_training, item_ids_test,
     ratings_training, ratings_test) = train_test_split(ratings.mapped_user_id,
                                                        ratings.mapped_item_id,
                                                        ratings.ratings,
                                                        test_size=testing_percentage,
                                                        shuffle=True,
                                                        random_state=seed)
    
    (user_ids_training, user_ids_validation,
     item_ids_training, item_ids_validation,
     ratings_training, ratings_validation) = train_test_split(user_ids_training,
                                                              item_ids_training,
                                                              ratings_training,
                                                              test_size=validation_percentage,
                                                             )
    
    urm_train = sp.csr_matrix((ratings_training, (user_ids_training, item_ids_training)), 
                              shape=(num_users, num_items))
    
    urm_validation = sp.csr_matrix((ratings_validation, (user_ids_validation, item_ids_validation)), 
                              shape=(num_users, num_items))
    
    urm_test = sp.csr_matrix((ratings_test, (user_ids_test, item_ids_test)), 
                              shape=(num_users, num_items))
    
    
    
    return urm_train, urm_validation, urm_test
    
    
    
    

In [12]:
urm_train, urm_validation, urm_test = dataset_splits(ratings, 
                                                     num_users=69878, 
                                                     num_items=10677, 
                                                     validation_percentage=0.10, 
                                                     testing_percentage=0.20)

In [13]:
urm_train

<69878x10677 sparse matrix of type '<class 'numpy.int32'>'
	with 7200038 stored elements in Compressed Sparse Row format>

In [14]:
urm_validation

<69878x10677 sparse matrix of type '<class 'numpy.int32'>'
	with 800005 stored elements in Compressed Sparse Row format>

In [16]:
urm_test

<69878x10677 sparse matrix of type '<class 'numpy.int32'>'
	with 2000011 stored elements in Compressed Sparse Row format>

## Cosine Similarity

We can implement different versions of a cosine similarity. Some of these are faster and others are slower.

The most simple version is just to loop item by item and calculate the similarity of item pairs.
$$ W_{i,j} 
= cos(v_i, v_j) 
= \frac{v_i \cdot v_j}{|| v_i || ||v_j ||} 
= \frac{\Sigma_{u \in U}{URM_{u,i} \cdot URM_{u,j}}}{\sqrt{\Sigma_{u \in U}{URM_{u,i}^2}} \cdot \sqrt{\Sigma_{u \in U}{URM_{u,j}^2}} + shrink} $$


In [17]:
def naive_similarity(urm: sp.csc_matrix, shrink: int):
    num_items = urm.shape[1]
    weights = np.empty(shape=(num_items, num_items))
    for item_i in range(num_items):
        item_i_profile = urm[:, item_i] # mx1 vector
        
        for item_j in range(num_items):
            item_j_profile = urm[:, item_j] # mx1 vector
            
            numerator = item_i_profile.T.dot(item_j_profile).todense()[0,0]
            denominator = (np.sqrt(np.sum(item_i_profile.power(2)))
                           * np.sqrt(np.sum(item_j_profile.power(2)))
                           + shrink
                           + 1e-6)
            
            weights[item_i, item_j] = numerator / denominator
    
    np.fill_diagonal(weights, 0.0)
    return weights
    
            

Another (faster) version of the similarity is by operating on vector products
$$ W_{i,I} 
= cos(v_i, URM_{I}) 
= \frac{v_i \cdot URM_{I}}{|| v_i || IW_{I} + shrink} $$

and where 

$$ IW_{i} = \sqrt{{\Sigma_{u \in U}{URM_{u,i}^2}}}$$

In [26]:
def vector_similarity(urm: sp.csc_matrix, shrink: int):
    item_weights = np.sqrt(
        np.sum(urm.power(2), axis=0)
    ).A.flatten()
    
    num_items = urm.shape[1]
    urm_t = urm.T
    weights = np.empty(shape=(num_items, num_items))
    for item_id in range(num_items):
        numerator = urm_t.dot(urm[:, item_id]).A.flatten()
        denominator = item_weights[item_id] * item_weights + shrink + 1e-6
        
        weights[item_id] = numerator / denominator
        
    np.fill_diagonal(weights, 0.0)
    return weights
    

Lastly, a faster but more memory-intensive version of the similarity is by operating on matrix products
$$ W  
= \frac{URM^{t} \cdot URM}{IW^{t} IW + shrink} $$

In [29]:
def matrix_similarity(urm: sp.csc_matrix, shrink: int):
    item_weights = np.sqrt(
        np.sum(urm.power(2), axis=0)
    ).A
    
    numerator = urm.T.dot(urm)
    denominator = item_weights.T.dot(item_weights) + shrink + 1e-6
    weights = numerator / denominator
    np.fill_diagonal(weights, 0.0)
    
    return weights

In [18]:
urm_csc = urm_train.tocsc()
shrink = 5
slice_size = 100

In [19]:
%%time 
naive_weights = naive_similarity(urm_csc[:slice_size,:slice_size], shrink)
naive_weights

CPU times: user 8.08 s, sys: 67.5 ms, total: 8.14 s
Wall time: 8.28 s


array([[0.        , 0.36632423, 0.36526804, ..., 0.        , 0.        ,
        0.        ],
       [0.36632423, 0.        , 0.54985153, ..., 0.        , 0.03425119,
        0.        ],
       [0.36526804, 0.54985153, 0.        , ..., 0.03108656, 0.11382563,
        0.        ],
       ...,
       [0.        , 0.        , 0.03108656, ..., 0.        , 0.2717996 ,
        0.1006602 ],
       [0.        , 0.03425119, 0.11382563, ..., 0.2717996 , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.1006602 , 0.        ,
        0.        ]])

In [27]:
%%time
vector_weights = vector_similarity(urm_csc[:slice_size,:slice_size], shrink)
vector_weights

CPU times: user 60.8 ms, sys: 2.53 ms, total: 63.4 ms
Wall time: 62.5 ms


array([[0.        , 0.36632423, 0.36526804, ..., 0.        , 0.        ,
        0.        ],
       [0.36632423, 0.        , 0.54985153, ..., 0.        , 0.03425119,
        0.        ],
       [0.36526804, 0.54985153, 0.        , ..., 0.03108656, 0.11382563,
        0.        ],
       ...,
       [0.        , 0.        , 0.03108656, ..., 0.        , 0.2717996 ,
        0.1006602 ],
       [0.        , 0.03425119, 0.11382563, ..., 0.2717996 , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.1006602 , 0.        ,
        0.        ]])

In [30]:
%%time
matrix_weights = matrix_similarity(urm_csc[:slice_size,:slice_size], shrink)
matrix_weights

CPU times: user 4.71 ms, sys: 2.08 ms, total: 6.79 ms
Wall time: 5.22 ms


matrix([[0.        , 0.36632423, 0.36526804, ..., 0.        , 0.        ,
         0.        ],
        [0.36632423, 0.        , 0.54985153, ..., 0.        , 0.03425119,
         0.        ],
        [0.36526804, 0.54985153, 0.        , ..., 0.03108656, 0.11382563,
         0.        ],
        ...,
        [0.        , 0.        , 0.03108656, ..., 0.        , 0.2717996 ,
         0.1006602 ],
        [0.        , 0.03425119, 0.11382563, ..., 0.2717996 , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.1006602 , 0.        ,
         0.        ]])

In [28]:
np.array_equal(naive_weights, vector_weights)

True

In [31]:
np.array_equal(vector_weights, matrix_weights)

True

## Collaborative Filtering ItemKNN Recommender

This step creates a `CFItemKNN` class that represents a Collaborative Filtering ItemKNN Recommender. As we have mentioned in previous practice sessions, our recommenders have two main functions: `fit` and `recommend`. 

The first receives the similarity function and the dataset with which it will create the similarities, the result of this function is to save the similarities (`weights`) into the class instance. 

The second function takes a user id, the train URM, the recommendation lenght and a boolean value to remove already-seen items from users. It returns a recommendation list for the user.

In [65]:
class CFItemKNN(object):
    def __init__(self, shrink: int):
        self.shrink = shrink
        self.weights = None
    
    
    def fit(self, urm_train: sp.csc_matrix, similarity_function):
        if not sp.isspmatrix_csc(urm_train):
            raise TypeError(f"We expected a CSC matrix, we got {type(urm_train)}")
        
        self.weights = similarity_function(urm_train, self.shrink)
        
    def recommend(self, user_id: int, urm_train: sp.csr_matrix, at: Optional[int] = None, remove_seen: bool = True):
        user_profile = urm_train[user_id]
        
        ranking = user_profile.dot(self.weights).A.flatten()
        
        if remove_seen:
            user_profile_start = urm_train.indptr[user_id]
            user_profile_end = urm_train.indptr[user_id+1]
            
            seen_items = urm_train.indices[user_profile_start:user_profile_end]
            
            ranking[seen_items] = -np.inf
            
        ranking = np.flip(np.argsort(ranking))
        return ranking[:at]

In [66]:
itemknn_recommender = CFItemKNN(shrink=50)
itemknn_recommender

<__main__.CFItemKNN at 0x7fa9acf67748>

In [67]:
%%time

itemknn_recommender.fit(urm_train.tocsc(), matrix_similarity)

CPU times: user 11.5 s, sys: 1.33 s, total: 12.8 s
Wall time: 12.3 s


In [70]:
for user_id in range(10):
    print(itemknn_recommender.recommend(user_id=user_id, urm_train=urm_train, at=10, remove_seen=True))

[ 93  85 175  91  74  75  84  77  92  82]
[ 37  14  96  19 101 177 179  11   7 187]
[ 798  793   60  279  821  195  235  179   62 1906]
[  85  175   75   11    7   91    9    4 1009   19]
[1008  176  179  145  148 1122  403  387   24  213]
[195 179 228 235  37 415 382 404 259 401]
[ 179  399 1073  404  146  411  244 1122   34  241]
[ 519  382 1093  504  411  798  425  992  793  279]
[ 195  411  235  616  277  259 1147  179 1093  625]
[ 166  179  170  411  387  146  235   34  145 1316]


## Evaluation Metrics

In this practice session we will be using the same evaluation metrics defined in the Practice session 2, i.e., precision, recall and mean average precision (MAP).

In [35]:
def recall(recommendations: np.array, relevant_items: np.array) -> float:
    is_relevant = np.in1d(recommendations, relevant_items, assume_unique=True)
    
    recall_score = np.sum(is_relevant) / relevant_items.shape[0]
    
    return recall_score
    
    
def precision(recommendations: np.array, relevant_items: np.array) -> float:
    is_relevant = np.in1d(recommendations, relevant_items, assume_unique=True)
    
    precision_score = np.sum(is_relevant) / recommendations.shape[0]

    return precision_score

def mean_average_precision(recommendations: np.array, relevant_items: np.array) -> float:
    is_relevant = np.in1d(recommendations, relevant_items, assume_unique=True)
    
    precision_at_k = is_relevant * np.cumsum(is_relevant, dtype=np.float32) / (1 + np.arange(is_relevant.shape[0]))

    map_score = np.sum(precision_at_k) / np.min([relevant_items.shape[0], is_relevant.shape[0]])

    return map_score
    

## Evaluation Procedure

The evaluation procedure returns the averaged accuracy scores (in terms of precision, recall and MAP) for all users (that have at least 1 rating in the test set). It also calculates the number of evaluated and skipped users. It receives a recommender instance, and the train and test URMs.

In [74]:
def evaluator(recommender: object, urm_train: sp.csr_matrix, urm_test: sp.csr_matrix):
    recommendation_length = 10
    accum_precision = 0
    accum_recall = 0
    accum_map = 0
    
    num_users = urm_train.shape[0]
    
    num_users_evaluated = 0
    num_users_skipped = 0
    for user_id in range(num_users):
        user_profile_start = urm_test.indptr[user_id]
        user_profile_end = urm_test.indptr[user_id+1]
        
        relevant_items = urm_test.indices[user_profile_start:user_profile_end]
        
        if relevant_items.size == 0:
            num_users_skipped += 1
            continue
            
        recommendations = recommender.recommend(user_id=user_id, 
                                               at=recommendation_length, 
                                               urm_train=urm_train, 
                                               remove_seen=True)
        
        accum_precision += precision(recommendations, relevant_items)
        accum_recall += recall(recommendations, relevant_items)
        accum_map += mean_average_precision(recommendations, relevant_items)
        
        num_users_evaluated += 1
        
    
    accum_precision /= max(num_users_evaluated, 1)
    accum_recall /= max(num_users_evaluated, 1)
    accum_map /=  max(num_users_evaluated, 1)
    
    return accum_precision, accum_recall, accum_map, num_users_evaluated, num_users_skipped
    

In [75]:
%%time

accum_precision, accum_recall, accum_map, num_user_evaluated, num_users_skipped = evaluator(itemknn_recommender, 
                                                                                            urm_train, 
                                                                                            urm_test)

CPU times: user 2min 21s, sys: 1.58 s, total: 2min 23s
Wall time: 2min 25s


In [79]:
accum_precision, accum_recall, accum_map, num_user_evaluated, num_users_skipped

(0.25290409467323116, 0.16215820144910845, 0.17916230515769266, 69798, 80)

## Hyperparameter Tuning

This step is fundamental to get the best performance of an algorithm, specifically, because we will train different configurations of the parameters for the `CFItemKNN` recommender and select the best performing one.

In order for this step to be meaningful (and to avoid overfitting on the test set), we perform it using the `validation` URM as test set.

This step is the longest one to run in the entire pipeline when building a recommender.

In [90]:
def hyperparameter_tuning():
    shrinks = [0,1,5,10,50]
    results = []
    for shrink in shrinks:
        print(f"Currently trying shrink {shrink}")
        
        itemknn_recommender = CFItemKNN(shrink=shrink)
        itemknn_recommender.fit(urm_train.tocsc(), matrix_similarity)
        
        ev_precision, ev_recall, ev_map, _, _ = evaluator(itemknn_recommender, urm_train, urm_validation)
        
        results.append((shrink, (ev_precision, ev_recall, ev_map)))
        
    return results
    


In [91]:
%%time

hyperparameter_results = hyperparameter_tuning()

Currently trying shrink 0
Currently trying shrink 1
Currently trying shrink 5
Currently trying shrink 10
Currently trying shrink 50
CPU times: user 10min 19s, sys: 10 s, total: 10min 29s
Wall time: 10min 29s


In [92]:
hyperparameter_results

[(0, (0.10417748757143938, 0.15800031735328562, 0.08277634539659091)),
 (1, (0.104173035542056, 0.15799589203952574, 0.08277511836838672)),
 (5, (0.1041641314832892, 0.15798864104525823, 0.0827729518318653)),
 (10, (0.10416413148328918, 0.15799223727560768, 0.08276682548217212)),
 (50, (0.10415374341472793, 0.1579780807320129, 0.08274985204536793))]

## Submission to competition

This step serves as a similar step that you will perform when preparing a submission to the competition. Specially after you have chosen and trained your recommender.

For this step the best suggestion is to select the most-performing configuration obtained in the hyperparameter tuning step and to train the recommender using both the `train` and `validation` set. Remember that in the competition you *do not* have access to the test set.

We simulated the users to generate recommendations by randomly selecting 100 users from the original identifiers. Do consider that in the competition you are most likely to be provided with the list of users to generate recommendations. 

Another consideration is that, due to easier and faster calculations, we replaced the user/item identifiers with new ones in the preprocessing step. For the competition, you are required to generate recommendations using the dataset's original identifiers. Due to this, this step also reverts back the newer identifiers with the ones originally found in the dataset.

Last, this step creates a function that writes the recommendations for each user in the same file in a tabular format following this format: 
```csv
<user_id>,<item_id_1> <item_id_2> <item_id_3> <item_id_4> <item_id_5> <item_id_6> <item_id_7> <item_id_8> <item_id_9> <item_id_10>
```

Always verify the competitions' submission file model as it might vary from the one we presented here.

In [100]:
best_shrink = 0
urm_train_validation = urm_train + urm_validation


In [101]:
best_recommender = CFItemKNN(shrink=best_shrink)
best_recommender.fit(urm_train_validation.tocsc(), matrix_similarity)

In [102]:
users_to_recommend = np.random.choice(ratings.user_id.unique(), size=100, replace=False)
users_to_recommend

array([40536, 49488, 49027, 29190, 49628, 66278, 17103, 23818, 12478,
       63886, 65972, 34986, 38319, 15855, 56982, 31855, 23636, 62620,
       34576, 14511, 55701, 69757, 66263, 18236, 16196, 29592, 65752,
       18857, 21273, 55964, 21366, 17794,  4773, 22825, 65160, 21034,
       11477,  3113, 15141, 18417,  8258, 24614,  2792, 15623, 40459,
       55495, 69727, 61689,   808, 10745, 39082, 60191, 29653, 33863,
        5817,  3991, 70220, 24478, 22145, 63028, 42481, 17621, 50446,
       52154, 45137, 56487, 48219, 68044, 48168, 20727, 41317, 32000,
         761, 69862,  1393, 45440, 41666, 60975, 65756, 68451, 61114,
       20542, 29077, 68373, 64886, 45504,  2284, 10828, 67424, 55725,
       11531, 61683, 60648, 32854, 60272, 34765, 53522, 12346,  7316,
       54992])

In [106]:
mapping_to_item_id = dict(zip(ratings.mapped_item_id, ratings.item_id))

In [107]:
mapping_to_item_id

{0: 122,
 1: 185,
 2: 231,
 3: 292,
 4: 316,
 5: 329,
 6: 355,
 7: 356,
 8: 362,
 9: 364,
 10: 370,
 11: 377,
 12: 420,
 13: 466,
 14: 480,
 15: 520,
 16: 539,
 17: 586,
 18: 588,
 19: 589,
 20: 594,
 21: 616,
 22: 110,
 23: 151,
 24: 260,
 25: 376,
 26: 590,
 27: 648,
 28: 719,
 29: 733,
 30: 736,
 31: 780,
 32: 786,
 33: 802,
 34: 858,
 35: 1049,
 36: 1073,
 37: 1210,
 38: 1356,
 39: 1391,
 40: 1544,
 41: 213,
 42: 1148,
 43: 1246,
 44: 1252,
 45: 1276,
 46: 1288,
 47: 1408,
 48: 1552,
 49: 1564,
 50: 1597,
 51: 1674,
 52: 3408,
 53: 3684,
 54: 4535,
 55: 4677,
 56: 4995,
 57: 5299,
 58: 5505,
 59: 5527,
 60: 5952,
 61: 6287,
 62: 6377,
 63: 6539,
 64: 7153,
 65: 7155,
 66: 8529,
 67: 8533,
 68: 8783,
 69: 27821,
 70: 33750,
 71: 21,
 72: 34,
 73: 39,
 74: 150,
 75: 153,
 76: 161,
 77: 165,
 78: 208,
 79: 253,
 80: 266,
 81: 317,
 82: 344,
 83: 349,
 84: 367,
 85: 380,
 86: 410,
 87: 432,
 88: 434,
 89: 435,
 90: 440,
 91: 500,
 92: 587,
 93: 592,
 94: 595,
 95: 597,
 96: 1,
 97: 7,


In [110]:
def prepare_submission(ratings: pd.DataFrame, users_to_recommend: np.array, urm_train: sp.csr_matrix, recommender: object):
    users_ids_and_mappings = ratings[ratings.user_id.isin(users_to_recommend)][["user_id", "mapped_user_id"]].drop_duplicates()
    items_ids_and_mappings = ratings[["item_id", "mapped_item_id"]].drop_duplicates()
    
    mapping_to_item_id = dict(zip(ratings.mapped_item_id, ratings.item_id))
    
    
    recommendation_length = 10
    submission = []
    for idx, row in users_ids_and_mappings.iterrows():
        user_id = row.user_id
        mapped_user_id = row.mapped_user_id
        
        recommendations = recommender.recommend(user_id=mapped_user_id,
                                                urm_train=urm_train,
                                                at=recommendation_length,
                                                remove_seen=True)
        
        submission.append((user_id, [mapping_to_item_id[item_id] for item_id in recommendations]))
        
    return submission
    

In [111]:
submission = prepare_submission(ratings, users_to_recommend, urm_train_validation, best_recommender)


In [112]:
submission

[(11477, [1270, 1097, 1198, 1580, 1197, 2858, 1196, 1784, 2762, 608]),
 (22825, [260, 589, 457, 356, 1270, 1240, 296, 318, 1265, 527]),
 (54992, [592, 457, 153, 480, 590, 377, 356, 589, 588, 110]),
 (67424, [380, 377, 480, 296, 590, 356, 318, 589, 500, 292]),
 (3991, [2762, 4963, 3793, 4306, 5952, 1923, 1517, 1200, 2987, 4027]),
 (11531, [592, 480, 590, 367, 316, 253, 597, 364, 587, 500]),
 (17103, [457, 592, 380, 356, 590, 110, 50, 349, 454, 586]),
 (18857, [150, 589, 593, 318, 597, 539, 329, 21, 527, 474]),
 (23636, [1291, 1198, 1097, 260, 1265, 457, 593, 2115, 592, 780]),
 (23818, [457, 153, 480, 377, 356, 589, 593, 588, 454, 10]),
 (24614, [1291, 1270, 5349, 4306, 1210, 3793, 1036, 1240, 2115, 480]),
 (29190, [2571, 1198, 2716, 4993, 1682, 1097, 1923, 3793, 4226, 4963]),
 (41666, [457, 592, 377, 480, 153, 165, 356, 454, 588, 587]),
 (42481, [150, 589, 480, 318, 377, 590, 50, 356, 110, 349]),
 (48219, [1198, 1291, 1036, 377, 1377, 648, 1387, 1374, 457, 1197]),
 (53522, [1270, 1240, 

In [116]:
def write_submission(submissions):
    with open("./submission.csv", "w") as f:
        for user_id, items in submissions:
            f.write(f"{user_id},{' '.join([str(item) for item in items])}\n")
    

In [117]:
write_submission(submission)

## Exercises

In this lecture we saw the most simple version of Cosine Similarity, where it just includes a shrink factor. There are different optimizations that we can do to it.

- Implement TopK Neighbors
- When calculating the cosine similarity we used `urm.T.dot(urm)` to calculate the enumerator. However, depending of the dataset and the number of items, this matrix could not fit in memory. Implemenent a `block` version, faster than our `vector` version but that does not use `urm.T.dot(urm)` beforehand.
- Implement Adjusted Cosine [Formula link](http://www10.org/cdrom/papers/519/node14.html)
- Implement Dice Similarity [Wikipedia Link](https://en.wikipedia.org/wiki/Sørensen–Dice_coefficient)
- Implement an implicit CF ItemKNN.
- Implement a CF UserKNN model