# This notebook uses the code available in the repository of the Recommender systems course at Politecnico di Milano

We adapt the KNN User Collaborative Filtering Recommender to remove temporal data leakage.
We change the code so that we compute for every test session s' taken independently the k closest sessions s only from the train set.

We compute so only a part of the similarity matrix, in particular we consider only the part that computes similarities of the test sessions with the train sessions.

We don't compute similarities among the train sessions and other train sessions (which wouldn't help in predicting the test month) and also we don't compute similarities between test sessions and other test sessions (which would lead to leakages that are against the rules).

We also modify the code of Top Popular recommender to use only a slice of all the available sessions, this is done to allow the use of only the sessions closer to the test month.
This is done to allow the use of these recommenders as we would use any other recommender of the repository.

Item similarity based recommenders are trained only using train sessions and features provided for each item.
The inference uses the learned similarity matrix between the items, learned on train sessions, on the test sessions to suggest items close to those seen in the currently considered test session according to what can be learned in the train sessions.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pwd

/content


In [3]:
just_checking_integrity=True
boundary=900000
boundary_after=-75000

In [4]:
%%capture
!git clone https://github.com/MaurizioFD/RecSys_Course_AT_PoliMi.git

In [5]:
import sys
sys.path.append("./RecSys_Course_AT_PoliMi")

In [6]:
%cd ./RecSys_Course_AT_PoliMi

/content/RecSys_Course_AT_PoliMi


In [7]:
%%writefile ./run_compile_all_cython.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on 30/03/2019

@author: Maurizio Ferrari Dacrema
"""

import sys, glob, traceback, os
from CythonCompiler.run_compile_subprocess import run_compile_subprocess


if __name__ == '__main__':

    # cython_file_list = glob.glob('**/*.pyx', recursive=True)

    subfolder_to_compile_list = [
        "./Recommenders/Similarity",
    ]


    cython_file_list = []

    for subfolder_to_compile in subfolder_to_compile_list:
        cython_file_list.extend(glob.glob('{}/Cython/*.pyx'.format(subfolder_to_compile), recursive=True))



    print("run_compile_all_cython: Found {} Cython files in {} folders...".format(len(cython_file_list), len(subfolder_to_compile_list)))
    print("run_compile_all_cython: All files will be compiled using your current python environment: '{}'".format(sys.executable))


    save_folder_path = "./result_cython_compile/"
    log_file_path = save_folder_path + "log.txt"

    # If directory does not exist, create
    if not os.path.exists(save_folder_path):
        os.makedirs(save_folder_path)


    log_file = open(log_file_path, "w")

    fail_count = 0

    for file_index, file_path in enumerate(cython_file_list):

        file_path = file_path.replace("\\", "/").split("/")

        file_name = file_path[-1]
        file_path = "/".join(file_path[:-1]) + "/"


        log_string = "Compiling [{}/{}]: {}... ".format(file_index+1, len(cython_file_list), file_name)
        print(log_string)

        try:
            run_compile_subprocess(file_path, [file_name])

            log_string += "PASS\n"
            print(log_string)
            log_file.write(log_string)
            log_file.flush()

        except Exception as exc:
            traceback.print_exc()

            fail_count += 1
            log_string += "FAIL: {}\n".format(str(exc))
            print(log_string)
            log_file.write(log_string)
            log_file.flush()


    log_string = "run_compile_all_cython: Compilation finished. "

    if fail_count != 0:
        log_string += "FAILS {}/{}.".format(fail_count, len(cython_file_list))
    else:
        log_string += "SUCCESS."

    log_string += "\nCompilation log can be found here: '{}'".format(log_file_path)

    print(log_string)
    log_file.write(log_string)
    log_file.close()

Overwriting ./run_compile_all_cython.py


In [8]:
%%writefile ./Recommenders/Similarity/Cython/Compute_Similarity_Cython.pyx

"""
Created on 23/10/17
@author: Maurizio Ferrari Dacrema
"""

#cython: boundscheck=False
#cython: wraparound=True
#cython: initializedcheck=False
#cython: language_level=3
#cython: nonecheck=False
#cython: cdivision=True
#cython: unpack_method_calls=True
#cython: overflowcheck=False

"""
Determine the operative system. The interface of numpy returns a different type for argsort under windows and linux
http://docs.cython.org/en/latest/src/userguide/language_basics.html#conditional-compilation
"""
IF UNAME_SYSNAME == "linux":
    DEF LONG_t = "long"
ELIF  UNAME_SYSNAME == "Windows":
    DEF LONG_t = "long long"
ELSE:
    DEF LONG_t = "long long"



import time, sys
import cython
import numpy as np
cimport numpy as np

from cpython.array cimport array, clone

from libc.math cimport sqrt




import scipy.sparse as sps
from Recommenders.Recommender_utils import check_matrix
from Utils.seconds_to_biggest_unit import seconds_to_biggest_unit

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False)
@cython.nonecheck(False)
@cython.cdivision(True)
@cython.overflowcheck(False)
cdef class Compute_Similarity_Cython:

    cdef int TopK
    cdef long n_columns, n_rows , end_col

    cdef double[:] this_item_weights
    cdef int[:] this_item_weights_mask, this_item_weights_id
    cdef int this_item_weights_counter

    cdef int[:] user_to_item_row_ptr, user_to_item_cols
    cdef int[:] item_to_user_rows, item_to_user_col_ptr
    cdef double[:] user_to_item_data, item_to_user_data
    cdef double[:] sum_of_squared, sum_of_squared_to_1_minus_alpha, sum_of_squared_to_alpha
    cdef int shrink, normalize, adjusted_cosine, pearson_correlation, tanimoto_coefficient, asymmetric_cosine, dice_coefficient, tversky_coefficient
    cdef float asymmetric_alpha, tversky_alpha, tversky_beta

    cdef int use_row_weights
    cdef double[:] row_weights

    cdef double[:,:] W_dense

    def __init__(self, dataMatrix, topK = 100, shrink=0, normalize = True,
                 asymmetric_alpha = 0.5, tversky_alpha = 1.0, tversky_beta = 1.0,
                 similarity = "cosine", row_weights = None):
        """
        Computes the cosine similarity on the columns of dataMatrix
        If it is computed on URM=|users|x|items|, pass the URM as is.
        If it is computed on ICM=|items|x|features|, pass the ICM transposed.
        :param dataMatrix:
        :param topK:
        :param shrink:
        :param normalize:           If True divide the dot product by the product of the norms
        :param row_weights:         Multiply the values in each row by a specified value. Array
        :param asymmetric_alpha     Coefficient alpha for the asymmetric cosine
        :param similarity:  "cosine"        computes Cosine similarity
                            "adjusted"      computes Adjusted Cosine, removing the average of the users
                            "asymmetric"    computes Asymmetric Cosine
                            "pearson"       computes Pearson Correlation, removing the average of the items
                            "jaccard"       computes Jaccard similarity for binary interactions using Tanimoto
                            "dice"          computes Dice similarity for binary interactions
                            "tversky"       computes Tversky similarity for binary interactions
                            "tanimoto"      computes Tanimoto coefficient for binary interactions
        """
        """
        Asymmetric Cosine as described in:
        Aiolli, F. (2013, October). Efficient top-n recommendation for very large scale binary rated datasets. In Proceedings of the 7th ACM conference on Recommender systems (pp. 273-280). ACM.

        """

        super(Compute_Similarity_Cython, self).__init__()

        self.n_columns = dataMatrix.shape[1]
        self.n_rows = dataMatrix.shape[0]
        self.shrink = shrink
        self.normalize = normalize
        self.asymmetric_alpha = asymmetric_alpha
        self.tversky_alpha = tversky_alpha
        self.tversky_beta = tversky_beta

        self.adjusted_cosine = False
        self.asymmetric_cosine = False
        self.pearson_correlation = False
        self.tanimoto_coefficient = False
        self.dice_coefficient = False
        self.tversky_coefficient = False

        if similarity == "adjusted":
            self.adjusted_cosine = True
        elif similarity == "asymmetric":
            self.asymmetric_cosine = True
        elif similarity == "pearson":
            self.pearson_correlation = True
        elif similarity == "jaccard" or similarity == "tanimoto":
            self.tanimoto_coefficient = True
            # Tanimoto has a specific kind of normalization
            self.normalize = False

        elif similarity == "dice":
            self.dice_coefficient = True
            self.normalize = False

        elif similarity == "tversky":
            self.tversky_coefficient = True
            self.normalize = False

        elif similarity == "cosine":
            pass
        else:
            raise ValueError("Cosine_Similarity: value for parameter 'mode' not recognized."
                             " Allowed values are: 'cosine', 'pearson', 'adjusted', 'asymmetric', 'jaccard', 'tanimoto',"
                             "dice, tversky."
                             " Passed value was '{}'".format(similarity))


        self.TopK = min(topK, self.n_columns)
        self.this_item_weights = np.zeros(self.n_columns, dtype=np.float64)
        self.this_item_weights_id = np.zeros(self.n_columns, dtype=np.int32)
        self.this_item_weights_mask = np.zeros(self.n_columns, dtype=np.int32)
        self.this_item_weights_counter = 0

        # Copy data to avoid altering the original object
        dataMatrix = dataMatrix.copy()





        if self.adjusted_cosine:
            dataMatrix = self.applyAdjustedCosine(dataMatrix)
        elif self.pearson_correlation:
            dataMatrix = self.applyPearsonCorrelation(dataMatrix)
        elif self.tanimoto_coefficient or self.dice_coefficient or self.tversky_coefficient:
            dataMatrix = self.useOnlyBooleanInteractions(dataMatrix)



        # Compute sum of squared values to be used in normalization
        self.sum_of_squared = np.array(dataMatrix.power(2).sum(axis=0), dtype=np.float64).ravel()

        # Tanimoto does not require the square root to be applied
        if not (self.tanimoto_coefficient or self.dice_coefficient or self.tversky_coefficient):
            self.sum_of_squared = np.sqrt(self.sum_of_squared)

        if self.asymmetric_cosine:
            # The power of 1-alpha may be negative so add small value to ensure values are non-zeros
            sum_of_squared_np = np.array(self.sum_of_squared) + 1e-6
            self.sum_of_squared_to_alpha = np.power(sum_of_squared_np, 2 * self.asymmetric_alpha)
            self.sum_of_squared_to_1_minus_alpha = np.power(sum_of_squared_np, 2 * (1 - self.asymmetric_alpha))

        # Apply weight after sum_of_squared has been computed but before the matrix is
        # split in its inner data structures
        self.use_row_weights = False

        if row_weights is not None:

            if dataMatrix.shape[0] != len(row_weights):
                raise ValueError("Cosine_Similarity: provided row_weights and dataMatrix have different number of rows."
                                 "Row_weights has {} rows, dataMatrix has {}.".format(len(row_weights), dataMatrix.shape[0]))


            self.use_row_weights = True
            self.row_weights = np.array(row_weights, dtype=np.float64)





        dataMatrix = check_matrix(dataMatrix, 'csr')

        self.user_to_item_row_ptr = dataMatrix.indptr
        self.user_to_item_cols = dataMatrix.indices
        self.user_to_item_data = np.array(dataMatrix.data, dtype=np.float64)

        dataMatrix = check_matrix(dataMatrix, 'csc')
        self.item_to_user_rows = dataMatrix.indices
        self.item_to_user_col_ptr = dataMatrix.indptr
        self.item_to_user_data = np.array(dataMatrix.data, dtype=np.float64)




        if self.TopK == 0:
            self.W_dense = np.zeros((self.n_columns,self.n_columns))





    cdef useOnlyBooleanInteractions(self, dataMatrix):
        """
        Set to 1 all data points
        :return:
        """

        cdef long index

        for index in range(len(dataMatrix.data)):
            dataMatrix.data[index] = 1

        return dataMatrix



    cdef applyPearsonCorrelation(self, dataMatrix):
        """
        Remove from every data point the average for the corresponding column
        :return:
        """

        cdef double[:] sumPerCol
        cdef int[:] interactionsPerCol
        cdef long colIndex, innerIndex, start_pos, end_pos
        cdef double colAverage


        dataMatrix = check_matrix(dataMatrix, 'csc')


        sumPerCol = np.array(dataMatrix.sum(axis=0), dtype=np.float64).ravel()
        interactionsPerCol = np.diff(dataMatrix.indptr)


        #Remove for every row the corresponding average
        for colIndex in range(self.n_columns):

            if interactionsPerCol[colIndex]>0:

                colAverage = sumPerCol[colIndex] / interactionsPerCol[colIndex]

                start_pos = dataMatrix.indptr[colIndex]
                end_pos = dataMatrix.indptr[colIndex+1]

                innerIndex = start_pos

                while innerIndex < end_pos:

                    dataMatrix.data[innerIndex] -= colAverage
                    innerIndex+=1


        return dataMatrix



    cdef applyAdjustedCosine(self, dataMatrix):
        """
        Remove from every data point the average for the corresponding row
        :return:
        """

        cdef double[:] sumPerRow
        cdef int[:] interactionsPerRow
        cdef long rowIndex, innerIndex, start_pos, end_pos
        cdef double rowAverage

        dataMatrix = check_matrix(dataMatrix, 'csr')

        sumPerRow = np.array(dataMatrix.sum(axis=1), dtype=np.float64).ravel()
        interactionsPerRow = np.diff(dataMatrix.indptr)


        #Remove for every row the corresponding average
        for rowIndex in range(self.n_rows):

            if interactionsPerRow[rowIndex]>0:

                rowAverage = sumPerRow[rowIndex] / interactionsPerRow[rowIndex]

                start_pos = dataMatrix.indptr[rowIndex]
                end_pos = dataMatrix.indptr[rowIndex+1]

                innerIndex = start_pos

                while innerIndex < end_pos:

                    dataMatrix.data[innerIndex] -= rowAverage
                    innerIndex+=1


        return dataMatrix





    cdef int[:] getUsersThatRatedItem(self, long item_id):
        return self.item_to_user_rows[self.item_to_user_col_ptr[item_id]:self.item_to_user_col_ptr[item_id+1]]

    cdef int[:] getItemsRatedByUser(self, long user_id):
        return self.user_to_item_cols[self.user_to_item_row_ptr[user_id]:self.user_to_item_row_ptr[user_id+1]]




    cdef computeItemSimilarities(self, long item_id_input):
        """
        For every item the cosine similarity against other items depends on whether they have users in common. The more
        common users the higher the similarity.

        The basic implementation is:
        - Select the first item
        - Loop through all other items
        -- Given the two items, get the users they have in common
        -- Update the similarity for all common users

        That is VERY slow due to the common user part, in which a long data structure is looped multiple times.

        A better way is to use the data structure in a different way skipping the search part, getting directly the
        information we need.

        The implementation here used is:
        - Select the first item
        - Initialize a zero valued array for the similarities
        - Get the users who rated the first item
        - Loop through the users
        -- Given a user, get the items he rated (second item)
        -- Update the similarity of the items he rated


        """

        # Create template used to initialize an array with zeros
        # Much faster than np.zeros(self.n_columns)
        #cdef array[double] template_zero = array('d')
        #cdef array[double] result = clone(template_zero, self.n_columns, zero=True)


        cdef long user_index, user_id, item_index, item_id, item_id_second

        cdef int[:] users_that_rated_item = self.getUsersThatRatedItem(item_id_input)
        cdef int[:] items_rated_by_user

        cdef double rating_item_input, rating_item_second, row_weight

        # Clean previous item
        for item_index in range(self.this_item_weights_counter):
            item_id = self.this_item_weights_id[item_index]
            self.this_item_weights_mask[item_id] = False
            self.this_item_weights[item_id] = 0.0

        self.this_item_weights_counter = 0



        # Get users that rated the items
        for user_index in range(len(users_that_rated_item)):

            user_id = users_that_rated_item[user_index]
            rating_item_input = self.item_to_user_data[self.item_to_user_col_ptr[item_id_input]+user_index]

            if self.use_row_weights:
                row_weight = self.row_weights[user_id]
            else:
                row_weight = 1.0

            # Get all items rated by that user
            items_rated_by_user = self.getItemsRatedByUser(user_id)

            for item_index in range(len(items_rated_by_user)):

                item_id_second = items_rated_by_user[item_index]
                if item_id_second>=self.end_col:
                    continue
                # Do not compute the similarity on the diagonal
                if item_id_second != item_id_input:
                    # Increment similairty
                    rating_item_second = self.user_to_item_data[self.user_to_item_row_ptr[user_id]+item_index]

                    self.this_item_weights[item_id_second] += rating_item_input*rating_item_second*row_weight


                    # Update global data structure
                    if not self.this_item_weights_mask[item_id_second]:

                        self.this_item_weights_mask[item_id_second] = True
                        self.this_item_weights_id[self.this_item_weights_counter] = item_id_second
                        self.this_item_weights_counter += 1




    def compute_similarity(self, start_col=None, end_col=None):
        """
        Compute the similarity for the given dataset
        :param self:
        :param start_col: column to begin with
        :param end_col: column to stop before, end_col is excluded
        :return:
        """

        cdef int print_block_size = 500

        cdef int item_index, inner_item_index, item_id, local_topK
        cdef long long topK_item_index

        cdef long long[:] top_k_idx

        # Declare numpy data type to use vector indexing and simplify the topK selection code
        cdef np.ndarray[LONG_t, ndim=1] top_k_partition, top_k_partition_sorting
        cdef np.ndarray[np.float64_t, ndim=1] this_item_weights_np = np.zeros(self.n_columns, dtype=np.float64)
        #cdef double[:] this_item_weights

        cdef long processed_items = 0

        # Data structure to incrementally build sparse matrix
        # Preinitialize max possible length
        cdef unsigned long long max_cells = <long long> self.n_columns*self.TopK
        cdef double[:] values = np.zeros((max_cells))
        cdef int[:] rows = np.zeros((max_cells,), dtype=np.int32)
        cdef int[:] cols = np.zeros((max_cells,), dtype=np.int32)
        cdef long sparse_data_pointer = 0

        cdef int start_col_local = 0, end_col_local = self.n_columns

        cdef array[double] template_zero = array('d')
        if end_col is None:
            self.end_col=self.n_columns
            start_col=0
        else:
            self.end_col=end_col
            start_col=end_col

        end_col=None

        if start_col is not None and start_col>0 and start_col<self.n_columns:
            start_col_local = start_col

        if end_col is not None and end_col>start_col_local and end_col<self.n_columns:
            end_col_local = end_col







        start_time = time.time()
        last_print_time = start_time

        item_index = start_col_local

        # Compute all similarities for each item
        while item_index < end_col_local:

            processed_items += 1

            # Computed similarities go in self.this_item_weights
            self.computeItemSimilarities(item_index)


            # Apply normalization and shrinkage, ensure denominator != 0
            if self.normalize:
                for inner_item_index in range(self.n_columns):

                    if self.asymmetric_cosine:
                        self.this_item_weights[inner_item_index] /= self.sum_of_squared_to_alpha[item_index] * self.sum_of_squared_to_1_minus_alpha[inner_item_index]\
                                                             + self.shrink + 1e-6

                    else:
                        self.this_item_weights[inner_item_index] /= self.sum_of_squared[item_index] * self.sum_of_squared[inner_item_index]\
                                                             + self.shrink + 1e-6

            # Apply the specific denominator for Tanimoto
            elif self.tanimoto_coefficient:
                for inner_item_index in range(self.n_columns):
                    self.this_item_weights[inner_item_index] /= self.sum_of_squared[item_index] + self.sum_of_squared[inner_item_index] -\
                                                         self.this_item_weights[inner_item_index] + self.shrink + 1e-6

            elif self.dice_coefficient:
                for inner_item_index in range(self.n_columns):
                    self.this_item_weights[inner_item_index] /= self.sum_of_squared[item_index] + self.sum_of_squared[inner_item_index] +\
                                                         self.shrink + 1e-6

            elif self.tversky_coefficient:
                for inner_item_index in range(self.n_columns):
                    self.this_item_weights[inner_item_index] /= self.this_item_weights[inner_item_index] + \
                                                              (self.sum_of_squared[item_index]-self.this_item_weights[inner_item_index])*self.tversky_alpha + \
                                                              (self.sum_of_squared[inner_item_index]-self.this_item_weights[inner_item_index])*self.tversky_beta +\
                                                              self.shrink + 1e-6

            elif self.shrink != 0:
                for inner_item_index in range(self.n_columns):
                    self.this_item_weights[inner_item_index] /= self.shrink


            if self.TopK == 0:

                for inner_item_index in range(self.n_columns):
                    self.W_dense[inner_item_index,item_index] = self.this_item_weights[inner_item_index]

            else:

                # Sort indices and select TopK
                # Using numpy implies some overhead, unfortunately the plain C qsort function is even slower
                #top_k_idx = np.argsort(this_item_weights) [-self.TopK:]

                # Sorting is done in three steps. Faster then plain np.argsort for higher number of items
                # because we avoid sorting elements we already know we don't care about
                # - Partition the data to extract the set of TopK items, this set is unsorted
                # - Sort only the TopK items, discarding the rest
                # - Get the original item index
                #



                #this_item_weights_np = clone(template_zero, self.this_item_weights_counter, zero=False)
                for inner_item_index in range(self.n_columns):
                    this_item_weights_np[inner_item_index] = 0.0


                # Add weights in the same ordering as the self.this_item_weights_id data structure
                for inner_item_index in range(self.this_item_weights_counter):
                    item_id = self.this_item_weights_id[inner_item_index]
                    this_item_weights_np[inner_item_index] = - self.this_item_weights[item_id]

                #this_item_weights_np[start_col:]=0.0

                local_topK = min([self.TopK, self.this_item_weights_counter])

                # Get the unordered set of topK items
                top_k_partition = np.argpartition(this_item_weights_np, local_topK-1)[0:local_topK]
                # Sort only the elements in the partition
                top_k_partition_sorting = np.argsort(this_item_weights_np[top_k_partition])
                # Get original index
                top_k_idx = top_k_partition[top_k_partition_sorting]



                # Incrementally build sparse matrix, do not add zeros
                for inner_item_index in range(len(top_k_idx)):

                    topK_item_index = top_k_idx[inner_item_index]

                    item_id = self.this_item_weights_id[topK_item_index]

                    if self.this_item_weights[item_id] != 0.0:

                        values[sparse_data_pointer] = self.this_item_weights[item_id]
                        rows[sparse_data_pointer] = item_id
                        cols[sparse_data_pointer] = item_index

                        sparse_data_pointer += 1


            item_index += 1


            if processed_items % print_block_size==0 or processed_items==end_col_local:

                current_time = time.time()

                # Set block size to the number of items necessary in order to print every 300 seconds
                if current_time - start_time != 0:
                    items_per_sec = processed_items/(current_time - start_time)
                else:
                    items_per_sec = 1

                print_block_size = int(items_per_sec*60)

                if current_time - last_print_time > 60  or processed_items==end_col_local:
                    new_time_value, new_time_unit = seconds_to_biggest_unit(time.time() - start_time)

                    print("Similarity column {} ({:4.1f}%), {:.2f} column/sec. Elapsed time {:.2f} {}".format(
                        processed_items, processed_items*1.0/(end_col_local-start_col_local)*100, items_per_sec, new_time_value, new_time_unit))

                    last_print_time = current_time

                    sys.stdout.flush()
                    sys.stderr.flush()

        # End while on columns


        if self.TopK == 0:

            return np.array(self.W_dense)

        else:

            values = np.array(values[0:sparse_data_pointer])
            rows = np.array(rows[0:sparse_data_pointer])
            cols = np.array(cols[0:sparse_data_pointer])

            W_sparse = sps.csr_matrix((values, (rows, cols)),
                                    shape=(self.n_columns, self.n_columns),
                                    dtype=np.float32)

            return W_sparse

Overwriting ./Recommenders/Similarity/Cython/Compute_Similarity_Cython.pyx


In [9]:
%%writefile ./Recommenders/KNN/UserKNNCFRecommender.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on 23/10/17

@author: Maurizio Ferrari Dacrema
"""

from Recommenders.Recommender_utils import check_matrix
from Recommenders.BaseSimilarityMatrixRecommender import BaseUserSimilarityMatrixRecommender

from Recommenders.IR_feature_weighting import okapi_BM_25, TF_IDF
import numpy as np

from Recommenders.Similarity.Compute_Similarity import Compute_Similarity


class UserKNNCFRecommender(BaseUserSimilarityMatrixRecommender):
    """ UserKNN recommender"""

    RECOMMENDER_NAME = "UserKNNCFRecommender"

    FEATURE_WEIGHTING_VALUES = ["BM25", "TF-IDF", "none"]


    def __init__(self, URM_train, verbose = True):
        super(UserKNNCFRecommender, self).__init__(URM_train, verbose = verbose)


    def fit(self, feature_weighting = 'none',start_user=None, **similarity_args):

        self.topK = similarity_args["topK"]
        self.shrink = similarity_args["shrink"]
        if feature_weighting not in self.FEATURE_WEIGHTING_VALUES:
            raise ValueError("Value for 'feature_weighting' not recognized. Acceptable values are {}, provided was '{}'".format(self.FEATURE_WEIGHTING_VALUES, feature_weighting))

        similarity = Compute_Similarity(self.URM_train.T, **similarity_args)

        self.W_sparse = similarity.compute_similarity(end_col=start_user)
        self.W_sparse = check_matrix(self.W_sparse, format='csr')

    def recommend(self, user_id_array, cutoff = None, remove_seen_flag=True, items_to_compute = None,
                  remove_top_pop_flag = False, remove_custom_items_flag = False, return_scores = False):
        if np.isscalar(user_id_array):
            user_id_array = np.atleast_1d(user_id_array)
            single_user = True
        else:
            single_user = False
        user_id_array=user_id_array.copy()-self.offset
        #print(user_id_array)
        return super(UserKNNCFRecommender, self).recommend(user_id_array,cutoff,remove_seen_flag,items_to_compute,remove_top_pop_flag,remove_custom_items_flag ,return_scores)


Overwriting ./Recommenders/KNN/UserKNNCFRecommender.py


In [10]:
%%writefile ./Recommenders/NonPersonalizedRecommender.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
@author: Massimo Quadrana
"""

import numpy as np
from Recommenders.BaseRecommender import BaseRecommender
from Recommenders.Recommender_utils import check_matrix
from Recommenders.DataIO import DataIO


class TopPop(BaseRecommender):
    """Top Popular recommender"""

    RECOMMENDER_NAME = "TopPopRecommender"

    def __init__(self, URM_train):
        super(TopPop, self).__init__(URM_train)


    def fit(self,offset):

        # Use np.ediff1d and NOT a sum done over the rows as there might be values other than 0/1
        self.item_pop = np.ediff1d(self.URM_train.tocsc().indptr)
        self.n_items = self.URM_train.shape[1]
        self.offset=offset


    def _compute_item_score(self, user_id_array, items_to_compute = None):

        # Create a single (n_items, ) array with the item score, then copy it for every user

        if items_to_compute is not None:
            item_pop_to_copy = - np.ones(self.n_items, dtype=np.float32)*np.inf
            item_pop_to_copy[items_to_compute] = self.item_pop[items_to_compute].copy()
        else:
            item_pop_to_copy = self.item_pop.copy()

        item_scores = np.array(item_pop_to_copy, dtype=np.float32).reshape((1, -1))
        item_scores = np.repeat(item_scores, len(user_id_array), axis = 0)

        return item_scores


    def save_model(self, folder_path, file_name = None):

        if file_name is None:
            file_name = self.RECOMMENDER_NAME

        self._print("Saving model in file '{}'".format(folder_path + file_name))

        data_dict_to_save = {"item_pop": self.item_pop}

        dataIO = DataIO(folder_path=folder_path)
        dataIO.save_data(file_name=file_name, data_dict_to_save = data_dict_to_save)

        self._print("Saving complete")
    def recommend(self, user_id_array, cutoff = None, remove_seen_flag=True, items_to_compute = None,
                  remove_top_pop_flag = False, remove_custom_items_flag = False, return_scores = False):
        if np.isscalar(user_id_array):
            user_id_array = np.atleast_1d(user_id_array)
            single_user = True
        else:
            single_user = False
        user_id_array=user_id_array.copy()-self.offset
        #print(user_id_array)
        return super(TopPop, self).recommend(user_id_array,cutoff,remove_seen_flag,items_to_compute,remove_top_pop_flag,remove_custom_items_flag ,return_scores)



class GlobalEffects(BaseRecommender):
    """GlobalEffects"""

    RECOMMENDER_NAME = "GlobalEffectsRecommender"

    def __init__(self, URM_train):
        super(GlobalEffects, self).__init__(URM_train)


    def fit(self, lambda_user=10, lambda_item=25):

        self.lambda_user = lambda_user
        self.lambda_item = lambda_item
        self.n_items = self.URM_train.shape[1]

        # convert to csc matrix for faster column-wise sum
        self.URM_train = check_matrix(self.URM_train, 'csc', dtype=np.float32)

        # 1) global average
        self.mu = self.URM_train.data.sum(dtype=np.float32) / self.URM_train.nnz

        # 2) item average bias
        # compute the number of non-zero elements for each column
        # it is equivalent to:
        # col_nnz = X.indptr[1:] - X.indptr[:-1]
        # and it is **much faster** than
        # col_nnz = (X != 0).sum(axis=0)
        col_nnz = np.ediff1d(self.URM_train.indptr)

        URM_train_unbiased = self.URM_train.copy()
        URM_train_unbiased.data -= self.mu
        self.item_bias = URM_train_unbiased.sum(axis=0) / (col_nnz + self.lambda_item)
        self.item_bias = np.asarray(self.item_bias).ravel()  # converts 2-d matrix to 1-d array without anycopy
        self.item_bias[col_nnz==0] = -np.inf

        # 3) user average bias
        # NOTE: the user bias is *useless* for the sake of ranking items. We just show it here for educational purposes.

        # first subtract the item biases from each column
        # then repeat each element of the item bias vector a number of times equal to col_nnz
        # and subtract it from the data vector
        URM_train_unbiased.data -= np.repeat(self.item_bias, col_nnz)

        # now convert the csc matrix to csr for efficient row-wise computation
        URM_train_unbiased_csr = URM_train_unbiased.tocsr()
        row_nnz = np.ediff1d(URM_train_unbiased_csr.indptr)
        # finally, let's compute the bias
        self.user_bias = URM_train_unbiased_csr.sum(axis=1).ravel() / (row_nnz + self.lambda_user)
        self.user_bias = np.asarray(self.user_bias).ravel()
        self.user_bias[row_nnz==0] = -np.inf

        self.URM_train = check_matrix(self.URM_train, 'csr', dtype=np.float32)




    def _compute_item_score(self, user_id_array, items_to_compute=None):

        # Create a single (n_items, ) array with the item score, then copy it for every user
        # 4) Compute the item ranking by using the item bias only
        # the global average and user bias won't change the ranking, so there is no need to use them

        if items_to_compute is not None:
            item_bias_to_copy = - np.ones(self.n_items, dtype=np.float32)*np.inf
            item_bias_to_copy[items_to_compute] = self.item_bias[items_to_compute].copy()
        else:
            item_bias_to_copy = self.item_bias.copy()

        item_scores = np.array(item_bias_to_copy, dtype=np.float).reshape((1, -1))
        item_scores = np.repeat(item_scores, len(user_id_array), axis = 0)

        return item_scores


    def save_model(self, folder_path, file_name = None):

        if file_name is None:
            file_name = self.RECOMMENDER_NAME

        self._print("Saving model in file '{}'".format(folder_path + file_name))

        data_dict_to_save = {"item_bias": self.item_bias}

        dataIO = DataIO(folder_path=folder_path)
        dataIO.save_data(file_name=file_name, data_dict_to_save = data_dict_to_save)

        self._print("Saving complete")



class Random(BaseRecommender):
    """Random recommender"""

    RECOMMENDER_NAME = "RandomRecommender"

    def __init__(self, URM_train):
        super(Random, self).__init__(URM_train)


    def fit(self, random_seed=42):
        np.random.seed(random_seed)
        self.n_items = self.URM_train.shape[1]


    def _compute_item_score(self, user_id_array, items_to_compute = None):

        # Create a random block (len(user_id_array), n_items) array with the item score

        if items_to_compute is not None:
            item_scores = - np.ones((len(user_id_array), self.n_items), dtype=np.float32)*np.inf
            item_scores[:, items_to_compute] = np.random.rand(len(user_id_array), len(items_to_compute))

        else:
            item_scores = np.random.rand(len(user_id_array), self.n_items)

        return item_scores



    def save_model(self, folder_path, file_name = None):

        if file_name is None:
            file_name = self.RECOMMENDER_NAME

        self._print("Saving model in file '{}'".format(folder_path + file_name))

        data_dict_to_save = {}

        dataIO = DataIO(folder_path=folder_path)
        dataIO.save_data(file_name=file_name, data_dict_to_save = data_dict_to_save)

        self._print("Saving complete")


Overwriting ./Recommenders/NonPersonalizedRecommender.py


In [11]:
!python ./run_compile_all_cython.py

run_compile_all_cython: Found 1 Cython files in 1 folders...
run_compile_all_cython: All files will be compiled using your current python environment: '/usr/bin/python3'
Compiling [1/1]: Compute_Similarity_Cython.pyx... 
In file included from [01m[K/usr/local/lib/python3.10/dist-packages/numpy/core/include/numpy/ndarraytypes.h:1929[m[K,
                 from [01m[K/usr/local/lib/python3.10/dist-packages/numpy/core/include/numpy/ndarrayobject.h:12[m[K,
                 from [01m[K/usr/local/lib/python3.10/dist-packages/numpy/core/include/numpy/arrayobject.h:5[m[K,
                 from [01m[KCompute_Similarity_Cython.c:1217[m[K:
      |  [01;35m[K^~~~~~~[m[K
  tree = Parsing.p_module(s, pxd, full_module_name)
Compiling [1/1]: Compute_Similarity_Cython.pyx... PASS

run_compile_all_cython: Compilation finished. SUCCESS.
Compilation log can be found here: './result_cython_compile/log.txt'


In [12]:
%cd ../

/content


In [13]:
%%capture
import numpy as np
import pandas as pd
import os
import scipy.sparse as sps


In [14]:
from Data_manager.split_functions.split_train_validation_random_holdout import *
from Evaluation.Evaluator import *
from Evaluation.metrics import *
from Evaluation.Evaluator import _create_empty_metrics_dict

In [15]:
from Recommenders.MatrixFactorization.IALSRecommender import IALSRecommender
from Recommenders.NonPersonalizedRecommender import TopPop, Random, GlobalEffects
from Recommenders.KNN.UserKNNCFRecommender import UserKNNCFRecommender
from Recommenders.KNN.ItemKNNCFRecommender import ItemKNNCFRecommender
from Recommenders.KNN.ItemKNNCBFRecommender import ItemKNNCBFRecommender
from Recommenders.GraphBased.RP3betaRecommender import RP3betaRecommender
from Recommenders.GraphBased.P3alphaRecommender import P3alphaRecommender

In [16]:
import os, multiprocessing
from functools import partial
import traceback, os
import scipy.sparse as sps

# HITRATE

In [18]:
import pandas as pd
class HITRATEEvaluator(EvaluatorHoldout):
    def __init__(self, URM_test_list, cutoff_list, min_ratings_per_user=1, exclude_seen=True,
                 diversity_object = None,
                 ignore_items = None,
                 ignore_users = None,
                 verbose=True):


        super(HITRATEEvaluator, self).__init__(URM_test_list, cutoff_list,
                                               diversity_object = diversity_object,
                                               min_ratings_per_user =min_ratings_per_user, exclude_seen=exclude_seen,
                                               ignore_items = ignore_items, ignore_users = ignore_users,
                                               verbose = verbose)

    def _run_evaluation_on_selected_users(self, recommender_objects, users_to_evaluate, block_size = None,items_to_compute=None,name="_"):

        if block_size is None:
            # Reduce block size if estimated memory requirement exceeds 4 GB
            block_size = min([4000, int(4*1e9*8/64/self.n_items), len(users_to_evaluate)])


        results_dict = _create_empty_metrics_dict(self.cutoff_list,
                                                  self.n_items, self.n_users,
                                                  recommender_objects[0].get_URM_train(),
                                                  self.URM_test,
                                                  self.ignore_items_ID,
                                                  self.ignore_users_ID,
                                                  self.diversity_object)




        # Start from -block_size to ensure it to be 0 at the first block
        user_batch_start = 0
        user_batch_end = 0

        while user_batch_start < len(users_to_evaluate):

            user_batch_end = user_batch_start + block_size
            user_batch_end = min(user_batch_end, len(users_to_evaluate))

            test_user_batch_array = np.array(users_to_evaluate[user_batch_start:user_batch_end])
            user_batch_start = user_batch_end
            recommended_items_batch_list = None
            # Compute predictions for a batch of users using vectorization, much more efficient than computing it one at a time
            for recommender_object in recommender_objects:
                recommended_items_batch_list_single, score = recommender_object.recommend(test_user_batch_array,
                                                                          remove_seen_flag=self.exclude_seen,
                                                                          cutoff = self.max_cutoff,
                                                                          items_to_compute=items_to_compute,
                                                                          remove_top_pop_flag=False,
                                                                          remove_custom_items_flag=self.ignore_items_flag,
                                                                          return_scores = True
                                                                         )

                if recommended_items_batch_list is None:
                    recommended_items_batch_list = recommended_items_batch_list_single
                    score_sum=score
                else:
                    recommended_items_batch_list=np.hstack((recommended_items_batch_list,recommended_items_batch_list_single))
                    score_sum+=score

            recommended_items_batch_list

            results_dict = self._compute_metrics_on_recommendation_list(test_user_batch_array = test_user_batch_array,
                                                         recommended_items_batch_list = recommended_items_batch_list,
                                                         score_batch=score_sum,
                                                         results_dict = results_dict,name=name)


        return results_dict

    def _compute_metrics_on_recommendation_list(self, test_user_batch_array, recommended_items_batch_list,score_batch,results_dict,name="_"):

        assert len(recommended_items_batch_list) == len(test_user_batch_array), "{}: recommended_items_batch_list contained recommendations for {} users, expected was {}".format(
            self.EVALUATOR_NAME, len(recommended_items_batch_list), len(test_user_batch_array))

        temp=pd.DataFrame()
        col_session=None
        col_items=None
        col_score=None
        col_max_score=None
        length=0
        # Compute recommendation quality for each user in batch
        for batch_user_index,recommended_items in enumerate(recommended_items_batch_list):
            score=score_batch[batch_user_index]
            recommended_items=np.unique(recommended_items)
            scores_top=score[recommended_items]

            self.len_unique+=len(recommended_items)
            test_user = test_user_batch_array[batch_user_index]
            session_id= np.repeat(test_user,len(recommended_items))
            maximum_score= np.repeat(np.max(score),len(recommended_items))
            if col_session is None:
                col_session=session_id
                col_items=recommended_items
                col_score=scores_top/np.max(score)
                col_max_score=maximum_score
            else:
                col_session=np.append(col_session,session_id)
                col_items=np.append(col_items,recommended_items)
                col_score=np.append(col_score,scores_top/np.max(score))
                col_max_score=np.append(col_max_score,maximum_score)
            #print(recommended_items[-4:])
            #print(scores_top[-4:])
            #print(session_id[-4:])
            relevant_items = self.get_user_relevant_items(test_user)

            # Being the URM CSR, the indices are the non-zero column indexes

            is_relevant = np.in1d(recommended_items, relevant_items, assume_unique=True)

            self._n_users_evaluated += 1

            cutoff = self.max_cutoff

            results_current_cutoff = results_dict[cutoff]
            results_current_cutoff[EvaluatorMetrics.HIT_RATE.value].add_recommendations(is_relevant)
        if time.time() - self._start_time_print > 60 or self._n_users_evaluated==len(self.users_to_evaluate):

            elapsed_time = time.time()-self._start_time
            new_time_value, new_time_unit = seconds_to_biggest_unit(elapsed_time)

            self._print("Processed {} ({:4.1f}%) in {:.2f} {}. Users per second: {:.0f}".format(
                          self._n_users_evaluated,
                          100.0* float(self._n_users_evaluated)/len(self.users_to_evaluate),
                          new_time_value, new_time_unit,
                          float(self._n_users_evaluated)/elapsed_time if elapsed_time>0.0 else np.nan))

            sys.stdout.flush()
            sys.stderr.flush()

            self._start_time_print = time.time()
        temp['Session_Id']=col_session
        temp['Item_ID']=col_items
        temp[f'Score_{name}']=col_score
        temp[f'Max_Score_{name}']=col_max_score
        self.dataset_for_ranker=pd.concat([self.dataset_for_ranker, temp], ignore_index=True)
        #print(temp.tail(4))
        #print(self.dataset_for_ranker.tail(4))

        return results_dict
    def evaluateRecommender(self, recommender_objects,block_size=None,items_to_compute=None,name="_"):
        """
        :param recommender_object: the trained recommender object, a BaseRecommender subclass
        :param URM_test_list: list of URMs to test the recommender against, or a single URM object
        :param cutoff_list: list of cutoffs to be use to report the scores, or a single cutoff
        :return results_df: dataframe with index the cutoff and columns the metric
        :return results_run_string: printable result string
        """
        self.len_unique=0
        if self.ignore_items_flag:
            for recommender_object in recommender_objects:
                recommender_object.set_items_to_ignore(self.ignore_items_ID)

        self._start_time = time.time()
        self._start_time_print = time.time()
        self._n_users_evaluated = 0
        self.dataset_for_ranker=pd.DataFrame(columns=['Session_Id', 'Item_ID', f'Score_{name}',f'Max_Score_{name}'])
        results_dict = self._run_evaluation_on_selected_users(recommender_objects, self.users_to_evaluate,block_size=block_size,items_to_compute=items_to_compute,name=name)
        self.dataset_for_ranker.to_csv(f"/drive/MyDrive/recsys2022-main/dataset/candidates/traditional_recs/train/{name}.csv",index=False)

        if self._n_users_evaluated > 0:

            for cutoff in self.cutoff_list:
                results_current_cutoff = results_dict[cutoff]

                for key in results_current_cutoff.keys():
                    if key!="HIT_RATE":
                        continue
                    value = results_current_cutoff[key]


                    results_current_cutoff[key] = value.get_metric_value()*self._n_users_evaluated

                    #results_current_cutoff[key] = value/self._n_users_evaluated


        else:
            self._print("WARNING: No users had a sufficient number of relevant items")

        if self.ignore_items_flag:
            recommender_object.reset_items_to_ignore()

        results_df = pd.DataFrame(columns=results_dict[self.cutoff_list[0]].keys(),
                                  index=self.cutoff_list)
        results_df.index.rename("cutoff", inplace = True)

        for cutoff in results_dict.keys():
            results_df.loc[cutoff] = results_dict[cutoff]
        print("average number of different items for each session",self.len_unique/self._n_users_evaluated)
        #results_run_string = get_result_string_df(results_df)

        return results_df.loc[self.cutoff_list[0]]["HIT_RATE"]

In [19]:
from collections import defaultdict
from tqdm import tqdm
import numpy as np

from joblib import Parallel, delayed


In [20]:
import time

In [24]:
def get_ICM(files_directory="drive/MyDrive/recsys2022-main/dataset/processed_data/"):
    df_icm = pd.read_csv(filepath_or_buffer=os.path.join(files_directory, 'simplified_features_and_categories_30.csv'), sep=',', header=0)

    item_id_list = df_icm['item_id'].values
    feat_id_list = df_icm['feature_idx'].values
    rating_id_list = np.ones_like(feat_id_list)
    ICM_matrix = sps.csr_matrix((rating_id_list, (item_id_list, feat_id_list)))
    return ICM_matrix

In [25]:
URM_train = sps.load_npz("drive/MyDrive/recsys2022-main/dataset/processed_data/URM_WT_train_full.npz")
URM_train.data=np.ones_like(URM_train.data)
URM_valid = sps.load_npz("drive/MyDrive/recsys2022-main/dataset/processed_data/URM_WT_valid_bought.npz").tocsr()

temp= sps.load_npz("drive/MyDrive/recsys2022-main/dataset/processed_data/URM_WT_valid_seen.npz")
URM_after_train = temp+URM_train

ICM_all = get_ICM()
if just_checking_integrity:
    URM_train = URM_train[boundary:boundary_after]
    URM_valid = URM_valid[boundary:boundary_after]
    URM_after_train = URM_after_train[boundary:boundary_after]




params_ICF=[
    {"topK":334,"shrink":396, "similarity":'cosine', "feature_weighting" : 'none' ,           "power":1.2561991065561426, "weight" :0.6393597471969044},
    {"topK":236,"shrink":687, "similarity":'tanimoto', "feature_weighting" : 'none' ,           "power":0.5509167891518838, "weight" :0.23362677165465845},
    {"topK":186,"shrink":699, "similarity":'dice', "feature_weighting" : 'none' ,           "power":0.5454807130915313, "weight" :0.37150060696200815},
    {"topK":236,"shrink":700, "similarity":'jaccard', "feature_weighting" : 'none' ,           "power":0.5224371929702275, "weight" :0.4532502310580744},
    {"topK":194,"shrink":45, "similarity":'adjusted', "feature_weighting" : 'none' ,           "power":1.3923780294419528, "weight" :0.747093242387402},
    {"topK":348,"shrink":661, "similarity":'asymmetric', "feature_weighting" : 'none' ,           "power":1.2462156230754484, "weight" :0.6410862143760282, "asymmetric_alpha" :0.6586444014201189},
    {"topK":587,"shrink":675, "similarity":'tversky', "feature_weighting" : 'none' ,           "power":1.894411025512657, "weight" :0.5871629850378687, "tversky_alpha" : 0.0022760544555811006,"tversky_beta" : 0.4049955813941767},
    {"topK":299,"shrink":674, "similarity":'pearson', "feature_weighting" : 'none' ,           "power":1.892818062582165, "weight" :0.5806120640896252}
]
params_graph=[
{"topK":760 , "alpha":0.35376041951192866  , "implicit":False , "power":1.6642347425902089, "weight":0.23322573709955957},
{"topK":1434 , "alpha":0.37391558455295765, "beta":0.13809420287767862  , "implicit":True , "power":1.6030310290761034, "weight":0.733513865097587}
]
params_ICBF=[
    {"topK":124,"shrink":9, "similarity":'tversky', "feature_weighting" : 'TF-IDF' ,           "power":1.861087725409631,  "tversky_alpha" : 0.7689675774356587,"tversky_beta" : 0.9052725399846582},
    {"topK":127,"shrink":79, "similarity":'asymmetric', "feature_weighting" : 'none' ,           "power":1.8699713052217346,  "asymmetric_alpha" : 0.7611612156193225},
    {"topK":218,"shrink":773, "similarity":'adjusted', "feature_weighting" : 'BM25' ,           "power":0.18601468383405495},
    {"topK":127,"shrink":19, "similarity":'cosine', "feature_weighting" : 'none' ,           "power":1.8849223054440538},
    {"topK":492,"shrink":348, "similarity":'pearson', "feature_weighting" : 'TF-IDF' ,           "power":1.467718817406988},
    {"topK":139,"shrink":3, "similarity":'dice', "feature_weighting" : 'TF-IDF' ,           "power":1.8969043493328803},
    {"topK":153,"shrink":5, "similarity":'jaccard', "feature_weighting" : 'BM25' ,           "power":1.8125421495255112},
    {"topK":152,"shrink":4, "similarity":'tanimoto', "feature_weighting" : 'BM25' ,           "power":1.8224861269207941}
]
params_UCF=[
    {"topK":1184,"shrink":386, "similarity":'cosine',           "power":0.1327615452747038,"split" :809438, "keep_dup" :False},
{"topK":1070,"shrink":15, "similarity":'tanimoto',           "power":0.845380696129735,"split" :813333, "keep_dup" :False},
{"topK":1751,"shrink":5, "similarity":'dice',           "power":0.6757609944671665,"split" :768495, "keep_dup" :False},
{"topK":1624,"shrink":16, "similarity":'jaccard',           "power":1.4426671932509196,"split" :765230, "keep_dup" :False},
{"topK":5123,"shrink":129, "similarity":'adjusted',           "power":0.12005038535002555,"split" :760859, "keep_dup" :True},
{"topK":4318,"shrink":3, "similarity":'pearson',           "power":0.16747332134568205,"split" :757664, "keep_dup" :True},
{"topK":5918,"shrink":1, "similarity":'asymmetric',           "power":1.443558847537245,"split" :776618, "keep_dup" :False,"asymmetric_alpha":0.33399818943172976},
{"topK":1031,"shrink":3, "similarity":'tversky',           "power":0.9137962256579533,"split" :775536, "keep_dup" :False,"tversky_alpha":0.5927402991066659,"tversky_beta":0.9129868678836787}
]

params=params_UCF+params_ICBF+params_ICF+params_graph

use_stacking=[False]*16+[True]*10
duplicates=[param["keep_dup"] if "keep_dup" in param else False for param in params ]
transpose=[True]*8+[False]*18
get_input=[lambda urm_train,urm,icm: {"URM_train":urm}]*8+[lambda urm_train,urm,icm: {"URM_train":urm,"ICM_train":icm}]*8+[lambda urm_train,urm,icm: {"URM_train":urm_train}]*10

rec_classes=([UserKNNCFRecommender]*8)+([ItemKNNCBFRecommender]*8)+([ItemKNNCFRecommender]*8)+[P3alphaRecommender,RP3betaRecommender]

recs=[]




In [26]:
candidates=np.unique(URM_valid.indices)

In [27]:
recs_top=[]
if just_checking_integrity:
    splits=[0]
else:
    splits=[850000]
for split in splits:
    recTop=TopPop(URM_train[split:])
    recTop.fit(split)
    recs_top.append(np.array([recTop]))


AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

In [None]:
#ev = HITRATEEvaluator(URM_valid,cutoff_list=[200])
#result=ev.evaluateRecommender(recs_top[0],block_size=4000,items_to_compute=candidates,name= "tentative")

In [None]:
if not just_checking_integrity:
    assert URM_valid[-81618,2638]==1,"should be 1, something went wrong with data preparation"

In [None]:
for i, rec in enumerate(rec_classes):
    print(f"-------------------{i}-------------------")
    params_rec=params[i]
    params_recommender={k:v for k,v in params_rec.items()  if k not in ["weight","power","split","keep_dup"]}
    p=params_rec["power"]

    split="split" in params_rec
    X=URM_train.copy()
    X_post=URM_after_train.copy()
    if not duplicates[i]:
        X.data=np.ones_like(X.data)
        X_post.data=np.ones_like(X_post.data)
    if use_stacking[i]:
        w=params_rec["weight"]
        URM_temp=sps.vstack([X*w,ICM_all.T*(1-w)]).tocsr()
        URM_temp_after=sps.vstack([X_post*w,ICM_all.T*(1-w)]).tocsr()
    else:
        URM_temp=X
        URM_temp_after=X_post

    if split:
        split_point=params_rec["split"]
        if just_checking_integrity:
            temp=URM_train.sum(axis=1)
        else:
            URM_temp=URM_temp[split_point:,:]
            temp=URM_train[split_point:,:].sum(axis=1)
        start_valid_user=np.sum(temp!=0)
        params_recommender["start_user"]=start_valid_user
        if not just_checking_integrity:
            URM_temp_after=URM_temp_after[split_point:,:]


    rec=rec(**get_input[i](URM_temp,URM_temp_after,ICM_all))
    rec.fit(**params_recommender)
    if split:
        if not just_checking_integrity:
            rec.offset=split_point
        else:
            rec.offset=0

    rec.W_sparse.data=np.power(rec.W_sparse.data,p)
    if transpose[i]:
        print("transposed")
        rec.W_sparse=rec.W_sparse.T
    rec.W_sparse=rec.W_sparse.astype("float32").tocsr()
    rec.URM_train=URM_temp_after.astype("float32")
    recs.append(rec)
    #rec.save_model("./",file_name =f"{rec.RECOMMENDER_NAME}-{i}.data")
recs=np.array(recs)

-------------------0-------------------
UserKNNCFRecommender: URM Detected 18686 (78.9%) items with no interactions.
transposed
-------------------1-------------------
UserKNNCFRecommender: URM Detected 18686 (78.9%) items with no interactions.
transposed
-------------------2-------------------
UserKNNCFRecommender: URM Detected 18686 (78.9%) items with no interactions.
transposed
-------------------3-------------------
UserKNNCFRecommender: URM Detected 18686 (78.9%) items with no interactions.
transposed
-------------------4-------------------
UserKNNCFRecommender: URM Detected 18686 (78.9%) items with no interactions.




transposed
-------------------5-------------------
UserKNNCFRecommender: URM Detected 18686 (78.9%) items with no interactions.
transposed
-------------------6-------------------
UserKNNCFRecommender: URM Detected 18686 (78.9%) items with no interactions.
transposed
-------------------7-------------------
UserKNNCFRecommender: URM Detected 18686 (78.9%) items with no interactions.
transposed
-------------------8-------------------
ItemKNNCBFRecommender: URM Detected 18686 (78.9%) items with no interactions.
ItemKNNCBFRecommender: ICM Detected 1 ( 0.0%) items with no features.
Similarity column 23692 (100.0%), 587.70 column/sec. Elapsed time 40.31 sec
-------------------9-------------------
ItemKNNCBFRecommender: URM Detected 18686 (78.9%) items with no interactions.
ItemKNNCBFRecommender: ICM Detected 1 ( 0.0%) items with no features.
Similarity column 23692 (100.0%), 719.65 column/sec. Elapsed time 32.92 sec
-------------------10-------------------
ItemKNNCBFRecommender: URM Detected 

In [None]:
UCF=recs[:8]
ICBF=recs[8:16]
ICF=recs[16:24]
Graph=recs[24:]

In [None]:
if just_checking_integrity:
    rec_groups=[UCF,ICBF,ICF,Graph,*recs_top]
    names=["UCF_W","ICBF_W","ICF_W","Graph_W","TopPop15_W"]
else:
    rec_groups=[UCF,ICBF,ICF,Graph,*recs_top]
    names=["UCF_W","ICBF","ICF_W","Graph_W","TopPop15_W"]

In [None]:
for i,rec_group in enumerate(rec_groups):
    name=names[i]
    ev = HITRATEEvaluator(URM_valid,cutoff_list=[100])
    result=ev.evaluateRecommender(rec_group,block_size=4000,items_to_compute=candidates,name= name)

EvaluatorHoldout: Ignoring 18382 (73.5%) Users that have less than 1 test interactions




EvaluatorHoldout: Processed 6618 (100.0%) in 26.77 sec. Users per second: 247
average number of different items for each session 211.52689634330613
EvaluatorHoldout: Ignoring 18382 (73.5%) Users that have less than 1 test interactions
EvaluatorHoldout: Processed 6618 (100.0%) in 24.37 sec. Users per second: 272
average number of different items for each session 191.69537624660018
EvaluatorHoldout: Ignoring 18382 (73.5%) Users that have less than 1 test interactions
EvaluatorHoldout: Processed 6618 (100.0%) in 25.09 sec. Users per second: 264
average number of different items for each session 244.41387126019944
EvaluatorHoldout: Ignoring 18382 (73.5%) Users that have less than 1 test interactions
EvaluatorHoldout: Processed 6618 (100.0%) in 8.42 sec. Users per second: 786
average number of different items for each session 123.63145965548505
EvaluatorHoldout: Ignoring 18382 (73.5%) Users that have less than 1 test interactions
EvaluatorHoldout: Processed 6618 (100.0%) in 4.53 sec. Users 