In [15]:
import os
import json
import numpy as np
from scipy.sparse import csr_matrix

In [54]:
class MatrixUtil:
    def __init__(self):
        with open("../datasets/ecom_ratings/customers.json") as f_:
            self.customers = json.load(f_)

        with open("../datasets/ecom_ratings/products.json") as f_:
            self.products = json.load(f_)

        with open("../datasets/ecom_ratings/ratings.json") as f_:
            self.ratings = json.load(f_)
            
        self.product_id_to_idx = {
            product['Id']: idx
            for idx, product in enumerate(self.products)
        }

        self.customer_id_to_idx = {
            customer['Id']: idx
            for idx, customer in enumerate(self.customers)
        }
        
        self.matrix = self._generate_matrix()
        
    def _generate_matrix(self):
        customer_data = []
        product_data = []
        rating_data = []

        for rating in ratings:
            rating_data.append(rating['Rate'])    
            customer_data.append(customer_id_to_idx[rating['CustomerID']])
            product_data.append(product_id_to_idx[rating['ProductID']])

        matrix = csr_matrix((rating_data, (customer_data, product_data)), 
                            shape=(len(customers), len(products))).toarray()
        return matrix
    
    def get_user_vector(self, id_):
        user_idx = self.customer_id_to_idx.get(id_)
        if user_idx == None:
            raise ValueError("Provided id exist does not exist")
        return self.matrix[user_idx][:]
    
    def get_item_vector(self, id_):
        item_idx = self.product_id_to_idx.get(id_)
        if item_idx:
            raise ValueError("Provided id exist does not exist")
        
        return self.matrix[item_idx][:]

In [55]:
mutil = MatrixUtil()

In [60]:
vector = mutil.get_user_vector(103954)

In [13]:
# The end goal is often either to:
# 1. Predict a user's rating value of an item
# 2. Predict the top-k items

# Important to note that the first goal can be used to produce the second goal, albeit being less efficient.

# User-based collaborative filtering

The ratings provided by similar users to a target user are used to make recommendations for that user. The weighted average values of those similar users on an item k is used as the predicted rating for the target user on that item k.

### Issues

1. Users may have different scales as one user may be biased towards liking most items and another user may be biased towards not liking at all.
2. Users may have liked different items.

To fix issue two, you take an intersection of the set of items liked by both users and use this to calculate similarity.

### Finding similar users using Pearson correlation coefficient.

The similarity metric uses items that have been rated by the target users and potential similar users as a vector.

(1)
$$ 
µ_{u} = \frac{\sum_{k∈I_u} r_{uk}}{|I_u|}
$$


(1.1)
$$ 
Sim(u, v) = Pearson(u,v) = \frac{\sum_{k∈I_{u}∩I_{v}}(r_{uk} - µ_u) * (r_{vk} - µ_v)}{\sqrt{\sum_{k∈I_{u}∩I_{v}}(r_{uk} - µ_u)^2} * \sqrt{\sum_{k∈I_{u}∩I_{v}}(r_{vk} - µ_v)^2}}
$$


(1.2)
$$ 
Sim(u, v) = Cosine(u,v) = \frac{\sum_{k∈I_{u}∩I_{v}}r_{uk} * r_{vk}}{\sqrt{\sum_{k∈I_{u}∩I_{v}}{r_{uk}}^2} * \sqrt{\sum_{k∈I_{u}∩I_{v}}{r_{vk}}^2}}
$$


Where:

$I_u$ is the set of items rated by user u.

$r_{uk}$ is the rating a user u gives to an item k.

$µ_{u}$ is the mean of the ratings for user u.

The replacement of u with v in the variables above translates to the user v.

### Cosine and Pearson Correlation

Cosine similarity checks for the angular difference between two vectors in relation to the origin. This means that vectors [1, 1, 1, 1] (let's call it A) and [500, 500, 500, 500] (let's call it B) will have an angle of 0 despite having different magnitudes. Pearson correlation checks for a linear relationship between two datasets (in this case vectors A and vectors B are considered different distributions rather than a single). For example, if an increase in a variable in dataset a also leads to an increase in its corresponding item in dataset b, that can be considered to be a positive correlation. This concept of correlation can then be transferred to check the similarity between two vectors. Since Pearson correlation can be used to check for the relationship between two vectors, it is more discriminative compared to cosine similarity.

References: https://blogs.sas.com/content/iml/2019/09/03/cosine-similarity.html, https://stats.stackexchange.com/questions/235673/is-there-any-relationship-among-cosine-similarity-pearson-correlation-and-z-sc, https://www.geeksforgeeks.org/python-pearson-correlation-test-between-two-variables/, https://leimao.github.io/blog/Cosine-Similarity-VS-Pearson-Correlation-Coefficient/


### Question on calculating the mean of ratings...

Since we are using the intersection of items rated between users. Should we not use then intersection as well when calculating the mean of a ratings for user u instead of using all of the items rated?

#### Answer...

1. It can be computationally expensive to calculate mean for each user u and v combination.
2. It is hard to argue that the approach of using the intersection being better than using all the items or vice-versa.
3. In cases where the intersection between two users is only 1 item, the similarity metric will fail because the part of eqn (1.1) that says $ (r_{uk} - µ_u) $ as that will yield zero.


### Predicting Ratings

To predict a target user's rating of an item, we rely on the ratings from other users and amplify or attenuate the impact of a rating from those users based on the similarity between them and the target user.

In [83]:
def average_rating(vector):
    ratings = vector[vector > 0]
    return round(ratings.sum() /len(ratings), 4)

In [110]:
def cos_sim(u_id, v_id, matrix):
    u_vector = matrix.get_user_vector(u_id)
    v_vector = matrix.get_user_vector(v_id)
    
    u_avg_rating = average_rating(u_vector)
    v_avg_rating = average_rating(v_vector)
    
    u_ratings_idx = np.where(u_vector != 0)
    v_ratings_idx = np.where(v_vector != 0)
    
    intersecting_idx = np.intersect1d(u_ratings_idx, v_ratings_idx)
    u_ratings = u_vector[intersecting_idx]
    v_ratings = v_vector[intersecting_idx]
    
    num = np.multiply(u_ratings, v_ratings).sum()
    u_mag = np.sqrt(np.dot(u_ratings, u_ratings))
    v_mag = np.sqrt(np.dot(v_ratings, v_ratings))
    denom = u_mag * v_mag
    
    return round(num / denom, 4)

In [114]:
cos_sim(103954, 103954, mutil)
# 103603
# 103654

1.0

# Item-based collaborative filtering

The items similar to a target item are retrieved. Then the user's ratings on those similar items extracted with their weighted average calculated. This calculated average becomes the predicted rating for that item.