### User-based collaborative filtering

Below is a rating matrix consisting of 5 items and 6 users. Two user-item ratings are unknown, marked with 0. 
Intuituvely, it seems user 0 (first row) is similar to user 4 (row 3) and dissimilar to user 2 (row 1).



In [1]:
import numpy as np
ratings = np.array([[8, 1, 0, 2, 7], [2, 0, 5, 7, 5], [5, 4, 7, 4, 7], [7, 1, 7, 3, 8], [1, 7, 4, 6, 5], [8, 3, 8, 3, 7]])

In [2]:
ratings

array([[8, 1, 0, 2, 7],
       [2, 0, 5, 7, 5],
       [5, 4, 7, 4, 7],
       [7, 1, 7, 3, 8],
       [1, 7, 4, 6, 5],
       [8, 3, 8, 3, 7]])

In [4]:
def euclidean_distance(x, y):
    return np.sqrt(np.sum((x - y) ** 2))

In [5]:
def cosine_similarity(x, y):
   return np.dot(x, y) / (np.sqrt(np.dot(x, x)) * np.sqrt(np.dot(y, y)))

In [7]:
print(euclidean_distance(ratings[0], ratings[1]))
print(euclidean_distance(ratings[0], ratings[3]))

9.539392014169456
7.211102550927978


In [9]:
print(cosine_similarity(ratings[0], ratings[1]))
print(cosine_similarity(ratings[0], ratings[3]))

0.5895949304375813
0.8352985630281698


According to ED, `user 2` is more distant to `user 1` than `user 4`, but not quite far. The same is true for cosine distance.

Now we do the centering

In [10]:
def centered(x):
    avg = np.sum(x)/float(np.count_nonzero(x))
    for i in x:
        if i == 0:
            yield i
        else:
            yield i - avg

In [11]:
def centered_cosine_similarity(x, y):
    x = list(centered(x))
    y = list(centered(y))
    return np.dot(x, y) / (np.sqrt(np.dot(x, x)) * np.sqrt(np.dot(y, y)))

In [12]:
print(centered_cosine_similarity(ratings[0], ratings[1]))
print(centered_cosine_similarity(ratings[0], ratings[3]))

-0.6733485361923414
0.9078624123797538


The difference becomes quite large.

#### Item-Item collaborative filtering

Below is a rating matrix consisting of 3 items and 4 users. Three user-item ratings are unknown, marked with 0. We want to predict these unseen user-item ratings. In this example, let's predict the rating for the second item by the first user.

Calculate the similarity between the second item and other items by looking at all users who have rated both items, and use this to compute a rating for the unseen user-item pair.

In [17]:
ratings = np.array([[2, 0, 3], [5, 2, 0], [3, 3, 1], [0, 2, 2]])

In [18]:
ratings

array([[2, 0, 3],
       [5, 2, 0],
       [3, 3, 1],
       [0, 2, 2]])

let's first transpose the matrix to make it item-user. 

In [19]:
ratings = np.transpose(ratings)
print(ratings)

[[2 5 3 0]
 [0 2 3 2]
 [3 0 1 2]]


What is the similarity between other items and item 2?

In [22]:
def sim(sim_fn, ratings):
    return [sim_fn(x,ratings[1]) for x in ratings]

In [23]:
sim(cosine_similarity, ratings)

[0.747545001596402, 1.0, 0.4537426064865151]

In [24]:
sim(centered_cosine_similarity, ratings)

[-0.44095855184409855, 1.0000000000000002, -0.5773502691896255]

item 1 is the most similar and has ratings by user 1. Using |K| = 1, we will choose item 1.

rating(user 1, item 2) = 0.75*2 = 1.5