### Demonstration of a simple neighborhood based item-item collaborative filtering recommender
based on section 14.5.1 in Machine Learning in Action.  
Document strings in the functions based on [Numpy convention.](https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt#docstring-standard)

In [1]:
from scipy import stats
from scipy.spatial.distance import euclidean
import numpy as np

<img src="./images/ratings_MLIA.png" alt="Drawing" style="width: 300px;"/>

The inital ratings matrix. Item ratings from from 1-5.  A blank is an unrated item.

In [2]:
def ratings_matrix():
    """ Sets initial ratings matrix with items ranked 1-5. 
        
        Rows are users, columns are items. Fill unrated items
        with 0 values.
        
        Parameters
        ----------
        None
        
        Returns
        -------
        2d numpy array of type float
    """ 
    return np.array([[4, 4, 0, 2, 2],
                     [4, 0, 0, 3, 3],
                     [4, 0, 0, 1, 1],
                     [1, 1, 1, 2, 0],
                     [2, 2, 2, 0, 0],
                     [1, 1, 1, 0, 0],
                     [5, 5, 5, 0, 0]]).astype(float)

Similarity metrics

In [3]:
def sim_euclidean(u, v):
    """ Finds euclidean similarity between two vectors 
        
        Parameters
        ----------
        u : 1d np.array
        v : 1d np.array
        
        Returns
        -------
        float
    """ 
    return 1 / (1 + euclidean(u, v))

def sim_cosine(u, v):
    """ Finds cosine similarity between two vectors 
        
        Parameters
        ----------
        u : 1d np.array
        v : 1d np.array
        
        Returns
        -------
        float
    """
    costheta = np.dot(u, v)/(np.linalg.norm(u) * np.linalg.norm(v))
    return 0.5 + 0.5 * costheta

### Methodology for prediciting an item rating:
* The rating on an item of interest is going to be a weighted sum of the ratings a user has given other items.
* The weight is determined by how similar the item of interest is to each item.
* To find the similarity of each item to the item of interest:  
    1) Determine which rows the item of interest was rated.  
    2) Determine which rows the item you're comparing it to was rated.  
    3) Find the rows that overlap and take the ratings for the item and item of interest to make 2 vectors  
    4) Calculate the similarity vector using those two vectors.  

In [4]:
def rating_collab(R, user, sim_func, item_ur):
    """User rating for unrated item using item-item similarity.

       Parameters
       ----------
       R : 2d numpy array
           ratings matrix where rows are users, columns are items
       user : int
           the user of interest (row in R)
       item_ur : int
           the user's unrated item of interest (col in R)
       sim_func : function
           similarity function to quantify similarity of items
           
       Returns
       -------
       float
          weighted rating for item based on item-item similarity
    """        
    n = R.shape[1] # number of columns (items) in R array
    sim_total = 0.0 # summed similarity of items to item of interest
    rat_sim_prod_total = 0.0 # summed rating * similarity for all items
    where_item_ur_rated = R[:,item_ur] > 0 # where the item of interest was rated
    for j in range(n): # for all the items
        rating = R[user, j] # get the user rating for that item
        if rating == 0:
            continue # not rated so it will not factor in rating
        where_item_rated = R[:, j] > 0
        where_both_rated = np.logical_and(where_item_ur_rated, where_item_rated)
        rows_both_rated = np.nonzero(where_both_rated)[0] 
        if len(rows_both_rated) == 0:
            similarity = 0 # no rows where they are both rated
        else:
            u = R[rows_both_rated, item_ur] # masked vector
            v = R[rows_both_rated, j] # masked vector
            similarity = sim_func(u, v) # find similarity between vectors
        sim_total += similarity  # denominator of weighted rating
        rat_sim_prod_total += rating * similarity # numerator of weighted rating
    if sim_total == 0:
        return 0
    else:
        weighted_rating = rat_sim_prod_total / sim_total  # weighted rating
        return weighted_rating

In [5]:
R = ratings_matrix()
print(R)

[[4. 4. 0. 2. 2.]
 [4. 0. 0. 3. 3.]
 [4. 0. 0. 1. 1.]
 [1. 1. 1. 2. 0.]
 [2. 2. 2. 0. 0.]
 [1. 1. 1. 0. 0.]
 [5. 5. 5. 0. 0.]]


User 0, item 2 doesn't have a rating.  Let's predict User 0's rating on that item.

In [6]:
# demo some part of function
usr = 0      # user
itm_ur = 2   # unrated item of interest
itm_cu = 0   # current item

ratings_usr = R[usr,:]
print(f"User {usr} ratings:")
print(ratings_usr)

ratings_cu = R[:,itm_cu]
print(f"\nCurrent item {itm_cu}")
print(ratings_cu)

where_item_ur_rated = R[:,itm_ur] > 0
print(f"\nRows where unrated item {itm_ur} was rated")
print(where_item_ur_rated)

where_item_cu_rated = R[:, itm_cu] > 0
print(f"\nRows where current item {itm_cu} was rated")
print(where_item_cu_rated)

where_both_rated = np.logical_and(where_item_ur_rated, where_item_cu_rated)
rows_both_rated = np.nonzero(where_both_rated)[0] 
print(f"\nRows where both unrated and current item rated")
print(rows_both_rated)

# make vectors
u = R[rows_both_rated, itm_ur]  # masked vector
v = R[rows_both_rated, itm_cu]  # masked vector
# calculate similarity
similarity = sim_cosine(u, v)      # find similarity between vectors

print(f"\nu {u}")
print(f"v {v}")
print(f"Similarity: {similarity}")

User 0 ratings:
[4. 4. 0. 2. 2.]

Current item 0
[4. 4. 4. 1. 2. 1. 5.]

Rows where unrated item 2 was rated
[False False False  True  True  True  True]

Rows where current item 0 was rated
[ True  True  True  True  True  True  True]

Rows where both unrated and current item rated
[3 4 5 6]

u [1. 2. 1. 5.]
v [1. 2. 1. 5.]
Similarity: 1.0


In [7]:
# using euclidean similarity
usr = 0
itm_ur = 2
rtng = rating_collab(R, user=usr, sim_func=sim_euclidean, item_ur=itm_ur)
print("The predicted rating of user {0} on item {1} is {2:0.1f}.".format(usr, itm_ur, rtng))

The predicted rating of user 0 on item 2 is 3.6.


In [8]:
# using cosine similarity
rtng = rating_collab(R, user=usr, sim_func=sim_cosine, item_ur=itm_ur)
print("The predicted rating of user {0} on item {1} is {2:0.1f}.".format(usr, itm_ur, rtng))

The predicted rating of user 0 on item 2 is 3.3.


Let's find ratings for all the unrated items in the matrix.

In [9]:
def find_unrated_items_per_user(R):
    """ Finds unrated items (columns) for each user 
        
        Parameters
        ----------
        R : 2d numpy array
           ratings matrix where rows are users, columns are items
        Returns
        -------
        dict where key = user (row) and values = list of ints (columns)
    """
    unrated = dict()
    for user in range(R.shape[0]):
        unrated_items = np.nonzero(R[user, :] == 0)[0]
        unrated[user] = unrated_items
    return unrated

In [10]:
def estimate_ratings(R, unrated, sim_func=sim_euclidean, est_method=rating_collab):
    """ Estimates the unrated items in the ratings matrix for each user 
        
        Parameters
        ----------
        R : 2d numpy array
           ratings matrix where rows are users, columns are items
        unrated: dict 
           key = user (row) and values = list of ints (columns)
        sim_func : function
           similarity function to quantify similarity of items
        est_method : function
           the function that will be used to estimate the ratings
        Returns
        -------
        R_est : 2d numpy array
           estimated ratings matrix where rows are users, columns are items
    """
    R_est = R.copy()
    for user, unrated_items in unrated.items():
        for unrated_item in unrated_items:
            R_est[user, unrated_item] = est_method(R, user, sim_func, unrated_item)
    return R_est

In [11]:
def recommend_n_items(R, user, n=3, sim_func=sim_euclidean, est_method=rating_collab):
    """Recommends n previously unrated items to a user

       Parameters
       ----------
       R : 2d numpy array
           ratings matrix where rows are users, columns are items
       user : int
           the user of interest (row in R)
       sim_meas : function
           the similarity measure that will be used
       est_method : function
           the function that will be used to estimate the ratings
       Returns
       -------
       list of ints      
          sorted items (column indices) to recommend
    """     
    unrated_items = np.nonzero(R[user, :] == 0)[0]
    if len(unrated_items) == 0:
        return "Everything is rated."
    item_scores = []
    for item in unrated_items:
        estimated_score = est_method(R, user, sim_func, item)
        item_scores.append((item, estimated_score))
    return sorted(item_scores, key = lambda x: x[1], reverse=True)[:n]

All functions defined, finally ready for all calculations.

In [12]:
R = ratings_matrix()
print("\nThe original ratings matrix:")
print(R.round(1))
unrated = find_unrated_items_per_user(R)
R_est = estimate_ratings(R, unrated, sim_func=sim_cosine, est_method=rating_collab)
print("\nThe ratings matrix filled in using item-item similarity:") 
print(R_est.round(1))
print("\nRecommendations:")
for user in range(R.shape[0]):
    recs = recommend_n_items(R, user, n=3, sim_func=sim_cosine, est_method=rating_collab)
    rec_str = " and ".join(["item {0} ({1:0.2f})".format(rec[0], rec[1]) for rec in recs])
    print("".join(["User {0} should be recommended ".format(user), rec_str, '.']))


The original ratings matrix:
[[4. 4. 0. 2. 2.]
 [4. 0. 0. 3. 3.]
 [4. 0. 0. 1. 1.]
 [1. 1. 1. 2. 0.]
 [2. 2. 2. 0. 0.]
 [1. 1. 1. 0. 0.]
 [5. 5. 5. 0. 0.]]

The ratings matrix filled in using item-item similarity:
[[4.  4.  3.3 2.  2. ]
 [4.  3.3 3.5 3.  3. ]
 [4.  2.  2.5 1.  1. ]
 [1.  1.  1.  2.  1.3]
 [2.  2.  2.  2.  2. ]
 [1.  1.  1.  1.  1. ]
 [5.  5.  5.  5.  5. ]]

Recommendations:
User 0 should be recommended item 2 (3.33).
User 1 should be recommended item 2 (3.50) and item 1 (3.34).
User 2 should be recommended item 2 (2.50) and item 1 (2.02).
User 3 should be recommended item 4 (1.34).
User 4 should be recommended item 3 (2.00) and item 4 (2.00).
User 5 should be recommended item 3 (1.00) and item 4 (1.00).
User 6 should be recommended item 3 (5.00) and item 4 (5.00).
