# CF book rec for user with context

This notebook demonstrates predicting rating of books for a user with records using goodreads-10k.

We have seen that for UBCF (user-based collaborative filtering), adjusted cosine similarity as a model for has good CV score (for RMSE, MAE).

This notebook implements model to generate top book ids for users.

In [1]:
import numpy as np
import scipy as sp
!readlink -f . # this reads filepath to the directory holding this notebook.

/home/zebalgebra/School/DVA/The-Last-Book-Bender/Notebooks


In [2]:
# define your path to data directory here
fp = "/home/zebalgebra/School/DVA/The-Last-Book-Bender/Data/Raw/"
fname = "ratings_for_cf.npz"

if you don't have the file `ratings_for_cf.npz`, execute this block to generate one; this reads raw csv, center each rating, and shift a little to give more useful recommendations.

In [3]:
%%time
import pandas as pd
df = pd.read_csv(fp + "ratings.csv")

# this shifts ratings by mean
mean = df.groupby("user_id").agg({"rating": "mean"}).rename(columns={"rating": "mean"})
df = df.merge(mean, on="user_id")
df["rating"] = df["rating"] - df["mean"] + 10 ** (-8)

# generate and save csc matrix
mat = sp.sparse.csc_matrix(
    (
        np.array(df["rating"]),
        (
            np.array(df["book_id"]),
            np.array(df["user_id"])
        )
    )
)
with open(fp + "ratings_for_cf.npz", "wb") as f:
    sp.sparse.save_npz(f, mat)

CPU times: user 4.41 s, sys: 1.33 s, total: 5.73 s
Wall time: 4.53 s


## Demo on predicting ratings
Execute the following block to initialize model.

Every method prefixed by _ are intended as internal methods.

In [4]:
class cf_model:
    
    def __init__(self, fp, fname):
        """
        assumes ratings data has users ratings shifted by means
        here mat has rows the books, columns the users.
        """
        self.mat = sp.sparse.load_npz(fp + fname)
        self.norms = np.sqrt(np.array(self.mat.power(2).sum(axis=0))).flatten()
        self.n_books, self.n_users = self.mat.shape
        
    def _context_to_vec(self, context):
        """
        utility. for transforming context into np array or scipy csc.
        """
        if type(context) not in [
            sp.sparse._csc.csc_matrix,
            sp.sparse._csr.csr_matrix,
            sp.sparse._coo.coo_matrix,
            np.ndarray,
            list,
            tuple,
            dict
        ]:
            raise NotImplementedError("type not supported.")
        if type(context) in [sp.sparse._csr.csr_matrix, sp.sparse._coo.coo_matrix]:
            context = context.tocsc()
        if type(context) is np.ndarray:
            context = sp.sparse.csc_matrix(context)
        if type(context) is sp.sparse._csc.csc_matrix:
            if context.shape[0] == 1:
                context = context.transpose().tocsc()
            return context
        if len(context) == 0:
            vec = sp.sparse.csc_matrix(0)
            vec.resize((self.n_books, 1))
            return vec
        if type(context) is dict: # assumes the form {book_id: rating}
            context = list(context.items())
        n = len(context)
        return sp.sparse.csc_matrix(
            (
                np.array([t[1] for t in context]), # data
                (
                    np.array([t[0] for t in context]).astype(int),
                    np.zeros(n).astype(int) # columns
                )
            )
        )

    def _get_mean(self, context_vec):
        """
        utility. assumes context_vec is scipy sparse csc matrix or np array
        """
        if type(context_vec) is np.ndarray:
            return context_vec.mean()
        if type(context_vec) is sp.sparse._csc.csc_matrix:
            return context_vec.data.sum() / len(context_vec.data)

    def _process_context(self, context_vec):
        """
        utility. assumes context_vec is scipy sparse csc matrix or np array
        """
        mu = self._get_mean(context_vec)
        if type(context_vec) is np.ndarray:
            # center by mean
            supp = context_vec != 0
            context_vec[supp] = (context_vec[supp] - mu + 10 ** (-8))
            # normalize
            s2 = (context_vec ** 2).sum() # sum of elements squared
            context_vec[supp] = context_vec[supp] / np.sqrt(s2)
            return context_vec
        if type(context_vec) is sp.sparse._csc.csc_matrix:
            # center by mean
            context_vec.data = (context_vec.data - mu + 10 ** (-8))
            # normalize
            s2 = (context_vec.data ** 2).sum()
            context_vec.data = context_vec.data / np.sqrt(s2)
            return context_vec

    def _top_k_neighbors(self, context, k=50):
        """
        get top k indices given context_vec;
        assumes context_vec is context processed by _process_context.
        """
        context_vec = self._process_context(self._context_to_vec(context))
        if np.prod(context_vec.shape) != self.n_books: # otherwise would raise dimension mismatch
            context_vec.resize((self.n_books, 1))
        # the following inevitably a dense vector
        sim_scores = (self.mat.T @ context_vec).toarray().flatten() / np.maximum(self.norms, 10 ** (-30))
        neighbors = np.argpartition(sim_scores, -k)[-k:] # note: this returns top k; needn't be sorted for better performance
        sim_scores = sim_scores[neighbors]
        return neighbors, sim_scores

    def _get_scores(self, context, k=50):
        """
        using the approximate formula for better vectorization; use nonnormalized
        """
        neighbors, sim_scores = self._top_k_neighbors(context, k=k)
        submat = self.mat[:, neighbors].toarray() # retrieve rating records of these neighbors
        numer = submat * sim_scores
        denom = (submat > 0) * sim_scores
        scores = numer.sum(axis=1) / np.maximum(
            denom.sum(axis=1),
            10 ** (-30)
        )
        return scores

    def get_top_m_recs_k_neighbors(self, context, k=50, m=100):
        """
        grabs top m books; will take a peak at cache to see if need recomputation
        """
        scores = self._get_scores(context, k=k)
        top_m_inds = np.argpartition(scores, -m)[-m:]
        top_m_scores = scores[top_m_inds]
        return sorted(zip(top_m_scores, top_m_inds))[::-1]
        

## Usage Demonstration
Initialize model with the processed ratings matrix, specified by filepath to directory and filename.

In [5]:
%%time
model = cf_model(fp, fname)

CPU times: user 236 ms, sys: 8.97 ms, total: 245 ms
Wall time: 245 ms


You can define context as:
1. A dictionary of book_id: value.
2. A list or tuple of pairs (book_id, value) or [book_id, value].
3. An numpy vector with v[book_id]=value.
4. A scipy sparse vector.

To get top recommendations, you need to specify how many neighbors to use (this is the value of `k` to pass in), how many recommendations to generate (this is the value of `m` to pass in).

Say your user rated book id 1 with 1, book id 2 also with 1. You can generate book recs as follows:

In [6]:
%%time
context_dict = {1: 1, 2: 1}
context_t_t = ((1, 1), (2, 1))
context_l_t = [(1, 1), (2, 1)]
context_t_l = ([1, 1], [2, 1])
context_l_l = [[1, 1], [2, 1]]
context_np = np.array([0, 1, 1, 0, 0])
context_sp_csc = sp.sparse.csc_matrix(np.array([0, 1, 1, 0, 0]))
context_sp_csr = sp.sparse.csr_matrix(np.array([0, 1, 1, 0, 0]))
context_sp_coo = sp.sparse.coo_matrix(np.array([0, 1, 1, 0, 0]))

context = context_dict

model.get_top_m_recs_k_neighbors(context, k=100, m=3)

CPU times: user 21.5 ms, sys: 579 µs, total: 22 ms
Wall time: 21 ms


[(2.12372248389602, 681), (2.00000001, 2157), (2.00000001, 1969)]

This says for example that the book with id 681 gives a difference in rating of 2.124 to the user's baseline.
## A