# Collaborative Filtering Baseline Model

In this notebook we build a collaborative filtering model to serve as a baseline

### Imports

In [1]:
import json
import os
import random
import numpy as np
import pandas as pd

random.seed(42)
np.random.seed(42)

### Load Training Data

In [2]:
OUTPUT_DATA_DIR = "./output_data/"

train_df = pd.read_csv(OUTPUT_DATA_DIR+"interactions_training.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [4]:
pd.set_option('display.max_columns', None)

### Load Validation Data

In [13]:
val_df = pd.read_csv(OUTPUT_DATA_DIR+"interactions_validation.csv")

### Collaborative Filtering - Item to Item Similarity

The predicted rating will be the average of the average rating for the most similar books.

We will be using kNN and so the predicted rating for a book will be the average rating for the `k` closest books

In [6]:
train_df['book_id'] = train_df['book_id'].astype("category")
train_df['user_id'] = train_df['user_id'].astype("category")

In [15]:
import scipy.sparse as sp

item_matrix = train_df.pivot(index='book_id', columns='user_id', values='rating').fillna(0)
item_train_matrix = sp.csr_matrix(item_matrix.values)

84037

We now fit a few KNN models for various values of `k`. Note that there are way more users than books and so we will keep `k` relatively small. We will try `k = [1, 2, 5, 10]` initially.

In [17]:
from sklearn.neighbors import NearestNeighbors

train_item_avg = train_df.groupby(train_df['book_id'], as_index=False)['rating'].mean()
train_item_avg.columns = ['book_id', 'book_average']
train_item_avg = train_item_avg.set_index('book_id')

In [19]:
def build_knn_model(train_matrix, k):
    """Builds a kNN model on `train_matrix` with `k` neighbours.
    
    Parameters
    ----------
    train_matrix: sp.csr_matrix
        The sparse matrix used to build the kNN model.
    k: int
        The number of neighbours to use in the kNN model.
    
    Returns
    -------
    NearestNeighbors
        A NearestNeighbors model fit to `train_matrix`.
    
    """
    model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=k)
    model_knn.fit(train_matrix)
    return model_knn

In [39]:
def get_item_preds_from_knn(knn_model, train_matrix, items, item_avgs):
    """Gets the kNN predictions for the items in `items`.
    
    This assumes that every item in items was fit on the
    knn_model. This is just a precomputation step to get
    the predictions for items in the training set.
    
    Parameters
    ----------
    knn_model: NearestNeighbors
        A NearestNeighbors model that has been fit.
    train_matrix: sp.csr_matrix
        The sparse matrix representing the training data.
    items: np.array
        An array of item indices for items in `knn_model`.
    item_avgs: pd.DataFrame
        A pandas dataframe containing the average rating for
        each item in `items`.
    
    Returns
    -------
    pd.DataFrame
        A DataFrame containing the predicted rating for each item
        in `items`.
    
    """
    item_neighbors = np.asarray(knn_model.kneighbors(train_matrix, return_distance=False))
    knn_avgs = np.zeros(len(item_neighbors))   # this is more efficient than appending multiple times (no resizing)
    for i in range(len(item_neighbors)):
        knn_avgs[i] = item_avgs['book_average'][items[item_neighbors[i]]].mean()    # average of average ratings for neighbors
    return pd.concat([pd.DataFrame(items, columns=['book_id']),
                      pd.DataFrame(knn_avgs, columns=['book_rating'])],
                    axis=1)

In [53]:
def predict_ratings(X, item_preds, default_val, merge_col):
    """Predicts the item ratings for the items in `X`.
    
    Parameters
    ----------
    X: pd.DataFrame
        The DataFrame of features.
    item_preds: pd.DataFrame
        The DataFrame of predicted ratings for the items.
    default_val: float
        A default rating used for unseen items.
    merge_col: str
        The column to merge on.
    
    Returns
    -------
    pd.DataFrame
        A DataFrame containing the predicted item ratings for
        the records in `X`.
    
    """
    id_col = "{}_id".format(merge_col)
    rating_col = "{}_rating".format(merge_col)
    df_item = pd.merge(X, item_preds, how='left', on=[id_col])
    df_item[rating_col] = df_item[rating_col].fillna(default_val)
    df_item.index = X.index
    return df_item

In [None]:
def get_item_knn_train_validation_preds(train_df, val_df, train_matrix, k, items, item_avgs):
    """Gets predictions on `train_df` and `val_df` from a kNN model.
    
    Parameters
    ----------
    train_df: pd.DataFrame
        A DataFrame of the training data.
    val_df: pd.DataFrame
        A DataFrame of the validation data.
    train_matrix: sp.csr_matrix
        The sparse matrix used to train the kNN model.
    k: int
        The number of neighbours in the kNN model.
    items: np.array
        An array of strings representing the ids of the
        items used in training.
    item_avgs: pd.DataFrame
        A DataFrame containing the average rating for the
        items in `items`.
    
    Returns
    -------
    np.array, np.array
        Arrays of predictions on the training and validation sets, respectively.
    
    """
    knn_model = build_knn_model(train_matrix, k)
    knn_preds = get_item_preds_from_knn(knn_model, train_matrix, items, item_avgs)
    
    # prediction for a new book
    new_book_vec = np.zeros(train_matrix.shape[1])
    new_book_neighbours = knn_model.kneighbors(new_book_vec.reshape(1, -1), return_distance=False)
    new_book_pred = item_avgs['book_average'][items[new_book_neighbours[0]]].mean()
    

In [24]:
knn_model = build_knn_model(item_train_matrix, 5)

In [40]:
knn_avgs = get_item_preds_from_knn(knn_model, item_train_matrix, item_matrix.index, train_item_avg)

In [42]:
new_book_vec = np.zeros(item_train_matrix.shape[1])
new_book_neighbours = knn_model.kneighbors(new_book_vec.reshape(1, -1), return_distance=False)

In [47]:
default_val = train_item_avg['book_average'][item_matrix.index[new_book_neighbours[0]]].mean()

In [55]:
merged_df = pd.merge(val_df, knn_avgs, how='left', on=['book_id'])
merged_df

Unnamed: 0,user_id,book_id,review_id,is_read,rating,review_text_incomplete,date_added,date_updated,read_at,started_at,isbn,text_reviews_count,series,country_code,language_code,popular_shelves,asin,is_ebook,average_rating,kindle_asin,similar_books,description,format,link,authors,publisher,num_pages,publication_day,isbn13,publication_month,edition_information,publication_year,url,image_url,ratings_count,work_id,title,title_without_series,shelved,read,rated,recommended,year_month_added,year_month_updated,pub_date,is_translated,main_author,is_in_series,series_length,title_description,book_rating
0,acb597b9ea2c89354bb8d70d8d4da103,428945,a3170601bc639f22c91fe97ea9620aa2,False,0,,Mon Jul 20 09:48:38 -0700 2015,Mon Jul 20 09:48:41 -0700 2015,,,1590170369,43,[],US,eng,"[{'count': '1297', 'name': 'to-read'}, {'count...",,False,3.95,,"['55213', '428527', '1504664', '1391333', '388...","""This writing has to do with some things I saw...",Paperback,https://www.goodreads.com/book/show/428945.In_...,"[{'author_id': '114265', 'role': ''}, {'author...",NYRB Classics,224.0,31.0,9781590170366,7.0,,2003.0,https://www.goodreads.com/book/show/428945.In_...,https://images.gr-assets.com/books/1320448191m...,317,1850287,In Parenthesis,In Parenthesis,1,0,0,0,2015-07,2015-07,2003-07,0,114265.0,0,1,"In Parenthesis ""This writing has to do with so...",2.103316
1,31d925e1d94d08097d98b67426266953,428945,7f42902061fe623cdb852b27c1e66c69,False,0,,Sat Dec 03 07:22:47 -0800 2016,Sat Dec 03 07:22:48 -0800 2016,,,1590170369,43,[],US,eng,"[{'count': '1297', 'name': 'to-read'}, {'count...",,False,3.95,,"['55213', '428527', '1504664', '1391333', '388...","""This writing has to do with some things I saw...",Paperback,https://www.goodreads.com/book/show/428945.In_...,"[{'author_id': '114265', 'role': ''}, {'author...",NYRB Classics,224.0,31.0,9781590170366,7.0,,2003.0,https://www.goodreads.com/book/show/428945.In_...,https://images.gr-assets.com/books/1320448191m...,317,1850287,In Parenthesis,In Parenthesis,1,0,0,0,2016-12,2016-12,2003-07,0,114265.0,0,1,"In Parenthesis ""This writing has to do with so...",2.103316
2,04605a93b883a5fb75e8aac38ec9b8c7,428945,3beec74480ed88a2aec6b3d6d6b7b5b3,False,0,,Wed Oct 20 01:39:59 -0700 2010,Wed Oct 20 02:16:50 -0700 2010,,,1590170369,43,[],US,eng,"[{'count': '1297', 'name': 'to-read'}, {'count...",,False,3.95,,"['55213', '428527', '1504664', '1391333', '388...","""This writing has to do with some things I saw...",Paperback,https://www.goodreads.com/book/show/428945.In_...,"[{'author_id': '114265', 'role': ''}, {'author...",NYRB Classics,224.0,31.0,9781590170366,7.0,,2003.0,https://www.goodreads.com/book/show/428945.In_...,https://images.gr-assets.com/books/1320448191m...,317,1850287,In Parenthesis,In Parenthesis,1,0,0,0,2010-10,2010-10,2003-07,0,114265.0,0,1,"In Parenthesis ""This writing has to do with so...",2.103316
3,63ae9b6e0a883bae8b86861e00d2df9d,428945,a02503255bac6d289fbd622e3714eacf,False,0,,Sun Sep 27 08:15:39 -0700 2015,Sun Sep 27 08:15:40 -0700 2015,,,1590170369,43,[],US,eng,"[{'count': '1297', 'name': 'to-read'}, {'count...",,False,3.95,,"['55213', '428527', '1504664', '1391333', '388...","""This writing has to do with some things I saw...",Paperback,https://www.goodreads.com/book/show/428945.In_...,"[{'author_id': '114265', 'role': ''}, {'author...",NYRB Classics,224.0,31.0,9781590170366,7.0,,2003.0,https://www.goodreads.com/book/show/428945.In_...,https://images.gr-assets.com/books/1320448191m...,317,1850287,In Parenthesis,In Parenthesis,1,0,0,0,2015-09,2015-09,2003-07,0,114265.0,0,1,"In Parenthesis ""This writing has to do with so...",2.103316
4,5aa8d21f619b434e5fe64932ae25908d,428945,6938fc5c837b8799c39317d9491e9f43,False,0,,Mon Jun 30 08:32:36 -0700 2014,Mon Jun 30 08:32:36 -0700 2014,,,1590170369,43,[],US,eng,"[{'count': '1297', 'name': 'to-read'}, {'count...",,False,3.95,,"['55213', '428527', '1504664', '1391333', '388...","""This writing has to do with some things I saw...",Paperback,https://www.goodreads.com/book/show/428945.In_...,"[{'author_id': '114265', 'role': ''}, {'author...",NYRB Classics,224.0,31.0,9781590170366,7.0,,2003.0,https://www.goodreads.com/book/show/428945.In_...,https://images.gr-assets.com/books/1320448191m...,317,1850287,In Parenthesis,In Parenthesis,1,0,0,0,2014-06,2014-06,2003-07,0,114265.0,0,1,"In Parenthesis ""This writing has to do with so...",2.103316
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49031,080eb4d63d102e825ee057c46ec5adfe,36390042,0eaec934ccf6c16318270317b5765f55,True,4,,Fri Nov 03 00:43:02 -0700 2017,Fri Nov 03 00:43:29 -0700 2017,Fri Nov 03 00:43:29 -0700 2017,Fri Nov 03 00:43:04 -0700 2017,,3,[],US,eng,"[{'count': '13', 'name': 'to-read'}, {'count':...",,False,3.84,,[],What are the places that have transformed you?...,,https://www.goodreads.com/book/show/36390042-t...,"[{'author_id': '2904025', 'role': ''}, {'autho...",Summit Books,,,,9.0,,2017.0,https://www.goodreads.com/book/show/36390042-t...,https://images.gr-assets.com/books/1507687390m...,19,58081365,The Map That Contains Us,The Map That Contains Us,1,1,1,1,2017-11,2017-11,2017-09,0,2904025.0,0,1,The Map That Contains Us What are the places t...,
49032,a6682963bd61f55102656458a89262b2,36390042,2255d7337428d816820d7fd41e42a75c,True,5,"<i> ""You don't need a half<br />To be whole - ...",Fri Oct 13 15:54:40 -0700 2017,Sun Oct 15 18:29:44 -0700 2017,,,,3,[],US,eng,"[{'count': '13', 'name': 'to-read'}, {'count':...",,False,3.84,,[],What are the places that have transformed you?...,,https://www.goodreads.com/book/show/36390042-t...,"[{'author_id': '2904025', 'role': ''}, {'autho...",Summit Books,,,,9.0,,2017.0,https://www.goodreads.com/book/show/36390042-t...,https://images.gr-assets.com/books/1507687390m...,19,58081365,The Map That Contains Us,The Map That Contains Us,1,1,1,1,2017-10,2017-10,2017-09,0,2904025.0,0,1,The Map That Contains Us What are the places t...,
49033,d843e9a894be06cb48a5862080c426a7,274515,6f03e28003b438022227c62fccf9a37d,True,4,,Sun Dec 15 08:02:33 -0800 2013,Sun Aug 23 09:59:23 -0700 2015,,Sun Dec 15 00:00:00 -0800 2013,9626344172,3,[],US,eng,"[{'count': '1659', 'name': 'school'}, {'count'...",,False,4.01,,"['383337', '92250', '52823', '18545', '12287',...",This outstanding historical recording made in ...,Audio CD,https://www.goodreads.com/book/show/274515.Hamlet,"[{'author_id': '947', 'role': ''}, {'author_id...",Naxos Audiobooks,,1.0,9789626344170,8.0,Abridged,2006.0,https://www.goodreads.com/book/show/274515.Hamlet,https://s.gr-assets.com/assets/nophoto/book/11...,17,1885548,Hamlet: John Gielgud's Classic 1948 Recording,Hamlet: John Gielgud's Classic 1948 Recording,1,1,1,1,2013-12,2015-08,2006-08,0,947.0,0,1,Hamlet: John Gielgud's Classic 1948 Recording ...,
49034,f1849b2ba733d94cd2d8825325667a10,274515,b65a64d99787d345d294686ef942abab,False,0,,Fri Apr 01 05:53:20 -0700 2016,Fri Apr 01 05:53:21 -0700 2016,,,9626344172,3,[],US,eng,"[{'count': '1659', 'name': 'school'}, {'count'...",,False,4.01,,"['383337', '92250', '52823', '18545', '12287',...",This outstanding historical recording made in ...,Audio CD,https://www.goodreads.com/book/show/274515.Hamlet,"[{'author_id': '947', 'role': ''}, {'author_id...",Naxos Audiobooks,,1.0,9789626344170,8.0,Abridged,2006.0,https://www.goodreads.com/book/show/274515.Hamlet,https://s.gr-assets.com/assets/nophoto/book/11...,17,1885548,Hamlet: John Gielgud's Classic 1948 Recording,Hamlet: John Gielgud's Classic 1948 Recording,1,0,0,0,2016-04,2016-04,2006-08,0,947.0,0,1,Hamlet: John Gielgud's Classic 1948 Recording ...,
