# Collaborative Filtering Baseline Model

In this notebook we build a collaborative filtering model to serve as a baseline

### Imports

In [1]:
import json
import os
import random
import numpy as np
import pandas as pd

random.seed(42)
np.random.seed(42)

### Load Training Data

In [2]:
OUTPUT_DATA_DIR = "./output_data/"

train_df = pd.read_csv(OUTPUT_DATA_DIR+"interactions_training.csv")

In [3]:
pd.set_option('display.max_columns', None)

### Load Validation Data

In [4]:
val_df = pd.read_csv(OUTPUT_DATA_DIR+"interactions_validation.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


### Collaborative Filtering - Item to Item Similarity Based on Ratings

The predicted rating will be the average of the average rating for the most similar books.

We will be using kNN and so the predicted rating for a book will be the average rating for the `k` closest books

In [5]:
train_df['book_id'] = train_df['book_id'].astype("category")
train_df['user_id'] = train_df['user_id'].astype("category")

In [6]:
import scipy.sparse as sp

item_matrix = train_df.pivot(index='book_id', columns='user_id', values='rating').fillna(0)
item_train_matrix = sp.csr_matrix(item_matrix.values)

We now fit a few KNN models for various values of `k`. Note that there are way more users than books and so we will keep `k` relatively small. We will try `k = [1, 2, 5, 10]` initially.

In [7]:
from sklearn.neighbors import NearestNeighbors

rated_df = train_df[train_df['rated'] == 1]

train_item_avg = rated_df.groupby(rated_df['book_id'], as_index=False)['rating'].mean()
train_item_avg.columns = ['book_id', 'book_average']
train_item_avg = train_item_avg.set_index('book_id')

In [8]:
def build_knn_model(train_matrix, k):
    """Builds a kNN model on `train_matrix` with `k` neighbours.
    
    Parameters
    ----------
    train_matrix: sp.csr_matrix
        The sparse matrix used to build the kNN model.
    k: int
        The number of neighbours to use in the kNN model.
    
    Returns
    -------
    NearestNeighbors
        A NearestNeighbors model fit to `train_matrix`.
    
    """
    model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=k)
    model_knn.fit(train_matrix)
    return model_knn

In [9]:
def get_item_preds_from_knn(knn_model, train_matrix, items, item_avgs):
    """Gets the kNN predictions for the items in `items`.
    
    This assumes that every item in items was fit on the
    knn_model. This is just a precomputation step to get
    the predictions for items in the training set.
    
    Parameters
    ----------
    knn_model: NearestNeighbors
        A NearestNeighbors model that has been fit.
    train_matrix: sp.csr_matrix
        The sparse matrix representing the training data.
    items: np.array
        An array of item indices for items in `knn_model`.
    item_avgs: pd.DataFrame
        A pandas dataframe containing the average rating for
        each item in `items`.
    
    Returns
    -------
    pd.DataFrame
        A DataFrame containing the predicted rating for each item
        in `items`.
    
    """
    item_neighbors = np.asarray(knn_model.kneighbors(train_matrix, return_distance=False))
    knn_avgs = np.zeros(len(item_neighbors))   # this is more efficient than appending multiple times (no resizing)
    for i in range(len(item_neighbors)):
        knn_avgs[i] = item_avgs['book_average'][items[item_neighbors[i]]].mean()    # average of average ratings for neighbors
    return pd.concat([pd.DataFrame(items, columns=['book_id']),
                      pd.DataFrame(knn_avgs, columns=['book_rating'])],
                    axis=1)

In [10]:
def predict_ratings(X, item_preds, default_val, merge_col):
    """Predicts the item ratings for the items in `X`.
    
    Parameters
    ----------
    X: pd.DataFrame
        The DataFrame of features.
    item_preds: pd.DataFrame
        The DataFrame of predicted ratings for the items.
    default_val: float
        A default rating used for unseen items.
    merge_col: str
        The column to merge on.
    
    Returns
    -------
    pd.DataFrame
        A DataFrame containing the predicted item ratings for
        the records in `X`.
    
    """
    id_col = "{}_id".format(merge_col)
    rating_col = "{}_rating".format(merge_col)
    df_item = pd.merge(X, item_preds, how='left', on=[id_col])
    df_item[rating_col] = df_item[rating_col].fillna(default_val)
    df_item.index = X.index
    return df_item[rating_col].apply(lambda x: 1 if x > 3 else 0)

In [11]:
def get_item_knn_train_validation_preds(train_df, val_df, train_matrix, k, items, item_avgs):
    """Gets predictions on `train_df` and `val_df` from a kNN model.
    
    Parameters
    ----------
    train_df: pd.DataFrame
        A DataFrame of the training data.
    val_df: pd.DataFrame
        A DataFrame of the validation data.
    train_matrix: sp.csr_matrix
        The sparse matrix used to train the kNN model.
    k: int
        The number of neighbours in the kNN model.
    items: np.array
        An array of strings representing the ids of the
        items used in training.
    item_avgs: pd.DataFrame
        A DataFrame containing the average rating for the
        items in `items`.
    
    Returns
    -------
    np.array, np.array
        Arrays of predictions on the training and validation sets, respectively.
    
    """
    knn_model = build_knn_model(train_matrix, k)
    knn_preds = get_item_preds_from_knn(knn_model, train_matrix, items, item_avgs)
    
    # prediction for a new book
    new_book_vec = np.zeros(train_matrix.shape[1])
    new_book_neighbours = knn_model.kneighbors(new_book_vec.reshape(1, -1), return_distance=False)
    new_book_pred = item_avgs['book_average'][items[new_book_neighbours[0]]].mean()
    
    train_pred = predict_ratings(train_df, knn_preds, new_book_pred, "book")
    val_pred = predict_ratings(val_df, knn_preds, new_book_pred, "book")
    return train_pred, val_pred

In [12]:
from sklearn.metrics import roc_auc_score

k_vals = [1, 2, 5, 10]
train_MSEs = [None for _ in range(4)]
val_MSEs = [None for _ in range(4)]

for i in range(len(k_vals)):
    k = k_vals[i]
    print("kNN with k = {}".format(k))
    print("---------------")
    train_preds, val_preds = get_item_knn_train_validation_preds(
        train_df, val_df, item_train_matrix, k, item_matrix.index, train_item_avg)
    train_MSEs[i] = roc_auc_score(train_df['recommended'], train_preds)
    val_MSEs[i] = roc_auc_score(val_df['recommended'], val_preds)
    print("Training AUC: {}".format(train_MSEs[i]))
    print("Validation AUC: {}".format(val_MSEs[i]))
    print()

kNN with k = 1
---------------
Training AUC: 0.5067737123901904
Validation AUC: 0.5018497724649386

kNN with k = 2
---------------
Training AUC: 0.5032251774455084
Validation AUC: 0.5015468723614261

kNN with k = 5
---------------
Training AUC: 0.5000900069813997
Validation AUC: 0.49993359484898875

kNN with k = 10
---------------
Training AUC: 0.5
Validation AUC: 0.5



The different values of k don't seem to make too much difference

In [13]:
RESULTS_DIR = './results/'

if not os.path.exists(RESULTS_DIR):
    os.makedirs(RESULTS_DIR)

In [14]:
item_item_cf = pd.DataFrame({'k': k_vals,
                             'trainMSE': train_MSEs,
                             'valMSE': val_MSEs})
item_item_cf.to_csv(RESULTS_DIR+"itemToItemCF.csv", index=False)

### Re-Running With All Ratings

In [15]:
train_item_avg = train_df.groupby(train_df['book_id'], as_index=False)['rating'].mean()
train_item_avg.columns = ['book_id', 'book_average']
train_item_avg = train_item_avg.set_index('book_id')

In [16]:
from sklearn.metrics import roc_auc_score

k_vals = [1, 2, 5, 10]
train_MSEs = [None for _ in range(4)]
val_MSEs = [None for _ in range(4)]

for i in range(len(k_vals)):
    k = k_vals[i]
    print("kNN with k = {}".format(k))
    print("---------------")
    train_preds, val_preds = get_item_knn_train_validation_preds(
        train_df, val_df, item_train_matrix, k, item_matrix.index, train_item_avg)
    train_MSEs[i] = roc_auc_score(train_df['recommended'], train_preds)
    val_MSEs[i] = roc_auc_score(val_df['recommended'], val_preds)
    print("Training AUC: {}".format(train_MSEs[i]))
    print("Validation AUC: {}".format(val_MSEs[i]))
    print()

kNN with k = 1
---------------
Training AUC: 0.6742108369200269
Validation AUC: 0.6050032017531815

kNN with k = 2
---------------
Training AUC: 0.6409586002368054
Validation AUC: 0.5868506978132447

kNN with k = 5
---------------
Training AUC: 0.6416863174652611
Validation AUC: 0.5798940542914955

kNN with k = 10
---------------
Training AUC: 0.6221939209392602
Validation AUC: 0.5644518140016275



In [17]:
item_item_cf = pd.DataFrame({'k': k_vals,
                             'trainMSE': train_MSEs,
                             'valMSE': val_MSEs})
item_item_cf.to_csv(RESULTS_DIR+"itemToItemCF.csv", index=False)

### Performing Item-Item Similarity based on books read

In [18]:
item_matrix = train_df.pivot(index='book_id', columns='user_id', values='read').fillna(0)
item_train_matrix = sp.csr_matrix(item_matrix.values)

In [19]:
recommended_df = train_df[['book_id', 'recommended']]

In [20]:
def get_recommend_preds_from_knn(knn_model, train_matrix, items, recommendation_df):
    """Gets the kNN predictions for the items in `items`.
    
    This assumes that every item in items was fit on the
    knn_model. This is just a precomputation step to get
    the predictions for items in the training set.
    
    Parameters
    ----------
    knn_model: NearestNeighbors
        A NearestNeighbors model that has been fit.
    train_matrix: sp.csr_matrix
        The sparse matrix representing the training data.
    items: np.array
        An array of item indices for items in `knn_model`.
    recommendation_df: pd.DataFrame
        A pandas dataframe containing the book_id and a
        column indicating whether the user recommended
        the book or not.
    
    Returns
    -------
    pd.DataFrame
        A DataFrame containing the predicted rating for each item
        in `items`.
    
    """
    item_neighbors = np.asarray(knn_model.kneighbors(train_matrix, return_distance=False))
    knn_avgs = np.zeros(len(item_neighbors))   # this is more efficient than appending multiple times (no resizing)
    for i in range(len(item_neighbors)):
        knn_avgs[i] = round(recommendation_df[recommendation_df['book_id'].isin(items[item_neighbors[i]])]['recommended'].mean())
    return pd.concat([pd.DataFrame(items, columns=['book_id']),
                      pd.DataFrame(knn_avgs, columns=['book_recommend'])],
                    axis=1)

In [21]:
def predict_recommend(X, item_preds, default_val, merge_col):
    """Predicts the item ratings for the items in `X`.
    
    Parameters
    ----------
    X: pd.DataFrame
        The DataFrame of features.
    item_preds: pd.DataFrame
        The DataFrame of predicted ratings for the items.
    default_val: float
        A default rating used for unseen items.
    merge_col: str
        The column to merge on.
    
    Returns
    -------
    pd.DataFrame
        A DataFrame containing the predicted item ratings for
        the records in `X`.
    
    """
    id_col = "{}_id".format(merge_col)
    recommend_col = "{}_recommend".format(merge_col)
    df_item = pd.merge(X, item_preds, how='left', on=[id_col])
    df_item[recommend_col] = df_item[recommend_col].fillna(default_val)
    df_item.index = X.index
    return df_item[recommend_col]

In [22]:
def get_recommend_knn_train_validation_preds(train_df, val_df, train_matrix, k, items, recommendation_df):
    """Gets predictions on `train_df` and `val_df` from a kNN model.
    
    Parameters
    ----------
    train_df: pd.DataFrame
        A DataFrame of the training data.
    val_df: pd.DataFrame
        A DataFrame of the validation data.
    train_matrix: sp.csr_matrix
        The sparse matrix used to train the kNN model.
    k: int
        The number of neighbours in the kNN model.
    items: np.array
        An array of strings representing the ids of the
        items used in training.
    recommendation_df: pd.DataFrame
        A pandas dataframe containing the book_id and a
        column indicating whether the user recommended
        the book or not.
    
    Returns
    -------
    np.array, np.array
        Arrays of predictions on the training and validation sets, respectively.
    
    """
    knn_model = build_knn_model(train_matrix, k)
    knn_preds = get_recommend_preds_from_knn(knn_model, train_matrix, items, recommendation_df)
    
    # prediction for a new book
    new_book_vec = np.zeros(train_matrix.shape[1])
    new_book_neighbours = knn_model.kneighbors(new_book_vec.reshape(1, -1), return_distance=False)
    new_book_pred = round(recommendation_df[recommendation_df['book_id'].isin(items[new_book_neighbours[0]])]['recommended'].mean())
    
    train_pred = predict_recommend(train_df, knn_preds, new_book_pred, "book")
    val_pred = predict_recommend(val_df, knn_preds, new_book_pred, "book")
    return train_pred, val_pred

In [23]:
k_vals = [1, 2, 5, 10]
train_MSEs = [None for _ in range(4)]
val_MSEs = [None for _ in range(4)]

for i in range(len(k_vals)):
    k = k_vals[i]
    print("kNN with k = {}".format(k))
    print("---------------")
    train_preds, val_preds = get_recommend_knn_train_validation_preds(
        train_df, val_df, item_train_matrix, k, item_matrix.index, recommended_df)
    train_MSEs[i] = roc_auc_score(train_df['recommended'], train_preds)
    val_MSEs[i] = roc_auc_score(val_df['recommended'], val_preds)
    print("Training AUC: {}".format(train_MSEs[i]))
    print("Validation AUC: {}".format(val_MSEs[i]))
    print()

kNN with k = 1
---------------
Training AUC: 0.6716172244378682
Validation AUC: 0.6112567493776453

kNN with k = 2
---------------
Training AUC: 0.6675420355936391
Validation AUC: 0.6102760423159885

kNN with k = 5
---------------
Training AUC: 0.6678445290355558
Validation AUC: 0.6059702134196963

kNN with k = 10
---------------
Training AUC: 0.6586977441126709
Validation AUC: 0.6001336897432307



In [24]:
item_item_cf = pd.DataFrame({'k': k_vals,
                             'trainMSE': train_MSEs,
                             'valMSE': val_MSEs})
item_item_cf.to_csv(RESULTS_DIR+"itemToItemCF.csv", index=False)

Although we would like to run user-user collaborative filtering. The user base is just too large and it seems that chainRec is a superior model