# Collaborative Filtering Notebook with `surprise` (demo)

## Model Training

These lines loads the ratings and convert it to `surprise.Dataset` object.

In [48]:
from surprise import Dataset, NormalPredictor, Reader
import pandas as pd
ratings_df = pd.read_csv(
    filepath_or_buffer="../Data/Raw/ratings.csv",
    dtype={
        "user_id": "Int32",
        "book_id": "Int32",
        "rating": "Int8"
    }
)
ratings_sdata = Dataset.load_from_df(
    df=ratings_df,
    reader=Reader(rating_scale=(1, 5))
)
ratings_df.nunique()

user_id    53424
book_id    10000
rating         5
dtype: int64

We can choose a similarity metric for CF, given by the `surprise.similarities` module.

In [56]:
from surprise import KNNBasic
sim_options = {
    "name": "cosine", # options: cosine, msd, pearson, pearson_baseline
    "user_based": False,  # False=CF on item; True=CF on user
    "shrinkage": 0, # takes effect if "name" set to pearson_baseline, can prevent overfit
    "min_support": 1, # if num of common ratings is less than this, truncates to 0, reduces user-item matrix density
}
algo = KNNBasic(sim_options=sim_options) # other algo options: https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html

To train, we need to convert a `surprise.Dataset` object to a `surprise.Trainset` object.\
Suppose we want to train on the whole dataset:

In [57]:
ratings_strain = ratings_sdata.build_full_trainset()

Combining `algo` and `ratings_strain`:

In [60]:
import time
start_time = time.time()
algo.fit(ratings_strain)
print(f"Ellapsed time: {time.time() - start_time} seconds")

Computing the cosine similarity matrix...
Done computing similarity matrix.
Ellapsed time: 43.53758215904236 seconds


## Compute Predictions, Training Loss

One can test the train loss over itself. Build first a trainset from the testset, make prediction, then plug in accuracy metrics\
The prediction step will take a big while if predicting over the entire training set (6 million ratings available).

In [80]:
ratings_stest = ratings_strain.build_testset()
start_time = time.time()
predictions = algo.test(ratings_stest)
print(f"Ellapsed time: {time.time() - start_time} seconds")

Ellapsed time: 1157.4575111865997 seconds


A peak to `predictions`:

In [100]:
print(len(predictions))
for x in predictions[12345:12348]:
    print(x)

5976479
user: 219        item: 4338       r_ui = 4.00   est = 4.15   {'actual_k': 40, 'was_impossible': False}
user: 219        item: 190        r_ui = 3.00   est = 4.05   {'actual_k': 40, 'was_impossible': False}
user: 219        item: 4629       r_ui = 3.00   est = 4.03   {'actual_k': 40, 'was_impossible': False}


where `r_ui` is true rating and `est` is estimated rating.

Training losses can be computed by `surprise.accuracy`'s functions; here the all four options are listed.

In [101]:
from surprise.accuracy import rmse, mse, mae, fcp
start_time = time.time()
for n, f in [
    ("Root mean square error", rmse),
    ("Mean squared error", mse),
    ("Mean absolute error", mae),
    ("Fraction of concordant paris", fcp)
]:
    print(f"{n}: {f(predictions, verbose=False)}")
print(f"Ellapsed time evaluating losses: {time.time() - start_time} seconds")

Root mean square error: 0.7923006406161057
Mean squared error: 0.6277403051206915
Mean absolute error: 0.6057735404799971
Fraction of concordant paris: 0.8216831080054721
Ellapsed time evaluating losses: 63.667232513427734 seconds


## Query KNN (for items)

We trained the model with `user_based=False`, so we can try to query similar book items (may compare this to the corresponding section of the `bert` notebook). Let us look at the most similar books for the most popular ones:

In [116]:
books_top_3_ids_df = ratings_df\
                        .groupby("book_id", as_index=False)\
                        .agg(count=("user_id", "count"))\
                        .head(3)
books_top_3_ids_df

Unnamed: 0,book_id,count
0,1,22806
1,2,21850
2,3,16931


Book infos retrieved and joined from other dataset:

In [132]:
books_df = pd.read_csv(
    filepath_or_buffer="../Data/Raw/books_enriched.csv"
)[["book_id", "authors", "title"]]
books_df.merge(
    right=books_top_3_ids_df,
    left_on=["book_id"],
    right_on=["book_id"]
)

Unnamed: 0,book_id,authors,title,count
0,1,['Suzanne Collins'],"The Hunger Games (The Hunger Games, #1)",22806
1,2,"['J.K. Rowling', 'Mary GrandPré']",Harry Potter and the Sorcerer's Stone (Harry P...,21850
2,3,['Stephenie Meyer'],"Twilight (Twilight, #1)",16931


then we can call `algo.get_neighbors()`.

In [145]:
books_top_3_ids = books_top_3_ids_df["book_id"].to_list()
reccomend_top_3_for_top_3 = [algo.get_neighbors(book_id, k=3) for book_id in books_top_3_ids]
reccomend_top_3_for_top_3

[[10, 57, 60], [11, 157, 246], [10, 11, 29]]

Inferring reccomendations book titles:

In [150]:
books_id_name_dict = dict(zip(books_df.book_id, books_df.title))
for book_id, rec_ids in zip(books_top_3_ids, reccomend_top_3_for_top_3):
    print(f"Top three titles for {books_id_name_dict[book_id]}:")
    print(f"{[books_id_name_dict[rec_id] for rec_id in rec_ids]}")

Top three titles for The Hunger Games (The Hunger Games, #1):
['Pride and Prejudice', 'The Secret Life of Bees', 'The Curious Incident of the Dog in the Night-Time']
Top three titles for Harry Potter and the Sorcerer's Stone (Harry Potter, #1):
['The Kite Runner', 'Green Eggs and Ham', 'Marked (House of Night, #1)']
Top three titles for Twilight (Twilight, #1):
['Pride and Prejudice', 'The Kite Runner', 'Romeo and Juliet']


Seems to be a bit irrelevant.

## Dump Results

`surprise` provides a `dump` module for dumping and loading models.\
For our model size, dumping would take up a long time.

In [84]:
surprise.dump.dump(
    "../Data/Dump/cf_knnbasic_all.dump",
    predictions=predictions,
    algo=algo,
    verbose=1
)

The dump has been saved as file ../Data/Dump/cf_knnbasic_all.dump


Dump files can be retrieved by `dump.load(filename)`.