# Collaborative Filtering Notebook with `surprise`

## Model Training Demo

These lines loads the ratings and convert it to `surprise.Dataset` object.

In [48]:
from surprise import Dataset, NormalPredictor, Reader
import pandas as pd
ratings_df = pd.read_csv(
    filepath_or_buffer="../Data/Raw/ratings.csv",
    dtype={
        "user_id": "Int32",
        "book_id": "Int32",
        "rating": "Int8"
    }
)
ratings_sdata = Dataset.load_from_df(
    df=ratings_df,
    reader=Reader(rating_scale=(1, 5))
)
ratings_df.nunique()

user_id    53424
book_id    10000
rating         5
dtype: int64

We can choose a similarity metric for CF, given by the `surprise.similarities` module.

In [56]:
from surprise import KNNBasic
sim_options = {
    "name": "cosine", # options: cosine, msd, pearson, pearson_baseline
    "user_based": False,  # False=CF on item; True=CF on user
    "shrinkage": 0, # takes effect if "name" set to pearson_baseline, can prevent overfit
    "min_support": 1, # if num of common ratings is less than this, truncates to 0, reduces user-item matrix density
}
algo = KNNBasic(sim_options=sim_options) # other algo options: https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html

To train, we need to convert a `surprise.Dataset` object to a `surprise.Trainset` object.\
Suppose we want to train on the whole dataset:

In [57]:
ratings_strain = ratings_sdata.build_full_trainset()

Combining `algo` and `ratings_strain`:

In [60]:
import time
start_time = time.time()
algo.fit(ratings_strain)
print(f"Ellapsed time: {time.time() - start_time} seconds")

Computing the cosine similarity matrix...
Done computing similarity matrix.
Ellapsed time: 43.53758215904236 seconds


## Compute Predictions, Training Loss

One can test the train loss over itself. Build first a trainset from the testset, make prediction, then plug in accuracy metrics\
(will take a while).

In [None]:
ratings_stest = ratings_strain.build_testset()
start_time = time.time()
predictions = algo.test(ratings_stest)
print(f"Ellapsed time: {time.time() - start_time} seconds")

## Trying out Reccomendation (section in progress)

Let us search for the top 10 power users and reccomend books for them.

In [78]:
users_top_10_df = ratings_df\
                    .groupby("user_id", as_index=False)\
                    .agg(count=("book_id", "count"))\
                    .sort_values("count", ascending=False)\
                    .head(10)
users_top_10_df

Unnamed: 0,user_id,count
12873,12874,200
30943,30944,200
28157,28158,199
52035,52036,199
12380,12381,199
45553,45554,197
6629,6630,197
7562,7563,196
15603,15604,196
37833,37834,196


## Dump Results

In [None]:
surprise.dump.dump(
    "../Data/Dumps/cf_knnbasic_all.dump",
    predictions=predictions,
    algo=algo,
    verbose=1
)

Dump files can be retrieved by `dump.load(filename)`.