# Scikit Surprise tester
This notebook contains a few tests to see how well Scikit Surprise performs on my data.

It uses KNN and SVD as these should be comparable to the KNN and SVC used in my recommender. Surprise doesn't include naive Bayes, decision trees or logistic regression. This is because it focuses on numeric prediction rather than classification.

This notebook is configured to be similar to my main config, with the same split of data used in my results. The accuracy here is 55% for SVD and 57% for KNN. This is similar to (though slightly lower than) the collaborative filtering results in my recommender system. This suggests that my collaborative filtering section is working as expected.

In [None]:
import sys

from pathlib import Path
from surprise import Dataset
from surprise import Reader
from surprise import KNNBasic

from sklearn.model_selection import train_test_split
from surprise.model_selection import cross_validate

# Hacky import for project modules by adding them directly to sys path  
modules = str(Path.cwd().parent / "modules")

if modules not in sys.path:
    sys.path.append(modules)
    
import load_data

In [None]:
# Set config values
ratings = load_data.trim_ratings(
    load_data.ratings_data(Path.cwd().parent/"data/book_ratings.db"), 10, 5
)
ratings["recommend"] = load_data.set_threshold(ratings, 4)
ratings = load_data.set_class_proportions(
    ratings, 0.2, 0.8
)
print(f"Data ready: {ratings.shape[0]} ratings to use")

# Main train/test split - don't touch test from here onwards
train, test = train_test_split(
    ratings,
    random_state=50,
    test_size=0.2,
    stratify=ratings["recommend"],
)
print(f"Test/train split done: {len(train)} ratings in training set")

In [None]:
reader = Reader(rating_scale=(False, True))

# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(train[['user_id', 'book_id', 'recommend']], reader)

In [None]:
sim_options = {
    "user_based": False,
}

algo = KNNBasic(k=5, sim_options=sim_options)

# Run 5-fold cross-validation and print results.
# MAE is 1 - accuracy when run on classification data
results = cross_validate(algo, data, measures=['MAE'], cv=5, verbose=False)
accuracy = 1 - results["test_mae"].mean()
print("KNN accuracy:", round(accuracy, 2))