# Summary

In this notebook, we look at neighbor models, using surprise's KNN prediction models. 
<br>

We start by setting up the infrastructure, importing all the user ratings of the top 100 game on boardgamegeek (dy default metric called geekrating), as of March 31, 2020. turning that into trainset
<br>

then, modelling, KNNs, 4 different kind of model

for more info, check out surprise's documentation


# Infrastructure

In [1]:
# surprise library
from surprise import Dataset, Reader

from surprise.similarities import cosine, msd, pearson, pearson_baseline
from surprise.prediction_algorithms.knns import KNNBasic, KNNWithMeans, KNNWithZScore, KNNBaseline
from surprise import SVD
from surprise.model_selection import train_test_split, GridSearchCV


from surprise.model_selection import cross_validate
from surprise import accuracy
from surprise.model_selection import KFold


import pandas as pd


# my functions for this project
import bgg_data_func
import bgg_model_func
from game_name_converter import NameConverter

In [2]:
file_path = './data_input/games_31_summary.csv'

reader = Reader(line_format='user item rating', sep=',', rating_scale = (1,10))

data = Dataset.load_from_file(file_path, reader=reader)

In [3]:
trainset, testset = train_test_split(data, test_size=0.2)

In [4]:
print('Number of users: ', trainset.n_users, '\n')
print('Number of items: ', trainset.n_items, '\n')

Number of users:  170825 

Number of items:  32 



In [5]:
trainset_iids = list(trainset.all_items())
iid_converter = lambda x: trainset.to_raw_iid(x)
trainset_raw_iids = list(map(iid_converter, trainset_iids))

In [6]:
trainsetfull = data.build_full_trainset()
print('Number of users: ', trainsetfull.n_users, '\n')
print('Number of items: ', trainsetfull.n_items, '\n')

Number of users:  184061 

Number of items:  32 



In [7]:
trainsetfull_iids = list(trainsetfull.all_items())
iid_converter = lambda x: trainsetfull.to_raw_iid(x)
trainsetfull_raw_iids = list(map(iid_converter, trainsetfull_iids))

In [8]:
name_converter = NameConverter('games_master_list.csv')

# Modelling

In [10]:
# define similarity options, to be used by different models
# four options: cosine, pearson, pearson_baseline, msd

In [9]:
sim_msd = {'name':'MSD', 'user_based':False}
sim_cos = {'name':'cosine', 'user_based':False}
sim_pearson = {'name':'pearson', 'user_based':False}
sim_pearson_baseline = {'name': 'pearson_baseline','user_based':False, 'shrinkage': 100}

sim_options = [sim_msd, sim_cos, sim_pearson, sim_pearson_baseline]

# shrinkage can be tuned, this is baseline
# we will not do user based, do to the size 
# every other parameter is taken into account

On to the modelling. 
KNNBasic, KNNWithMeans, KNNWithZScore can be used with first three sims, to tune: k, number of neighbors, looking at these first. 
With KNNBaseline, you also mix this with an underlying svd baseline estimation, and the recommended measure is pearson_baseline. 

## Similarity Matrices

Going to handle the first 3 in one group. at first, we just run three models to get the similarity matrices, running them on the full trainset

In [30]:
# going for the first three models
for i in range(0,3):
    model = KNNBasic(sim_options = sim_options[i], verbose = False)
    model.fit(trainsetfull)
    save_similar_games(
        model.sim, trainsetfull_raw_iids, 10, 
        './results/top_10_similar_games_' + sim_options[i]['name'] + '.csv')

In [23]:
def save_similar_games(similarity_matrix, raw_iids, top_x, output_file):
    df = bgg_model_func.return_top_similar_dataframe(similarity_matrix, raw_iids, top_x)
    for column in df.columns:
        df[column] = df[column].map(name_converter.get_game_name_from_id)
    df.sort_values(['game'], inplace = True, axis = 0)
    df.to_csv(output_file, index = False)

In [31]:
# we can check the matrices here, if we want to

In [34]:
df = pd.read_csv('./results/top_10_similar_games_' + sim_options[2]['name'] + '.csv')
df;

## Models without Baseline

In [None]:
param_grid_knn = {'k': [40]}
gs_model_basic = GridSearchCV(KNNBasic, param_grid = param_grid_knn, cv = 3, return_train_measures=True, joblib_verbose = 10)
gs_model_basic.fit(data)

In [None]:
param_grid_1 = {'k': [5, 10]}

gs_model_1 = GridSearchCV(KNNBasic,param_grid=param_grid_1,joblib_verbose=5)
gs_model_1.fit(data)

In [16]:
param_grid_1 = {'k': [5, 10]}

gs_model_1 = GridSearchCV(KNNBasic,param_grid=param_grid_1,n_jobs = -1,joblib_verbose=5)
gs_model_1.fit(data)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   8 | elapsed:  1.5min remaining:   53.6s


TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)}

In [13]:
param_grid_2 = {'reg_all': [0.4, 0.6]}

gs_model_2 = GridSearchCV(SVD,param_grid=param_grid_2,n_jobs = -1,joblib_verbose=5)
gs_model_2.fit(data)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:  3.8min remaining:  2.5min
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  5.3min finished


In [None]:
1+1

In [None]:
basic_cos = knns.KNNBasic(sim_options=sim_cos)
basic_cos.fit(trainset)
predictions = basic_cos.test(testset)
print(accuracy.rmse(predictions))

In [None]:
df = bgg_model_func.return_top_similar_dataframe(basic_cos.sim, trainset_raw_iids, 3)
for column in df.columns:
    df[column] = df[column].map(name_converter.get_game_name_from_id)
df.sort_values(['game'], inplace = True, axis = 0)
df[:100]

In [None]:

basic_pearson = knns.KNNBasic(sim_options=sim_pearson)
basic_pearson.fit(trainset)
predictions = basic_pearson.test(testset)
print(accuracy.rmse(predictions))

In [None]:
df = bgg_model_func.return_top_similar_dataframe(basic_pearson.sim, trainset_raw_iids, 3)
for column in df.columns:
    df[column] = df[column].map(name_converter.get_game_name_from_id)
df.sort_values(['game'], inplace = True)
df

In [None]:
sim_pearson = {'name':'pearson', 'user_based':False}
knn_means = knns.KNNWithMeans(sim_options=sim_pearson)
knn_means.fit(trainset)
predictions = knn_means.test(testset)
print(accuracy.rmse(predictions))

In [None]:
sim_pearson = {'name':'pearson', 'user_based':False}
knn_baseline = knns.KNNBaseline(sim_options=sim_pearson)
knn_baseline.fit(trainset)
predictions = knn_baseline.test(testset)
print(accuracy.rmse(predictions))