# Summary

In this notebook, we look at neighbor models, using surprise's KNN prediction models. 
<br>

We start by setting up the infrastructure, importing all the user ratings of the top 100 game on boardgamegeek (dy default metric called geekrating), as of March 31, 2020. turning that into trainset
<br>

then, modelling, KNNs, 4 different kind of model

for more info, check out surprise's documentation


# Infrastructure

In [23]:
# surprise library
from surprise import Dataset, Reader

from surprise.similarities import \
    cosine, msd, pearson, pearson_baseline
from surprise.prediction_algorithms.knns import \
    KNNBasic, KNNWithMeans, KNNWithZScore, KNNBaseline
from surprise.model_selection import \
    train_test_split, GridSearchCV, cross_validate

from surprise import accuracy
from surprise.model_selection import KFold


import pandas as pd
import csv

# my functions for this project
import bgg_data_func
import bgg_model_func
from game_name_converter import NameConverter

In [45]:
file_path = './data_input/games_100_summary.csv'

reader = Reader(line_format='user item rating', sep=',', rating_scale = (1,10))

data = Dataset.load_from_file(file_path, reader=reader)

In [46]:
trainset, testset = train_test_split(data, test_size=0.2)

In [47]:
print('Number of users: ', trainset.n_users, '\n')
print('Number of items: ', trainset.n_items, '\n')

Number of users:  225845 

Number of items:  100 



In [48]:
trainset_iids = list(trainset.all_items())
iid_converter = lambda x: trainset.to_raw_iid(x)
trainset_raw_iids = list(map(iid_converter, trainset_iids))

In [49]:
trainsetfull = data.build_full_trainset()
print('Number of users: ', trainsetfull.n_users, '\n')
print('Number of items: ', trainsetfull.n_items, '\n')

Number of users:  237253 

Number of items:  100 



In [50]:
trainsetfull_iids = list(trainsetfull.all_items())
iid_converter = lambda x: trainsetfull.to_raw_iid(x)
trainsetfull_raw_iids = list(map(iid_converter, trainsetfull_iids))

In [51]:
name_converter = NameConverter('games_master_list.csv')

# Modelling

In [None]:
# define similarity options, to be used by different models
# four options: cosine, pearson, pearson_baseline, msd

In [52]:
sim_msd = {'name':'MSD', 'user_based':False}
sim_cos = {'name':'cosine', 'user_based':False}
sim_pearson = {'name':'pearson', 'user_based':False}
sim_pearson_baseline = {'name': 'pearson_baseline','user_based':False, 'shrinkage': 100}

sim_options = [sim_msd, sim_cos, sim_pearson, sim_pearson_baseline]

# shrinkage can be tuned, this is baseline
# we will not do user based, do to the size 
# every other parameter is taken into account

On to the modelling. 
KNNBasic, KNNWithMeans, KNNWithZScore can be used with first three sims, to tune: k, number of neighbors, looking at these first. 
With KNNBaseline, you also mix this with an underlying svd baseline estimation, and the recommended measure is pearson_baseline. 

In [53]:
list_of_ks = [10,20,40]

## Similarity Matrices

Going to handle the first 3 in one group. at first, we just run three models to get the similarity matrices, running them on the full trainset

In [None]:
# going for the first three models
for i in range(0,3):
    model = KNNBasic(sim_options = sim_options[i], verbose = False)
    model.fit(trainsetfull)
    save_similar_games(
        model.sim, trainsetfull_raw_iids, 10, 
        './results/top_10_similar_games_' + sim_options[i]['name'] + '.csv')

In [None]:
def save_similar_games(similarity_matrix, raw_iids, top_x, output_file):
    df = bgg_model_func.return_top_similar_dataframe(similarity_matrix, raw_iids, top_x)
    for column in df.columns:
        df[column] = df[column].map(name_converter.get_game_name_from_id)
    df.sort_values(['game'], inplace = True, axis = 0)
    df.to_csv(output_file, index = False)

In [None]:
# we can check the matrices here, if we want to

In [None]:
df = pd.read_csv('./results/top_10_similar_games_' + sim_options[2]['name'] + '.csv')
df;

## Models without Baseline

In [None]:
# just showing test option here, we are not going to use it

In [12]:
algo = KNNBasic(k = 10, sim_options = sim_options[1])
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.3974


1.3973772156201962

In [None]:
# we are using cross-validation as a default, saving everything in a csv called KNN_CV_Scores 
# structure of csv: modeltype, similarity, k, rmse_test

In [54]:
with open('./results/kNN_scores.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(['model_type', 'similarity_option', 'k', 'train_rmse', 'test_rmse'])

In [55]:
# KNNBasic
for curr_sim_option in sim_options[0:3]:

    for curr_k in list_of_ks:
        
        print(
            'Currently calculating sim_option = ' + str(curr_sim_option['name']) + \
            ' and k = ' + str(curr_k) + ' ...' )        
        algo = KNNBasic(k = curr_k, sim_options = curr_sim_option)
        results = cross_validate(algo, data, measures=['RMSE'], cv=3, return_train_measures=True);
        
        with open('./results/kNN_scores.csv', 'a') as f:
            writer = csv.writer(f)
            writer.writerow(
                ['KNNBasic', curr_sim_option['name'], str(curr_k), 
                 str(np.mean(results['train_rmse'])), str(np.mean(results['test_rmse']))])

Currently calculating sim_option = MSD and k = 10 ...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Currently calculating sim_option = MSD and k = 20 ...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Currently calculating sim_option = MSD and k = 40 ...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Currently calculating sim_option = cosine and k = 10 ...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done c

In [56]:
# KNNWithMeans
for curr_sim_option in sim_options[0:3]:

    for curr_k in list_of_ks:
        
        print(
            'Currently calculating sim_option = ' + str(curr_sim_option['name']) + \
            ' and k = ' + str(curr_k) + ' ...' )
        algo = KNNWithMeans(k = curr_k, sim_options = curr_sim_option)
        results = cross_validate(algo, data, measures=['RMSE'], cv=3, return_train_measures=True);
        
        with open('./results/kNN_scores.csv', 'a') as f:
            writer = csv.writer(f)
            writer.writerow(
                ['KNNWithMeans', curr_sim_option['name'], str(curr_k), 
                 str(np.mean(results['train_rmse'])), str(np.mean(results['test_rmse']))])

Currently calculating sim_option = MSD and k = 10 ...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Currently calculating sim_option = MSD and k = 20 ...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Currently calculating sim_option = MSD and k = 40 ...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Currently calculating sim_option = cosine and k = 10 ...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done c

In [57]:
# KNNWithZScore
for curr_sim_option in sim_options[0:3]:

    for curr_k in list_of_ks:
        
        print(
            'Currently calculating sim_option = ' + str(curr_sim_option['name']) + \
            ' and k = ' + str(curr_k) + ' ...' )
        algo = KNNWithZScore(k = curr_k, sim_options = curr_sim_option)
        results = cross_validate(algo, data, measures=['RMSE'], cv=3, return_train_measures=True);
        
        with open('./results/kNN_scores.csv', 'a') as f:
            writer = csv.writer(f)
            writer.writerow(
                ['KNNWithZScore', curr_sim_option['name'], str(curr_k), 
                 str(np.mean(results['train_rmse'])), str(np.mean(results['test_rmse']))])

Currently calculating sim_option = MSD and k = 10 ...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Currently calculating sim_option = MSD and k = 20 ...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Currently calculating sim_option = MSD and k = 40 ...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Currently calculating sim_option = cosine and k = 10 ...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done c

In [36]:
# can analyse with pandas now

In [58]:
df = pd.read_csv('./results/kNN_scores.csv')
df.sort_values(by = 'test_rmse', inplace = True)
df

Unnamed: 0,model_type,similarity_option,k,train_rmse,test_rmse
15,KNNWithMeans,pearson,10,0.802911,1.268071
24,KNNWithZScore,pearson,10,0.803766,1.268229
25,KNNWithZScore,pearson,20,0.876658,1.270155
16,KNNWithMeans,pearson,20,0.875939,1.270204
26,KNNWithZScore,pearson,40,0.903048,1.274972
17,KNNWithMeans,pearson,40,0.902398,1.275318
10,KNNWithMeans,MSD,20,0.875226,1.278689
9,KNNWithMeans,MSD,10,0.789032,1.27878
19,KNNWithZScore,MSD,20,0.877514,1.281942
11,KNNWithMeans,MSD,40,0.904816,1.2827
