# Summary

In this notebook, we look at recommednded systems based on the k-Nearest Neighbour (kNN) models, using the [surprise](https://surprise.readthedocs.io/en/stable/index.html) library. 
<br>

We start by setting up the infrastructure, importing all the user ratings of the top 100 game on boardgamegeek (dy default metric called `geekrating`), as of March 31, 2020, and turning them into data structures `surprise` can work with. 
<br>

Then, we run 4 different types of models:
- `kNNBasic`: as the name suggests, the basic model, where we find similarities between items based on user ratings, and us that information to estimate unkown ratings
- `kNNWithMeans`: a model that also takes the average ratings of different items into consideration
- `kNNWithZScore`: starts by normalizing the user ratings
- `kNNWithBasline`: a combination of kNN-type models and matrix factorization (which we will explore to more detail in the next Jupyter notebook)

For each model, we run a `cross_validate` method to find the one with the highest `rmse` (Root Mean Squared Error) score. We save these results in `.csv` files in the `results` folder. Please note that we are doing some hyperparameter optimization when it comes to the `k` factor and the distance types to be used, so we could have used the `GridSearchCV` method as well. I wanted to explore the `cross_validate` function on its own, and the hyperparameter optimization is not too complicated to do by hand. 



# Infrastructure

In [11]:
# surprise library
from surprise import Dataset, Reader

from surprise.similarities import \
    cosine, msd, pearson, pearson_baseline
from surprise.prediction_algorithms.knns import \
    KNNBasic, KNNWithMeans, KNNWithZScore, KNNBaseline
from surprise.model_selection import \
    train_test_split, GridSearchCV, cross_validate

from surprise import accuracy
from surprise.model_selection import KFold

import pandas as pd
import numpy as np
import csv

# my functions for this project
import bgg_data_func
import bgg_model_func
from game_name_converter import NameConverter

In [12]:
# setting a random seed for reproducibility
my_seed = 12345
np.random.seed(my_seed)

# name converter to convert raw ids to game names
name_converter = NameConverter('games_master_list.csv')

In [13]:
# importing top 100 games and create a surprise object from it
file_path = './data_input/games_100_summary.csv'

reader = Reader(line_format='user item rating', sep=',', rating_scale = (1,10))

data = Dataset.load_from_file(file_path, reader=reader)

In [14]:
trainset, testset = train_test_split(data, test_size=0.2)

In [15]:
print('Number of users: ', trainset.n_users, '\n')
print('Number of items: ', trainset.n_items, '\n')

Number of users:  225784 

Number of items:  100 



In [16]:
trainset_iids = list(trainset.all_items())
iid_converter = lambda x: trainset.to_raw_iid(x)
trainset_raw_iids = list(map(iid_converter, trainset_iids))

In [17]:
# we are using the trainsetfull to save distance metrics
trainsetfull = data.build_full_trainset()
print('Number of users: ', trainsetfull.n_users, '\n')
print('Number of items: ', trainsetfull.n_items, '\n')

Number of users:  237253 

Number of items:  100 



In [18]:
trainsetfull_iids = list(trainsetfull.all_items())
iid_converter = lambda x: trainsetfull.to_raw_iid(x)
trainsetfull_raw_iids = list(map(iid_converter, trainsetfull_iids))

# Modelling

First, we define the similarity options, to be used by different models. The four options are:
- `msd`
- `cosine`
- `pearson`
- `pearson_baseline`

In [24]:
sim_msd = {'name':'MSD', 'user_based':False}
sim_cos = {'name':'cosine', 'user_based':False}
sim_pearson = {'name':'pearson', 'user_based':False}
sim_pearson_baseline = {'name': 'pearson_baseline','user_based':False, 'shrinkage': 100}

sim_options = [sim_msd, sim_cos, sim_pearson, sim_pearson_baseline]

In [11]:
list_of_ks = [10,20,40]

On to the modelling. 

`KNNBasic`, `KNNWithMean`, `KNNWithZScore` can be used with first three sims, to tune: `k`, number of neighbors, looking at these first. 

With `KNNBaseline`, you also mix the `kNN` method with an underlying svd baseline estimation, and the recommended measure is `pearson_baseline` (according to `surprise`'s documentation).  

## Similarity Matrices

First, we are saving the top 10 closest games to each game in a `csv` file, separately for the three different similarity options. This is used in the first, simplest form of prediction we are doing in Jupyter notebook 04 later. 

In [3]:
# going for the first three models
for i in range(0,3):
    model = KNNBasic(sim_options = sim_options[i], verbose = False)
    model.fit(trainsetfull)
    save_similar_games(
        model.sim, trainsetfull_raw_iids, 10, 
        './results/top_10_similar_games_' + sim_options[i]['name'] + '.csv')

In [4]:
def save_similar_games(similarity_matrix, raw_iids, top_x, output_file):
    df = bgg_model_func.return_top_similar_dataframe(similarity_matrix, raw_iids, top_x)
    for column in df.columns:
        df[column] = df[column].map(name_converter.get_game_name_from_id)
    df.sort_values(['game'], inplace = True, axis = 0)
    df.to_csv(output_file, index = False)

In [5]:
df = pd.read_csv('./results/top_10_similar_games_' + sim_options[2]['name'] + '.csv')
df;

We are also saving the similarity matrix, just for `pearson` for now, that being the similarity option we use in the second, more complicated prediction function. 

In [13]:
model = KNNBasic(sim_options = sim_options[2], verbose = False)
model.fit(trainsetfull)

<surprise.prediction_algorithms.knns.KNNBasic at 0x181bad390>

In [17]:
np.savetxt('./results/pearson_distances.csv', model.sim, delimiter=',')

In [21]:
with open('./results/game_ids_for_distance_matrix.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerows(zip(trainsetfull_raw_iids, trainsetfull_iids))

## kNN Models without Baseline

Starting with the three `kNN` models that have no baseline attached to them. We are using `cross_validate` on all of them, and then save the rmse scores in `kNN_results.csv` in the `result` folder. 

Then, we pick the best performing model, and train it on the trainset only. 

### Hyperparameter Tuning

In [2]:
# only run this cell once, when creating the file, it deletes everything from it

with open('./results/kNN_scores.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(['model_type', 'similarity_option', 'k', 'train_rmse', 'test_rmse'])

In [1]:
# KNNBasic
for curr_sim_option in sim_options[0:3]:

    for curr_k in list_of_ks:
        
        print(
            'Currently calculating sim_option = ' + str(curr_sim_option['name']) + \
            ' and k = ' + str(curr_k) + ' ...' )        
        algo = KNNBasic(k = curr_k, sim_options = curr_sim_option)
        results = cross_validate(algo, data, measures=['RMSE'], cv=3, return_train_measures=True);
        
        with open('./results/kNN_scores.csv', 'a') as f:
            writer = csv.writer(f)
            writer.writerow(
                ['KNNBasic', curr_sim_option['name'], str(curr_k), 
                 str(np.mean(results['train_rmse'])), str(np.mean(results['test_rmse']))])

In [6]:
# KNNWithMeans
for curr_sim_option in sim_options[0:3]:

    for curr_k in list_of_ks:
        
        print(
            'Currently calculating sim_option = ' + str(curr_sim_option['name']) + \
            ' and k = ' + str(curr_k) + ' ...' )
        algo = KNNWithMeans(k = curr_k, sim_options = curr_sim_option)
        results = cross_validate(algo, data, measures=['RMSE'], cv=3, return_train_measures=True);
        
        with open('./results/kNN_scores.csv', 'a') as f:
            writer = csv.writer(f)
            writer.writerow(
                ['KNNWithMeans', curr_sim_option['name'], str(curr_k), 
                 str(np.mean(results['train_rmse'])), str(np.mean(results['test_rmse']))])

In [7]:
# KNNWithZScore
for curr_sim_option in sim_options[0:3]:

    for curr_k in list_of_ks:
        
        print(
            'Currently calculating sim_option = ' + str(curr_sim_option['name']) + \
            ' and k = ' + str(curr_k) + ' ...' )
        algo = KNNWithZScore(k = curr_k, sim_options = curr_sim_option)
        results = cross_validate(algo, data, measures=['RMSE'], cv=3, return_train_measures=True);
        
        with open('./results/kNN_scores.csv', 'a') as f:
            writer = csv.writer(f)
            writer.writerow(
                ['KNNWithZScore', curr_sim_option['name'], str(curr_k), 
                 str(np.mean(results['train_rmse'])), str(np.mean(results['test_rmse']))])

In [58]:
df = pd.read_csv('./results/kNN_scores.csv')
df.sort_values(by = 'test_rmse', inplace = True)
df

Unnamed: 0,model_type,similarity_option,k,train_rmse,test_rmse
15,KNNWithMeans,pearson,10,0.802911,1.268071
24,KNNWithZScore,pearson,10,0.803766,1.268229
25,KNNWithZScore,pearson,20,0.876658,1.270155
16,KNNWithMeans,pearson,20,0.875939,1.270204
26,KNNWithZScore,pearson,40,0.903048,1.274972
17,KNNWithMeans,pearson,40,0.902398,1.275318
10,KNNWithMeans,MSD,20,0.875226,1.278689
9,KNNWithMeans,MSD,10,0.789032,1.27878
19,KNNWithZScore,MSD,20,0.877514,1.281942
11,KNNWithMeans,MSD,40,0.904816,1.2827


Based on the results, best model seems to be this: 
- model type: `kNNWithMeans` 
- similarity option: `pearson` 
- `k`: 10
<br>

Overall, `pearson` seems to be a good similarity option. `kNNwithMeans` and `kNNWithZScore` outperforms `kNNBasic`.

There is probably some overfitting, the train scores are much lower than the test scores (which is what you want, a low `rmse` score), but there is not much that can be done with it. The `cross_validated` test scores are quite promising. 

### Chosen Model Fitting

In [15]:
chosen_k = 10
chosen_sim_option = sim_pearson

In [16]:
chosen_knn = KNNWithMeans(k = chosen_k, sim_options = chosen_sim_option)
chosen_knn.fit(trainset)
predictions = chosen_knn.test(testset)
accuracy.rmse(predictions)

Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 1.2530


1.2529763600132695

`rmse` score for the chosen kNN without baseline model is __1.253__. 

## kNN with Baseline

Next, we are following a similar process with the baseline model. We are using `cross_validate` to calculate scores of different models, then pick the best one, and fit it on the trainset and test on the test set. 

I am also doing some exploratory runs before `cross_validation`. 

### Hyperparameter Tuning - SGD

With `sgd`, there are two things to potentially tune: `reg` and `learning_rate`. 

In [18]:
algo = KNNBaseline(
    k = 10, sim_options = sim_pearson_baseline, 
    bsl_options = {'method': 'sgd','learning_rate': .00005,}
            )
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

Estimating biases using sgd...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.2525


1.2524560273013874

In [21]:
algo = KNNBaseline(
    k = 20, sim_options = sim_pearson_baseline, 
    bsl_options = {'method': 'sgd','learning_rate': .00005,}
            )
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

Estimating biases using sgd...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.2549


1.254947354928457

In [19]:
algo = KNNBaseline(
    k = 10, sim_options = sim_pearson_baseline, 
    bsl_options = {'method': 'sgd','reg': 0.7, 'learning_rate': 0.006}
            )
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

Estimating biases using sgd...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.2596


1.2596134464301816

In [20]:
algo = KNNBaseline(
    k = 20, sim_options = sim_pearson_baseline, 
    bsl_options = {'method': 'sgd','reg': 0.7, 'learning_rate': 0.006}
            )
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

Estimating biases using sgd...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.2576


1.2576253376221016

In [22]:
algo = KNNBaseline(
    k = 20, sim_options = sim_pearson_baseline, 
    bsl_options = {'method': 'sgd','reg': 1, 'learning_rate': 0.01}
            )
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

Estimating biases using sgd...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.2594


1.2594038932095013

Choosing with cross validation. 

In [12]:
sgd_bsl_options = [
    {'method':'sgd', 'reg': 0.02, 'learning_rate': 0.005},
    {'method':'sgd', 'reg': 0.05, 'learning_rate': 0.005},
    {'method':'sgd', 'reg': 0.1, 'learning_rate': 0.005},
    {'method':'sgd', 'reg': 0.02, 'learning_rate': 0.01},
    {'method':'sgd', 'reg': 0.05, 'learning_rate': 0.01},
    {'method':'sgd', 'reg': 0.1, 'learning_rate': 0.01}
]

In [14]:
with open('./results/kNN_baseline_sgd_scores.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(['reg', 'learning_rate', 'k', 'train_rmse', 'test_rmse'])

In [8]:
for curr_bsl_option in sgd_bsl_options:

    for curr_k in list_of_ks:
        
        print(
            'Currently calculating k = ' + str(curr_k) + ' ...'
        )        
        algo = KNNBaseline(k = curr_k, sim_options = sim_pearson_baseline, bsl_options = curr_bsl_option)
        results = cross_validate(algo, data, measures=['RMSE'], cv=3, return_train_measures=True);
        
        with open('./results/kNN_baseline_sgd_scores.csv', 'a') as f:
            writer = csv.writer(f)
            writer.writerow(
                [curr_bsl_option['reg'], curr_bsl_option['learning_rate'],str(curr_k), 
                 str(np.mean(results['train_rmse'])), str(np.mean(results['test_rmse']))])

In [19]:
df = pd.read_csv('./results/kNN_baseline_sgd_scores.csv')
df.sort_values(by = 'test_rmse', inplace = True)
df

Unnamed: 0,reg,learning_rate,k,train_rmse,test_rmse
4,0.05,0.005,20,0.717027,1.270645
1,0.02,0.005,20,0.714546,1.272096
2,0.02,0.005,40,0.734164,1.272906
8,0.1,0.005,40,0.721023,1.273011
7,0.1,0.005,20,0.706895,1.273243
16,0.1,0.01,20,0.688731,1.274284
5,0.05,0.005,40,0.734801,1.274333
6,0.1,0.005,10,0.640963,1.274801
0,0.02,0.005,10,0.643013,1.275287
10,0.02,0.01,20,0.693436,1.276044


In the best performing model, `reg` = 0.05, `learning_rate` = 0.005, `k` = 20. 

### Chosen Model Fitting - SGD

In [20]:
chosen_reg_sgd = 0.05
chosen_learning_rate_sgd = 0.005
chosen_k_sgd = 20

In [25]:
chosen_knn_baseline_sgd = KNNBaseline(
    k = chosen_k_sgd, 
    sim_options = sim_pearson_baseline,
    bsl_options = {
        'method':'sgd', 'reg': chosen_reg_sgd,  'learning_rate': chosen_learning_rate_sgd}
    )
chosen_knn_baseline_sgd.fit(trainset)
predictions = chosen_knn_baseline_sgd.test(testset)
accuracy.rmse(predictions)

Estimating biases using sgd...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.2577


1.2576738344721636

`rmse` score for the chosen kNN with `sgd` baseline model is __1.2577__. 

### Hyperparameter Tuning - ALS

With `als`, there are two things to potentially tune tune: `reg_i` and `reg_u`. 

In [23]:
algo = KNNBaseline(
    k = 20, sim_options = sim_pearson_baseline, 
    bsl_options = {'method': 'als','reg_i': 10, 'reg_u': 15}
            )
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.2615


1.2615304328151062

In [24]:
algo = KNNBaseline(
    k = 20, sim_options = sim_pearson_baseline, 
    bsl_options = {'method': 'als','reg_i': 20, 'reg_u': 30}
            )
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.2456


1.2455804940611168

In [25]:
algo = KNNBaseline(
    k = 20, sim_options = sim_pearson_baseline, 
    bsl_options = {'method': 'als','reg_i': 40, 'reg_u': 60}
            )
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.2443


1.2443345584252317

In [26]:
algo = KNNBaseline(
    k = 20, sim_options = sim_pearson_baseline, 
    bsl_options = {'method': 'als','reg_i': 80, 'reg_u': 120}
            )
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.2476


1.2476387661036483

In [27]:
algo = KNNBaseline(
    k = 40, sim_options = sim_pearson_baseline, 
    bsl_options = {'method': 'als','reg_i': 40, 'reg_u': 60}
            )
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.2479


1.2478712796308054

In [17]:
als_bsl_options = [
    {'method':'als', 'reg_i': 20, 'reg_u': 30},
    {'method':'als', 'reg_i': 40, 'reg_u': 60},
    {'method':'als', 'reg_i': 20, 'reg_u': 30},
    {'method':'als', 'reg_i': 40, 'reg_u': 60}
]

In [16]:
with open('./results/kNN_baseline_als_scores.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(['reg_i', 'reg_u', 'k', 'train_rmse', 'test_rmse'])

In [26]:
for curr_bsl_option in als_bsl_options:

    for curr_k in list_of_ks:
        
        print(
            'Currently calculating k = ' + str(curr_k) + ' ...'
        )        
        algo = KNNBaseline(k = curr_k, sim_options = sim_pearson_baseline, bsl_options = curr_bsl_option)
        results = cross_validate(algo, data, measures=['RMSE'], cv=3, return_train_measures=True);
        
        with open('./results/kNN_baseline_als_scores.csv', 'a') as f:
            writer = csv.writer(f)
            writer.writerow(
                [curr_bsl_option['reg_i'], curr_bsl_option['reg_u'],str(curr_k), 
                 str(np.mean(results['train_rmse'])), str(np.mean(results['test_rmse']))])

In [27]:
df = pd.read_csv('./results/kNN_baseline_als_scores.csv')
df.sort_values(by = 'test_rmse', inplace = True)
df

Unnamed: 0,reg_i,reg_u,k,train_rmse,test_rmse
3,40,60,10,0.653267,1.259249
10,40,60,20,0.726383,1.259914
9,40,60,10,0.653127,1.26
4,40,60,20,0.726415,1.260152
7,20,30,20,0.602146,1.261174
1,20,30,20,0.60222,1.261175
8,20,30,40,0.61501,1.261723
2,20,30,40,0.614825,1.261848
0,20,30,10,0.546603,1.262745
6,20,30,10,0.5465,1.263341


In the best performing model, `reg_i` = 40, `reg_u` = 60, `k` = 10. 

### Chosen Model Fitting - ALS

In [29]:
chosen_reg_i_als = 40
chosen_reg_u_als = 60
chosen_k_als = 10

In [30]:
chosen_knn_baseline_als = KNNBaseline(
    k = chosen_k_als, 
    sim_options = sim_pearson_baseline,
    bsl_options = {
        'method':'als', 'reg_i':chosen_reg_i_als, 'reg_u': chosen_reg_u_als }
    )
chosen_knn_baseline_als.fit(trainset)
predictions = chosen_knn_baseline_als.test(testset)
accuracy.rmse(predictions)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.2456


1.2455509595825713

`rmse` score for the chosen kNN with `als` baseline model is __1.2456__. 

# Results

We picked one hyperparameter setup from three model groups, all of them `kNN`-type models: 
- for the basic model with no baseline, `rmse` = 1.2530
- for the baseline model with `sgd`, `rmse` = 1.2577
- for the baseline model with `als`, `rmse` = 1.2456 

It is clear that the results are quite close to each other, but strictly speaking, the `kNN` baseline model with an `als` method to estimate the baseline performed the best out of the three types. 