# Collaborative based filtering

Collaborative filtering based models finds similaries between items or users through ratings or items that other users have liked as well.

### Importing necessary packages


In [1]:
# installing the library suprise
!pip install scikit-surprise

Collecting scikit-surprise
[?25l  Downloading https://files.pythonhosted.org/packages/f5/da/b5700d96495fb4f092be497f02492768a3d96a3f4fa2ae7dea46d4081cfa/scikit-surprise-1.1.0.tar.gz (6.4MB)
[K     |████████████████████████████████| 6.5MB 2.5MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.0-cp36-cp36m-linux_x86_64.whl size=1675384 sha256=6e041b1bad2679ac38d1fc20f356f73f55e491f9ceca17f76cc2b8b0ff25326d
  Stored in directory: /root/.cache/pip/wheels/cc/fa/8c/16c93fccce688ae1bde7d979ff102f7bee980d9cfeb8641bcf
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.0


In [2]:
# importing necessary packages
import pandas as pd
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 100)
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import StandardScaler 
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import train_test_split

from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD
from surprise.prediction_algorithms import knns
from surprise import accuracy
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise.model_selection import GridSearchCV
from surprise.model_selection import train_test_split

from surprise import Reader, Dataset
import numpy as np

import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize, FreqDist
nltk.download('punkt')
import re
import string
import os

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

  import pandas.util.testing as tm


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### Importing and checking through the data sets

In [3]:
# unziping the folder containing the steam_rs.csv data
!unzip steam_rs.zip

Archive:  steam_rs.zip
  inflating: steam_rs.csv            


In [0]:
# importing the data set
steam = pd.read_csv('steam_rs.csv')

In [0]:
# removing release_date column
steam = steam.drop('release_date', axis = 1)

In [6]:
# displaying the data frame
steam.head()

Unnamed: 0,id,appid,name,purchase,hours_of_play,developer,publisher,positive,negative,english,platforms,required_age,categories,genres,steamspy_tags,achievements,average_playtime,median_playtime,owners,detailed_description,about_the_game,short_description,price,rank
0,151603712,570,Dota 2,purchase,0.0,Valve,Valve,1097301,194384,1,windows;mac;linux,0,Multi-player;Co-op;Steam Trading Cards;Steam W...,Action;Free to Play;Strategy,Free to Play;MOBA;Strategy,0,23944,801,100000000-200000000,<strong>The most-played game on Steam.</strong...,<strong>The most-played game on Steam.</strong...,"Every day, millions of players worldwide enter...",0.0,84.95113
1,151603712,570,Dota 2,play,0.5,Valve,Valve,1097301,194384,1,windows;mac;linux,0,Multi-player;Co-op;Steam Trading Cards;Steam W...,Action;Free to Play;Strategy,Free to Play;MOBA;Strategy,0,23944,801,100000000-200000000,<strong>The most-played game on Steam.</strong...,<strong>The most-played game on Steam.</strong...,"Every day, millions of players worldwide enter...",0.0,84.95113
2,187131847,570,Dota 2,purchase,0.0,Valve,Valve,1097301,194384,1,windows;mac;linux,0,Multi-player;Co-op;Steam Trading Cards;Steam W...,Action;Free to Play;Strategy,Free to Play;MOBA;Strategy,0,23944,801,100000000-200000000,<strong>The most-played game on Steam.</strong...,<strong>The most-played game on Steam.</strong...,"Every day, millions of players worldwide enter...",0.0,84.95113
3,187131847,570,Dota 2,play,2.3,Valve,Valve,1097301,194384,1,windows;mac;linux,0,Multi-player;Co-op;Steam Trading Cards;Steam W...,Action;Free to Play;Strategy,Free to Play;MOBA;Strategy,0,23944,801,100000000-200000000,<strong>The most-played game on Steam.</strong...,<strong>The most-played game on Steam.</strong...,"Every day, millions of players worldwide enter...",0.0,84.95113
4,176410694,570,Dota 2,purchase,0.0,Valve,Valve,1097301,194384,1,windows;mac;linux,0,Multi-player;Co-op;Steam Trading Cards;Steam W...,Action;Free to Play;Strategy,Free to Play;MOBA;Strategy,0,23944,801,100000000-200000000,<strong>The most-played game on Steam.</strong...,<strong>The most-played game on Steam.</strong...,"Every day, millions of players worldwide enter...",0.0,84.95113


In [0]:
# modifying the columns in the data frame
steam = steam[['id', 'appid', 'name', 'rank', 'genres', 'steamspy_tags', 'short_description', 'hours_of_play']]

Since Steam does not use ratings for games, i will base the recommendations on the hours of play for users.

In [0]:
# alterting the data frame further to show the id, game name and its hours of play
steam = steam[['id', 'name', 'hours_of_play']]

In [9]:
# displaying shape of data frame
steam.shape

(99632, 3)

In [0]:
# dropping the duplicate ids, and keeping the most recent ones as making sure it dosent remove all the unique games in the data frame
steam.drop_duplicates(subset = ['id', 'name'],
                     keep = 'last', inplace = True)

In [11]:
# displaying shape of data frame
steam.shape

(60446, 3)

In [0]:
# transforming the current data set into something that is compatible with surpirse 
reader = Reader()
steam = Dataset.load_from_df(steam,reader)

### Train/ test split

The train and test data sets will contain randomly selected user ratings and items instead of the entire list of users and items. 80% of these ratings reside in the training set and the remaining 20% is in the test set.

In [0]:
# preforming train/test split
trainset, testset = train_test_split(steam, test_size=0.2)

In [14]:
# checking the number of items and users in the data set
print('Number of users: ', trainset.n_users, '\n')
print('Number of items: ', trainset.n_items, '\n')

Number of users:  9134 

Number of items:  2169 



From this it can be seen that there is a fewer number of items rather than the number of users.

# Memory Based

For these memory based models I will be using KNNBasic which is a basic collaborative filtering algorithm. In addition to that i will also be using KNNBaseline which is a basic collaborative filtering algorithm taking into account a baseline rating. KNNWithMeans is basic collaborative filtering algorithm, taking into account the mean ratings of each user.

## Cosine similarity

## KNNBasic with cosine similarity (USER BASED)

In [0]:
#   cosine similarity
sim_cos = {'name':'cosine', 'user_based':True}

In [0]:
#   training the model with user_based = True
basic_user = knns.KNNBasic(sim_options=sim_cos)

In [17]:
#   fitting the model
simcos_cv_user = cross_validate(basic_user, steam, measures=['rmse', 'mae'],
                           cv = 3, return_train_measures=True, n_jobs= -1,
                           verbose = True)

Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    204.5769246.6565212.8675221.367018.1999 
MAE (testset)     34.6045 36.4259 33.6591 34.8965 1.1482  
RMSE (trainset)   230.3942208.7817226.6090221.92839.4236  
MAE (trainset)    34.7446 33.8297 35.1954 34.5899 0.5682  
Fit time          24.94   26.59   13.84   21.79   5.66    
Test time         5.87    6.03    3.00    4.97    1.39    


In [18]:
for i in simcos_cv_user.items():
    print(i)
print('-----------------')
print(np.mean(simcos_cv_user['test_rmse']))

('test_rmse', array([204.5769285 , 246.65650146, 212.8674908 ]))
('train_rmse', array([230.39421302, 208.78171823, 226.60901308]))
('test_mae', array([34.60454199, 36.42590337, 33.65913531]))
('train_mae', array([34.7445758 , 33.82967345, 35.1954068 ]))
('fit_time', (24.93520998954773, 26.59270405769348, 13.842443227767944))
('test_time', (5.8696136474609375, 6.029863119125366, 3.004136800765991))
-----------------
221.36697358803846


## KNNBasic with cosine similarity (ITEM BASED)


In [0]:
#   cosine similarity
sim_cos = {'name':'cosine', 'user_based':False}

In [0]:
#   training the model with user_based = True
basic_item = knns.KNNBasic(sim_options=sim_cos)

In [21]:
#   fitting the model
simcos_cv_item = cross_validate(basic_item, steam, measures=['rmse', 'mae'],
                           cv = 3, return_train_measures=True, n_jobs= -1,
                           verbose = True)

Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    203.2459264.7117191.7883219.915332.0193 
MAE (testset)     34.6813 37.8216 33.2628 35.2552 1.9049  
RMSE (trainset)   231.0100197.4605235.8619221.444217.0743 
MAE (trainset)    34.8713 33.3267 35.6154 34.6045 0.9532  
Fit time          1.67    2.34    1.48    1.83    0.37    
Test time         2.06    1.55    0.80    1.47    0.52    


In [22]:
for i in simcos_cv_item.items():
    print(i)
print('-----------------')
print(np.mean(simcos_cv_item['test_rmse']))

('test_rmse', array([203.24591737, 264.71165175, 191.78831519]))
('train_rmse', array([231.01004416, 197.46053306, 235.86188335]))
('test_mae', array([34.68127341, 37.82162647, 33.2628225 ]))
('train_mae', array([34.87134728, 33.32671691, 35.61540859]))
('fit_time', (1.6671230792999268, 2.335387706756592, 1.4834179878234863))
('test_time', (2.0574514865875244, 1.5549097061157227, 0.8019325733184814))
-----------------
219.9152947688224


### Pearson similarity

## KNNBaseline with pearson similarity (USER BASED)

In [0]:
# person similarity
sim_pearson = {'name':'pearson', 'user_based':True}

In [0]:
#   training the model with user_based = True
knn_baseline_user = knns.KNNBaseline(sim_options=sim_pearson)


In [25]:
#   fitting the model
sim_pearson_cv_user = cross_validate(knn_baseline_user, steam, measures=['rmse', 'mae'],
                           cv = 3, return_train_measures=True, n_jobs= -1,
                           verbose = True)



Evaluating RMSE, MAE of algorithm KNNBaseline on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    221.3875217.5467227.3137222.08274.0175  
MAE (testset)     35.4333 35.2876 34.1606 34.9605 0.5687  
RMSE (trainset)   222.4656224.3515219.4561222.09112.0160  
MAE (trainset)    34.0063 34.0615 34.6364 34.2347 0.2849  
Fit time          28.32   29.64   16.38   24.78   5.97    
Test time         6.63    7.03    3.52    5.73    1.57    


In [26]:
for i in sim_pearson_cv_user.items():
    print(i)
print('-----------------')
print(np.mean(sim_pearson_cv_user['test_rmse']))

('test_rmse', array([221.38753405, 217.54674443, 227.31372382]))
('train_rmse', array([222.46564568, 224.35148106, 219.4561416 ]))
('test_mae', array([35.43332079, 35.28755382, 34.16062031]))
('train_mae', array([34.00629883, 34.06145064, 34.63644716]))
('fit_time', (28.317095041275024, 29.642582416534424, 16.378057718276978))
('test_time', (6.626521348953247, 7.029038906097412, 3.5241141319274902))
-----------------
222.08266743355136


## KNNBaseline with pearson similarity (ITEM BASED)

In [0]:
# person similarity
sim_pearson = {'name':'pearson', 'user_based':False}

In [0]:
#   training the model with user_based = False
knn_baseline_item = knns.KNNBaseline(sim_options=sim_pearson)

In [29]:
#   fitting the model
sim_pearson_cv_item = cross_validate(knn_baseline_user, steam, measures=['rmse', 'mae'],
                           cv = 3, return_train_measures=True, n_jobs= -1,
                           verbose = True)

Evaluating RMSE, MAE of algorithm KNNBaseline on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    246.3717222.3894194.5604221.107221.1713 
MAE (testset)     37.0523 34.5028 33.3295 34.9615 1.5541  
RMSE (trainset)   208.9182221.9648234.6701221.851010.5135 
MAE (trainset)    33.1873 34.4563 35.0580 34.2338 0.7797  
Fit time          27.31   29.28   17.01   24.54   5.38    
Test time         6.60    6.70    3.62    5.64    1.43    


In [30]:
for i in sim_pearson_cv_item.items():
    print(i)
print('-----------------')
print(np.mean(sim_pearson_cv_item['test_rmse']))

('test_rmse', array([246.37174928, 222.38942853, 194.56042984]))
('train_rmse', array([208.91818946, 221.96484817, 234.67005547]))
('test_mae', array([37.05229668, 34.50284447, 33.32947695]))
('train_mae', array([33.18727007, 34.45627109, 35.05800225]))
('fit_time', (27.31334137916565, 29.28150248527527, 17.012567281723022))
('test_time', (6.596349477767944, 6.702720403671265, 3.6239962577819824))
-----------------
221.107202549238


From the RMSE and MAE scores obtained, there is evident improvement in this user based KNNBaseline model with pearson similarity. The lower RMSE scores shown above indicates an improved performance compared to previous models.

## KNNWithMeans with pearson similarity (USER BASED)

In [0]:
# person similarity
sim_pearson = {'name':'pearson', 'user_based':True}

In [0]:
#   training the model with user_based = True
knn_WithMeans_user = knns.KNNWithMeans(sim_options=sim_pearson)

In [33]:
#   fitting the model
sim_pearson_wm_cv_user = cross_validate(knn_WithMeans_user, steam, measures=['rmse', 'mae'],
                           cv = 3, return_train_measures=True, n_jobs= -1,
                           verbose = True)

Evaluating RMSE, MAE of algorithm KNNWithMeans on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    201.4611243.7119219.4421221.538417.3124 
MAE (testset)     34.2260 36.3971 33.7972 34.8067 1.1381  
RMSE (trainset)   231.7860210.5229223.4901221.93308.7502  
MAE (trainset)    34.4575 33.3450 34.6925 34.1650 0.5877  
Fit time          28.26   28.96   15.26   24.16   6.30    
Test time         6.14    6.58    3.21    5.31    1.49    


In [34]:
for i in sim_pearson_wm_cv_user.items():
    print(i)
print('-----------------')
print(np.mean(sim_pearson_wm_cv_user['test_rmse']))

('test_rmse', array([201.46113363, 243.71185531, 219.44206158]))
('train_rmse', array([231.78600249, 210.52286761, 223.49013435]))
('test_mae', array([34.2259837 , 36.39708395, 33.79716772]))
('train_mae', array([34.45748556, 33.344979  , 34.69245102]))
('fit_time', (28.259010553359985, 28.96112585067749, 15.25899052619934))
('test_time', (6.143218517303467, 6.58163857460022, 3.2142868041992188))
-----------------
221.53835017556557


## KNNWithMeans with pearson similarity (ITEM BASED)

In [0]:
# person similarity
sim_pearson = {'name':'pearson', 'user_based':False}

In [0]:
#   training the model with user_based = False
knn_WithMeans_item = knns.KNNWithMeans(sim_options=sim_pearson)

In [37]:
#   fitting the model
sim_pearson_wm_cv_item = cross_validate(knn_WithMeans_item, steam, measures=['rmse', 'mae'],
                           cv = 3, return_train_measures=True, n_jobs= -1,
                           verbose = True)

Evaluating RMSE, MAE of algorithm KNNWithMeans on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    236.3851204.8304224.0300221.748512.9827 
MAE (testset)     35.7357 34.6474 33.8361 34.7397 0.7782  
RMSE (trainset)   214.6100230.2625221.1376222.00346.4194  
MAE (trainset)    33.5380 34.0686 34.5033 34.0366 0.3948  
Fit time          1.47    2.03    1.40    1.63    0.28    
Test time         2.32    1.65    0.96    1.64    0.55    


In [38]:
for i in sim_pearson_wm_cv_item.items():
    print(i)
print('-----------------')
print(np.mean(sim_pearson_wm_cv_item['test_rmse']))

('test_rmse', array([236.38505284, 204.8304221 , 224.02995696]))
('train_rmse', array([214.61004907, 230.26254956, 221.13764998]))
('test_mae', array([35.7357287 , 34.64737916, 33.83614146]))
('train_mae', array([33.53795612, 34.06863997, 34.50331467]))
('fit_time', (1.4732868671417236, 2.0329673290252686, 1.398512363433838))
('test_time', (2.3176026344299316, 1.6523809432983398, 0.960484504699707))
-----------------
221.7484772981121


# Model based 

In [39]:
param_grid = {'n_factors':[20, 100],'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
               'reg_all': [0.4, 0.6]}
gs_model = GridSearchCV(SVD,param_grid=param_grid,n_jobs = -1,joblib_verbose=5)
gs_model.fit(steam)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:   16.9s
[Parallel(n_jobs=-1)]: Done  68 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed:  2.0min finished


In [40]:
print(gs_model.best_score)
print(gs_model.best_params)

{'rmse': 221.24626257601489, 'mae': 35.7652975930309}
{'rmse': {'n_factors': 20, 'n_epochs': 5, 'lr_all': 0.002, 'reg_all': 0.4}, 'mae': {'n_factors': 20, 'n_epochs': 5, 'lr_all': 0.002, 'reg_all': 0.4}}


In [0]:
## Perform a gridsearch with SVD
params = {'n_factors': [20, 50, 100],
         'reg_all': [0.02, 0.05, 0.1]}
g_s_svd = GridSearchCV(SVD,param_grid=params,n_jobs=-1)
g_s_svd.fit(steam)

In [42]:
print(g_s_svd.best_score)
print(g_s_svd.best_params)

{'rmse': 219.44145445420668, 'mae': 35.76530381501246}
{'rmse': {'n_factors': 20, 'reg_all': 0.02}, 'mae': {'n_factors': 20, 'reg_all': 0.02}}


In [0]:
# reate our high performance algorithm
algo = gs_model.best_estimator['rmse']

In [44]:
# fitting the data set using the trainset
trainset = steam.build_full_trainset()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fdeb87d8860>

In [45]:
# trainset
predictions = algo.test(trainset.build_testset())
print('Train', end='  ')
accuracy.rmse(predictions)

Train  RMSE: 222.1159


222.1159269640164

In [0]:
svd = SVD(n_factors=100, n_epochs=10, lr_all=0.005, reg_all=0.4)

In [47]:
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fdeb87d8278>

In [48]:
predictions = svd.test(testset)
print(accuracy.rmse(predictions))

RMSE: 262.0211
262.021095408759


# Making recommendations

In [0]:
# importing the steam_rs data frame again and naming it df_games
df_games = pd.read_csv('steam_rs.csv')

In [50]:
# displaying the data frame
df_games.head()

Unnamed: 0,id,appid,name,purchase,hours_of_play,developer,publisher,positive,negative,release_date,english,platforms,required_age,categories,genres,steamspy_tags,achievements,average_playtime,median_playtime,owners,detailed_description,about_the_game,short_description,price,rank
0,151603712,570,Dota 2,purchase,0.0,Valve,Valve,1097301,194384,2013-07-09,1,windows;mac;linux,0,Multi-player;Co-op;Steam Trading Cards;Steam W...,Action;Free to Play;Strategy,Free to Play;MOBA;Strategy,0,23944,801,100000000-200000000,<strong>The most-played game on Steam.</strong...,<strong>The most-played game on Steam.</strong...,"Every day, millions of players worldwide enter...",0.0,84.95113
1,151603712,570,Dota 2,play,0.5,Valve,Valve,1097301,194384,2013-07-09,1,windows;mac;linux,0,Multi-player;Co-op;Steam Trading Cards;Steam W...,Action;Free to Play;Strategy,Free to Play;MOBA;Strategy,0,23944,801,100000000-200000000,<strong>The most-played game on Steam.</strong...,<strong>The most-played game on Steam.</strong...,"Every day, millions of players worldwide enter...",0.0,84.95113
2,187131847,570,Dota 2,purchase,0.0,Valve,Valve,1097301,194384,2013-07-09,1,windows;mac;linux,0,Multi-player;Co-op;Steam Trading Cards;Steam W...,Action;Free to Play;Strategy,Free to Play;MOBA;Strategy,0,23944,801,100000000-200000000,<strong>The most-played game on Steam.</strong...,<strong>The most-played game on Steam.</strong...,"Every day, millions of players worldwide enter...",0.0,84.95113
3,187131847,570,Dota 2,play,2.3,Valve,Valve,1097301,194384,2013-07-09,1,windows;mac;linux,0,Multi-player;Co-op;Steam Trading Cards;Steam W...,Action;Free to Play;Strategy,Free to Play;MOBA;Strategy,0,23944,801,100000000-200000000,<strong>The most-played game on Steam.</strong...,<strong>The most-played game on Steam.</strong...,"Every day, millions of players worldwide enter...",0.0,84.95113
4,176410694,570,Dota 2,purchase,0.0,Valve,Valve,1097301,194384,2013-07-09,1,windows;mac;linux,0,Multi-player;Co-op;Steam Trading Cards;Steam W...,Action;Free to Play;Strategy,Free to Play;MOBA;Strategy,0,23944,801,100000000-200000000,<strong>The most-played game on Steam.</strong...,<strong>The most-played game on Steam.</strong...,"Every day, millions of players worldwide enter...",0.0,84.95113


In [0]:
# dropping the duplicate ids, and keeping the most recent ones as making sure it dosent remove all the unique games in the data frame
df_games.drop_duplicates(subset = ['id', 'name'],
                     keep = 'last', inplace = True)

In [0]:
# dropping the duplicate ids
df_games.drop_duplicates(subset = ['appid'],
                     keep = 'last', inplace = True)

In [0]:
# removing unessesary columns
df_games = df_games[['id', 'appid', 'name', 'purchase', 'developer', 'genres', 'rank']]

In [54]:
svd = SVD(n_factors= 20, reg_all=0.02)
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fdeb9a1e6d8>

In [55]:
svd.predict(2, 4)

Prediction(uid=2, iid=4, r_ui=None, est=5, details={'was_impossible': False})

In [0]:
def game_rater(df_games, num, genre=None):
  
    userID = 1000
    rating_list = []
    while num > 0:
        if genre:
            game = df_games[df_games['genres'].str.contains(genre)].sample(1)
        else:
            game = df_games.sample(1)
        print(game)
        rating = input('How many hours have you spent on this game, press n if you have not played it :\n')
        if rating == 'n':
            continue
        else:
            rating_one_game = {'id':userID,'gameId':game['appid'].values[0],'hours_of_play':rating}
            rating_list.append(rating_one_game) 
            num -= 1
    return rating_list      

In [66]:
user_rating = game_rater(df_games, 4, 'RPG')

             id   appid            name purchase   developer  \
99153  49462664  263920  Zombie Grinder     play  TwinDrills   

                                        genres       rank  
99153  Action;Adventure;Indie;RPG;Early Access  71.052632  
How many hours have you spent on this game, press n if you have not played it :
n
             id   appid       name  purchase            developer  \
97557  11373749  664780  Alter Ego  purchase  Choose Multiple LLC   

                                      genres  rank  
97557  Adventure;Casual;Indie;RPG;Simulation  70.0  
How many hours have you spent on this game, press n if you have not played it :
2
             id   appid                      name purchase             developer  \
68711  60859695  247080  Crypt of the NecroDancer     play  Brace Yourself Games   

                 genres       rank  
68711  Action;Indie;RPG  96.369565  
How many hours have you spent on this game, press n if you have not played it :
n
              id 

In [0]:
## add the new ratings to the original ratings DataFrame
new_ratings_df = df_games.append(user_rating,ignore_index=True)

In [0]:
new_ratings_df = new_ratings_df[['id', 'appid', 'hours_of_play']]


In [0]:
# transforming the current data set into something that is compatible with surpirse 
reader = Reader()
new_data = Dataset.load_from_df(new_ratings_df,reader)

In [70]:
# train a model using the new combined DataFrame
svd_ = SVD(n_factors= 20, reg_all=0.02)
svd_.fit(new_data.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fdeb9a32f28>

In [0]:
# make predictions for the user
list_of_games = []
for m_id in df_games['appid'].unique():
    list_of_games.append( (m_id,svd_.predict(1000,m_id)[3]))

In [0]:
# order the predictions from highest to lowest rated
ranked_games = sorted(list_of_games, key=lambda x:x[1], reverse=True)

In [0]:
# return the top n recommendations
def recommended_games(user_ratings, df_games, n):
        for idx, rec in enumerate(user_ratings):
            title = df_games.loc[df_games['appid'] == int(rec[0])]['name']
            print('Recommendation # ', idx+1, ': ', title, '\n')
            n-= 1
            if n == 0:
                break

In [74]:
recommended_games(ranked_games, df_games, 5)

Recommendation #  1 :  9681    Dota 2
Name: name, dtype: object 

Recommendation #  2 :  14327    Team Fortress 2
Name: name, dtype: object 

Recommendation #  3 :  16959    Unturned
Name: name, dtype: object 

Recommendation #  4 :  18230    Warframe
Name: name, dtype: object 

Recommendation #  5 :  18246    Tom Clancy's Rainbow Six Siege
Name: name, dtype: object 



From the models that were ran, the item-based memory models tended to preform much better than the user-based models. The KNNBaseline with pearson similarity (ITEM BASED) model compared to the other models that were ran, showed a much more improved preformance according to the RMSE and MAE scores obtained. 

The RMSE score, indicates the standard deviation of the residuals, having a smaller RMSE value such as the score we recieved for the KNNBaseline with pearson similarity (ITEM BASED) is most preffered as it indicates that the data is closely placed around to its central mean.

# FUTURE WORK



*   Developing improved recommendation systems using the hybrid filtering methods
*   Soley using Steam's API to form game recommendations​
*   Using matrix factorization techniques

