# Knowledge Data Discovery and Neural Networks : Final Project

In this notebook we will encounter the domain of recommender systems.

The purpose of this section is to be able to face a new problem with the the skills you have thus far.

The grade will be based on your results on the test set and will be realtive to the other class mates - we attach a file "example_submission.csv" which you need to submit so we can check your results on the test set -"recommender_test.csv" (only we know the labels), you need to use "recommender_train.csv" for the training and validation of the algorithm you choose. We will test you on the root mean squared error metric (RMSE).

We add here a couple of questions to guide you throw the process of understanding the problem world, but they will not be graded.
We recommend to try and use a couple of algorithms from [surprise package](http://surpriselib.com/) and find the one that works best for you. 

We **recommend** to read a couple of posts online on "collaborative filtering" in recommender systems to get to know the topic.

#### guided questions - 

1. What are the features we have? are they numerical or categorical or do we have both?
2. What are we trying to predict, is it classification or regression?
3. Offer a very simple prediction algorithm that you may use and can implement yourself (it doesn't have to be complicated but make sure at least that each user gets a differnt rating for an item in the test set) - you may find it useful especially if you will have problems with [surprise package](http://surpriselib.com/) or other package that you want to use.


It is recommended to read the [original paper on svd](https://datajobs.com/data-science-repo/Recommender-Systems-[Netflix].pdf).
Other resources on collaborative filtering:

* [collaborative filtering with knn ](https://medium.com/sfu-cspmp/recommendation-systems-user-based-collaborative-filtering-using-n-nearest-neighbors-bf7361dc24e0)
* [more collaborative filtering](https://towardsdatascience.com/prototyping-a-recommender-system-step-by-step-part-1-knn-item-based-collaborative-filtering-637969614ea)

In [1]:
# add more packages in this section
import numpy as np
import pandas as pd
# import surprise # install it first
%matplotlib inline

In [2]:
train = pd.read_csv("data/recommender_train.csv")
test = pd.read_csv("data/recommender_test.csv")

### Predictions

For your convience, we add a code that creates "example_submission.csv".
You need to replace "algo" with your best algorithm.
If you choose a different method to predict or create the algorithm you may write different code - it is not obligatory

In [3]:
#data_train = pd.read_csv("recommender_train.csv")
#data_test = pd.read_csv("recommender_test.csv")

#print(data_train.shape)
# 859395 records ==> Rankings
#print(data_test.shape)
# 6040 records ==> Users


In [1]:
import pandas as pd
from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split


df = pd.DataFrame(pd.read_csv("recommender_train.csv"))
reader = Reader(rating_scale=(1, 5))
# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(df[['user', 'item', 'rating']], reader)
# We can now use this dataset as we please, e.g. calling cross_validate
cross_validate(NormalPredictor(), data, cv=2)


trainset, testset = train_test_split(data, test_size=.25)

test = pd.read_csv("recommender_test.csv")

In [2]:
algo1 = SVD()

predictions = algo1.fit(trainset).test(testset)
# Then compute RMSE
print("The RMSE using SVD:")
#print(accuracy.rmse(predictions))

The RMSE using SVD:
RMSE: 0.8818
0.8817633450982018


In [3]:
# define a cross-validation iterator
from surprise.model_selection import KFold

kf = KFold(n_splits=3)

for trainset, testset in kf.split(data):

    # train and test algorithm.
    algo1.fit(trainset)
    predictions = algo1.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)


RMSE: 0.8902
RMSE: 0.8879
RMSE: 0.8880


In [4]:
from surprise.model_selection import GridSearchCV

param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

0.9283437645168623
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}


In [5]:
#### Trying Basic KNN algorithm

from surprise import KNNBasic

algo2 = KNNBasic()

predictions = algo2.fit(trainset).test(testset)
# Then compute RMSE
accuracy.rmse(predictions)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9319


0.9318995029130168

In [6]:
#### Trying KNNWithMeans algorithm
from surprise import KNNWithMeans

algo3 = KNNWithMeans()

predictions = algo3.fit(trainset).test(testset)
# Then compute RMSE
accuracy.rmse(predictions)


Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9313


0.9313421626863028

In [7]:
#### Trying SlopeOne algorithm
from surprise import SlopeOne

algo4 = SlopeOne()

predictions = algo4.fit(trainset).test(testset)
# Then compute RMSE
accuracy.rmse(predictions)


RMSE: 0.9064


0.9063868062339259

In [10]:
#### Trying SVDpp algorithm (took me long time, I've stopped it...)
from surprise import SVDpp
algo5 = SVDpp()

predictions = algo5.fit(trainset).test(testset)
# Then compute RMSE
accuracy.rmse(predictions)

KeyboardInterrupt: 

In [1]:
import pandas as pd
from surprise import Reader
from surprise.model_selection import cross_validate
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split

train = pd.read_csv("data/recommender_train_copy.csv")
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(train[['user', 'item', 'rating']], reader)
trainset, testset = train_test_split(data, test_size=.25)

In [2]:
###
# I'll focus on the KNNBasic algorithm and try playing with the HyperParameters (e.g. different k)
### 

from surprise import KNNBasic

for neighbours in [10, 50, 100]:
    algo = KNNBasic(k=neighbours)
    predictions = algo.fit(trainset).test(testset)
    print("The RMSE result for K Neighbours number of {k} is: {rmse}".format( k=str(neighbours),
                                                                              rmse=str(accuracy.rmse(predictions))))

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9620
The RMSE result for K Neighbours number of 10 is: 0.9620261799216681
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9250
The RMSE result for K Neighbours number of 50 is: 0.9250289272937631
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9270
The RMSE result for K Neighbours number of 100 is: 0.926983020283427


In [6]:
test = pd.read_csv("data/recommender_test_copy.csv")
predictions = []
for _, row in test.iterrows():
    predictions.append(algo.predict(row['user'], row['item']).est)
pd.Series(predictions).to_csv('example_submission_031691082.csv', index = None, header = None)

In [8]:
#Checking the csv file
result = pd.read_csv("example_submission_031691082.csv")
result.shape

(6039, 1)