# Recommender Systems
## Writing an Algorithm for Recommending Movies with a Collaborative Filtering
In dit script probeer ik een algoritme te schrijven die de beoordeling van films gaat voorspellen voor mensen. Recommender Systems worden gebruikt door bedrijven zoals YouTube en Netflix om jou langer op hun platform te houden. Ook kan het gebruikt worden door online kledingwinkels die willen voorspellen welke kleding jij leuk vindt.

De dataset die wordt gebruikt is de MovieLens database. Deze is te vinden op [grouplens.org/](https://grouplens.org/datasets/movielens/). Daarnaast gebruik ik de kennis die ik heb opgedaan vanuit de [Towards Data Science](https://towardsdatascience.com/various-implementations-of-collaborative-filtering-100385c6dfe0) artikel.

Deze notebook is een gevolg op `CB_RecommenderSystem.ipynb` en mijn aanbeveling is om die eerst te snappen. Die geeft je in het simpel het idee van een recommender system. In deze .ipynb gaan we gebruik maken van deep learning met de library `fast.ai`. Dit geeft schijnbaar het beste model. Wel is bekend dat dit model misschien langzamer is dan bijvoorbeeld Singular Value Decomposision.

In [1]:
import pandas as pd 
import numpy as np
import warnings
import matplotlib.pyplot as plt

import tensorflow as tf
from tflearn.layers.core import input_data, fully_connected
from tflearn.layers.estimator import regression
import tflearn

import sklearn as sk
from sklearn.model_selection import train_test_split
from tqdm import tqdm_notebook as tqdm

%matplotlib inline
warnings.filterwarnings('ignore')

curses is not supported on this machine (please install/reinstall curses for an optimal experience)


Het belangrijkste wat in de volgende cell gebeurd is het inlezen van de data en het samenvoegen van de twee datasets. De variable `BigDataSet` bepaald of je de grote data set pakt van 20 miljoen regels (`True`) of de kleine dataset van 100.000 regels (`False`).

In [2]:
BigDataSet = False
DataSet = "data" if BigDataSet else "data_small"

df = pd.read_csv(DataSet + '/ratings.csv', sep=',', names=['userID','movieID','rating','timestamp'], header = 0)
movieTitles = pd.read_csv(DataSet + '/movies.csv', names=['movieID','title','genres'], header = 0)
df = pd.merge(df, movieTitles, on='movieID')

df.head()

Unnamed: 0,userID,movieID,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


De volgende cell is belangrijk. Deze veranderd de kolom `genres` naar een dummie kolom. Dit zorgt er voor dat het algaritme kan rekenen met de cijfers in plaats van tekst. In `allGenres` staan alle voorkomende genres en deze wordt aangevuld met een 1 als de film (`movieID`) wel bij die genre hoort en 0 als dat niet zo is.

In [3]:
cols = ['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy', 'Romance', 'Action', 'Crime', 'Thriller', 'Mystery', 'Horror', 'Drama', 'War', 'Western', 'Sci-Fi', 'Musical', 'Film-Noir', 'IMAX', 'Documentary', '(no genres listed)']

dfGenres = pd.DataFrame(0, columns=cols, index = movieTitles.movieID.unique())
with tqdm(total=len(movieTitles.movieID.unique())) as pbar:
    for i, film in enumerate(movieTitles.movieID.unique()):
        genres = movieTitles.loc[i, "genres"].split("|")

        dfGenres.loc[film, genres] = 1
        pbar.update(1)

dfGenres.head()

HBox(children=(IntProgress(value=0, max=9742), HTML(value='')))




Unnamed: 0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Action,Crime,Thriller,Mystery,Horror,Drama,War,Western,Sci-Fi,Musical,Film-Noir,IMAX,Documentary,(no genres listed)
1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0
5,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [4]:
df.title.unique()

array(['Toy Story (1995)', 'Grumpier Old Men (1995)', 'Heat (1995)', ...,
       'Hazard (2005)', 'Blair Witch (2016)', '31 (2016)'], dtype=object)

In [5]:
# Merge all genres with the ratings dataframe

#uncomment if creating new dataset
#df = pd.merge(df, dfGenres, how = "outer", on = "movieID")
#df.to_csv(DataSet + "/ProcessedDataSmall.csv", index = False, sep = ";")

df = pd.read_csv(DataSet + "/ProcessedDataSmall.csv", sep = ";")
df.sort_values(by="userID").head()

Unnamed: 0,userID,movieID,rating,timestamp,title,genres,Adventure,Animation,Children,Comedy,...,Horror,Drama,War,Western,Sci-Fi,Musical,Film-Noir,IMAX,Documentary,(no genres listed)
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0.2,0.2,0.2,0.2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12117,1,2389,2.0,964983094,Psycho (1998),Crime|Horror|Thriller,0.0,0.0,0.0,0.0,...,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12126,1,2395,5.0,964981093,Rushmore (1998),Comedy|Drama,0.0,0.0,0.0,0.5,...,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12182,1,2406,4.0,964982310,Romancing the Stone (1984),Action|Adventure|Comedy|Romance,0.25,0.0,0.0,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
776,1,70,3.0,964982400,From Dusk Till Dawn (1996),Action|Comedy|Horror|Thriller,0.0,0.0,0.0,0.25,...,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
dfUserMovie = df.pivot_table(index="userID", columns="movieID", values="rating", fill_value=0)

dfUserMovie.to_csv(DataSet + "/UserMovieMatrixSmall.csv")

In [7]:
x = df.drop(['userID', 'movieID', 'timestamp', 'rating', "title", "genres"],axis=1) # drops the labels and text variables
y = df['rating'].astype(np.float32)
y = y/5
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
x_train = x_train.sort_index()
x_test = x_test.sort_index()
y_train = y_train.sort_index()
y_test = y_test.sort_index()

print(x_train.head())
print()
print(y_train.head())

   Adventure  Animation  Children  Comedy  Fantasy  Romance  Action  Crime  \
0        0.2        0.2       0.2     0.2      0.2      0.0     0.0    0.0   
1        0.2        0.2       0.2     0.2      0.2      0.0     0.0    0.0   
2        0.2        0.2       0.2     0.2      0.2      0.0     0.0    0.0   
3        0.2        0.2       0.2     0.2      0.2      0.0     0.0    0.0   
4        0.2        0.2       0.2     0.2      0.2      0.0     0.0    0.0   

   Thriller  Mystery  Horror  Drama  War  Western  Sci-Fi  Musical  Film-Noir  \
0       0.0      0.0     0.0    0.0  0.0      0.0     0.0      0.0        0.0   
1       0.0      0.0     0.0    0.0  0.0      0.0     0.0      0.0        0.0   
2       0.0      0.0     0.0    0.0  0.0      0.0     0.0      0.0        0.0   
3       0.0      0.0     0.0    0.0  0.0      0.0     0.0      0.0        0.0   
4       0.0      0.0     0.0    0.0  0.0      0.0     0.0      0.0        0.0   

   IMAX  Documentary  (no genres listed)  
0

# Neural Network approach

In [16]:
nh = 64

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(nh*2, activation=tf.nn.sigmoid, input_shape=(21,), use_bias = True),
    tf.keras.layers.Dropout(0.05),
    tf.keras.layers.Dense(nh*1.5, activation=tf.nn.sigmoid, use_bias = True),  
    tf.keras.layers.Dense(nh, activation=tf.nn.sigmoid, use_bias = True),  
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation = tf.nn.sigmoid)
])
model.compile(optimizer=tf.keras.optimizers.Adam(lr=1e-3, decay=1e-5),
              loss=tf.keras.losses.mean_squared_error,
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=10, verbose = 2, validation_split = 0.1)

Train on 60804 samples, validate on 6756 samples
Epoch 1/10
 - 6s - loss: 0.0450 - acc: 0.1395 - val_loss: 0.0566 - val_acc: 0.0594
Epoch 2/10
 - 5s - loss: 0.0430 - acc: 0.1402 - val_loss: 0.0558 - val_acc: 0.0594
Epoch 3/10
 - 5s - loss: 0.0428 - acc: 0.1402 - val_loss: 0.0539 - val_acc: 0.0594
Epoch 4/10
 - 5s - loss: 0.0426 - acc: 0.1402 - val_loss: 0.0565 - val_acc: 0.0594
Epoch 5/10
 - 5s - loss: 0.0425 - acc: 0.1402 - val_loss: 0.0551 - val_acc: 0.0594
Epoch 6/10
 - 6s - loss: 0.0425 - acc: 0.1402 - val_loss: 0.0561 - val_acc: 0.0594
Epoch 7/10
 - 7s - loss: 0.0425 - acc: 0.1402 - val_loss: 0.0559 - val_acc: 0.0594
Epoch 8/10
 - 6s - loss: 0.0425 - acc: 0.1402 - val_loss: 0.0558 - val_acc: 0.0594
Epoch 9/10
 - 5s - loss: 0.0425 - acc: 0.1402 - val_loss: 0.0549 - val_acc: 0.0594
Epoch 10/10
 - 5s - loss: 0.0425 - acc: 0.1402 - val_loss: 0.0553 - val_acc: 0.0594


<tensorflow.python.keras.callbacks.History at 0x2bb307d0828>

In [40]:
model.evaluate(x_test, y_test)



[0.043225352522619415, 0.12880153864736754]

# Scikit surprise approach
https://towardsdatascience.com/building-and-testing-recommender-systems-with-surprise-step-by-step-d4ba702ef80b

In [10]:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate

# Load the movielens-100k dataset (download it if needed).
data = Dataset.load_builtin('ml-100k')

# Use the famous SVD algorithm.
algo = SVD()

# Run 5-fold cross-validation and print results.
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9355  0.9351  0.9310  0.9409  0.9388  0.9363  0.0034  
MAE (testset)     0.7391  0.7392  0.7306  0.7430  0.7377  0.7379  0.0041  
Fit time          9.61    9.23    7.47    6.63    6.55    7.90    1.29    
Test time         0.28    0.27    0.20    0.32    0.20    0.25    0.05    


{'test_rmse': array([0.93554486, 0.93513075, 0.93095526, 0.94086107, 0.93884143]),
 'test_mae': array([0.73908881, 0.73924707, 0.7305567 , 0.74300597, 0.73765002]),
 'fit_time': (9.614481687545776,
  9.2326979637146,
  7.4707136154174805,
  6.633191108703613,
  6.548240900039673),
 'test_time': (0.2768423557281494,
  0.2688455581665039,
  0.19588613510131836,
  0.31781625747680664,
  0.19590282440185547)}

In [11]:
# Use the famous SVD algorithm.
uid = df.userIDs

movieSVD = SVD()
movieSVD.fit([x_train, y_train])

# test the algorithm
movieSVD.test(x_test)

AttributeError: 'list' object has no attribute 'global_mean'