# Recommender Systems
## Writing an Algorithm for Recommending Movies with a Collaborative Filtering
In dit script probeer ik een algoritme te schrijven die de beoordeling van films gaat voorspellen voor mensen. Recommender Systems worden gebruikt door bedrijven zoals YouTube en Netflix om jou langer op hun platform te houden. Ook kan het gebruikt worden door online kledingwinkels die willen voorspellen welke kleding jij leuk vindt.

De dataset die wordt gebruikt is de MovieLens database. Deze is te vinden op [grouplens.org/](https://grouplens.org/datasets/movielens/). Daarnaast gebruik ik de kennis die ik heb opgedaan vanuit de [Towards Data Science](https://towardsdatascience.com/various-implementations-of-collaborative-filtering-100385c6dfe0) artikel.

Deze notebook is een gevolg op `CB_RecommenderSystem.ipynb` en mijn aanbeveling is om die eerst te snappen. Die geeft je in het simpel het idee van een recommender system. In deze .ipynb gaan we gebruik maken van deep learning met de library `fast.ai`. Dit geeft schijnbaar het beste model. Wel is bekend dat dit model misschien langzamer is dan bijvoorbeeld Singular Value Decomposision.

In [19]:
import pandas as pd 
import numpy as np
import warnings
import matplotlib.pyplot as plt

import tensorflow as tf
from tflearn.layers.core import input_data, fully_connected
from tflearn.layers.estimator import regression
import tflearn

import sklearn as sk
from sklearn.model_selection import train_test_split
from tqdm import tqdm_notebook as tqdm

%matplotlib inline
warnings.filterwarnings('ignore')

Het belangrijkste wat in de volgende cell gebeurd is het inlezen van de data en het samenvoegen van de twee datasets. De variable `BigDataSet` bepaald of je de grote data set pakt van 20 miljoen regels (`True`) of de kleine dataset van 100.000 regels (`False`).

In [20]:
BigDataSet = False
DataSet = "data" if BigDataSet else "data_small"

df = pd.read_csv(DataSet + '/ratings.csv', sep=',', names=['userID','movieID','rating','timestamp'], header = 0)
movieTitles = pd.read_csv(DataSet + '/movies.csv', names=['movieID','title','genres'], header = 0)
df = pd.merge(df, movieTitles, on='movieID')

df.head()

Unnamed: 0,userID,movieID,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


De volgende cell is belangrijk. Deze veranderd de kolom `genres` naar een dummie kolom. Dit zorgt er voor dat het algaritme kan rekenen met de cijfers in plaats van tekst. In `allGenres` staan alle voorkomende genres en deze wordt aangevuld met een 1 als de film (`movieID`) wel bij die genre hoort en 0 als dat niet zo is.

In [21]:
allGenres = ['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy', 'Romance', 'Action', 'Crime', 'Thriller', 'Mystery', 'Horror', 'Drama', 'War', 'Western', 'Sci-Fi', 'Musical', 'Film-Noir', 'IMAX', 'Documentary', '(no genres listed)']
cols = ["movieID"]
cols.extend(allGenres)

dfGenres = pd.DataFrame(columns=cols)
with tqdm(total=len(df.movieID.unique())) as pbar:
    for i, film in enumerate(df.movieID.unique()):
        dfGenres.loc[i, :] = 0
        dfGenres.loc[i, "movieID"] = film
        genres = df.loc[df.movieID == film, "genres"].iloc[0].split("|")
        #for genre in genres:
        #    dfGenres.loc[i, genre] = 1
        dfGenres.loc[i, genres] = [1/len(genres) ] * len(genres) 
        pbar.update(1)

dfGenres.head()

HBox(children=(IntProgress(value=0, max=9724), HTML(value='')))




Unnamed: 0,movieID,Adventure,Animation,Children,Comedy,Fantasy,Romance,Action,Crime,Thriller,...,Horror,Drama,War,Western,Sci-Fi,Musical,Film-Noir,IMAX,Documentary,(no genres listed)
0,1.0,0.2,0.2,0.2,0.2,0.2,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,3.0,0.0,0.0,0.0,0.5,0.0,0.5,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.333333,0.333333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,47.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,50.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.333333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
# Merge all genres with the ratings dataframe

#uncomment if creating new dataset
#df = pd.merge(df, dfGenres, how = "outer", on = "movieID")
#df.to_csv(DataSet + "/ProcessedDataSmall.csv", index = False, sep = ";")

df = pd.read_csv(DataSet + "/ProcessedDataSmall.csv", sep = ";")
df.head()

Unnamed: 0,userID,movieID,rating,timestamp,title,genres,Adventure,Animation,Children,Comedy,...,Horror,Drama,War,Western,Sci-Fi,Musical,Film-Noir,IMAX,Documentary,(no genres listed)
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0.2,0.2,0.2,0.2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0.2,0.2,0.2,0.2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0.2,0.2,0.2,0.2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0.2,0.2,0.2,0.2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0.2,0.2,0.2,0.2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
dfUserMovie = pd.DataFrame(index = df.userID.unique(), columns=df.movieID.unique())
with tqdm(total=len(df.userID.unique())) as pbar:
    for i, user in enumerate(df.userID.unique()):
        subsetIndex = df.loc[df.userID==user, "movieID"]
        subsetRating= df.loc[df.userID==user, "rating"].values
        dfUserMovie.loc[user, subsetIndex] = subsetRating
        pbar.update(1)
        
dfUserMovie.head()

HBox(children=(IntProgress(value=0, max=610), HTML(value='')))

In [4]:
x = df.drop(['rating', "title", "genres"],axis=1) # drops the labels and text variables
y = df['rating'].astype(np.float32)
y = y/5
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
x_train = x_train.sort_index()
x_test = x_test.sort_index()
y_train = y_train.sort_index()
y_test = y_test.sort_index()

x_train.index = range(x_train.shape[0])
x_test.index = range(x_train.shape[0], x_train.shape[0]+x_test.shape[0])
x_train.index = range(x_train.shape[0])
x_test.index = range(x_train.shape[0], x_train.shape[0]+x_test.shape[0])

print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(67560, 23) (33276, 23) (67560,) (33276,)


In [16]:
nh = 64

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(nh*2, activation=tf.nn.sigmoid, input_shape=(23,), use_bias = True),
    tf.keras.layers.Dropout(0.05),
    tf.keras.layers.Dense(nh*1.5, activation=tf.nn.sigmoid, use_bias = True),  
    tf.keras.layers.Dense(nh, activation=tf.nn.sigmoid, use_bias = True),  
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation = tf.nn.sigmoid)
])
model.compile(optimizer=tf.keras.optimizers.Adam(lr=1e-3, decay=1e-5),
              loss=tf.keras.losses.mean_squared_error,
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=10, verbose = 2, validation_split = 0.1)

Train on 60804 samples, validate on 6756 samples
Epoch 1/10
 - 6s - loss: 0.0446 - acc: 0.1399 - val_loss: 0.0546 - val_acc: 0.0594
Epoch 2/10
 - 5s - loss: 0.0430 - acc: 0.1402 - val_loss: 0.0573 - val_acc: 0.0594
Epoch 3/10
 - 5s - loss: 0.0428 - acc: 0.1402 - val_loss: 0.0576 - val_acc: 0.0594
Epoch 4/10
 - 5s - loss: 0.0427 - acc: 0.1402 - val_loss: 0.0539 - val_acc: 0.0594
Epoch 5/10
 - 5s - loss: 0.0425 - acc: 0.1402 - val_loss: 0.0559 - val_acc: 0.0594
Epoch 6/10
 - 5s - loss: 0.0425 - acc: 0.1402 - val_loss: 0.0543 - val_acc: 0.0594
Epoch 7/10
 - 5s - loss: 0.0425 - acc: 0.1402 - val_loss: 0.0554 - val_acc: 0.0594
Epoch 8/10
 - 6s - loss: 0.0425 - acc: 0.1402 - val_loss: 0.0548 - val_acc: 0.0594
Epoch 9/10
 - 6s - loss: 0.0425 - acc: 0.1402 - val_loss: 0.0549 - val_acc: 0.0594
Epoch 10/10
 - 6s - loss: 0.0425 - acc: 0.1402 - val_loss: 0.0557 - val_acc: 0.0594


<tensorflow.python.keras.callbacks.History at 0x1dd0d54b438>

In [18]:
model.evaluate(x_test, y_test)



[0.043225352522619415, 0.12880153864736754]