# Recommendation System 
Trying out the process of creating a recommendation system using the [Movielens dataset](https://grouplens.org/datasets/movielens/)
## Good resources
 - [Introduction to recommender system (Part 1)](https://hackernoon.com/introduction-to-recommender-system-part-1-collaborative-filtering-singular-value-decomposition-44c9659c5e75)
 - [Recommender system in keras](https://nipunbatra.github.io/blog/2017/recommend-keras.html)

(The implementation in this notebook is almost identical to that of the second link.) Every movie and user can be assigned a vector which should describe their characterisits in such a way that when multiplied results in the users rating for that movie. These two embeddings could also be concatenated to form the input to a network which also could be used to predict the rating for this particular user-movie pair. 

In [43]:
import csv
import numpy as np

from math import floor, sqrt
from collections import defaultdict

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer

import keras
from keras.regularizers import l2
from keras.models import Model
from keras.callbacks import EarlyStopping
from keras.layers import Input, Embedding, Activation, Flatten, Lambda, Concatenate, Dense, Dropout, BatchNormalization
from keras.optimizers import Adam

## Preparing the data
We have to clean the data a bit since the user and movie id's have to be in a contiguous order for the Keras Embedding layer to work propperly. We also want to make use of the movie tags when training the model, requering us to somehow turn these into vectors. 

One idea is to also create a embedding for each tag and combine these in an average fashion.

In [2]:
# user, movie, rating, time
ratings = np.genfromtxt('ml-latest-small/ratings.csv', delimiter=',', skip_header=1)

#id, title, tags
with open('ml-latest-small/movies.csv', 'r') as f:
    reader = csv.reader(f)
    movie_header = next(reader)
    mid2cmid = {} #movie id to contiguous id
    mid2tags = {} #contigious movie id to tag
    tags = []
    for i, row in enumerate(reader):
        mid2cmid[int(row[0])] = i #mapping movie id to contigious integers
        mid2tags[i] = row[2].split('|')
        tags.append(mid2tags[i])
    #end for
#end with

mlb = MultiLabelBinarizer().fit(tags)
print(mlb.classes_)

uid2cuid = {} #user id to contiguous user id
users = set(ratings[:,0])
for i, u in enumerate(users):
    uid2cuid[u] = i


#Update the rating matrix
mlb_tags = np.zeros((len(ratings), len(mlb.classes_)))
for i, row in enumerate(ratings):
    ratings[i][0] = uid2cuid[row[0]] #swapping user id to its contiguous value
    ratings[i][1] = mid2cmid[int(row[1])] #swappings movi id to its contiuous value
    mlb_tags[i,:] = mlb.transform(mlb.transform([mid2tags[ratings[i][1]]])) #Getting the tag vector for each movie


['(no genres listed)' 'Action' 'Adventure' 'Animation' 'Children' 'Comedy'
 'Crime' 'Documentary' 'Drama' 'Fantasy' 'Film-Noir' 'Horror' 'IMAX'
 'Musical' 'Mystery' 'Romance' 'Sci-Fi' 'Thriller' 'War' 'Western']


  .format(sorted(unknown, key=str)))


In [3]:
n_users = 610 
n_movies = 9742
assert(len(users) == n_users)
assert(len(mid2cmid) == n_movies)

Test-train split the data

In [4]:
train_ratings_matrix, test_ratings_matrix, train_tags, test_tags= train_test_split(ratings, mlb_tags, train_size=0.8)



In [5]:
train_user, test_user = train_ratings_matrix[:,0], test_ratings_matrix[:,0]
train_movie, test_movie = train_ratings_matrix[:,1], test_ratings_matrix[:,1]
train_rating, test_rating = train_ratings_matrix[:,2], test_ratings_matrix[:,2]

In [6]:
min_rateing = np.min(ratings[:,2])
max_rateing = np.max(ratings[:,2])
n_tags = len(mlb.classes_)

# Functions for creating models

In [48]:
def gen_dot_model(common_emb_sz=10, dropout=0.1):
    movie_imp = Input(shape=(1,), name='movie_input')
    movie_emb = Embedding(n_movies, common_emb_sz, name='movie_embedding')(movie_imp)
    
    user_imp = Input(shape=(1,), name='user_input')
    user_emb = Embedding(n_users, common_emb_sz, name='user_embedding')(user_imp)
    
    movie_do = Dropout(dropout)(movie_emb)
    user_do = Dropout(dropout)(user_emb)
    
    dot = Flatten()(keras.layers.dot([movie_do, user_do], axes=2, name='dot'))
    
    act = Activation('sigmoid', name='activation')(dot)
    out = Lambda(lambda x: min_rateing + (max_rateing - min_rateing)*x)(act)

    model = Model(inputs=[movie_imp, user_imp], outputs=[out])
    model.compile(loss = 'mean_squared_error', optimizer = Adam(lr=0.001), metrics = ['mse'])
    model.summary()
    return model
#end def

def gen_net_model(movie_emb_sz = 10, user_emb_sz = 10, dropout=0.1):
    movie_imp = Input(shape=(1,), name='movie_input')
    movie_emb = Embedding(n_movies, movie_emb_sz, name='movie_embedding')(movie_imp)
    movie_emb = Flatten()(movie_emb)
    
    user_imp = Input(shape=(1,), name='user_input')
    user_emb = Embedding(n_users, user_emb_sz, name='user_embedding')(user_imp)
    user_emb = Flatten()(user_emb)
    
    concat = Concatenate(axis=1, name='concat')([movie_emb, user_emb])
    
    dense1 = Dense(floor(2/3*(movie_emb_sz + user_emb_sz)), activation='tanh')(concat)
    d1_do = Dropout(dropout)(dense1)
    
    dense2 = Dense(floor(1/5*(movie_emb_sz + user_emb_sz)), activation='tanh')(d1_do)
    d2_do = Dropout(dropout)(dense2)
    
    dense3 = Dense(1, activation='sigmoid', kernel_regularizer=l2)(d2_do)    
    out = Lambda(lambda x: min_rateing + (max_rateing - min_rateing)*x)(dense3)

    model = Model(inputs=[movie_imp, user_imp], outputs=[out])
    model.compile(loss = 'mean_squared_error', optimizer = Adam(lr=0.001), metrics = ['mse'])
    model.summary()
    return model
#end def

def gen_net_model_with_tags(movie_emb_sz=10, user_emb_sz=10, tag_emb_sz=10, dropout=0.1):
    movie_imp = Input(shape=(1,), name='movie_input')
    movie_emb = Embedding(n_movies, movie_emb_sz, name='movie_embedding')(movie_imp)
    movie_emb = Flatten()(movie_emb)
    
    user_imp = Input(shape=(1,), name='user_input')
    user_emb = Embedding(n_users, user_emb_sz, name='user_embedding')(user_imp)
    user_emb = Flatten()(user_emb)
    
    tag_imp = Input(shape=(n_tags,), name='tag_input')
    tag_emb = Dense(tag_emb_sz)(tag_imp)
    
#    movie_do = Dropout(0.1)(movie_emb)
#    user_do = Dropout(0.1)(user_emb)
    
    concat = Concatenate(axis=1, name='concat')([movie_emb, user_emb, tag_emb])
    
    dense1 = Dense(floor(2/3*(movie_emb_sz + user_emb_sz)), activation='tanh')(concat)
    d1_do = Dropout(dropout)(dense1)
    
    dense2 = Dense(floor(1/5*(movie_emb_sz + user_emb_sz)), activation='tanh')(d1_do)
    d2_do = Dropout(dropout)(dense2)
    
    dense3 = Dense(1, activation='sigmoid')(d2_do)    
    out = Lambda(lambda x: min_rateing + (max_rateing - min_rateing)*x)(dense3)

    model = Model(inputs=[movie_imp, user_imp, tag_imp], outputs=[out])
    model.compile(loss = 'mean_squared_error', optimizer = Adam(lr=0.001), metrics = ['mse'])
    model.summary()
    return model
#end def

es = EarlyStopping(patience=5, restore_best_weights=True)

## Simple dot model
Calculate the rating through the inner product between user embedding and movie embedding. 

In [38]:
model = gen_dot_model(common_emb_sz=100)
history = model.fit([train_movie, train_user], [train_rating], validation_split=0.1, epochs=100, batch_size=256, callbacks = [es])

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
movie_input (InputLayer)        (None, 1)            0                                            
__________________________________________________________________________________________________
user_input (InputLayer)         (None, 1)            0                                            
__________________________________________________________________________________________________
movie_embedding (Embedding)     (None, 1, 100)       974200      movie_input[0][0]                
__________________________________________________________________________________________________
user_embedding (Embedding)      (None, 1, 100)       61000       user_input[0][0]                 
__________________________________________________________________________________________________
dropout_27

In [39]:
mse = model.evaluate([test_movie, test_user], [test_rating])
rmse = sqrt(mse[0])
print(f'RMSE on test set: {rmse}')

RMSE on test set: 0.890418960919065


## Simple network model
Instead of calculating the dot product between user and movie embedding, this model concatenates them into a vector which is then fed trhough a two layer network.

In [49]:
model = gen_net_model(movie_emb_sz=100, user_emb_sz=100, dropout=0.1)
history = model.fit([train_movie, train_user], [train_rating], validation_split=0.1, epochs=100, batch_size=256, callbacks = [es])

ValueError: setting an array element with a sequence.

In [41]:
mse = model.evaluate([test_movie, test_user], [test_rating])
rmse = sqrt(mse[0])
print(f'RMSE on test set: {rmse}')

RMSE on test set: 0.8702055489711874


## Tag net model
Here we also take the tags of the movie into account when creating the input to the network. 

In [42]:
model = gen_net_model_with_tags(movie_emb_sz=100, user_emb_sz=100, tag_emb_sz=100)
history = model.fit([train_movie, train_user, train_tags], [train_rating], validation_split=0.1, epochs=100, batch_size=256, callbacks = [es])

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
movie_input (InputLayer)        (None, 1)            0                                            
__________________________________________________________________________________________________
user_input (InputLayer)         (None, 1)            0                                            
__________________________________________________________________________________________________
movie_embedding (Embedding)     (None, 1, 100)       974200      movie_input[0][0]                
__________________________________________________________________________________________________
user_embedding (Embedding)      (None, 1, 100)       61000       user_input[0][0]                 
__________________________________________________________________________________________________
tag_input 

In [34]:
mse = model.evaluate([test_movie, test_user, test_tags], [test_rating])
rmse = sqrt(mse[0])
print(f'RMSE on test set: {rmse}')

RMSE on test set: 0.8713615891180057
