# Movie recommendation using memory based and model based collaborative filtering

A movie recommender system has been developed using the movielens dataset. The data can be obtained from https://grouplens.org/datasets/movielens/100k/.

The recommender system is built using user-user based collaborative filtering, item-item based collaborative filtering and Singular value decomposition method.

 Importing the libraries

In [10]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.metrics import mean_squared_error
import math

 Reading the dataset from file ua.base and ua.test and setting the column name

In [11]:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings_base = pd.read_csv('ml-100k/ua.base', sep='\t', names=r_cols, encoding='latin-1')
ratings_test = pd.read_csv('ml-100k/ua.test', sep='\t', names=r_cols, encoding='latin-1')

In [12]:
ratings_base.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,1,1,5,874965758
1,1,2,3,876893171
2,1,3,4,878542960
3,1,4,3,876893119
4,1,5,3,889751712


In [13]:
ratings_test.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,1,20,4,887431883
1,1,33,4,878542699
2,1,61,4,878542420
3,1,117,3,874965739
4,1,155,2,878542201


Shows the number of unique users and movies in the base set

In [14]:
n_users_base = ratings_base['user_id'].unique().max()
n_items_base = ratings_base['movie_id'].unique().max()
n_users_base,n_items_base

(943, 1682)

Shows the number of unique users and movies in the test set

In [15]:
n_users_test = ratings_test['user_id'].unique().max()
n_items_test = ratings_test['movie_id'].unique().max()
n_users_test,n_items_test

(943, 1664)

# Creating user-item matrix

The training set contains 943 users and 1682 movies. We are now creating the test_matrix and the train_matrix in which the number of rows is equal to the number of unique users and the number of columns is equal to the number of unique movies. The matrix cells are filled with the corresponding rating the user has given to the movie. The matrix cell has the value 0 if the user has not rated the movie.

In [16]:
train_matrix = np.zeros((n_users_base, n_items_base))
for line in ratings_base.itertuples():
    train_matrix[line[1]-1,line[2]-1] = line[3]
    
test_matrix = np.zeros((n_users_test, n_items_test))
for line in ratings_test.itertuples():
    test_matrix[line[1]-1,line[2]-1] = line[3]

# User-user based collaborative filtering

We create a similarity matrix which specifies the similarity between two users based on the ratings they have given to different movies. 

In [17]:
user_similarity = pairwise_distances(train_matrix, metric='cosine')
print('shape: ',user_similarity.shape)
user_similarity

shape:  (943, 943)


array([[ 0.        ,  0.85324924,  0.9493235 , ...,  0.96129522,
         0.8272823 ,  0.61960392],
       [ 0.85324924,  0.        ,  0.87419215, ...,  0.82629308,
         0.82681535,  0.91905667],
       [ 0.9493235 ,  0.87419215,  0.        , ...,  0.97201154,
         0.87518372,  0.97030738],
       ..., 
       [ 0.96129522,  0.82629308,  0.97201154, ...,  0.        ,
         0.96004871,  0.98085615],
       [ 0.8272823 ,  0.82681535,  0.87518372, ...,  0.96004871,
         0.        ,  0.85528944],
       [ 0.61960392,  0.91905667,  0.97030738, ...,  0.98085615,
         0.85528944,  0.        ]])

The shape of similarity matrix is 943 x 943 with each cell corresponding to the similarity between two users. Now we will use a prediction function that will predict the values in the user-item matrix. We will only consider the top n users which are similar to a user that are similar to a user to make predictions for that user. In the formula we normalise the ratings of users by subtracting the mean rating of a user from every rating given by the user.

$$\begin{equation*} \hat{x}_{k,m} =\bar{x}_{k} + \frac{\sum\limits_{u_a} sim_u(u_k, u_a) (x_{a,m} - \bar{x}_{u_a})}{\sum\limits_{u_a}|sim_u(u_k, u_a)|} \end{equation*}$$

In [18]:
def predict_user_user(train_matrix, user_similarity, n_similar=30):
    similar_n = user_similarity.argsort()[:,-n_similar:][:,::-1]
    pred = np.zeros((n_users_base,n_items_base))
    for i,users in enumerate(similar_n):
        similar_users_indexes = users
        similarity_n = user_similarity[i,similar_users_indexes]
        matrix_n = train_matrix[similar_users_indexes,:]
        rated_items = similarity_n[:,np.newaxis].T.dot(matrix_n - matrix_n.mean(axis=1)[:,np.newaxis])/ similarity_n.sum()
        pred[i,:]  = rated_items
    return pred

We are now using this function to find the predicted ratings and add the average rating of every user to give the final predicted ratings. Here, we are considering the top 50 users which are similar to our user and using their ratings to predict our user's ratings.

In [19]:
predictions = predict_user_user(train_matrix,user_similarity, 50) + train_matrix.mean(axis=1)[:, np.newaxis]
print('predictions shape ',predictions.shape)
predictions

predictions shape  (943, 1682)


array([[ 0.53079191,  0.53079191,  0.53079191, ...,  0.53079191,
         0.53079191,  0.53079191],
       [ 0.27556554,  0.17581381, -0.00189689, ..., -0.00189689,
        -0.00189689, -0.00189689],
       [ 1.17064209,  0.07064209,  0.01064209, ...,  0.01064209,
         0.01064209,  0.01064209],
       ..., 
       [-0.0479786 , -0.0479786 , -0.0479786 , ..., -0.0479786 ,
        -0.0479786 , -0.0479786 ],
       [ 0.8909642 ,  0.12995357,  0.12995357, ...,  0.12995357,
         0.12995357,  0.12995357],
       [ 0.27315101,  0.27315101,  0.27315101, ...,  0.31315101,
         0.27315101,  0.27315101]])

Considering only those ratings that are not zero in the test matrix and using them to find error in our model.

In [20]:
predicted_ratings = predictions[test_matrix.nonzero()]
test_truth = test_matrix[test_matrix.nonzero()]

In [21]:
math.sqrt(mean_squared_error(predicted_ratings,test_truth))

3.507744099069281

# Item-item based collaborative filtering

In this method we will find the similarity between items instead of users. Using the similarity between items and the users rating for similar items, we will find the predicted ratings for unrated movies.

In [22]:
item_similarity = pairwise_distances(train_matrix.T, metric = 'cosine')
item_similarity.shape

(1682, 1682)

In [23]:
item_similarity

array([[ 0.        ,  0.59704074,  0.66673863, ...,  1.        ,
         0.94919585,  0.94919585],
       [ 0.59704074,  0.        ,  0.7308149 , ...,  1.        ,
         0.91844091,  0.91844091],
       [ 0.66673863,  0.7308149 ,  0.        , ...,  1.        ,
         1.        ,  0.90098525],
       ..., 
       [ 1.        ,  1.        ,  1.        , ...,  0.        ,
         1.        ,  1.        ],
       [ 0.94919585,  0.91844091,  1.        , ...,  1.        ,
         0.        ,  1.        ],
       [ 0.94919585,  0.91844091,  0.90098525, ...,  1.        ,
         1.        ,  0.        ]])

The shape of similarity matrix is 1682 x 1682 with each cell corresponding to the similarity between two users. Now we will use a prediction function that will predict the values in the user-item matrix. We will only consider the top n items which are similar to a item to make predictions. In the formula we don't need to normalize the ratings of users as we are using items to make predictions instead of users.


$$\begin{equation*} \hat{x}_{k,m} = \frac{\sum\limits_{i_b} sim_i(i_m, i_b) (x_{k,b}) }{\sum\limits_{i_b}|sim_i(i_m, i_b)|} \end{equation*}$$

In [24]:
def predict_item_item(train_matrix, item_similarity, n_similar=30):
    similar_n = item_similarity.argsort()[:,-n_similar:][:,::-1]
    print('similar_n shape: ', similar_n.shape)
    pred = np.zeros((n_users_base,n_items_base))
    
    for i,items in enumerate(similar_n):
        similar_items_indexes = items
        similarity_n = item_similarity[i,similar_items_indexes]
        matrix_n = train_matrix[:,similar_items_indexes]
        rated_items = matrix_n.dot(similarity_n)/similarity_n.sum()
        pred[:,i]  = rated_items
    return pred

We are considering the top 50 users which are similar to our user and using their ratings to predict our user's ratings.

In [25]:
predictions = predict_item_item(train_matrix,item_similarity,50)
print('predictions shape ',predictions.shape)
predictions

similar_n shape:  (1682, 50)
predictions shape  (943, 1682)


array([[ 0.  ,  0.  ,  0.  , ...,  0.  ,  0.  ,  0.66],
       [ 0.  ,  0.  ,  0.  , ...,  0.  ,  0.  ,  0.  ],
       [ 0.  ,  0.  ,  0.  , ...,  0.  ,  0.  ,  0.  ],
       ..., 
       [ 0.  ,  0.  ,  0.  , ...,  0.  ,  0.  ,  0.  ],
       [ 0.  ,  0.  ,  0.  , ...,  0.1 ,  0.08,  0.08],
       [ 0.  ,  0.  ,  0.  , ...,  0.44,  0.2 ,  0.06]])

Considering only those ratings that are not zero in the test matrix and using them to find error in our model.

In [26]:
predicted_ratings = predictions[test_matrix.nonzero()]
test_truth = test_matrix[test_matrix.nonzero()]
math.sqrt(mean_squared_error(predicted_ratings,test_truth))

3.749688827167227

# Getting recommendations for a user

Now we are getting movie recommendations for a user with user id 55 using item-item collaborative filtering.

In [27]:
user_id = 42
user_ratings = predictions[user_id-1,:]

In [28]:
train_unkown_indices = np.where(train_matrix[user_id-1,:] == 0)[0]
train_unkown_indices

array([   2,    3,    4, ..., 1679, 1680, 1681])

In [29]:
user_recommendations = user_ratings[train_unkown_indices]

In [30]:
user_recommendations.shape

(1509,)

The top 5 movie recommendations for user 42

In [31]:
print('\nRecommendations for user {} are the movies: \n'.format(user_id))
for movie_id in user_recommendations.argsort()[-5:][: : -1]:
    print(movie_id +1)


Recommendations for user 42 are the movies: 

1184
1182
1436
1291
689


# Singular Value Decomposition

SVD is a model-based method. It is a mathematical technique to find the missing values in a matrix. It decomposes matrix into three matrices two of which are rectangular and the middle one is a diagonal matrix.


$$\begin{equation*}
X=U \times S \times V^T
\end{equation*}$$

In [32]:
import scipy.sparse as sp
from scipy.sparse.linalg import svds

In [33]:
u, s, vt = svds(train_matrix, k = 20)

In [34]:
u.shape, s.shape, vt.shape

((943, 20), (20,), (20, 1682))

In [35]:
s_diag_matrix = np.diag(s)

Prediction is obtained by finding the dot product of the three matrices.

In [36]:
predictions_svd = np.dot(np.dot(u,s_diag_matrix),vt)

In [37]:
predictions_svd.shape

(943, 1682)

In [38]:
predicted_ratings_svd = predictions_svd[test_matrix.nonzero()]
test_truth = test_matrix[test_matrix.nonzero()]
math.sqrt(mean_squared_error(predicted_ratings_svd,test_truth))

2.825807569445831

The root mean square error is the least using this method. Now we will obtain the recommendation for user 85

In [39]:
user_id = 85
user_ratings = predictions_svd[user_id-1,:]
train_unkown_indices = np.where(train_matrix[user_id-1,:] == 0)[0]
user_recommendations = user_ratings[train_unkown_indices]
user_recommendations.shape

(1404,)

In [40]:
print('\nRecommendations for user {} are the movies: \n'.format(user_id))
for movie_id in user_recommendations.argsort()[-5:][: : -1]:
    print(movie_id +1)


Recommendations for user 85 are the movies: 

318
175
171
280
93
