## Movie Recommender systems using Collaborative filtering 

In this notebook, I will use collaborative filtering  apporach for movie Recommender system using the same movielens dataset. 
 
Unilike content based filtering which basically uses the similarity of attributes of the items to give recommendations, collaborative filtering method uses the knowledge of user's attitude to items,i.e it uses the "wisdom of the crowd". One of the main advantage of collaborative filtering is that the algorithm has the ability to do learn usefull features by itself which generally gives better performance than ontent based filtering. 

Collaborative filtering(CF) can be divided into Memory-Based Collaborative Filtering and Model-Based Collaborative filtering. 

Here, I will implement Model-Based CF using singular value decomposition (SVD) and Memory-Based CF using cosine similarity.

### Basic Imports 

In [1]:
import numpy as np
import pandas as pd

### Loading the Data 

In [2]:
columns = ['user_id', 'item_id', 'rating', 'timestamp']
movie_data = pd.read_csv('u.data', sep='\t', names=columns)

In [3]:
movie_data.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,0,50,5,881250949
1,0,172,5,881250949
2,0,133,1,881250949
3,196,242,3,881250949
4,186,302,3,891717742


In [4]:
movie_titles = pd.read_csv("Movie_Id_Titles")
movie_titles.head()

Unnamed: 0,item_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [5]:
movie_merged = pd.merge(movie_data,movie_titles,on="item_id")
movie_merged.head()

Unnamed: 0,user_id,item_id,rating,timestamp,title
0,0,50,5,881250949,Star Wars (1977)
1,290,50,5,880473582,Star Wars (1977)
2,79,50,4,891271545,Star Wars (1977)
3,2,50,5,888552084,Star Wars (1977)
4,8,50,5,879362124,Star Wars (1977)


Now, let's see the number of unique users and movies.

In [6]:
n_of_users = movie_merged.user_id.nunique()
n_of_movies = movie_merged.item_id.nunique()

print('Num of Unique Users: '+ str(n_of_users))
print('Num of Unique Movies: '+str(n_of_movies))

Num of Unique Users: 944
Num of Unique Movies: 1682


### Train Test Split

Note that it is not the usual train test split, since we do not have a target variable here. 

In [7]:
from sklearn.model_selection import train_test_split

In [10]:
train_set, test_set = train_test_split(movie_merged, test_size=0.2)

## Memory-Based Collaborative Filtering

They can be broadly divided into two:

 - Item-Item Collaborative Filtering: Users who liked this item also liked other similar items.
 - User-Item Collaborative Filtering: Users who are similar to me also liked the items that I like. 
 
First we must create a similarity matrix of dimension [943 x 1682] (all users by all movies). We need two here, one each for training and testing.

The similarity values between items in Item-Item Collaborative Filtering are measured by observing all the users who have rated both items.

For User-Item Collaborative Filtering the similarity values between users are measured by observing all the items that are rated by both users.

In both cases we use cosine similarity as the distance metric.



### User-item matrices

In [11]:
train_matrix = np.zeros((n_of_users, n_of_movies))
for line in train_set.itertuples():
    train_matrix[line[1]-1, line[2]-1] = line[3]  

test_matrix = np.zeros((n_of_users, n_of_movies))
for line in test_set.itertuples():
    test_matrix[line[1]-1, line[2]-1] = line[3]

In [12]:
train_matrix.shape

(944, 1682)

In [13]:
test_matrix.shape

(944, 1682)

We can see that we have two identically shaped matrices, which is what is we want. 

### Cosine similarity

In [16]:
from sklearn.metrics.pairwise import pairwise_distances
user_similarity = pairwise_distances(train_matrix, metric='cosine')
item_similarity = pairwise_distances(train_matrix.T, metric='cosine')

### Making Predictions 

The function defined here is based on the formula for user-based CF.

In [17]:
def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis]) 
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])     
    return pred

In [18]:
item_pred = predict(train_matrix, item_similarity, type='item')
user_pred = predict(train_matrix, user_similarity, type='user')

### Evaluating CF algorithm

There are many evaluation metrics but I will us RMSE here for simplicity sake. 

Since we only want to consider the ratings that are in the test dataset, we simply filter out all other elements in the prediction matrix.

In [19]:
from sklearn.metrics import mean_squared_error
from math import sqrt

In [20]:
def rmse(prediction, ground_truth):
    """
    function for calculating rmse, it filters out all the other ratings using prediction[ground_truth.nonzero()]
    """
    
    prediction = prediction[ground_truth.nonzero()].flatten() 
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

In [21]:
print('Item-based CF RMSE: ' + str(rmse(item_pred, test_matrix)))
print('User-based CF RMSE: ' + str(rmse(user_pred, test_matrix)))

Item-based CF RMSE: 3.430917128562384
User-based CF RMSE: 3.083351220437865


These are pretty good results but Memory-Based CF's are not scalable in real world use cases and also does not address the well known cold start issues. This means that the algorithm tends to not perform well when a new user or item enters the system. Model-based CFs'are scalable and can deal with higher sparsity level than memory-based CFs', but also suffer when new users or items that don't have any ratings enter the system. 


## Model-Based Collaborative Filtering

This method is based on "matrix factorization". This essenetially is an unsupervised learning method which tend to be the go to choice these days for Recommender systems. We will see other unsupervised methods such a RBMs' and Auto Encoders in another notebook. The basic idea is this, through the matrix factorization and dimentionality reduction the model learns certain features from the data that it can use to make future predictions. This makes the Model-Based CF a lot like PCA but here we use singular value decomposition instead of eigen value/eigen vecor decomposition. It is the sparsity of such models that make them scalabe to real world applications. 

Let's calculate the sparsity level of MovieLens dataset

In [22]:
sparsity=round(1.0-len(movie_merged)/float(n_of_users*n_of_movies),3)
print('The sparsity level of MovieLens100K is {}'.format(str(sparsity*100)))

The sparsity level of MovieLens100K is 93.7


### SVD 

In [23]:
import scipy.sparse as sp
from scipy.sparse.linalg import svds

In [24]:
#get SVD components from train matrix. Choose k, totally arbitrary.
u, s, vt = svds(train_matrix, k = 25)
s_diag_matrix=np.diag(s)
X_pred = np.dot(np.dot(u, s_diag_matrix), vt)

In [26]:
print('User-based CF RMSE:{} '.format(str(rmse(X_pred, test_matrix))))

User-based CF RMSE:2.6791845369347227 


We can already see the improvement here. 

There are other more advanced systems such as Hybrid Recommender systems. As the name implies, they use a combination of both content based and collaborative filtering methods and tend to give better performance as well. Research have shown that, they are capable of handling the well known cold-start problem associated with the Recommender systems. This is because, if we don't have any ratings for a user or an item we could use the metadata from the user or item to make a prediction.