## Collaborative Filtering

**Collaborative filtering produces recommendations based on the knowledge of users’ attitude to items, that is it uses the "wisdom of the crowd" to recommend items.**

- CF can be divided into Memory-Based Collaborative Filtering and Model-Based Collaborative filtering.

- Memory-Based CF are of two types:
-  User-User CF
-  Item-Item CF

- Model-Based CF is based on matrix factorization (MF). Matrix factorization is widely used for recommender systems where it can deal better with scalability and sparsity than Memory-based CF.

In this notebook we will implement Memory-Based CF by computing cosine similarity and Model-Based CF by using singular value decomposition (SVD).

### Item-Item CF

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
%matplotlib inline

In [2]:
column = ['user_id', 'item_id','rating','timestamp']
df = pd.read_csv('u.data.csv', sep = '\t', names = column)
movies = pd.read_csv('Movie_Id_Titles.csv')

In [3]:
df = pd.merge(df,movies, on = 'item_id').drop(['timestamp'], axis = 1)
df.head()

Unnamed: 0,user_id,item_id,rating,title
0,0,50,5,Star Wars (1977)
1,290,50,5,Star Wars (1977)
2,79,50,4,Star Wars (1977)
3,2,50,5,Star Wars (1977)
4,8,50,5,Star Wars (1977)


In [4]:
n_users = df.user_id.nunique()
n_items = df.item_id.nunique()

print('Num. of Users: '+ str(n_users))
print('Num of Movies: '+str(n_items))

Num. of Users: 944
Num of Movies: 1682


#### Creating a pivot table containing the rating given by each user for every movie.
Since a user wouldn't have rated all the movies there could be a lot of NaN values

In [5]:
userRatings = df.pivot_table(index=['user_id'],columns=['title'],values='rating')
#removing movies which have less than 10 users rating and filling the remaining with 0
userRatings = userRatings.dropna(thresh=10, axis=1).fillna(0,axis=1)
userRatings.head()

title,101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),"39 Steps, The (1935)",8 1/2 (1963),Absolute Power (1997),"Abyss, The (1989)",...,Wolf (1994),"Women, The (1939)","Wonderful, Horrible Life of Leni Riefenstahl, The (1993)",Wonderland (1997),"Wrong Trousers, The (1993)",Wyatt Earp (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)"
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2.0,5.0,0.0,0.0,3.0,4.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,5.0,0.0,5.0,3.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Creating similarity matrix

In [6]:
corrMatrix = userRatings.corr(method='pearson')
corrMatrix.head()

title,101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),"39 Steps, The (1935)",8 1/2 (1963),Absolute Power (1997),"Abyss, The (1989)",...,Wolf (1994),"Women, The (1939)","Wonderful, Horrible Life of Leni Riefenstahl, The (1993)",Wonderland (1997),"Wrong Trousers, The (1993)",Wyatt Earp (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)"
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
101 Dalmatians (1996),1.0,0.059375,-0.001026,0.052983,0.128832,0.078451,0.015592,0.005819,0.22113,0.121285,...,0.057828,0.05199,-0.034379,0.000754,0.074758,0.109125,0.155599,0.1153,0.039243,-0.005846
12 Angry Men (1957),0.059375,1.0,-0.014261,0.066459,0.230361,0.298878,0.33926,0.174562,0.019941,0.156865,...,0.048841,0.145077,0.138377,-0.004764,0.170314,0.160215,0.290332,0.165072,0.079418,0.038188
187 (1997),-0.001026,-0.014261,1.0,0.078831,-0.010273,-0.039807,-0.021359,-0.006205,0.127598,0.017356,...,0.085036,-0.024068,-0.020277,0.115338,-0.025753,-0.000791,-0.021764,0.006881,0.053885,0.063828
2 Days in the Valley (1996),0.052983,0.066459,0.078831,1.0,0.056372,0.091159,-0.019876,-0.008144,0.245286,0.129326,...,0.087648,0.069056,-0.001807,-0.02696,0.028328,0.116563,0.061485,0.19771,0.176088,0.146833
"20,000 Leagues Under the Sea (1954)",0.128832,0.230361,-0.010273,0.056372,1.0,0.384624,0.274579,0.118159,0.117611,0.231341,...,0.244146,0.130682,0.06269,-0.001689,0.10168,0.286895,0.309606,0.243381,0.058035,0.071166


#### Creating a function to get the similar movies from the above table

In [7]:
def get_similar(movie_name,rating):
    #subtracting the user rating by its mean in order to get the movies with rating more than 3 on the top of the list
    similar_ratings = corrMatrix[movie_name]*(rating-2.5)
    similar_ratings = similar_ratings.sort_values(ascending=False)
    return similar_ratings

In [8]:
romantic_lover = [("Toy Story (1995)",2),("GoldenEye (1995)",4),("Get Shorty (1995)",5)]
similar_movies = pd.DataFrame()
for movie,rating in romantic_lover:
    similar_movies = similar_movies.append(get_similar(movie,rating),ignore_index = True)

similar_movies.head()
similar_movies.sum().sort_values(ascending=False).head()

Get Shorty (1995)    3.000457
GoldenEye (1995)     2.405486
True Lies (1994)     1.949342
Batman (1989)        1.902480
Top Gun (1986)       1.847022
dtype: float64

===================================================================================================

### User-User CF

#### Train Test Split

We will segment the data into two sets of data

In [9]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(df, test_size=0.25)

Creating user-item matrix for training and test data

In [10]:
#Create two user-item matrices, one for training and another for testing
train_data_matrix = np.zeros((n_users, n_items))
for line in train_data.itertuples():
    train_data_matrix[line[1]-1, line[2]-1] = line[3]  

test_data_matrix = np.zeros((n_users, n_items))
for line in test_data.itertuples():
    test_data_matrix[line[1]-1, line[2]-1] = line[3]

**We will now use the pairwise_distances function from sklearn to calculate the cosine similarity. The output will range from 0 to 1 since the ratings are all positive.**

In [11]:
from sklearn.metrics import pairwise_distances
user_sim = pairwise_distances(train_data_matrix, metric="cosine")

**Making predictions**

In [12]:
def predict(ratings, similarity):
    mean_user_rating = ratings.mean(axis=1)
    #You use np.newaxis so that mean_user_rating has same format as ratings
    ratings_diff = (ratings - mean_user_rating[:, np.newaxis]) 
    pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    return pred

In [13]:
user_prediction = predict(train_data_matrix, user_sim)
pd.DataFrame(user_prediction).head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1672,1673,1674,1675,1676,1677,1678,1679,1680,1681
0,1.586001,0.556256,0.466423,0.778021,0.456795,0.327816,1.412294,0.894477,1.073745,0.488665,...,0.277833,0.278616,0.27479,0.276823,0.277289,0.27479,0.278465,0.27724,0.27721,0.276942
1,1.313729,0.278874,0.15348,0.54162,0.148314,-0.004523,1.173239,0.651747,0.772925,0.16867,...,-0.063521,-0.062229,-0.066796,-0.064787,-0.063918,-0.066796,-0.063846,-0.064829,-0.063633,-0.063403


**Evaluation**

In [14]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten() 
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

In [15]:
print('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))

User-based CF RMSE: 3.1368008046071805


## Model-Based Collaborative filtering

Model-based Collaborative Filtering is based on matrix factorization (MF). Matrix factorization is widely used for recommender systems where it can deal better with scalability and sparsity than Memory-based CF.

**Calculating the sparsity level of this dataset**

In [16]:
sparsity=round(1.0-len(df)/float(n_users*n_items),3)
print('The sparsity level of MovieLens100K is ' +  str(sparsity*100) + '%')

The sparsity level of MovieLens100K is 93.7%


### SVD

A well-known matrix factorization method is Singular value decomposition (SVD). Collaborative Filtering can be formulated by approximating a matrix X by using singular value decomposition. 

The general equation can be expressed as :
X = USV^T

Given m x n matrix X:
U is an ((m x r) orthogonal matrix
S is an (r x r) diagonal matrix with non-negative real numbers in the diagonal
V^T is an (m x r) orthogonal matrix

Elements in the diagonal of S are known as singular values of X.

Matrix M can be factorised to U, S and V. The U matrix represents the feature vectors corresponding to the users in the hidden feature space and the V matrix represents the feature vectors corresponding to the items in the hidden feature space.

Now we will make a prediction by taking dot product of U, S and V^T.

In [18]:
import scipy.sparse as sp
from scipy.sparse.linalg import svds

#get SVD components from train matrix. Choose k.
u, s, vt = svds(train_data_matrix, k = 20)
s_diag_matrix=np.diag(s)
X_pred = np.dot(np.dot(u, s_diag_matrix), vt)
print('User-based CF MSE: ' + str(rmse(X_pred, test_data_matrix)))

User-based CF MSE: 2.7267134117552745


Review:

- We have covered how to implement simple Collaborative Filtering methods, both memory-based CF and model-based CF.
- Memory-based models are based on similarity between items or users, where we use cosine-similarity.
- Model-based CF is based on matrix factorization where we use SVD to factorize the matrix.
- It was a simple recommender system. However in real world scenarios more robust models are used which requires heavy use of linear algebra and other computations