In [1]:
import numpy as np
import pandas as pd

In [2]:
# read the movieLens dataset
header = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('ml-100k/u.data', sep='\t', names=header)

In [8]:
# check the no. of users and movies
n_users = df.user_id.unique().shape[0]
n_items = df.item_id.unique().shape[0]
print ('Number of users = ' + str(n_users) + '| Number of movies = ' + str(n_items))

Number of users = 943| Number of movies = 1682


In [9]:
from sklearn import cross_validation as cv
train_data, test_data = cv.train_test_split(df, test_size=0.25)


Since this is toy example, we make a memory based recommendation system
There are two main approaches of CF here, user based collaboratove filtering and item based collaboraive filtering
Item based - Users who liked this item also liked....... 
User based - Users who are similar to you also liked.......

* User based CF defines similarity on the basis of similarity of ratings of users, and recommend items that those similar users liked
* Item based CF picks an item, see which users liked it, and sees what other items did these users liked, and then recommends those

In both cases, we need to create a user item matrix of dim(nXm), where n is no. of users and m is no. of movies

* Item based similarity is determined by taking dot product of the two columns. This means that we observe all the users who have rated both the items

* User based similarity is determined by taking dot product of two rows. This means that if users have given similar ratings to same items, then they are more likely to be close


In [11]:
#Create two user-item matrices, one for training and another for testing
train_data_matrix = np.zeros((n_users, n_items))
for row in train_data.itertuples():
    train_data_matrix[row[1]-1, row[2]-1] = row[3]

test_data_matrix = np.zeros((n_users, n_items))
for row in test_data.itertuples():
    test_data_matrix[row[1]-1, row[2]-1] = row[3]

In [12]:
from sklearn.metrics.pairwise import pairwise_distances
user_similarity = pairwise_distances(train_data_matrix, metric='cosine')
item_similarity = pairwise_distances(train_data_matrix.T, metric='cosine')

![Item based similarity ](item_sim_formula.png)
![User based similarity](user_sim_formula.png)

In [13]:

def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        #You use np.newaxis so that mean_user_rating has same format as ratings
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred


In [16]:
item_prediction = predict(train_data_matrix, item_similarity, type='item')
user_prediction = predict(train_data_matrix, user_similarity, type='user')

In [20]:
print (user_prediction.shape)
print (item_prediction.shape)

(943, 1682)
(943, 1682)


In [27]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    '''ground truth means the rating matrix that we created'''
    prediction = prediction[ground_truth.nonzero()].flatten()
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

In [28]:
print ('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))
print ('Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix)))

User-based CF RMSE: 3.12946125005121
Item-based CF RMSE: 3.4558716606792514


Most of the code has been lifted as it is from this [tutorial](https://cambridgespark.com/content/tutorials/implementing-your-own-recommender-systems-in-Python/index.html). This notebook is only for my understanding purposes.