# Data description & Problem statement: 
One of the most common datasets that is available on the internet for building a Recommender System is the MovieLens DataSet. This version of the dataset that I'm working with (100K) contains 100,000 ratings from 1000 users on 1700 movies. Released 4/1998. The data was collected by GroupLens researchers over various periods of time, depending on the size of the set. Users were selected at random for inclusion. All users selected had rated at least 20 movies. Each user is represented by an id, and no other information is provided. The original data are contained in three files, movies.dat, ratings.dat and users.dat. 
In this project, I use the Item-based collaborative filtering to predict the rating of a user on new item.


# Theory of Collaborative Filtering (CF):
There are 2 main types of memory-based collaborative filtering algorithms:

1) User-User Collaborative Filtering: Here I find look alike users based on similarity and recommend movies which first user’s look-alike has chosen in past. This algorithm is very effective but takes a lot of time and resources. It requires to compute every user pair information which takes time. Therefore, for big base platforms, this algorithm is hard to implement without a very strong parallelizable system.

2) Item-Item Collaborative Filtering: It is quite similar to previous algorithm, but instead of finding user's look-alike, I try finding movie's look-alike. Once we have movie's look-alike matrix, we can easily recommend alike movies to user who have rated any movie from the dataset. This algorithm is far less resource consuming than user-user collaborative filtering. Hence, for a new user, the algorithm takes far lesser time than user-user collaborate as I don’t need all similarity scores between users. And with fixed number of movies, movie-movie look alike matrix is fixed over time.

Note: Here, I use Item-based CF.

# Workflow:
- Load and merge the datasets
- Creat functions to calculate the Item-Item similarity, using:
     - Euclidian distance
     - Pearson's correlation
     - Corrected Pearson's correlation
     
- Object Oriented Programming to:
     - compute the Item-Item similarity matrix for all items, using one of similarity functions defined above,
     - predict the User rating on a new Item, using all Item-Item similarities. 

- Define functions to calculate RMSE and evaluate the Recommender System
- Evaluate the Recommender System for different similarity functions

In [1]:
import sklearn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
import math
from math import isnan
%matplotlib inline

from scipy import stats

import warnings
warnings.filterwarnings("ignore")

plt.style.use('seaborn-whitegrid')
plt.rc('text', usetex=True)
plt.rc('font', family='times')
plt.rc('xtick', labelsize=10) 
plt.rc('ytick', labelsize=10) 
plt.rc('font', size=12)

In [2]:
ls

 Volume in drive C is OS
 Volume Serial Number is 3EA9-93A4

 Directory of C:\Users\rhash\Documents\Datasets\Recommender systems

09/15/2018  12:13 AM    <DIR>          .
09/15/2018  12:13 AM    <DIR>          ..
09/14/2018  08:05 PM    <DIR>          .ipynb_checkpoints
09/15/2018  12:13 AM         2,145,491 itemBased_CF (Movie Lens Data).ipynb
09/14/2018  09:42 AM    <DIR>          ml-100k
09/15/2018  12:11 AM            45,651 userBased_CF (Movie Lens Data).ipynb
               2 File(s)      2,191,142 bytes
               4 Dir(s)  390,184,722,432 bytes free


In [3]:
# Load Data set
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('ml-100k/u.user', sep='|', names=u_cols)

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols)

# movies file contains columns indicating the movie's genres
# let's load only the first three columns of movie file with usecols
m_cols = ['movie_id', 'title', 'release_date']
movies = pd.read_csv('ml-100k/u.item', sep='|', names=m_cols, usecols=range(3), encoding='cp1250')

# merge the DataFrame
data = pd.merge(pd.merge(ratings, users), movies)
data = data[['user_id','title', 'movie_id','rating','release_date','sex','age']]


print("La BD has "+ str(data.shape[0]) +" ratings")
print("La BD has ", data.user_id.nunique()," users")
print("La BD has ", data.movie_id.nunique(), " movies\n")
print(data.head(5))

La BD has 100000 ratings
La BD has  943  users
La BD has  1682  movies

   user_id         title  movie_id  rating release_date sex  age
0      196  Kolya (1996)       242       3  24-Jan-1997   M   49
1      305  Kolya (1996)       242       5  24-Jan-1997   M   23
2        6  Kolya (1996)       242       4  24-Jan-1997   M   42
3      234  Kolya (1996)       242       4  24-Jan-1997   M   60
4       63  Kolya (1996)       242       3  24-Jan-1997   M   31


In [4]:
# dataframe with the data from user 1
data_movie_1 = data[data.movie_id==6]

# dataframe with the data from user 2
data_movie_2 = data[data.movie_id==18]

# We first compute the set of common movies
common_users = set(data_movie_1.user_id) & set(data_movie_2.user_id)
print("\nNumber of common users:", len(common_users),'\n')

# creat the subdataframe with only with the common movies
mask = (data_movie_1.user_id.isin(common_users))
data_movie_1 = data_movie_1[mask]
print(data_movie_1[['user_id','rating']].head(), '\n')

mask = (data_movie_2.user_id.isin(common_users))
data_movie_2 = data_movie_2[mask]
print(data_movie_2[['user_id','rating']].head())


Number of common users: 4 

       user_id  rating
95661      181       1
95662       90       4
95665        1       5
95676      655       4 

       user_id  rating
96837      181       1
96839       90       3
96840        1       4
96846      655       3


In [8]:
# Create the Item-Item similarity functions:
from scipy.stats import pearsonr
from scipy.spatial.distance import euclidean

# Returns a distance-based similarity score for item 1 and item 2:
def SimEuclid(DataFrame, Item1, Item2, min_common_users=1):
    # GET MOVIES OF ITEM1
    movies_item1=DataFrame[DataFrame['movie_id'] ==Item1 ]
    # GET MOVIES OF ITEM2
    movies_item2=DataFrame[DataFrame['movie_id'] ==Item2 ]
    
    # FIND SHARED USERS
    rep=pd.merge(movies_item1, movies_item2, on='user_id')  
    
    if len(rep)==0:
        return 0
    if(len(rep)<min_common_users):
        return 0
    #return distEuclid(rep['rating_x'],rep['rating_y']) 
    return 1.0/(1.0+euclidean(rep['rating_x'],rep['rating_y'])) 

# Returns a pearsonCorrealation-based similarity score for item 1 and item 2:
def SimPearson(DataFrame, Item1, Item2, min_common_users=1):
    # GET MOVIES OF ITEM1
    movies_item1=DataFrame[DataFrame['movie_id'] ==Item1 ]
    # GET MOVIES OF ITEM2
    movies_item2=DataFrame[DataFrame['movie_id'] ==Item2 ]
    
    # FIND SHARED USERS
    rep=pd.merge(movies_item1, movies_item2, on='user_id') 
    
    if len(rep)==0:
        return 0    
    if(len(rep)<min_common_users):
        return 0    
    res=pearsonr(rep['rating_x'],rep['rating_y'])[0]
    if(isnan(res)):
        return 0
    return res

# Returns a corrected pearsonCorrealation-based similarity score for item 1 and item 2:
def SimPearson_Corrected(DataFrame, Item1, Item2, min_common_users=1, pref_common_users=10):
    # GET MOVIES OF ITEM1
    movies_item1=DataFrame[DataFrame['movie_id'] ==Item1 ]
    # GET MOVIES OF ITEM2
    movies_item2=DataFrame[DataFrame['movie_id'] ==Item2 ]
    
    # FIND SHARED USERS
    rep=pd.merge(movies_item1, movies_item2, on='user_id')  
    
    if len(rep)==0:
        return 0    
    if(len(rep)<min_common_users):
        return 0
    
    res=pearsonr(rep['rating_x'],rep['rating_y'])[0] * min(pref_common_users, len(rep))/pref_common_users
    if(isnan(res)):
        return 0
    return res

In [9]:
""" Object Oriented Programming to:
    - compute the Item-Item similarity matrix for all items, using one of similarity functions defined above,
    - predict the User rating on a new Item, using all Item-Item similarities. """ 

class itemBased_CF:
    """ Collaborative filtering using a custom sim(u,u'). """
    
    def __init__(self, DataFrame, similarity=SimPearson, min_common_users=10, max_sim_items=10):
        """ Constructor """
        self.sim_method=similarity # Gets recommendations for a item by using a weighted average
        self.df=DataFrame
        self.sim = pd.DataFrame(np.sum([0]), columns=data_train.movie_id.unique(), index=data_train.movie_id.unique())
        self.min_common_users=min_common_users
        self.max_sim_items=max_sim_items

    def learn(self):
        """ Prepare data structures for estimation. compute Item-Item Similarity Matrix for all items """
        allItems=set(self.df['movie_id'])
        self.sim = {}
        for item1 in allItems:
            self.sim.setdefault(item1, {})
            a=data_train[data_train['movie_id']==item1][['user_id']]
            data_reduced=pd.merge(data_train, a, on='user_id')
            
            for item2 in allItems:
                # don't compare the item with itself
                if item1==item2: 
                    continue
                self.sim.setdefault(item2, {})
                
                if (item1 in self.sim[item2]):
                    continue  # since correlation matrix is a symmetric matrix
                
                sim=self.sim_method(data_reduced, item1, item2, self.min_common_users)
                
                # print(item1, item2, sim)
                
                if(sim<0):
                    self.sim[item1][item2]=0
                    self.sim[item2][item1]=0
                else:
                    self.sim[item1][item2]=sim
                    self.sim[item2][item1]=sim
                    
        self.mean_ratings=mean_rating \
                         =data_train[['user_id','movie_id','rating']].groupby('movie_id')['rating'].mean()
                
                
    def estimate(self, user_id, movie_id):
        
        totals={}
        movie_users=self.df[self.df['user_id'] ==user_id]
        rating_num=0.0
        rating_den=0.0
        allItems=set(movie_users['movie_id'])
        listOrdered=sorted([(self.sim[movie_id][other],other) for other in allItems if movie_id!=other], reverse=True)
        
        for user in range(min(len(listOrdered), self.max_sim_items)):
            other=listOrdered[user][1]
            rating_num += self.sim[movie_id][other] * (float(movie_users[movie_users['movie_id']==other]['rating'])) #-self.mean_ratings[other]
            rating_den += self.sim[movie_id][other]
        if rating_den==0: 
            if self.df.rating[self.df['user_id']==user_id].mean()>0:
                # return the mean user rating if there is no similar for the computation
                return self.df.rating[self.df['user_id']==user_id].mean()
            else:
                # else return mean item rating 
                return self.df.rating[self.df['movie_id']==movie_id].mean()
        return rating_num/rating_den

In [11]:
# Define function to calculate RMSE
def  compute_rmse(y_pred, y_true):
    """ Compute Root Mean Squared Error. """
    return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))

In [12]:
# Define function to evaluate the Recommender System
def  evaluate (estimate_f, data_train, data_test):
    """ RMSE-based predictive performance evaluation. """
    ids_to_estimate = zip(data_test.user_id, data_test.movie_id)
    estimated = np.array([estimate_f(u,i) if u in data_train.movie_id else 3 for (u,i) in ids_to_estimate ])
    
    real = data_test.rating.values
    return compute_rmse(estimated, real)

In [10]:
from sklearn.cross_validation import train_test_split

count=data.groupby('movie_id').count()
selectedData=data[data['movie_id'].isin(list(count[count['user_id']>100].reset_index()['movie_id']))]

data_train, data_test = train_test_split(selectedData, test_size=0.2, random_state=42)

In [13]:
record=itemBased_CF(data_train, similarity=SimEuclid, min_common_users=5, max_sim_items=10)
record.learn()

In [17]:
print('RMSE for Collaborative Recomender (using Item-Item Euclidian Similarity):  %s' % 
                                                                    evaluate(record.estimate, data_train, data_test))

RMSE for Collaborative Recomender (using Item-Item Euclidian Similarity):  1.134673114522238


In [15]:
record_new=itemBased_CF(data_train, similarity=SimPearson_Corrected, min_common_users=5, max_sim_items=10)
record_new.learn()

In [16]:
print('RMSE for Collaborative Recomender (using Corrected Item-Item Correlation Similarity):  %s' % 
                                                                    evaluate(record_new.estimate, data_train, data_test))

RMSE for Collaborative Recomender (using Corrected Item-Item Correlation Similarity):  1.11367658172128
