# Data description & Problem statement: 
One of the most common datasets that is available on the internet for building a Recommender System is the MovieLens DataSet. This version of the dataset that I'm working with (100K) contains 100,000 ratings from 1000 users on 1700 movies. Released 4/1998. The data was collected by GroupLens researchers over various periods of time, depending on the size of the set. Users were selected at random for inclusion. All users selected had rated at least 20 movies. Each user is represented by an id, and no other information is provided. The original data are contained in three files, movies.dat, ratings.dat and users.dat. 
In this project, I use the User-based collaborative filtering to predict the rating of a user on a new item.


# Theory of Collaborative Filtering (CF):
There are 2 main types of memory-based collaborative filtering algorithms:

1) User-User Collaborative Filtering: Here I find look alike users based on similarity and recommend movies which first user’s look-alike has chosen in past. This algorithm is very effective but takes a lot of time and resources. It requires to compute every user pair information which takes time. Therefore, for big base platforms, this algorithm is hard to implement without a very strong parallelizable system.

2) Item-Item Collaborative Filtering: It is quite similar to previous algorithm, but instead of finding user's look-alike, I try finding movie's look-alike. Once we have movie's look-alike matrix, we can easily recommend alike movies to user who have rated any movie from the dataset. This algorithm is far less resource consuming than user-user collaborative filtering. Hence, for a new user, the algorithm takes far lesser time than user-user collaborate as I don’t need all similarity scores between users. And with fixed number of movies, movie-movie look alike matrix is fixed over time.

Note: Here, I use User-based CF.

# Workflow:
- Load and merge the datasets
- Creat functions to calculate the User-User similarity, using:
     - Euclidian distance
     - Pearson's correlation
     - Corrected Pearson's correlation
     
- Object Oriented Programming to:
     - compute the User-User similarity matrix for all Users, using one of similarity functions defined above
     - predict the User rating on a new Item, using all User-User similarities 

- Define functions to calculate RMSE and evaluate the Recommender System
- Evaluate the Recommender System for different similarity functions

In [1]:
import sklearn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
import math
from math import isnan
%matplotlib inline

from scipy import stats

import warnings
warnings.filterwarnings("ignore")

plt.style.use('seaborn-whitegrid')
plt.rc('text', usetex=True)
plt.rc('font', family='times')
plt.rc('xtick', labelsize=10) 
plt.rc('ytick', labelsize=10) 
plt.rc('font', size=12)

In [2]:
ls

 Volume in drive C is OS
 Volume Serial Number is 3EA9-93A4

 Directory of C:\Users\rhash\Documents\Datasets\Recommender systems

09/15/2018  01:21 AM    <DIR>          .
09/15/2018  01:21 AM    <DIR>          ..
09/14/2018  08:05 PM    <DIR>          .ipynb_checkpoints
09/15/2018  01:15 AM            14,802 itemBased_CF (Movie Lens Data).ipynb
09/14/2018  09:42 AM    <DIR>          ml-100k
09/15/2018  01:21 AM            15,906 userBased_CF (Movie Lens Data).ipynb
               2 File(s)         30,708 bytes
               4 Dir(s)  390,187,143,168 bytes free


In [3]:
# Load Data set
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('ml-100k/u.user', sep='|', names=u_cols)

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols)

# movies file contains columns indicating the movie's genres
# let's load only the first three columns of movie file with usecols
m_cols = ['movie_id', 'title', 'release_date']
movies = pd.read_csv('ml-100k/u.item', sep='|', names=m_cols, usecols=range(3), encoding='cp1250')

# merge the DataFrame
data = pd.merge(pd.merge(ratings, users), movies)
data = data[['user_id','title', 'movie_id','rating','release_date','sex','age']]


print("La BD has "+ str(data.shape[0]) +" ratings")
print("La BD has ", data.user_id.nunique()," users")
print("La BD has ", data.movie_id.nunique(), " movies\n")
print(data.head(5))

La BD has 100000 ratings
La BD has  943  users
La BD has  1682  movies

   user_id         title  movie_id  rating release_date sex  age
0      196  Kolya (1996)       242       3  24-Jan-1997   M   49
1      305  Kolya (1996)       242       5  24-Jan-1997   M   23
2        6  Kolya (1996)       242       4  24-Jan-1997   M   42
3      234  Kolya (1996)       242       4  24-Jan-1997   M   60
4       63  Kolya (1996)       242       3  24-Jan-1997   M   31


In [4]:
# dataframe with the data from user 1
data_user_1 = data[data.user_id==6]

# dataframe with the data from user 2
data_user_2 = data[data.user_id==18]

# We first compute the set of common movies
common_movies = set(data_user_1.movie_id) & set(data_user_2.movie_id)
print("\nNumber of common movies", len(common_movies),'\n')

# creat the subdataframe with only with the common movies
mask = (data_user_1.movie_id.isin(common_movies))
data_user_1 = data_user_1[mask]
print(data_user_1[['title','rating']].head(), '\n')

mask = (data_user_2.movie_id.isin(common_movies))
data_user_2 = data_user_2[mask]
print(data_user_2[['title','rating']].head())


Number of common movies 132 

                                    title  rating
2                            Kolya (1996)       4
885                Raising Arizona (1987)       5
1255  Truth About Cats & Dogs, The (1996)       2
1854          English Patient, The (1996)       2
2636                          Babe (1995)       4 

                                    title  rating
13                           Kolya (1996)       5
939                Raising Arizona (1987)       5
1303  Truth About Cats & Dogs, The (1996)       3
1906          English Patient, The (1996)       5
2676                          Babe (1995)       5


In [5]:
# Create the similarity functions:
from scipy.stats import pearsonr
from scipy.spatial.distance import euclidean

# Returns a distance-based similarity score for user 1 and user 2:
def SimEuclid(DataFrame, User1, User2, min_common_items=1):
    # GET MOVIES OF USER1
    movies_user1=DataFrame[DataFrame['user_id'] ==User1 ]
    # GET MOVIES OF USER2
    movies_user2=DataFrame[DataFrame['user_id'] ==User2 ]
    
    # FIND SHARED FILMS
    rep=pd.merge(movies_user1, movies_user2, on='movie_id')  
    
    if len(rep)==0:
        return 0
    if(len(rep)<min_common_items):
        return 0
    #return distEuclid(rep['rating_x'],rep['rating_y']) 
    return 1.0/(1.0+euclidean(rep['rating_x'],rep['rating_y'])) 

# Returns a pearsonCorrealation-based similarity score for user 1 and user 2:
def SimPearson(DataFrame, User1, User2, min_common_items=1):
    # GET MOVIES OF USER1
    movies_user1=DataFrame[DataFrame['user_id'] ==User1 ]
    # GET MOVIES OF USER2
    movies_user2=DataFrame[DataFrame['user_id'] ==User2 ]
    
    # FIND SHARED FILMS
    rep=pd.merge(movies_user1, movies_user2, on='movie_id')
    
    if len(rep)==0:
        return 0    
    if(len(rep)<min_common_items):
        return 0    
    res=pearsonr(rep['rating_x'],rep['rating_y'])[0]
    if(isnan(res)):
        return 0
    return res

# Returns a corrected pearsonCorrealation-based similarity score for user 1 and user 2:
def SimPearson_Corrected(DataFrame, User1, User2, min_common_items=1, pref_common_items=20):
    # GET MOVIES OF USER1
    movies_user1=DataFrame[DataFrame['user_id']==User1]
    # GET MOVIES OF USER2
    movies_user2=DataFrame[DataFrame['user_id']==User2]
    
    # FIND SHARED FILMS
    rep=pd.merge(movies_user1, movies_user2, on='movie_id')
    if len(rep)==0:
        return 0    
    if(len(rep)<min_common_items):
        return 0
    
    res=pearsonr(rep['rating_x'],rep['rating_y'])[0] * min(pref_common_items,len(rep))/pref_common_items
    if(isnan(res)):
        return 0
    return res

In [6]:
"""Object Oriented Programming, to:
    - compute the User-User similarity matrix for all Users, using one of similarity functions defined above,
    - predict the User rating on a new Item, using all User-User similarities. """

class userBased_CF:
    """ Collaborative filtering using a custom sim(u,u'). """
    
    def __init__(self, DataFrame, similarity=SimPearson, min_common_items=10, max_sim_users=10):
        """ Constructor """
        self.sim_method=similarity #Gets recommendations for a person by using a weighted average
        self.df=DataFrame
        self.sim = pd.DataFrame(np.sum([0]), columns=data_train.user_id.unique(), index=data_train.user_id.unique())
        self.min_common_items=min_common_items
        self.max_sim_users=max_sim_users

    def learn(self):
        """ Prepare data structures for estimation. compute User-User Similarity Matrix for all users """
        allUsers=set(self.df['user_id'])
        self.sim = {}
        for person1 in allUsers:
            self.sim.setdefault(person1, {})
            a=data_train[data_train['user_id']==person1][['movie_id']]
            data_reduced=pd.merge(data_train, a, on='movie_id')
            
            for person2 in allUsers:
                # don't compare the user with itself
                if person1==person2: 
                    continue
                self.sim.setdefault(person2, {})
                
                if (person1 in self.sim[person2]):
                    continue  # since correlation matrix is a symmetric matrix
                
                sim=self.sim_method(data_reduced, person1, person2, self.min_common_items)
                
                ### print person1, person2, sim
                
                if(sim<0):
                    self.sim[person1][person2]=0
                    self.sim[person2][person1]=0
                else:
                    self.sim[person1][person2]=sim
                    self.sim[person2][person1]=sim
                    
        self.mean_ratings=mean_rating \
                         =data_train[['user_id','movie_id','rating']].groupby('user_id')['rating'].mean()
                
                
    def estimate(self, user_id, movie_id):
        
        totals={}
        movie_users=self.df[self.df['movie_id'] ==movie_id]
        rating_num=0.0
        rating_den=0.0
        allUsers=set(movie_users['user_id'])
        listOrdered=sorted([(self.sim[user_id][other],other) for other in allUsers if user_id!=other],reverse=True)
        
        for item in range(min(len(listOrdered),self.max_sim_users)):
            other=listOrdered[item][1]
            rating_num += self.sim[user_id][other] * (float(movie_users[movie_users['user_id']==other]['rating']-self.mean_ratings[other]))
            rating_den += self.sim[user_id][other]
        if rating_den==0: 
            if self.df.rating[self.df['movie_id']==movie_id].mean()>0:
                # return the mean movie rating if there is no similar for the computation
                return self.df.rating[self.df['movie_id']==movie_id].mean()
            else:
                # else return mean user rating 
                return self.df.rating[self.df['user_id']==user_id].mean()
        return self.mean_ratings[user_id]+rating_num/rating_den

In [7]:
test_size=0.25

def assign_to_set(df):
    sampled_ids = np.random.choice(df.index,
                                  size=np.int64(np.ceil(df.index.size * test_size)),
                                  replace=False)
    df.ix[sampled_ids, 'for_testing'] = True
    return df

data['for_testing'] = False
grouped = data.groupby('user_id', group_keys=False).apply(assign_to_set)
data_train = data[grouped.for_testing == False]
data_test = data[grouped.for_testing == True]

#print(data_train.shape)
#print(data_test.shape)
#print(data_train.index & data_test.index, '\n')

print("Training data_set has "+ str(data_train.shape[0]) +" ratings")
print("Test data set has "+ str(data_test.shape[0]) +" ratings")
print("It has ", data.movie_id.nunique(), " movies")

Training data_set has 74646 ratings
Test data set has 25354 ratings
It has  1682  movies


In [8]:
# Define function to calculate RMSE
def  compute_rmse(y_pred, y_true):
    """ Compute Root Mean Squared Error. """
    return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))

In [9]:
# Define function to evaluate Recommender System
def  evaluate (estimate_f, data_train, data_test):
    """ RMSE-based predictive performance evaluation. """
    ids_to_estimate = zip(data_test.user_id, data_test.movie_id)
    estimated = np.array([estimate_f(u,i) if u in data_train.user_id else 3  for (u,i) in ids_to_estimate ])
    
    real = data_test.rating.values
    return compute_rmse(estimated, real)

In [10]:
record=userBased_CF(data_train, similarity=SimPearson, min_common_items=3, max_sim_users=10)
record.learn()

In [11]:
print('RMSE for Collaborative Recomender (using User-User Correlation Similarity):  %s' % 
                                                                    evaluate(record.estimate, data_train, data_test))

RMSE for Collaborative Recomender (using User-User Correlation Similarity):  1.039392771277547


In [12]:
record_new=userBased_CF(data_train, similarity=SimPearson_Corrected, min_common_items=3, max_sim_users=10)
record_new.learn()

In [13]:
print('RMSE for Collaborative Recomender (using Corrected User-User Correlation Similarity):  %s' % 
                                                                    evaluate(record_new.estimate, data_train, data_test))

RMSE for Collaborative Recomender (using Corrected User-User Correlation Similarity):  1.027682960355351
