-------------------------------------------------------------------------------------------
# CS616-Assignment 2

-------------------------------------------------------------------------------------------
The goal of this assignment is to implement the various recommender system algorithms learnt in class. I implemented and evaluated recommender systems based on the following algorithms:
1. Matrix Factorisation Based 
2. Content-based Collaborative Filtering
3. Profile-based Collaborative Filtering
4. Hybrid (Mixture of Content and Profile based) 

The algorithms are implemented on the MovieLens dataset, which had the following information:
- Information about movies: 1682 movies cateogarised into 20 genres
- Information about users: The dataset contains information of IMDB users, such as their sex, age, postal code and their ratings for the above 1682 movies. 

The dataset was already split into multiple training and test sets. The first set (u1.base) was used to feed information into the algorithms, and u1.test to tune the hyperparameters related to each algorithm. The final results are shown on the u2.base and u2.test datasets.


In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import pairwise
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import scipy.stats as stats
from geopy.geocoders import Nominatim
from scipy.spatial.distance import pdist, squareform

## Read Data

In [2]:
userProfiles=pd.read_csv("./ml-100k/u.user",sep="|",header=None)
userProfiles=userProfiles.rename(columns={0:"id",1:"age",2:"sex",3:"occupation",4:"zip"})
print(userProfiles.head())

   id  age sex  occupation    zip
0   1   24   M  technician  85711
1   2   53   F       other  94043
2   3   23   M      writer  32067
3   4   24   M  technician  43537
4   5   33   F       other  15213


In [3]:
data=pd.read_csv("./ml-100k/u.item",sep="|",header=None,encoding = "ISO-8859-1")
data=data.rename(columns={0:"id",1:"title",2:"release",3:"vid_release",4:"URL"})
movies=pd.DataFrame(columns=["id","title","release","genre"])
for row,i in data.iterrows():
    movies.loc[row,:]=[i["id"],i["title"],i["release"],[i[j] for j in range(5,24)]]
print(movies.head())

  id              title      release  \
0  1   Toy Story (1995)  01-Jan-1995   
1  2   GoldenEye (1995)  01-Jan-1995   
2  3  Four Rooms (1995)  01-Jan-1995   
3  4  Get Shorty (1995)  01-Jan-1995   
4  5     Copycat (1995)  01-Jan-1995   

                                               genre  
0  [0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  
1  [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  
2  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  
3  [0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...  
4  [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, ...  


In [4]:
ratings=pd.read_csv("./ml-100k/u2.base",sep="\t",header=None)
ratings=ratings.rename(columns={0:"uid",1:"mid",2:"rating",3:"epoch"})
print(ratings.head())

   uid  mid  rating      epoch
0    1    3       4  878542960
1    1    4       3  876893119
2    1    5       3  889751712
3    1    6       5  887431973
4    1    7       4  875071561


In [5]:
rating_matrix=np.zeros(shape=(len(userProfiles),len(movies)))
for idx,row in ratings.iterrows():
    rating_matrix[row.uid-1][row.mid-1]=row.rating
print(rating_matrix[:5][:5])

[[0. 0. 4. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [4. 3. 0. ... 0. 0. 0.]]


In [6]:
test_ratings=pd.read_csv("./ml-100k/u2.test",sep="\t",header=None)
test_ratings=test_ratings.rename(columns={0:"uid",1:"mid",2:"rating",3:"epoch"})
print(test_ratings.head())

   uid  mid  rating      epoch
0    1    1       5  874965758
1    1    2       3  876893171
2    1    8       1  875072484
3    1    9       5  878543541
4    1   21       1  878542772


In [7]:
num_movies = len(movies)
num_users = len(userProfiles)

## Evaluation of the recommendations:
Since all of the algorithms that I tried return a score on a scale of 0 to 5, evaluation of the algorithm is a straightforward task.
Suppose $\hat{y}_i, i \in [1,n]$ denotes the predicted score returned for each item in the test data of length n, and ${y}_i, i \in [1,n]$ denotes the actual score. To evaluate the performance of the recommendor system algorithm, I made use of the following two metrics.
1. Mean Square Error (MSE): This is defined as follows:
$$ MSE = \frac{\sum^{n}_{1}( \hat{y}_i - y_i)^2}{n}$$
2. Pearson Correlation 

While the algorithms do try to return a predicted score for each movie, I found that the algorithms were mostly predicting values close to 3.5, since the values were a result of averaging over multiple related scores. Also, the goal of the recommender system is to output the movies that the user is likely to enjoy the most, and not to actually predict the scores. Thus just testing for a correlation between the predicted values and the actual ratings should also be a good indicator of how good the recommendations are. Thus, our second evaluation metric is the Pearson correlation between $\hat{y}_i$ and $y_i$. 

I found that both of these metrics were in tune with each other most of the times.


In [8]:
def evaluate_ratings(true_ratings,predicted_ratings):
    mse = np.sum((true_ratings - predicted_ratings)**2)/len(test_ratings)
    print("Mean Square Error: " + str(mse))
    pearson_corr = np.corrcoef(true_ratings,y = predicted_ratings)
    print("Pearson Correlation: " + str(pearson_corr[0][1]))

## Approach 1: Matrix Factorisation
In this approach, I factorise our ratings matrix into two matrices of smaller dimension, which shall represent the user feature matrix and the movie feature matrix. The idea behind this is that since the original matrix can be reconstructed by the dot product of the smaller matrices, I can use gradient descent on the individual matrices to get a fairy good approximation of the missing values in the original matrix.

Note that the number of latent features to be considered, the learning rate (alpha) and the regularisation factor (beta) are hyperparameters in this approach.

The loss function taken is the regularised square error -
     $$ \textrm{Error}= \sum^{n}_{1} (\textrm{actual rating} - \textrm{predicted rating})^2+ L^2\textrm{-Regularisation Term} $$

This was taken according to the most recent approaches in the SVD algorithm 
  

In [10]:
# Matrix Factorisation 
def fact(rating,userFeature,movieFeature,latent,steps,alpha,beta):
    movieFeature=movieFeature.T
    for s in range(steps):
        for i in range(len(rating)):
            for j in range(len(rating[i])):
                if(rating[i][j]>0):
                    err=rating[i][j]-np.dot(userFeature[i,:],movieFeature[:,j])
                    for k in range(latent):
                        # Calculating Gradient
                        userFeature[i][k]=userFeature[i][k]+alpha*(2*err*movieFeature[k][j]-beta*userFeature[i][k])
                        movieFeature[k][j]=movieFeature[k][j]+alpha*(2*err*userFeature[i][k]-beta*movieFeature[k][j])
        #erating=np.dot(userFeature,movieFeature)
        e=0
        for i in range(len(rating)):
            for j in range(len(rating[i])):
                if(rating[i][j]>0):
                    e=e+pow(rating[i][j]-np.dot(userFeature[i,:],movieFeature[:,j]),2)
                    for k in range(latent):
                        e=e+(beta/2)*(pow(userFeature[i][k],2)+pow(movieFeature[k][j],2))
        if(e<0.1):
            break
    return userFeature,movieFeature.T

In [11]:
#rating_matrix=np.array(rating_matrix)
(n,m) = rating_matrix.shape
latent=19
userFeatures=np.random.rand(n,latent)
movieFeatures=np.random.rand(m,latent)
nU,nM=fact(rating_matrix,userFeatures,movieFeatures,latent,100,0.001,0.01)
nR=np.dot(nU,nM.T)

In [12]:
# Predicting Ratings
def fact_predicted_score(u_id,m_id):
    if(rating_matrix[u_id-1,m_id-1]!=0): # should not occur if train and test data are disjoint
        print("Predicted: ",nR[u_id-1,m_id])
        return "Already seen, rated :"+str(rating_matrix[u_id-1,m_id-1])
    else:
        return nR[u_id-1,m_id]
    
predicted_ratings = np.zeros(len(test_ratings))
for i in range(len(test_ratings)):
    predicted_ratings[i] = fact_predicted_score(test_ratings.loc[i,'uid'],test_ratings.loc[i,'mid'])

In [15]:
evaluate_ratings(test_ratings['rating'],predicted_ratings)

Mean Square Error: 1.5194176531823604
Pearson Correlation: 0.2787804333910821


## Approach 2: Item-based recommendations

In this approach, I made use of the similarity between movies to predict a user's rating for a movie that they have not seen. First, I computed the similarity between movies. I defined the similarity between movie $X$ and movie $Y$ as: 

$$sim(X,Y) = adjusted\_cosine\_similarity(user\_ratings(X),user\_ratings(Y)$$

Thus I get a similarity matrix, where the $(i,j)^{th}$ element indicates the similarity between movie $i$ and movie $j$. Now, to get the predicted score for movie $M$ and user $U$, I used the function as defined in class:
$$pred(U,M) = mean(ratings(U,:)) + \frac{\sum_{Z\in N} sim(M,Z)*(rating(U,Z)-mean(ratings(U,:))}{\sum_{Z\in N}sim(M,Z)}$$
Here, I tried defining N in two ways:
1. Threshold based: $ N = \{z: similarity(z,M)>\lambda, z \in \hat{M}\} $, where $\lambda$ is a user defined threshold and $\hat{M}$ is the set of all movies. 
2. Neighbors based: The set of n-most similar movies to movie $M$, where n is a user defined number 

Incase ${\sum_{Z\in N}sim(M,Z)} = 0,$ I just set $pred(U,M) = mean(ratings(U,:))$


In [12]:
#Computes similarity between movie X and Y using adjusted cosine similarity
def item_cosine_similarity(X,Y):
    X_mean = np.mean(X)
    Y_mean = np.mean(Y)
    if(X_mean==0 or Y_mean==0):
        return [[0]]
    X_adj = X-X_mean
    Y_adj = Y-Y_mean
    similarity = pairwise.cosine_similarity(X_adj.reshape(1,-1),Y_adj.reshape(1,-1))
    return similarity
#Returns the predicted score for movie i for a user j
def item_based_predicted_score(user_id,movie_id,similarity_matrix,rating_matrix,n_neighbors = 10,type = 'threshold'):
    score=0
    count=0
    sim_sum = 0
    l=[]
    if(rating_matrix[user_id-1,movie_id-1]!=0):
        ## Movie is already watched
        return -1
    else:
        if(type=='neighbors'):
            top_neighbors = np.flip(np.argsort(similarity_matrix[:,movie_id-1]))
            for i in top_neighbors:
                if(rating_matrix[user_id-1,i]!=0):
                    mean = np.mean(rating_matrix[:,i])
                    l.append(i)
                    score+=similarity_matrix[movie_id-1][i]*(rating_matrix[user_id-1,i] - mean)
                    sim_sum+=similarity_matrix[movie_id-1][i]
                    count+=1
                    if(count==n_neighbors):
                        break
        if(type == 'threshold'):
            for i in range(len(rating_matrix[user_id-1])):
                if(rating_matrix[user_id-1,i]!=0 and similarity_matrix[movie_id-1][i]>0):
                    mean = np.mean(rating_matrix[:,i])
                    l.append(i)
                    score+=similarity_matrix[movie_id-1][i]*(rating_matrix[user_id-1,i] - mean)
                    sim_sum+=similarity_matrix[movie_id-1][i]
                    count+=1
        if(count==0):
            return 0
        if(sim_sum==0):
            predicted_rating = np.mean(rating_matrix[:,i])
        else:
            predicted_rating = np.mean(rating_matrix[:,i]) + score/sim_sum
        return predicted_rating

In [14]:
similarity_matrix_item=cosine_similarity(rating_matrix.T)
print(similarity_matrix_item)

[[1.         0.33518385 0.2457249  ... 0.         0.05290709 0.05290709]
 [0.33518385 1.         0.20116515 ... 0.         0.08522865 0.08522865]
 [0.2457249  0.20116515 1.         ... 0.         0.         0.11420805]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.05290709 0.08522865 0.         ... 0.         1.         0.        ]
 [0.05290709 0.08522865 0.11420805 ... 0.         0.         1.        ]]


In [22]:
#Computes the similarity matrix
similarity_matrix_item=cosine_similarity(rating_matrix.T)
print(similarity_matrix_item)

[[ 1.          0.18559869  0.12886179 ... -0.02440874  0.04173334
   0.04173334]
 [ 0.18559869  1.          0.13326154 ... -0.01118631  0.07897345
   0.07897345]
 [ 0.12886179  0.13326154  1.         ... -0.0081614  -0.0081614
   0.10963762]
 ...
 [-0.02440874 -0.01118631 -0.0081614  ...  1.         -0.00106157
  -0.00106157]
 [ 0.04173334  0.07897345 -0.0081614  ... -0.00106157  1.
  -0.00106157]
 [ 0.04173334  0.07897345  0.10963762 ... -0.00106157 -0.00106157
   1.        ]]


On the validation set u1, I found that the neighbors method works better with $n = 8$

In [15]:
predicted_ratings = np.zeros(len(test_ratings))
for i in range(len(test_ratings)):
    predicted_ratings[i] =  item_based_predicted_score(test_ratings.loc[i,'uid'],test_ratings.loc[i,'mid'],similarity_matrix_item,rating_matrix,n_neighbors = 8,type = 'neighbors')

In [18]:
def rmse(y_true, y_pred):
    # Ensure that y_true and y_pred are NumPy arrays
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    
    # Calculate the squared differences
    squared_diff = (y_true - y_pred) ** 2
    
    # Calculate the mean of the squared differences
    mean_squared_diff = np.mean(squared_diff)
    
    # Calculate the square root to get RMSE
    rmse_value = np.sqrt(mean_squared_diff)
    
    return rmse_value


In [20]:
rmse(test_ratings['rating'],predicted_ratings)

1.0468753129757744

In [24]:
evaluate_ratings(test_ratings['rating'],predicted_ratings)

Mean Square Error: 1.0280864092559379
Pearson Correlation: 0.48620088979496545


I also have some more information available about the movies, which is the genres that they belong to. This can be useful information in computing the similarity between movies.

In [25]:
movie_columns = ["movie_id", "title", "release_date", "video_release_date", "IMDb_URL"]
genre_columns = [
    "unknown", "Action", "Adventure", "Animation", "Children's", "Comedy", "Crime", "Documentary",
    "Drama", "Fantasy", "Film-Noir", "Horror", "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller",
    "War", "Western"
]
data = pd.read_csv("./ml-100k/u.item", sep="|", header=None, encoding="ISO-8859-1")
data.columns = ["movie_id"] + movie_columns[1:] + genre_columns
data.drop(columns = 'unknown',inplace = True)
data.head()

Unnamed: 0,movie_id,title,release_date,video_release_date,IMDb_URL,Action,Adventure,Animation,Children's,Comedy,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,1,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


I first compute the tf-idf scores, and then calculate the adjusted cosine similarity between the movies only based on the genre information

In [26]:
data['genres_list'] = data.apply(lambda row: ','.join(row[6:].index[row[6:] == 1]),axis=1)
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(data['genres_list'])
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(),columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df.head()

Unnamed: 0,adventure,animation,children,comedy,crime,documentary,drama,fantasy,fi,film,horror,musical,mystery,noir,romance,sci,thriller,war,western
0,0.0,0.74066,0.573872,0.349419,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.771538,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.636183,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.7672,0.0,0.0,0.641408,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.735504,0.0,0.363186,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.571953,0.0,0.0


In [27]:
similarity_matrix_item_genre=np.zeros(shape=(num_movies,num_movies))
for i in range(num_movies):
    for j in range(num_movies):
        similarity_matrix_item_genre[i,j]=item_cosine_similarity(np.array(tfidf_df)[i,:],np.array(tfidf_df)[j,:])[0][0]
print(similarity_matrix_item_genre)

[[ 1.         -0.1409362  -0.09734822 ... -0.13778894  0.29105884
  -0.09734822]
 [-0.1409362   1.          0.61019237 ... -0.11384354 -0.08043074
  -0.08043074]
 [-0.09734822  0.61019237  1.         ... -0.07863463 -0.05555556
  -0.05555556]
 ...
 [-0.13778894 -0.11384354 -0.07863463 ...  1.         -0.07863463
   0.49967011]
 [ 0.29105884 -0.08043074 -0.05555556 ... -0.07863463  1.
  -0.05555556]
 [-0.09734822 -0.08043074 -0.05555556 ...  0.49967011 -0.05555556
   1.        ]]


To get the final predictions, I use a weighted sum of the predictions obtained based on user and movie ratings, and the predictions obtained based on genre information. I found that a weight of 0.8 for ratings based predictions gave the best results.

In [28]:
weight = 0.8
predicted_ratings = np.zeros(len(test_ratings))
for i in range(len(test_ratings)):
    predicted_ratings[i] =  weight*item_based_predicted_score(test_ratings.loc[i,'uid'],test_ratings.loc[i,'mid'],similarity_matrix_item ,rating_matrix,n_neighbors = 8,type = 'neighbors') + (1-weight)*item_based_predicted_score(test_ratings.loc[i,'uid'],test_ratings.loc[i,'mid'],similarity_matrix_item_genre ,rating_matrix,type = 'threshold')

There is only a slight increase in accuracy

In [29]:
evaluate_ratings(test_ratings['rating'],predicted_ratings)

Mean Square Error: 0.9757337468455628
Pearson Correlation: 0.4985341576258008


## Approach 3: Profile Based recommendations
In this approach, I made use of the similarity between users to predict a user's rating for a movie that they have not seen. First, I computed the similarity between users. I defined the similarity between user $X$ and user $Y$ as: 

$$sim(X,Y) = adjusted\_pearson\_correlation(movie\_ratings(X),movie\_ratings(Y)$$

Thus I get a similarity matrix, where the $(i,j)^{th}$ element indicates the similarity between user $i$ and user $j$. Now, to get the predicted score for movie $M$ and user $U$, I used the function as defined in class:
$$pred(U,M) = mean(ratings(:,M)) + \frac{\sum_{Z\in N} sim(U,Z)*(rating(Z,M)-mean(ratings(:,M))}{\sum_{Z\in N}sim(U,Z)}$$
Here, I tried defining N in two ways:
1. Threshold based: $ N = \{z: similarity(U,z)>\lambda, z \in \hat{U}\} $, where $\lambda$ is a user defined threshold and $\hat{U}$ is the set of all users. 
2. Neighbors based: The set of n-most similar users to user $U$, where n is a user defined number 

Incase ${\sum_{Z\in N}sim(U,Z)} = 0,$ I just set $pred(U,M) = mean(ratings(:,M))$

In [30]:
## Computes adjusted pearson correlation between X and Y
def user_pearson_similarity(X,Y):
    X_mean = np.mean(X)
    Y_mean = np.mean(Y)
    if(X_mean==0 or Y_mean==0):
        return [[0],[0]]
    X_adj = X-X_mean
    Y_adj = Y-Y_mean
    similarity = np.corrcoef(X_adj,Y_adj)
    return similarity
## Computes the predicted score for given user U and movie M
def user_based_predicted_score(user_id,movie_id,similarity_matrix,rating_matrix,n_neighbors = 10,type = 'threshold',threshold = 0):
    score=0
    count=0
    sim_sum = 0
    l=[]
    if(rating_matrix[user_id-1,movie_id-1]!=0):
        return -1
    else:
        if(type=='neighbors'):
            top_neighbors = np.flip(np.argsort(similarity_matrix[user_id-1,:]))
            for i in top_neighbors:
                if(rating_matrix[i,movie_id-1]!=0):
                    mean = np.mean(rating_matrix[i,:])
                    l.append(i)
                    score+=similarity_matrix[user_id-1][i]*(rating_matrix[i,movie_id-1] - mean)
                    sim_sum+=similarity_matrix[user_id-1][i]
                    count+=1
                    if(count==n_neighbors):
                        break
        if(type == 'threshold'):
            for i in range(len(rating_matrix[:,movie_id-1])):
                if(rating_matrix[i,movie_id-1]!=0 and similarity_matrix[user_id-1][i]>threshold):
                    mean = np.mean(rating_matrix[i,:])
                    l.append(i)
                    score+=similarity_matrix[user_id-1][i]*(rating_matrix[i,movie_id-1] - mean)
                    sim_sum+=similarity_matrix[user_id-1][i]
                    count+=1
        if(count==0):
            return 0
        if(sim_sum==0):
            predicted_rating = np.mean(rating_matrix[:,i])
        else:
            predicted_rating = np.mean(rating_matrix[:,i]) + score/sim_sum
        return predicted_rating

In [31]:
similarity_matrix_user=np.zeros(shape=(num_users,num_users))
for i in range(num_users):
    for j in range(num_users):
        similarity_matrix_user[i,j]=user_pearson_similarity(rating_matrix[i,:],rating_matrix[j,:])[0][1]
print(similarity_matrix_user)

[[ 1.          0.05985147 -0.03696226 ...  0.06026923  0.09672882
   0.29232671]
 [ 0.05985147  1.          0.06717595 ...  0.08067685  0.13000657
   0.05524174]
 [-0.03696226  0.06717595  1.         ...  0.03121862  0.09862254
  -0.02759119]
 ...
 [ 0.06026923  0.08067685  0.03121862 ...  1.          0.08012962
   0.06529963]
 [ 0.09672882  0.13000657  0.09862254 ...  0.08012962  1.
   0.12773543]
 [ 0.29232671  0.05524174 -0.02759119 ...  0.06529963  0.12773543
   1.        ]]


In [32]:
predicted_ratings = np.zeros(len(test_ratings))
for i in range(len(test_ratings)):
    predicted_ratings[i] =  user_based_predicted_score(test_ratings.loc[i,'uid'],test_ratings.loc[i,'mid'],similarity_matrix_user,rating_matrix,type = 'threshold',threshold = 0)

In [33]:
evaluate_ratings(test_ratings['rating'],predicted_ratings)

Mean Square Error: 1.117923616944161
Pearson Correlation: 0.42758331685477824


I also have some more information available regarding the users, which is the demographic information. The demographic information can play a crucial role in computing the similarity between users. 

The user profile data had the zip code for each user. Geographical location greatly impacts culture, which in turn impacts what kind of movies people enjoy. Thus this information is important, and can be extracted from the zip code. I used the library geopy to do that:

In [34]:
geolocator = Nominatim(user_agent="geoapiExercises")
userProfiles['country'] = "NA"
userProfiles['state'] = "NA"
for i in range(len(userProfiles)):
    try:
        location = geolocator.geocode(userProfiles.loc[i,'zip'])
        userProfiles.loc[i,'country'] = location[0].split(", ")[-1]
    except:
        continue

I filtered out countries that had less than 5 users belonging to them

In [35]:
less_info = userProfiles['country'].value_counts()[(userProfiles['country'].value_counts() < 5)].index
userProfiles.loc[np.isin(userProfiles['country'],np.array(less_info)),'country'] = "NA"

In [36]:
userProfiles['country'] = userProfiles['country'].str.replace(' ', '_')
userProfiles['country'] = userProfiles['country'].str.replace('/', '_')

In [37]:
userProfiles_dummy = pd.get_dummies(userProfiles,columns = ['country','occupation','sex'])
userProfiles_dummy.drop(columns = ['country_NA','occupation_other'],inplace = True)

In [38]:
userProfiles_dummy.drop(columns = ['state','zip','age','id'],inplace = True)

I obtained the tf-idf representation for the demographics, including the country information and the sex information, and then obtained the similarity matrix using adjusted pearson correlation

In [39]:
userProfiles_dummy['demographics_list'] = userProfiles_dummy.apply(lambda row: ','.join(row[6:].index[row[6:] == 1]), axis=1)
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(userProfiles_dummy['demographics_list'])
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df.head()

Unnamed: 0,country_hrvatska,country_indonesia,country_italia,country_lietuva,country_maroc_ⵍⵎⵖⵔⵉⴱ_المغرب,country_méxico,country_polska,country_slovensko,country_suomi___finland,country_sverige,...,occupation_none,occupation_programmer,occupation_retired,occupation_salesman,occupation_scientist,occupation_student,occupation_technician,occupation_writer,sex_f,sex_m
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.880031,0.0,0.0,0.261278
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.379583,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.582284,0.0,0.194219
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.880031,0.0,0.0,0.261278
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.49007,0.0


In [40]:
similarity_matrix_user_demo=np.zeros(shape=(num_users,num_users))
for i in range(num_users):
    for j in range(num_users):
        similarity_matrix_user_demo[i,j]=user_pearson_similarity(np.array(tfidf_df)[i,:],np.array(tfidf_df)[j,:])[0][1]
print(similarity_matrix_user_demo)

[[ 1.         -0.05428283 -0.01172943 ...  0.07164157 -0.07076029
   0.2794357 ]
 [-0.05428283  1.          0.71615281 ... -0.04728051  0.08404554
  -0.05963554]
 [-0.01172943  0.71615281  1.         ...  0.03784126 -0.0721369
   0.00661615]
 ...
 [ 0.07164157 -0.04728051  0.03784126 ...  1.         -0.06163243
   0.80813699]
 [-0.07076029  0.08404554 -0.0721369  ... -0.06163243  1.
  -0.07773779]
 [ 0.2794357  -0.05963554  0.00661615 ...  0.80813699 -0.07773779
   1.        ]]


I then took a weighted sum of the predictions obtained using the ratings information and the predictions obtained using the demographics information. I found that a weight of 0.9 on the predictions obtained using the ratings gave the best results, although the increase in accuracy is small

In [41]:
weight = 0.9
predicted_ratings = np.zeros(len(test_ratings))
for i in range(len(test_ratings)):
    predicted_ratings[i] =  weight*user_based_predicted_score(test_ratings.loc[i,'uid'],test_ratings.loc[i,'mid'],similarity_matrix_user,rating_matrix,type = 'threshold') + (1-weight)*user_based_predicted_score(test_ratings.loc[i,'uid'],test_ratings.loc[i,'mid'],similarity_matrix_user_demo ,rating_matrix,type = 'threshold')

In [42]:
evaluate_ratings(test_ratings['rating'],predicted_ratings)

Mean Square Error: 1.115142160603081
Pearson Correlation: 0.4287607951752191


## Approach 4: Hybrid: Combining predictions from user-based and profile-based approaches
To combine the predictions, I simply took a weighted sum of the predictions from the profile-based algorithm and the content-based algorithm. I found that a weight of 0.5 on the user-based algorithm gives the best results. 
In a way, the weight on the user based predictions can be adjusted to account for Serendipity. The higher this weight is, the better will be the Serendipity of the algorithm, since more weightage will be given to the user based approach

In [43]:
user_weight = 0.5
predicted_ratings = np.zeros(len(test_ratings))
for i in range(len(test_ratings)):
    predicted_ratings[i] =  user_weight*user_based_predicted_score(test_ratings.loc[i,'uid'],test_ratings.loc[i,'mid'],similarity_matrix_user,rating_matrix,type = 'threshold',threshold = 0) + (1-user_weight)*item_based_predicted_score(test_ratings.loc[i,'uid'],test_ratings.loc[i,'mid'],similarity_matrix_item,rating_matrix,n_neighbors = 8,type = 'neighbors')

In [44]:
evaluate_ratings(test_ratings['rating'],predicted_ratings)

Mean Square Error: 0.894633334321835
Pearson Correlation: 0.5575564146121965


## Summary of methods
<div align="center">
    
| Method    | RMSE | Correlation |
| :-----------: | :-----------: | :-----------: |
| Matrix Factorisation Based      | 1.529       | 0.280 |
| Item based (ratings)  | 1.028        | 0.486 |
| Item-based with genre information  | 0.976      | 0.498 |
| Profile Based CF (ratings)  | 1.118       | 0.427 |
| Profile Based CF with demographics information   | 1.115        | 0. 429|
| Hybrid   | 0.895       | 0.558 |
    
</div>

As expected, the more information that I feed the system, the better results I get, and therefore the hybrid algor

## Final Recommender System

In [45]:
user_id = 1

In [46]:
s = 0.5
data['predicted_ratings'] = -1
for i in range(len(data)):
    data.loc[i,'predicted_ratings'] =  s*(0.9*user_based_predicted_score(user_id,i+1,similarity_matrix_user,rating_matrix,type = 'threshold') + 0.1*user_based_predicted_score(user_id,i+1,similarity_matrix_user_demo ,rating_matrix,type = 'threshold'))+(1-s)*(0.8*item_based_predicted_score(user_id,i+1,similarity_matrix_item ,rating_matrix,n_neighbors = 8,type = 'neighbors') + 0.2*item_based_predicted_score(user_id,i+1,similarity_matrix_item_genre ,rating_matrix,type = 'threshold'))

In [47]:
print("Recommended Movies:")
data.sort_values('predicted_ratings',ascending = False)['title'][0:10].reset_index(drop = True)

Recommended Movies:


0                                     Boot, Das (1981)
1                       They Made Me a Criminal (1939)
2                                 Aiqing wansui (1994)
3                                         Faust (1994)
4               World of Apu, The (Apur Sansar) (1959)
5                                   Little City (1998)
6    Wonderful, Horrible Life of Leni Riefenstahl, ...
7                                 Kaspar Hauser (1993)
8                               Golden Earrings (1947)
9           Marlene Dietrich: Shadow and Light (1996) 
Name: title, dtype: object

In [48]:
genres_wanted = ["Comedy","Horror"]
genres_not_wanted = ["Musical"]

In [49]:
sub1 = (data.loc[:,genres_wanted] == 1).all(axis=1)
sub2 = (data.loc[:,genres_not_wanted] == 0).all(axis=1)
sub3 = (data.loc[:, 'predicted_ratings'] != -1)

In [50]:
print("Recommended Movies, in the specific genres:")
data[sub1*sub2*sub3].sort_values('predicted_ratings',ascending = False)['title'][0:10].reset_index(drop = True)

Recommended Movies, in the specific genres:


0                              Bad Taste (1987)
1            Dracula: Dead and Loving It (1995)
2    Cemetery Man (Dellamorte Dellamore) (1994)
3                              Braindead (1992)
4                             Serial Mom (1994)
5                           Howling, The (1981)
6                    Tales from the Hood (1995)
7                       April Fool's Day (1986)
8                           Machine, The (1994)
Name: title, dtype: object