--------
# Recommender systems with NLP
--------

Overview: This is final part of the 2 part project using NMF. There aren't any guidelines as to how we should go about predicting the RMSE on the test set. So, I will import the class used in the Week 3 homework that constructed the ratings matrix, and factor that using NMF. I will then reconstruct the user movie preferences using NMF.inverse_transform(). I will lastly adapt the rmse() method of RecSys() to obtain the rmse. The ratings matrix will be stored in rs.Mr . 

Let's Begin!


In [1]:
#Importing libraries
import pandas as pd
import numpy as np
from scipy.sparse import coo_matrix, csr_matrix
from collections import namedtuple
from sklearn.decomposition import NMF
from sklearn.metrics import accuracy_score, confusion_matrix

In [2]:
#loading data
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
movie_users = pd.read_csv("users.csv")
movies = pd.read_csv("movies.csv")

# Exploratory Data Analysis

In [3]:
train.head()

Unnamed: 0,uID,mID,rating
0,744,1210,5
1,3040,1584,4
2,1451,1293,5
3,5455,3176,2
4,2507,3074,5


In [4]:
test.head()

Unnamed: 0,uID,mID,rating
0,2233,440,4
1,4274,587,5
2,2498,454,3
3,2868,2336,5
4,1636,2686,5


In [5]:
movie_users.head()

Unnamed: 0,uID,gender,age,accupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [6]:
movies.head()

Unnamed: 0,mID,title,year,Doc,Com,Hor,Adv,Wes,Dra,Ani,...,Chi,Cri,Thr,Sci,Mys,Rom,Fil,Fan,Act,Mus
0,1,Toy Story,1995,0,1,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0
1,2,Jumanji,1995,0,0,0,1,0,0,0,...,1,0,0,0,0,0,0,1,0,0
2,3,Grumpier Old Men,1995,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale,1995,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Father of the Bride Part II,1995,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
Data = namedtuple('Data', ['users','movies','train','test'])
data = Data(movie_users, movies, train, test)

class RecSys():
    def __init__(self,data):
        self.data=data
        self.allusers = list(self.data.users['uID'])
        self.allmovies = list(self.data.movies['mID'])
        self.genres = list(self.data.movies.columns.drop(['mID', 'title', 'year']))
        self.mid2idx = dict(zip(self.data.movies.mID,list(range(len(self.data.movies)))))
        self.uid2idx = dict(zip(self.data.users.uID,list(range(len(self.data.users)))))
        self.Mr=self.rating_matrix()
        self.Mm=None 
        self.sim=np.zeros((len(self.allmovies),len(self.allmovies)))
        
    def rating_matrix(self):
        """
        Convert the rating matrix to numpy array of shape (#allusers,#allmovies)
        """
        ind_movie = [self.mid2idx[x] for x in self.data.train.mID] 
        ind_user = [self.uid2idx[x] for x in self.data.train.uID]
        rating_train = list(train.rating)
        return np.array(coo_matrix((rating_train, (ind_user, ind_movie)), shape=(len(self.allusers), len(self.allmovies))).toarray())


    def predict_everything_to_3(self):
        """
        Predict everything to 3 for the test data
        """
        ### BEGIN SOLUTION
        return np.ones(len(self.data.test))*3
        ### END SOLUTION
        
    def predict_to_user_average(self):
        """
        Predict to average rating for the user.
        Returns numpy array of shape (#users,)
        """
        ### BEGIN SOLUTION
        useravg = self.Mr.sum(axis=1)/(self.Mr>0).sum(axis=1)
        return useravg[[self.uid2idx[x] for x in self.data.test.uID]]
    
    def predict_from_sim(self,uid,mid):
        """
        Predict a user rating on a movie given userID and movieID
        """
        ### BEGIN SOLUTION
        uf = self.Mr[self.uid2idx[uid]]
        mf = self.sim[self.mid2idx[mid]]
        return np.dot(uf,mf)/np.dot(mf,uf>0)
    
    def predict(self):
        """
        Predict ratings in the test data. Returns predicted rating in a numpy array of size (# of rows in testdata,)
        """
        ### BEGIN SOLUTION
        yp=[]
        for i in range(len(self.data.test)):
            x = self.data.test.iloc[i]
            mid=x.mID
            uid=x.uID
            yp.append(self.predict_from_sim(uid,mid))
        return np.array(yp)
    
    def rmse(self,yp):
        yp[np.isnan(yp)]=3 
        yt=np.array(self.data.test.rating)
        return np.sqrt(((yt-yp)**2).mean())

In [8]:
# Perform NMF on ratings matrix, may take some time...
rs = RecSys(data)
ratingsMatrix = rs.Mr
model = NMF(n_components = 18, random_state = 42, init="nndsvda", solver="mu", beta_loss="kullback-leibler", max_iter=1000).fit(ratingsMatrix)
W = model.transform(ratingsMatrix)
H = model.components_

In [9]:
# Reconstruct user data as predictions from NMF
X = model.inverse_transform(W)
X.shape

(6040, 3883)

In [10]:
# Adapt the predict method of RecSys() to make predictions from the reconstructed user data, rather than the using the baseline / imputation methods.
yhat = []
n_test = len(rs.data.test)
for i in range(n_test):
    x = rs.data.test.iloc[i]
    mid = x.mID
    uid = x.uID
    yhat.append(X[rs.uid2idx[uid],rs.mid2idx[mid]])

In [11]:
# Adapt the rmse method of RecSys()
yhat = np.asarray(yhat)
yhat[np.isnan(yhat)] = 3 
labs = np.array(rs.data.test.rating)
RMSE = np.sqrt(((labs-yhat)**2).mean())

print("The RMSE of the predictions made using NMF was:", RMSE)

The RMSE of the predictions made using NMF was: 2.8850867946900713


------
# Conclusion
------

The best collaborative filtering models from the week 3 homework had RMSE less than 1. Even the worst model - where we just predicted missing ratings as the user's mean provided rating - had a much better RMSE. As such an RMSE of 2.885 is definitely not good. The primary issues are:

1) The data are far too sparse, and so the matrix is difficult to factor using approximation methods without introducing large errors

2) The values per cell are so low that noise is profound

3) I used KL-Loss because it is much better at dealing with a matrix overpopulated with 0's. However, KL-loss is more sensitive to sparse matricies because it cannot be used with init = "nndsvd", which is the best init for sparse matricies.

A good way to improve the performance would be to condense the matrix before factoring. Such methods might include sklearn's truncated SVD for dimensionality reduction, or even PCA.