# Discussion
In the code below, I modified the class from the the movie recommender assignment to use Sklearn's NMF model. I found an RMSE value of 2.911. This is significantly higher than the baseline values in the week 3 assignment. The reason for this is because when NMF sees a zero, it assumes that the user rated the movie a 0. That means that all of the zeros are averaged into the predictions. Since there are lots of zeros, this has the result of lowering all the ratings. 

I think this could be fixed by finding a better value instead of zero. One suggestion would be to use a 3, since that is theoretically what an average user would rate a movie. Other options that would make sense would be to replace them with a user average, or a movie average.

Link to Jupyter notebook: https://github.com/highdeltav/Unsupervised-Learning-Week3

# Code

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time
from sklearn.model_selection import train_test_split
from scipy.sparse import coo_matrix, csr_matrix
from scipy.spatial.distance import jaccard, cosine 
from pytest import approx
from sklearn.decomposition import NMF

In [2]:
MV_users = pd.read_csv('movie_data/users.csv')
MV_movies = pd.read_csv('movie_data/movies.csv')
train = pd.read_csv('movie_data/train.csv')
test = pd.read_csv('movie_data/test.csv')

In [3]:
from collections import namedtuple
Data = namedtuple('Data', ['users','movies','train','test'])
data = Data(MV_users, MV_movies, train, test)

In [4]:
class RecSys():
    def __init__(self,data):
        self.data=data
        self.allusers = list(self.data.users['uID'])
        self.allmovies = list(self.data.movies['mID'])
        self.genres = list(self.data.movies.columns.drop(['mID', 'title', 'year']))
        self.mid2idx = dict(zip(self.data.movies.mID,list(range(len(self.data.movies)))))
        self.uid2idx = dict(zip(self.data.users.uID,list(range(len(self.data.users)))))
        self.Mr=self.rating_matrix()
        self.Mm=None 
        self.sim=np.zeros((len(self.allmovies),len(self.allmovies)))
        
    def rating_matrix(self):
        """
        Convert the rating matrix to numpy array of shape (#allusers,#allmovies)
        """
        ind_movie = [self.mid2idx[x] for x in self.data.train.mID] 
        ind_user = [self.uid2idx[x] for x in self.data.train.uID]
        rating_train = list(train.rating)
        return np.array(coo_matrix((rating_train, (ind_user, ind_movie)),
                                   shape=(len(self.allusers),
                                          len(self.allmovies))).toarray())

    def predict_from_sim(self,uid,mid):
            """
            Predict a user rating on a movie given userID and movieID
            """
            # your code here

            movie_index = self.mid2idx[mid]
            user_index = self.uid2idx[uid]

            return self.sim[user_index,movie_index]

        
    def predict(self):
        """
        Predict ratings in the test data. Returns predicted rating in a numpy array of size (# of rows in testdata,)
        """
        # your code here
        return np.array(self.data.test.apply(lambda x: self.predict_from_sim(x['uID'], x['mID']), axis = 1))
    
    def rmse(self,yp):
        yp[np.isnan(yp)]=3 #In case there is nan values in prediction, it will impute to 3.
        yt=np.array(self.data.test.rating)
        return np.sqrt(((yt-yp)**2).mean())
    
    def nmf(self):
        X = self.Mr
        ts = time.time()
        NMF_model = NMF(n_components = 10, max_iter = 1500)
        NMF_model.fit(X)
        te = time.time()
        print(f"Runtime of model: {te-ts}")
        
        W = NMF_model.transform(X)
        H = NMF_model.components_
        self.sim = np.dot(W,H)
        

In [5]:
rec = RecSys(data)
rec.nmf()

Runtime of model: 12.982351541519165


In [6]:
y_pred = rec.predict()

In [7]:
y_pred

array([1.95395342, 0.86297973, 0.21207891, ..., 0.44894441, 0.17072982,
       0.24428557])

In [8]:
print(f"RMSE: {rec.rmse(y_pred)}")

RMSE: 2.9118191391872275


# Appendix
Bonus test to see what the RMSE would be if 3s replaced all of the zeros.

In [9]:
X_mod = rec.Mr.copy()


In [10]:
#Replace zeros with 3s
for k in range(0, len(X_mod)):
    mv = np.mean(X_mod[k])
    X_mod[k]=np.where(X_mod[k]==0,3,X_mod[k])

In [11]:
rec_mod = RecSys(data)
rec_mod.Mr = X_mod
rec_mod.nmf()
y_pred1 =  rec_mod.predict()
print(f"RMSE: {rec_mod.rmse(y_pred1)}")

Runtime of model: 2.906679153442383




RMSE: 1.1341797357112018
