## Part 2: NMF on Movie Data

**1. Load the movie ratings data (as in the HW3-recommender-system) and use matrix factorization technique(s) and predict the missing ratings from the test data. Measure the RMSE. You should use sklearn library.** 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from scipy.sparse import coo_matrix, csr_matrix
from collections import namedtuple
from sklearn.decomposition import NMF

In [2]:
train = pd.read_csv("C:/Users/venuk/Desktop/dtsa-5510/Mini/MovieData/train.csv")
test = pd.read_csv("C:/Users/venuk/Desktop/dtsa-5510/Mini/MovieData/test.csv")
MV_users = pd.read_csv("C:/Users/venuk/Desktop/dtsa-5510/Mini/MovieData/users.csv")
MV_movies = pd.read_csv("C:/Users/venuk/Desktop/dtsa-5510/Mini/MovieData/movies.csv")

In [3]:
Data = namedtuple('Data', ['users','movies','train','test'])
data = Data(MV_users, MV_movies, train, test)

In [4]:
#This RecSys Class is from the HW3 assignment
class RecSys():
    def __init__(self,data):
        self.data=data
        self.allusers = list(self.data.users['uID'])
        self.allmovies = list(self.data.movies['mID'])
        self.genres = list(self.data.movies.columns.drop(['mID', 'title', 'year']))
        self.mid2idx = dict(zip(self.data.movies.mID,list(range(len(self.data.movies)))))
        self.uid2idx = dict(zip(self.data.users.uID,list(range(len(self.data.users)))))
        self.Mr=self.rating_matrix()
        
    def rating_matrix(self):
        """
        Convert the rating matrix to numpy array of shape (#allusers,#allmovies)
        """
        ind_movie = [self.mid2idx[x] for x in self.data.train.mID] 
        ind_user = [self.uid2idx[x] for x in self.data.train.uID]
        rating_train = list(train.rating)
        return np.array(coo_matrix((rating_train, (ind_user, ind_movie)), shape=(len(self.allusers), len(self.allmovies))).toarray())
    
    def rmse(self,yp):
        yp[np.isnan(yp)]=3 #In case there is nan values in prediction, it will impute to 3.
        yt=np.array(self.data.test.rating)
        return np.sqrt(((yt-yp)**2).mean())

In [5]:
rec = RecSys(data)

#Getting the rating matrix of the train data
tr_matrix = rec.Mr

#Initialize the nmf model
nmf_model = NMF(n_components = 20, init  = "nndsvda", max_iter = 1000, random_state = 42)

#W Matrix
user_features = nmf_model.fit_transform(tr_matrix)

#H Matrix
movie_features = nmf_model.components_

In [9]:
print("W-matrix:", user_features)
print("H.T-matrix:", movie_features)

W-matrix: [[0.01570787 0.         0.         ... 0.         0.         0.        ]
 [0.16809181 0.         0.         ... 0.19401762 0.46401632 0.07002243]
 [0.         0.         0.         ... 0.22259054 0.00480161 0.        ]
 ...
 [0.00316293 0.         0.00107228 ... 0.         0.         0.00885329]
 [0.         0.48581206 0.         ... 0.         0.         0.20872088]
 [0.15615691 0.22701272 0.41014161 ... 0.00595943 0.01966774 0.        ]]
H.T-matrix: [[0.         0.34646931 0.         ... 0.13803215 0.         0.15213657]
 [0.         0.         0.         ... 0.08414881 0.         0.        ]
 [0.         0.         0.         ... 0.1151299  0.20674001 0.89444953]
 ...
 [0.10626587 0.         0.         ... 0.         0.         0.        ]
 [2.59385428 0.17258454 0.05412258 ... 0.         0.         0.        ]
 [0.         0.         0.49137561 ... 0.         0.         0.        ]]


In [6]:
# Adjusting the predict function for NMF model
# Reference https://stackoverflow.com/questions/49341132/using-nmf-for-generating-recommendations
def predict_ratings(W, H):
    uid = [rec.uid2idx[u_id] for u_id in data.test['uID']]
    mid = [rec.mid2idx[m_id] for m_id in data.test['mID']]
    predicted_ratings = np.zeros(len(uid))
    for i in range(len(uid)):
        predicted_ratings[i] = np.dot(W[uid[i]], H.T[mid[i]])
    return predicted_ratings

In [7]:
from sklearn import metrics
#Predict ratings for test data
r_pred  = predict_ratings(user_features, movie_features)

#Calculate the RMSE score
rec.rmse(r_pred)

2.860841849313365

By definition, RMSE is the standard deviation of the prediction errors. It is a measurement of how far the predicted value deviates from the observed value. If the RMSE value is small, it means that the model is good at predicting the observed data. If it is large, it means that that the model is unable to predict the observed data and is deviating.

In HW3, the RMSE values for the simple baseline and similarity-based methods were approximately 1. The NMF model's RMSE value is 2.86 is relatively larger. This means that the model is unable to extract important features to be able to predict accurately.

*Reference: https://towardsdatascience.com/what-does-rmse-really-mean-806b65f2e48e*

**2. Discuss the results and why sklearn's non-negative matrix facorization library did not work well compared to simple baseline or similarity-based methods we’ve done in Module 3. Can you suggest a way(s) to fix it?**

After comparing the performance of the NMF model with the simple baseline and similarity-based methods used in HW3 - Recommender System, the NMF model did worse than the baseline models, having a RMSE value of ~ 2.8608.

A lot of the ratings for a user were 0, so this would affect how the NMF model predicts ratings as we're taking the dot product of the two smaller matrices of features and components. Having less data can make it difficult for the NMF model to extract features to use to approximate similar rated items and won't be able to recommend movies well to the user.

A few ways to improve the performance of the NMF model is to replace zeroes in the rating matrix with each user's average rating or by normalizing the users' ratings. Removing the zeroes might decrease the amount of data available for training, so it is better to replace them.