# Limitation(s) of sklearn’s non-negative matrix factorization library.

###  1. Load the movie ratings data (as in the HW3-recommender-system) and use matrix factorization technique(s) and predict the missing ratings from the test data. Measure the RMSE. You should use sklearn library.

In [2]:
import pandas as pd
import numpy as np
import os
from scipy.sparse import coo_matrix, csr_matrix
from collections import namedtuple
from sklearn.decomposition import NMF
from sklearn.metrics import accuracy_score, confusion_matrix

Load the movie ratings

In [3]:
# Load data from google drive
from google.colab import drive
drive.mount('/content/drive')
train = pd.read_csv("/content/drive/My Drive/movie_train.csv")
test = pd.read_csv("/content/drive/My Drive/movie_test.csv")
MV_users = pd.read_csv("/content/drive/My Drive/movie_users.csv")
MV_movies = pd.read_csv("/content/drive/My Drive/movies.csv")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).



Re-using the code from Week 3 Assignment for the ratings matrix factorization techniques

In [6]:
# Define data structure
Data = namedtuple('Data', ['users', 'movies', 'train', 'test'])
data = Data(MV_users, MV_movies, train, test)

class RecSys:
    """
    Recommender System class to handle movie recommendations.
    """
    def __init__(self, data):
        """
        Initialize the recommender system with user and movie data.
        """
        self.data = data
        self.allusers = list(data.users['uID'])
        self.allmovies = list(data.movies['mID'])
        self.mid2idx = dict(zip(data.movies.mID, range(len(data.movies))))
        self.uid2idx = dict(zip(data.users.uID, range(len(data.users))))
        self.Mr = self.rating_matrix()  # Rating matrix
        self.sim = np.zeros((len(self.allmovies), len(self.allmovies)))  # Similarity matrix

    def rating_matrix(self):
        """
        Convert the training data into a user-item rating matrix.
        """
        ind_movie = [self.mid2idx[m] for m in self.data.train.mID]
        ind_user = [self.uid2idx[u] for u in self.data.train.uID]
        return np.array(coo_matrix((self.data.train.rating, (ind_user, ind_movie)),
                                   shape=(len(self.allusers), len(self.allmovies))).toarray())

    def predict_everything_to_3(self):
        """
        Predict a constant rating of 3 for all test items.
        """
        return np.full(len(self.data.test), 3)

    def predict_to_user_average(self):
        """
        Predict ratings based on user average ratings.
        """
        user_avg = np.divide(self.Mr.sum(axis=1), (self.Mr > 0).sum(axis=1))
        return user_avg[[self.uid2idx[u] for u in self.data.test.uID]]

    def predict_from_sim(self, uid, mid):
        """
        Predict a rating for a given user and movie using similarity-based prediction.
        """
        uf = self.Mr[self.uid2idx[uid]]
        mf = self.sim[self.mid2idx[mid]]
        return np.dot(uf, mf) / np.dot(mf, uf > 0)

    def predict(self):
        """
        Predict ratings for all test data using similarity-based predictions.
        """
        return np.array([self.predict_from_sim(x.uID, x.mID) for _, x in self.data.test.iterrows()])

    def rmse(self, yp):
        """
        Calculate the root mean square error between predicted and actual ratings.
        """
        yp[np.isnan(yp)] = 3
        return np.sqrt(((self.data.test.rating - yp) ** 2).mean())


In [7]:
# Perform NMF on ratings matrix
rs = RecSys(data)
ratingsMatrix = rs.Mr
model = NMF(n_components = 18, random_state = 42, init="nndsvda", solver="mu", beta_loss="kullback-leibler", max_iter=1000).fit(ratingsMatrix)
W = model.transform(ratingsMatrix)
H = model.components_

In [8]:
# Reconstruct user data as predictions from NMF
X = model.inverse_transform(W)
X.shape

(6040, 3883)

In [9]:
# Adapt the predict method to use reconstructed user data
yhat = [X[rs.uid2idx[x.uID], rs.mid2idx[x.mID]] for _, x in rs.data.test.iterrows()]
print(f'Predictions: {yhat}')


Predictions: [1.5536136713256865, 0.6851166697252002, 0.3837880017819665, 0.9424926352090831, 1.1363320562217163, 0.5274423307659339, 1.1799102687370846, 2.492188732748855, 0.8509998591151985, 0.3213151837932003, 0.2802404530164094, 0.9963941459830036, 0.5546408433172821, 3.635531668969538, 0.6661843616849846, 0.5841731759791975, 1.5002069875157138, 0.051224578438354644, 0.5903916690167957, 0.4270931473753008, 2.1789115706578452, 3.653329202788758, 0.9983357882584093, 0.8286608704569871, 0.45416935698950456, 0.9833261347579355, 0.7465356827884746, 1.9765688984812715, 0.5131467140189924, 0.4007281993573487, 6.007702844110132, 0.5830309473261613, 0.5554646166859423, 2.507761318009736, 0.15008788413662494, 0.023965296423569378, 0.9083079971375952, 0.17803272598945713, 0.10285445880017308, 0.20213728535796635, 1.2459958273782128, 0.8487313440626693, 2.156267099309003, 1.204682794954764, 1.3263308361821327, 1.7669566905174865, 1.7562625988983656, 3.1556818616186306, 2.3661816494119683, 0.48

In [11]:
# Convert predictions to numpy array and handle NaNs
yhat = np.asarray(yhat)
yhat[np.isnan(yhat)] = 3  # Replace NaN values with 3

# Compute RMSE
labs = np.array(rs.data.test.rating)  # True ratings
RMSE = np.sqrt(((labs - yhat) ** 2).mean())  # Root Mean Square Error

print(f"The RMSE of the predictions made using NMF is: {RMSE:.4f}")


The RMSE of the predictions made using NMF is: 2.8851


### 2. Discuss the results and why sklearn's non-negative matrix facorization library did not work well compared to simple baseline or similarity-based methods we’ve done in Module 3. Can you suggest a way(s) to fix it?


The Non-Negative Matrix Factorization (NMF) model shows a high RMSE of 2.885, significantly worse than the less than 1 RMSE of top models from Week 3. This poor performance stems from the dataset's extreme sparsity, which complicates accurate factorization and results in a high RMSE. Additionally, the low values in the matrix exacerbate noise issues, leading to unreliable predictions. The use of KL-Loss with the "nndsvd" initialization further contributes to the problem, as KL-Loss is less effective for the given data characteristics.

To enhance performance, consider applying dimensionality reduction techniques like Truncated SVD or PCA to address sparsity. Exploring alternative matrix factorization methods or regularization could also improve robustness. Increasing the dataset size and employing hybrid approaches that combine matrix factorization with similarity-based methods might further boost accuracy. These steps can help align NMF performance with more effective baseline and similarity-based methods.