# Limitation of Sklearn’s Non-Negative Matrix Factorization Library

**1.Load the movie ratings data (as in the HW3-recommender-system) and use matrix factorization technique(s) and predict the missing ratings from the test data. Measure the RMSE. You should use sklearn library.**

In [10]:
import pandas as pd
import numpy as np
from sklearn.decomposition import NMF
from sklearn.metrics import mean_squared_error
from math import sqrt
import warnings


In [11]:
MV_users = pd.read_csv('data/users.csv')
MV_movies = pd.read_csv('data/movies.csv')
train_data = pd.read_csv('data/train.csv')
test_data = pd.read_csv('data/test.csv')

In [12]:
# Suppress UserWarnings from sklearn
warnings.filterwarnings('ignore')

# Create a utility matrix (user-item matrix)
utility_matrix = train_data.pivot(index='uID', columns='mID', values='rating')

# Fill missing values with 0 for NMF (but depending on the method, could use NaN)
utility_matrix.fillna(0, inplace=True)

# Matrix Factorization using NMF
nmf_model = NMF(n_components=20, init='nndsvd', beta_loss='kullback-leibler', solver='mu', random_state=42, max_iter=500)
W = nmf_model.fit_transform(utility_matrix)
H = nmf_model.components_

# Predict ratings for the utility matrix
predicted_ratings = np.dot(W, H)

# Create a function to get predicted ratings for test data
def predict_rating(uID, mID):
    if uID in utility_matrix.index and mID in utility_matrix.columns:
        return predicted_ratings[utility_matrix.index.get_loc(uID), utility_matrix.columns.get_loc(mID)]
    else:
        return np.nan  # If uID or mID not found

# Apply the function to predict ratings for the test set
test_data['predicted_rating'] = test_data.apply(lambda row: predict_rating(row['uID'], row['mID']), axis=1)

# Drop rows where prediction is NaN
test_data = test_data.dropna(subset=['predicted_rating'])

# Calculate RMSE between actual and predicted ratings
rmse = sqrt(mean_squared_error(test_data['rating'], test_data['predicted_rating']))

print(f'RMSE: {rmse}')

RMSE: 2.9400073885874316


RMSE is 2.94.

**2.Discuss the results and why sklearn's non-negative matrix facorization library did not work well compared to simple baseline or similarity-based methods we’ve done in Module 3. Can you suggest a way(s) to fix it?**

Non-negative matrix factorization (NMF) works by factorizing the user-item matrix into two lower-dimensional matrices (user features and item features) and reconstructing the original matrix through the product of these matrices.
There could be couple of reasons for such underperformance:
- Sparse Data Problem: One major reason NMF may underperform is that the user-item matrix is extremely sparse(many users haven't rated many movies). Sparse data makes it difficult for NMF to find meaningful latent factors to represent users and movies, leading to inaccurate predictions.
- Cold Start Problem: NMF struggles with cold start issues, i.e., when a user has rated very few movies (or none at all), NMF cannot effectively learn that user's preferences. Similarly, new movies with few ratings pose challenges for NMF.
- Overfitting or Underfitting: If the number of components(n_components) in NMF is not tuned properly, the model may either overfit or underfit the data. Choosing the wrong number of components can lead to poorer generalization to new users and movies.

To improve the performance of NMF, the following strategies could be considered:
- Tuning the number of components (n_components): It’s essential to try different values for n_components to balance the model's complexity and ability to generalize. A cross-validation approach should be used to select the optimal number.
- Regularization: Applying regularization to prevent overfitting. This can be done by adding constraints on the NMF factors (e.g., L2 regularization).
- Matrix Imputation: Before applying NMF, you could use matrix imputation techniques to fill in missing ratings in a more intelligent way than simply assuming they are 0.


