## 1. Load the movie ratings data (as in the HW3-recommender-system) and use matrix factorization technique(s) and predict the missing ratings from the test data. Measure the RMSE. You should use sklearn library. [10 pts]

In [1]:
import pandas as pd
from sklearn.decomposition import NMF
from sklearn.metrics import mean_squared_error
import numpy as np
from math import sqrt

In [2]:
# Load the data
MV_users = pd.read_csv('data/users.csv')
MV_movies = pd.read_csv('data/movies.csv')
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

from collections import namedtuple
Data = namedtuple('Data', ['users','movies','train','test'])
data = Data(MV_users, MV_movies, train, test)

In [4]:
# Get the maximum user and movie IDs from the data
max_user_id = data.train['uID'].max()
max_movie_id = data.train['mID'].max()

# Create user-item rating matrix
ratings_matrix = np.zeros((max_user_id, max_movie_id))

for index, row in data.train.iterrows():
    user_id = row['uID']
    movie_id = row['mID']
    rating = row['rating']
    ratings_matrix[user_id - 1, movie_id - 1] = rating


In [5]:

# Apply Non-Negative Matrix Factorization (NMF)
n_components = 10  # Number of latent factors
model = NMF(n_components=n_components, init='random', random_state=0)
W = model.fit_transform(ratings_matrix)
H = model.components_

# Predict missing ratings
predicted_ratings = np.dot(W, H)





In [7]:
# Get the maximum user and movie IDs from the testing data
max_test_user_id = data.test['uID'].max()
max_test_movie_id = data.test['mID'].max()

# Create the test_ratings_matrix with adjusted dimensions
test_ratings_matrix = np.zeros((max_test_user_id, max_test_movie_id))

for index, row in data.test.iterrows():
    user_id = row['uID']
    movie_id = row['mID']
    test_ratings_matrix[user_id - 1, movie_id - 1] = row['rating']


In [8]:
test_mask = test_ratings_matrix > 0
predicted_test_ratings = predicted_ratings[test_mask]
actual_test_ratings = test_ratings_matrix[test_mask]

rmse = sqrt(mean_squared_error(actual_test_ratings, predicted_test_ratings))
print(f'RMSE: {rmse}')


RMSE: 2.91409307190423


In [10]:
import pandas as pd
from sklearn.decomposition import NMF
from sklearn.metrics import mean_squared_error
import numpy as np
from math import sqrt

# Load the data
MV_users = pd.read_csv('data/users.csv')
MV_movies = pd.read_csv('data/movies.csv')
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

from collections import namedtuple
Data = namedtuple('Data', ['users','movies','train','test'])
data = Data(MV_users, MV_movies, train, test)


# Get the maximum user and movie IDs from the data
max_user_id = data.train['uID'].max()
max_movie_id = data.train['mID'].max()

# Create user-item rating matrix
ratings_matrix = np.zeros((max_user_id, max_movie_id))

for index, row in data.train.iterrows():
    user_id = row['uID']
    movie_id = row['mID']
    rating = row['rating']
    ratings_matrix[user_id - 1, movie_id - 1] = rating

    

# Apply Non-Negative Matrix Factorization (NMF)
n_components = 10  # Number of latent factors
model = NMF(n_components=n_components, init='random', random_state=0)
W = model.fit_transform(ratings_matrix)
H = model.components_

# Predict missing ratings
predicted_ratings = np.dot(W, H)

# Get the maximum user and movie IDs from the testing data
max_test_user_id = data.test['uID'].max()
max_test_movie_id = data.test['mID'].max()

# Create the test_ratings_matrix with adjusted dimensions
test_ratings_matrix = np.zeros((max_test_user_id, max_test_movie_id))

for index, row in data.test.iterrows():
    user_id = row['uID']
    movie_id = row['mID']
    test_ratings_matrix[user_id - 1, movie_id - 1] = row['rating']

test_mask = test_ratings_matrix > 0
predicted_test_ratings = predicted_ratings[test_mask]
actual_test_ratings = test_ratings_matrix[test_mask]

rmse = sqrt(mean_squared_error(actual_test_ratings, predicted_test_ratings))
print(f'RMSE: {rmse}')




RMSE: 2.91409307190423


## 2. Discuss the results and why sklearn's non-negative matrix facorization library did not work well compared to simple baseline or similarity-based methods we’ve done in Module 3. Can you suggest a way(s) to fix it? [10 pts]

To discuss the results and understand why sklearn's Non-Negative Matrix Factorization (NMF) library might not have worked as well compared to simple baseline or similarity-based methods, let's break down some potential reasons:

1. **Loss of Information**: NMF works by factorizing the original matrix into two matrices, which might not fully capture all the nuances of the data. This loss of information can lead to suboptimal performance, especially when dealing with sparse or noisy data.

2. **Choice of Hyperparameters**: The choice of hyperparameters, such as the number of latent factors (`n_components` in your case), can significantly impact the performance of NMF. If the chosen number of latent factors is not appropriate, the model might fail to capture the underlying patterns in the data.

3. **Initialization Sensitivity**: NMF is sensitive to the choice of initial values. Different initializations can lead to different factorized matrices and therefore varying results. In your code, you've used `'init'='random'`, which might not always yield the best results.

4. **Scalability**: NMF might not scale well to large datasets due to its computational complexity. This could lead to longer training times and potentially less accurate results on larger datasets.

5. **Non-Convex Optimization**: NMF involves non-convex optimization, meaning that there can be multiple local minima in the optimization landscape. This can result in the algorithm getting stuck in suboptimal solutions.

6. **Complexity of the Problem**: The movie recommendation problem is inherently complex, and simple matrix factorization methods like NMF might not capture all the intricate relationships between users and movies as well as more advanced techniques.

To improve the performance of NMF or address its limitations, you can consider the following strategies:

1. **Tune Hyperparameters**: Experiment with different values of the `n_components` hyperparameter to find the optimal number of latent factors. This can be done through cross-validation on your training data.

2. **Advanced Initialization**: Instead of random initialization, you can try using other initialization techniques like "Nndsvd" or "nndsvda" available in sklearn's NMF implementation. These methods are designed to provide better starting points for optimization.

3. **Regularization**: Incorporate regularization terms into the NMF objective function. Regularization can help prevent overfitting and improve generalization to the test data.

4. **Use Other Matrix Factorization Techniques**: Consider trying other matrix factorization methods, such as Singular Value Decomposition (SVD), which can capture more complex relationships in the data.

5. **Ensemble Methods**: Combine NMF with other techniques, like collaborative filtering or content-based methods, in an ensemble approach. This can potentially mitigate the limitations of NMF by leveraging the strengths of different methods.

6. **Advanced Collaborative Filtering**: Explore more advanced collaborative filtering techniques like matrix factorization with implicit feedback, which can handle the sparsity of the data more effectively.

7. **Deep Learning**: Consider using deep learning techniques like autoencoders or neural collaborative filtering, which can capture intricate patterns in the data and might lead to better results.

8. **Evaluation Metrics**: Besides RMSE, consider using other evaluation metrics such as precision, recall, and F1-score, which can provide a more comprehensive view of the recommendation system's performance.

In conclusion, while sklearn's NMF library can be a useful tool for matrix factorization tasks, it might not work as well as simpler methods or more advanced techniques in certain scenarios. By carefully tuning hyperparameters, exploring initialization methods, and considering other matrix factorization approaches, you can potentially improve the performance of your recommendation system.