# Movie Ratings Prediction

## Load packages

In [2]:
import pandas as pd
# import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from scipy.sparse import dok_matrix

from sklearn.decomposition import NMF

from sklearn.metrics import mean_squared_error

## Load data

In [3]:
users = pd.read_csv('data/movie-ratings-data/users.csv')
movies = pd.read_csv('data/movie-ratings-data/movies.csv')
train = pd.read_csv('data/movie-ratings-data/train.csv')
test = pd.read_csv('data/movie-ratings-data/test.csv')

## Process data

### Match user and movie ID's to indices

In [4]:
uid2idx = {row['uID']: index for index, row in users.iterrows()}
mid2idx = {row['mID']: index for index, row in movies.iterrows()}

### Generate the user-movie rating matrix

In [5]:
num_users = users['uID'].nunique()
num_movies = movies['mID'].nunique()
rating_matrix_train = dok_matrix((num_users, num_movies), dtype=np.float32)

for _, row in train.iterrows():
    idx_movie = mid2idx[row['mID']]
    idx_user = uid2idx[row['uID']]
    rating_matrix_train[idx_user, idx_movie] = row['rating']

### Define and train the NMF model

In [6]:
nmf_model = NMF(n_components=20, init='random', random_state=0)
nmf_model.fit(rating_matrix_train)

W = nmf_model.transform(rating_matrix_train)
H = nmf_model.components_

# Predict ratings for the test set
rating_matrix_train_pred = np.dot(W, H)

### Generate predictions

In [7]:
train_pred = train.copy()
train_pred['rating_pred'] = np.nan

for index, row in train_pred.iterrows():
    idx_movie = mid2idx[row['mID']]
    idx_user = uid2idx[row['uID']]
    train_pred.loc[index, 'rating_pred'] = rating_matrix_train_pred[idx_user, idx_movie]

### Clip the prediction values

In [8]:
train_pred.loc[train_pred['rating_pred'] > 5, 'rating_pred'] = 5.
train_pred.loc[train_pred['rating_pred'] < 1, 'rating_pred'] = 1.
train_pred['rating_pred'] = train_pred['rating_pred'].round()

### Evaluate model performance

In [9]:
train_rmse = mean_squared_error(train_pred['rating'], train_pred['rating_pred'], squared=False)
print(f'RMSE in train data: {train_rmse:.4f}')

RMSE in train data: 2.5018


In [10]:
test_pred = test.copy()
test_pred['rating_pred'] = np.nan

for index, row in test_pred.iterrows():
    idx_movie = mid2idx[row['mID']]
    idx_user = uid2idx[row['uID']]
    test_pred.loc[index, 'rating_pred'] = rating_matrix_train_pred[idx_user, idx_movie]
    
                                               
test_pred.loc[test_pred['rating_pred'] > 5, 'rating_pred'] = 5.
test_pred.loc[test_pred['rating_pred'] < 1, 'rating_pred'] = 1.
test_pred['rating_pred'] = test_pred['rating_pred'].round()  



test_rmse = mean_squared_error(test_pred['rating'], test_pred['rating_pred'], squared=False)
print(f'RMSE in test data: {test_rmse:.4f}')

RMSE in test data: 2.5597


## Discussion

The RMSE from both train and test datasets are much higher than Week 3 homework. There are multiple potential reasons why non-negative matrix facorization did not work well compared to simple baseline or similarity-based methods we used in last homework. 

### Sparsity?

In [14]:
total_elements = num_users * num_movies
non_zero_elements = len(rating_matrix_train.keys())
sparsity = non_zero_elements / total_elements
print(f'{(100 * sparsity):.2f}% of the rating matrix have valid values')

2.99% of the rating matrix has valid values


Matrix factorization relies on a user-item matrix where most entries are filled with ratings. However, the calculation above shows that our dataset is very sparse. We expect matrix factorization to struggle.

One way to fix this issue is to gather more rating data. 

### Hyperparameter?

When we defined the NMF model, we arbitrarily picked the number of latent factors to be 20. But the performance of matrix factorization will be sensitive to this hyperparameter. Other regularization hyperparameters might influence the model performance as well. 

The fix would be to experiment with different values for hyperparameters (potentially via cross-validation) to find the optimal settings.