# NMF with movie rating

## Section 1

 Load the movie ratings data (as in the HW3-recommender-system) and use matrix factorization technique(s) and predict the missing ratings from the test data. Measure the RMSE.

In [13]:
import matplotlib.pylab as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.decomposition import NMF
from sklearn.metrics import mean_squared_error

### Load train and test data.

In [4]:
MV_users = pd.read_csv('https://raw.githubusercontent.com/BaffinLee/BBC-News-Classification/main/users.csv')
MV_movies = pd.read_csv('https://raw.githubusercontent.com/BaffinLee/BBC-News-Classification/main/movies.csv')
train = pd.read_csv('https://raw.githubusercontent.com/BaffinLee/BBC-News-Classification/main/train.csv')
test = pd.read_csv('https://raw.githubusercontent.com/BaffinLee/BBC-News-Classification/main/test.csv')

In [23]:
data = pd.concat([train, test], ignore_index=True)
print(data.info())
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000209 entries, 0 to 1000208
Data columns (total 3 columns):
 #   Column  Non-Null Count    Dtype
---  ------  --------------    -----
 0   uID     1000209 non-null  int64
 1   mID     1000209 non-null  int64
 2   rating  1000209 non-null  int64
dtypes: int64(3)
memory usage: 22.9 MB
None


Unnamed: 0,uID,mID,rating
0,744,1210,5
1,3040,1584,4
2,1451,1293,5
3,5455,3176,2
4,2507,3074,5


### Prepare train matrix for NMF model.

In [49]:
user_ids = data['uID'].unique()
movie_ids = data['mID'].unique()
user_id_mapping = {uID: idx for idx, uID in enumerate(user_ids)}
movie_id_mapping = {mID: idx for idx, mID in enumerate(movie_ids)}

data['user_idx'] = data['uID'].map(user_id_mapping)
data['movie_idx'] = data['mID'].map(movie_id_mapping)

user_item_matrix = np.zeros((len(user_ids), len(movie_ids)))
for row in data.itertuples():
  user_item_matrix[row.user_idx, row.movie_idx] = row.rating

train_matrix = user_item_matrix.copy()
for row in test.itertuples():
  train_matrix[user_id_mapping[row.uID], movie_id_mapping[row.mID]] = 0

print(train_matrix.shape)
print(train_matrix)

(6040, 3706)
[[5. 0. 0. ... 0. 0. 0.]
 [0. 4. 0. ... 0. 0. 0.]
 [4. 0. 5. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### Train model

In [61]:
nmf_model = NMF(n_components=25, random_state=42, max_iter=1000)
W = nmf_model.fit_transform(train_matrix)
H = nmf_model.components_
predicted_matrix = np.dot(W, H)

predicted_matrix_rounded = np.round(predicted_matrix)
predicted_matrix_clipped = np.clip(predicted_matrix_rounded, 1, 5)

print(predicted_matrix_clipped.shape)
print(predicted_matrix_clipped)

(6040, 3706)
[[5. 2. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]
 [3. 3. 3. ... 1. 1. 1.]
 ...
 [1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]]


### Test model

In [62]:
predicted_rating = test.apply(lambda x: predicted_matrix_clipped[user_id_mapping[x.uID], movie_id_mapping[x.mID]], axis=1)
rmse = mean_squared_error(test['rating'], predicted_rating, squared=False)

print(f'RMSE: {rmse}')

RMSE: 2.558737664212988


## Section 2

Discuss the results and why they did not work well compared to simple baseline or similarity-based methods we’ve done in Module 3. Can you suggest a way(s) to fix it?

The NMF model's RMSE is 2.5587, the baseline (predict every rating to be 3) is 1.259

It's obvious that NMF model is not good in this case, even worse than baseline. The NMF algorithm is designed to reproduce the original matrix from a dot product of two smaller matrices.

In this case, most of the ratings for a user is zero, the NMF fails to perform if the underlaying dataset in highly sparse matrix. It fails to extract meaningful information and latent features if data is sparse.

To improve model's performance, I will suggest fill zero ratings (or missing ratings) with real number, like average rating of the user or just with a fixed rating of 3. It's should help NMF model to perform better in this case.