# DTSA-5510 Week 4 Assignment

In part 2 of this assignment, I applied non-negative matrix factorization (NMF) to the movie ratings data set from module 3 and compared results with the recommender systems from that assignment.

## Part 2

We'll import data from the movie ratings recommender system assignment in module 3, build an NMF model to predict users' movie ratings, and compare with the actual ratings matrix.

We'll create a `ratings_matrix` with rows representing users and colunns representing movies (values in columns are users' ratings from 1 to 5 - zero values are supposed to indicate that the user has not rated a movie).

In [21]:
movies_train_df = pd.read_csv("./data/movies_data/train.csv")

ratings_df = movies_train_df.pivot(index='uID', columns='mID', values='rating').fillna(0)
ratings_matrix = ratings_df.to_numpy().astype(float)

Next we'll use the scikit-learn [NMF](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html) class again to build a non-negative matrix factorization model, and we'll factor out our W and H matrices. We'll choose `n_components=20` because there were 20 metadata columns for each movie in the `MV_movies` data set from module 3.

In [22]:
nmf_model = NMF(n_components=20, init='random', random_state=42, max_iter=500)
W = nmf_model.fit_transform(ratings_matrix)
H = nmf_model.components_

Now we'll approximate the `ratings_matrix` by matrix-multiplying W and H together.

In [23]:
WH = np.dot(W, H)

And we'll calculate the element-wise RMSE between the entries in `ratings_matrix` and `WH`. We exclude the 0 values in `ratings_matrix` from the RMSE calculation because zeros represent missing ratings in the original dataset—i.e., movies the users did not rate.

In [24]:
mask = ratings_matrix > 0
rmse = np.sqrt(np.mean((ratings_matrix[mask] - WH[mask]) ** 2))
print(f"RMSE = {rmse}")

RMSE = 2.775512629076963


The RMSE of ~2.78 is considerably higher than the recommender system RMSE best of ~0.95 from module 3. This is likely due to the NMF treatment of 0 values in `ratings_matrix` where 0 is handled as an actual rating rather than a non-rating. Perhaps a more sophisticated type of NMF algorithm that allows exclusion of 0 values would fix this.