# Limitations of NMF: Movie Rating

## Imports

In [1]:
import os

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.preprocessing import Normalizer

from sklearn.pipeline import Pipeline

from sklearn.decomposition import NMF

from sklearn.model_selection import GridSearchCV

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error

%matplotlib inline

## Exploratory Data Analysis (EDA)

Let's load the data. The data was copied from [Kaggle: UnLrW3_movie_ratings_dataset](https://www.kaggle.com/datasets/yu1111/unlrw3-movie-ratings-dataset)

In [2]:
train_csv = os.path.join('./data', 'train.csv')
test_csv = os.path.join('./data', 'test.csv')
movies_csv = os.path.join('./data', 'movies.csv')
users_csv = os.path.join('./data', 'users.csv')

In [3]:
df_train = pd.read_csv(train_csv)
df_test = pd.read_csv(test_csv)
df_movies = pd.read_csv(movies_csv)
df_users = pd.read_csv(users_csv)

In [4]:
df_train.head()

Unnamed: 0,uID,mID,rating
0,744,1210,5
1,3040,1584,4
2,1451,1293,5
3,5455,3176,2
4,2507,3074,5


In [5]:
df_movies.head()

Unnamed: 0,mID,title,year,Doc,Com,Hor,Adv,Wes,Dra,Ani,...,Chi,Cri,Thr,Sci,Mys,Rom,Fil,Fan,Act,Mus
0,1,Toy Story,1995,0,1,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0
1,2,Jumanji,1995,0,0,0,1,0,0,0,...,1,0,0,0,0,0,0,1,0,0
2,3,Grumpier Old Men,1995,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale,1995,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Father of the Bride Part II,1995,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
df_users.head()

Unnamed: 0,uID,gender,age,accupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [28]:
df_train.isna().sum()

uID       0
mID       0
rating    0
dtype: int64

The training dataset looks clean. So, we can go ahead and build a rating user-movie matrix.

In [29]:
r_df = df_train.pivot(index='uID', columns='mID', values='rating')
r_df

mID,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
uID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,2.0,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6036,,,,2.0,,,,,,,...,,,,,,,,,,
6037,,,,,,,,,,,...,,,,,,,,,,
6038,,,,,,,,,,,...,,,,,,,,,,
6039,,,,,,,,,,,...,,,,,,,,,,


Now we see a lot of NaNs. We will deal with them later.

## Non-Negative Matrix Factorization (NMF) model

NMF cannot deal with NaN. Let's substitute NaN with 0

In [11]:
y = r_df.fillna(0).to_numpy()
y

array([[5., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [3., 0., 0., ..., 0., 0., 0.]])

Before building the model, let's implement RMSE method to evaluate performance of the model

In [12]:
def rmse_calc(y, yh):
    mask = y.nonzero()
    yf = y[mask].flatten()
    yhf = yh[mask].flatten()
    n = yf.shape[0]
    return np.sqrt(np.sum((yf - yhf)**2)/n)

Let's calculate base RMSE. For that, we assume that the model classifies all ratings to 3. 

In [15]:
print(f"Base RMSE: {rmse_calc(y, np.repeat(3, y.flatten().shape[0]).reshape(y.shape[0], -1)).round(2)}")

Base RMSE: 1.26


For movie classification, we will build the next model. 

We will perform NMF on the rating matrix with an arbitrary n_components. As the result we will have $W$ and $H$ matrices:
$$R=WH$$
By multiplying back $W$ and $H$ we will get a reconstruction of the $R$ matrix.
$$ WH=\hat{R} $$

In [16]:
nmf = NMF(n_components=20, random_state=42, max_iter=1000)
nmf.fit(y)
w = nmf.transform(y)
h = nmf.components_
y_hat = w.dot(h)
y_hat

array([[1.75705963e+00, 5.52859369e-01, 3.80516698e-02, ...,
        1.12477756e-02, 5.59806123e-03, 8.82495204e-02],
       [1.55143423e+00, 3.50792809e-01, 9.36638351e-02, ...,
        2.30062027e-02, 0.00000000e+00, 4.14951640e-02],
       [7.52640785e-01, 1.55689744e-01, 2.23612969e-02, ...,
        0.00000000e+00, 0.00000000e+00, 1.42194443e-03],
       ...,
       [6.82892148e-01, 1.77780700e-02, 5.18911859e-03, ...,
        5.46306804e-04, 2.18319691e-04, 1.44304428e-03],
       [1.34878448e+00, 2.90711824e-01, 1.18468009e-01, ...,
        4.12772607e-02, 0.00000000e+00, 0.00000000e+00],
       [1.39370538e+00, 9.34118947e-02, 7.78264759e-03, ...,
        8.78888730e-02, 8.45913444e-02, 3.93811803e-01]])

In [20]:
print(f"RMSE: {rmse_calc(y, y_hat).round(2)}")

RMSE: 2.78


RMSE is terrible. It is more than twice worse than the base RMSE.

I believe one of the issues is that we selected 0 as a neutral element. 
The reality is that 0 is not neutral. It is highly negative having a range of ratings [1, 5]

Let's check this assumption by using 3 as a neutral element. 

In [18]:
y3 = r_df.fillna(3).to_numpy()
y3

array([[5., 3., 3., ..., 3., 3., 3.],
       [3., 3., 3., ..., 3., 3., 3.],
       [3., 3., 3., ..., 3., 3., 3.],
       ...,
       [3., 3., 3., ..., 3., 3., 3.],
       [3., 3., 3., ..., 3., 3., 3.],
       [3., 3., 3., ..., 3., 3., 3.]])

In [19]:
print(f"Base RMSE: {rmse_calc(y3, np.repeat(3, y.flatten().shape[0]).reshape(y.shape[0], -1)).round(2)}")

Base RMSE: 0.22


In [23]:
nmf = NMF(n_components=20, random_state=42, max_iter=1000)
nmf.fit(y3)
w = nmf.transform(y3)
h = nmf.components_
y_hat3 = w.dot(h)
y_hat3

array([[3.64647457, 3.07365864, 3.08936123, ..., 2.98023576, 2.98569035,
        2.99837489],
       [3.17836367, 2.99719526, 3.063883  , ..., 2.97749291, 2.9868935 ,
        3.0560864 ],
       [3.294116  , 3.01222244, 3.05370954, ..., 2.98593414, 2.99445055,
        2.98352755],
       ...,
       [3.13715105, 2.99989059, 3.05388814, ..., 2.98885259, 2.98880177,
        2.99764213],
       [3.43487113, 3.0030729 , 3.07690803, ..., 2.99637966, 2.98514955,
        2.96004352],
       [3.56760083, 2.86933872, 3.04654591, ..., 2.99345829, 3.01786141,
        3.16476086]])

In [24]:
print(f"RMSE: {rmse_calc(y3, y_hat3).round(2)}")

RMSE: 0.21


It performs slightly better now. Still not the best, but at least better than the base. 

## Conclusion

The application of Non-negative Matrix Factorization (NMF) for predicting movie ratings demonstrated suboptimal performance. In the basic case, the RMSE for NMF was more than twice worse than just predicting everything to have a rating of 3.  

This poor performance can be attributed to several factors: NMF's assumption of linear relationships between users and movies may be too simplistic for capturing complex rating patterns. Another issue is the sparse nature of the rating matrix, where most users rate only a small fraction of available movies.

To improve the prediction performance, several approaches could be implemented:

1. Incorporate bias terms to account for user and movie rating tendencies, as some users tend to rate higher/lower than others, and some movies consistently receive higher/lower ratings. For example, we could replace NaNs with mean or median per user. 

2. Combine NMF with other techniques in a hybrid approach, such as integrating content-based features (movie genres, actors, directors) alongside the collaborative filtering aspects of NMF

3. Consider advanced variants like SVD that might better handle the inherent structure of rating data.

## Resources

1. [Kaggle: UnLrW3_movie_ratings_dataset](https://www.kaggle.com/datasets/yu1111/unlrw3-movie-ratings-dataset)
2. [Collaborative Filtering: Matrix Factorization Recommender System](https://www.jiristodulka.com/post/recsys_cf/)
3. [Chang Liu. Personalized Recommendation Algorithm for Movie Data Combining Rating Matrix and User Subjective Preference](https://onlinelibrary.wiley.com/doi/10.1155/2022/2970514)