### SKlearn NMF On MovieLens Dataset

[Sklearns NMF Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html)

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

from sklearn.decomposition import NMF

#### Load Datasets

MovieLens 1M data retrieved from [Kaggle](https://www.kaggle.com/datasets/odedgolden/movielens-1m-dataset)

In [49]:
rat = pd.read_csv('ratings.dat', sep ='::', header=None)
rat.columns = ['user', 'movie', 'rating', 'timestamp']
rat.head()

  rat = pd.read_csv('ratings.dat', sep ='::', header=None)


Unnamed: 0,user,movie,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


#### Split Into Train and Test Sets

In [61]:
from sklearn.model_selection import train_test_split

X = rat[['user', 'movie', 'timestamp']].to_numpy()
y = rat['rating'].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12)

#### Scale Data, Create and Fit Model

In [65]:
from sklearn import decomposition, datasets, model_selection, preprocessing, metrics

# NMF does not allow negative input, so we don't want to center the data
scaler = preprocessing.StandardScaler(with_mean=False).fit(X_train)
X_train_sc = scaler.transform(X_train)
X_test_sc = scaler.transform(X_test)

nmf = decomposition.NMF(n_components=None, max_iter=500, random_state=8, init='random').fit(X_train_sc)

#### Calculate RMSE

RMSE is at least balanced between train and test, but is still at around 0.67. A simple gradient boosted ensemble classifier (below) easily achieves half that RMSE. 

In [66]:
prediction = nmf.inverse_transform(nmf.transform(X_train_sc))
print(f'Train Set RMSE: {metrics.mean_squared_error(X_train_sc, prediction)}')

prediction = nmf.inverse_transform(nmf.transform(X_test_sc))
print(f'Test Set RMSE: {metrics.mean_squared_error(X_test_sc, prediction)}')

Train Set RMSE: 0.6738686972921465
Test Set RMSE: 0.6739360760282939


#### Compare With Supervised Method

In [81]:
from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier(n_estimators=10, learning_rate=0.05, max_depth=5, random_state=8)
clf.fit(X_train, y_train)

print(clf.score(X_test, y_test))

0.3538649806241077


### Commentary on Limitations of NMF

NMF is NP-hard, which means it does not scale efficiently as the size of the dataset increases. 

Another issue with NMF is that there is no guaranteed single unique decomposition. It needs to be probabilistically optimized using a liklihood function.

It's also sensitive to how the W and H matrices are intialized. 