# **Limitation(s) of sklearn’s non-negative matrix factorization library**
Those are the questions to answer:

1. Load the movie ratings data (as in the HW3-recommender-system) and use matrix factorization technique(s) and predict the missing ratings from the test data. Measure the RMSE. You should use sklearn library. [10 pts]

Make sure that your notebook includes the following:

use's sklearn's non-negative matrix factorization

notebook shows the RMSE with an analysis of what that RMSE means

2. Discuss the results and why they did not work well compared to simple baseline or similarity-based methods we’ve done in Module 3. Can you suggest a way(s) to fix it? [10 pts]

# **Question 1**

In [10]:
# Let's import the libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt
import seaborn as sns
from math import floor
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import itertools
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from wordcloud import WordCloud
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from collections import namedtuple
from sklearn.model_selection import train_test_split
from scipy.sparse import coo_matrix, csr_matrix
from scipy.spatial.distance import jaccard, cosine 
from pytest import approx

# data
rep="/kaggle/input/ratings"
MV_users = pd.read_csv(rep+'/users.csv')
MV_movies = pd.read_csv(rep+'/movies.csv')
train = pd.read_csv(rep+'/train.csv')
test = pd.read_csv(rep+'/test.csv')
Data = namedtuple('Data', ['users','movies','train','test'])
data = Data(MV_users, MV_movies, train, test)

#print(data.train)
# transform the matrix in a pivot table
pivot = data.train.pivot(index='uID', columns='mID', values='rating').fillna(0)
n = len(data.train['rating'].unique())
nmf = NMF(n_components=n, random_state=42)
W = nmf.fit_transform(pivot)
H = nmf.components_ 
WH = np.dot(W, H)

preds = pd.DataFrame(WH, index=pivot.index, columns=pivot.columns)
#print(preds.shape)

y_true=pivot.values.flatten()
y_pred=WH.flatten()
idx=np.nonzero(y_true)
err=y_true[idx]-y_pred[idx]
rmse=np.sqrt(sum(np.power(err,2))/len(err))

print("RMSE is: ",rmse)


RMSE is:  2.968076162331461


The RMSE is huge, which indicates that NMF is not ideal to predict the user ratings. The RMSE obtained with the simple baseline or similarity-based methods in Module 3 is around 1, so NMF performs worse compared to those methods.

# **Question 2**

* Explanation for poor performance:
* * NMF heavily depends on the size of data; since in the case of user ratings, matrices are sparse and have many empty values, which compromise its accuracy drastically
* * Since NMF assumes a linear impact from the latent factors, if the latent factors have non-linear impacts or there are too many latent factors, whcih renders it difficult for the NMF model to interprete, the model will work poorly


* Improvement
* * We can fill in the sparse matrices with some approximation to avoid having a huge ammount of missing data
* * We can try to combine the similarity method with the NMF to make up for its short-comings
* * Last but not least, we can work on the data, to see if we can provide more meaningful features and adapt the data to NMF