GitHub: https://github.com/DataChemE/Unsupervised_Learning_text_classification

# Matrix Factorization Techniques for Movie Ratings

## Introduction

In this notebook, we explore the application of matrix factorization techniques on movie rating data to predict missing ratings. We'll be using the `scikit-learn` library to implement non-negative matrix factorization and compare its performance against simple baseline or similarity based methods used previously in the week 3 assignment.

Matrix factorization is a class of collaborative filtering algorithms used in recommendation systems. The idea is to decompose the user-item interaction matrix into the product of two lower dimensional matrices. The user item interaction matrix is typically sparse, with missing entries corresponding to unobserved ratings. The goal is to fill in these missing entries by learning the latent factors that best explain the observed ratings.

In [1]:
import numpy as np
import pandas as pd
from sklearn.decomposition import NMF
from scipy.sparse import coo_matrix
from sklearn.metrics import mean_squared_error
from collections import namedtuple

In [3]:
# Load movie ratings data
MV_users = pd.read_csv('movie_data/users.csv')
MV_movies = pd.read_csv('movie_data/movies.csv')
train = pd.read_csv('movie_data/train.csv')
test = pd.read_csv('movie_data/test.csv')

In [4]:
# Create a sparse matrix of the data for NMF
Data = namedtuple('Data', ['users','movies','train','test'])
data = Data(MV_users, MV_movies, train, test)

In [11]:
class RecSys():
    def __init__(self, data):
        self.data = data
        self.allusers = list(self.data.users['uID'])
        self.allmovies = list(self.data.movies['mID'])
        self.genres = list(self.data.movies.columns.drop(['mID', 'title', 'year']))
        self.mid2idx = dict(zip(self.data.movies.mID, list(range(len(self.data.movies)))))
        self.uid2idx = dict(zip(self.data.users.uID, list(range(len(self.data.users)))))
        self.Mr = self.rating_matrix()  
        self.nmf_model = None
        self.user_features = None
        self.item_features = None

    def rating_matrix(self):
        """ Convert the rating matrix to a numpy array of shape (#allusers, #allmovies) """
        ind_movie = [self.mid2idx[x] for x in self.data.train.mID]
        ind_user = [self.uid2idx[x] for x in self.data.train.uID]
        rating_train = list(self.data.train.rating)
        return np.array(coo_matrix((rating_train, (ind_user, ind_movie)), shape=(len(self.allusers), len(self.allmovies))).toarray())

    def fit_nmf(self, n_components=15):
        """ Fit NMF model to the ratings matrix """
        self.nmf_model = NMF(n_components=n_components, init='random', random_state=10, max_iter=1000)
        self.user_features = self.nmf_model.fit_transform(self.Mr)
        self.item_features = self.nmf_model.components_

    def predict_nmf(self):
        """ Predict ratings using the factorized matrices from NMF """
        if not self.nmf_model:  
            print("NMF model is not fitted, call fit_nmf() first.")
            return
        predicted_ratings = np.dot(self.user_features, self.item_features)
        return predicted_ratings

    def calculate_rmse(self):
        """ Calculate the RMSE for the predicted ratings against the actual test set ratings """
        y_pred = self.predict_nmf()
        test_indices = (self.data.test.uID.map(self.uid2idx), self.data.test.mID.map(self.mid2idx))
        y_true = self.data.test.rating
        y_pred = y_pred[test_indices]
        return np.sqrt(mean_squared_error(y_true, y_pred))
    

In [10]:
# Creating Sample test data
np.random.seed(42)
sample_train = train[:30000]
sample_test = test[:30000]


sample_MV_users = MV_users[(MV_users.uID.isin(sample_train.uID)) | (MV_users.uID.isin(sample_test.uID))]
sample_MV_movies = MV_movies[(MV_movies.mID.isin(sample_train.mID)) | (MV_movies.mID.isin(sample_test.mID))]


sample_data = Data(sample_MV_users, sample_MV_movies, sample_train, sample_test)


rec_sys = RecSys(sample_data)
rec_sys.fit_nmf(n_components=20)  
predicted_ratings = rec_sys.predict_nmf()
rmse = rec_sys.calculate_rmse()
print("RMSE for NMF-based predictions:", rmse)

RMSE for NMF-based predictions: 3.729573183874525


# Below are the results from the previous week's assignment using cosine similarity in the collaborative method. 

As we can see the RMSE for the same sample data created, the RMSE for NMF above is almost 4 times high as the similarity based method. 

Cosine Similarity RMSE: 1.0263081874204125

![Results from week 3 cossine similarirty results](./similarity_comparisons.png)

NMF and cosine similarity differ in their RMSE performance due to a few key reasons:

- Handling Sparse Data: NMF struggles with the sparse matrices common in user item data, as it tries to approximate all entries, which can lead to higher errors. In contrast, cosine similarity focuses only on overlapping entries between vectors, making it more effective for sparse datasets.

- Sensitivity to Data Scale: NMF's performance is sensitive to the scale of the input data, which can affect the approximation error. On the other hand, cosine similarity is scale invariant since it measures the angle between vectors, making it robust to variations in user rating scales.

- Model Complexity and Overfitting: NMF has a higher risk of overfitting due to its complexity, especially when there are many items relative to users. Cosine similarity, being simpler and more interpretable, is less prone to overfitting.