### Grading
The final score that you will receive for your programming assignment is generated in relation to the total points set in your programming assignment item—not the total point value in the nbgrader notebook.<br>
When calculating the final score shown to learners, the programming assignment takes the percentage of earned points vs. the total points provided by nbgrader and returns a score matching the equivalent percentage of the point value for the programming assignment. <br>
**DO NOT CHANGE VARIABLE OR METHOD SIGNATURES** The autograder will not work properly if your change the variable or method signatures. 

### Validate Button
Please note that this assignment uses nbgrader to facilitate grading. You will see a **validate button** at the top of your Jupyter notebook. If you hit this button, it will run tests cases for the lab that aren't hidden. It is good to use the validate button before submitting the lab. Do know that the labs in the course contain hidden test cases. The validate button will not let you know whether these test cases pass. After submitting your lab, you can see more information about these hidden test cases in the Grader Output. <br>
***Cells with longer execution times will cause the validate button to time out and freeze. Please know that if you run into Validate time-outs, it will not affect the final submission grading.*** <br>

# Building Recommender Systems for Movie Rating Prediction

In this assignment, we will build a recommender systems that predict movie ratings. [MovieLense](https://grouplens.org/datasets/movielens/) has currently 25 million user-movie ratings.  Since the entire data is too big, we use  a 1 million ratings subset [MovieLens 1M](https://www.kaggle.com/odedgolden/movielens-1m-dataset), and we reformatted the data to make it more convenient to use.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time
from sklearn.model_selection import train_test_split
from scipy.sparse import coo_matrix, csr_matrix
from scipy.spatial.distance import jaccard, cosine 
from pytest import approx

In [2]:
MV_users = pd.read_csv('data/users.csv')
MV_movies = pd.read_csv('data/movies.csv')
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [3]:
from collections import namedtuple
Data = namedtuple('Data', ['users','movies','train','test'])
data = Data(MV_users, MV_movies, train, test)

In [4]:
print(data.users[:2])
print(data.movies[:2])
print(data.train[:2])
print(data.test[:2])

   uID gender  age  accupation    zip
0    1      F    1          10  48067
1    2      M   56          16  70072
   mID      title  year  Doc  Com  Hor  Adv  Wes  Dra  Ani  ...  Chi  Cri  \
0    1  Toy Story  1995    0    1    0    0    0    0    1  ...    1    0   
1    2    Jumanji  1995    0    0    0    1    0    0    0  ...    1    0   

   Thr  Sci  Mys  Rom  Fil  Fan  Act  Mus  
0    0    0    0    0    0    0    0    0  
1    0    0    0    0    0    1    0    0  

[2 rows x 21 columns]
    uID   mID  rating
0   744  1210       5
1  3040  1584       4
    uID  mID  rating
0  2233  440       4
1  4274  587       5


### Starter codes
Now, we will be building a recommender system which has various techniques to predict ratings. 
The `class RecSys` has baseline prediction methods (such as predicting everything to 3 or to average rating of each user) and other utility functions. `class ContentBased` and `class Collaborative` inherit `class RecSys` and further add methods calculating item-item similarity matrix. You will be completing those functions using what we learned about content-based filtering and collaborative filtering.

`RecSys`'s `rating_matrix` method converts the (user id, movie id, rating) triplet from the train data (train data's ratings are known) into a utility matrix for 6040 users and 3883 movies.    
Here, we create the utility matrix as a dense matrix (numpy.array) format for convenience. But in a real world data where hundreds of millions of users and items may exist, we won't be able to create the utility matrix in a dense matrix format (For those who are curious why, try measuring the dense matrix self.Mr using .nbytes()). In that case, we may use sparse matrix operations as much as possible and distributed file systems and distributed computing will be needed. Fortunately, our data is small enough to fit in a laptop/pc memory. Also, we will use numpy and scipy.sparse, which allow significantly faster calculations than calculating on pandas.DataFrame object.    
In the `rating_matrix` method, pay attention to the index mapping as user IDs and movie IDs are not the same as array index.

In [12]:
# Load the movie ratings data (as in the HW3-recommender-system) and use matrix factorization technique(s) 
# and predict the missing ratings from the test data. Measure the RMSE. You should use sklearn library.
from sklearn.decomposition import NMF
from sklearn.metrics import mean_squared_error

class NMFRecSys(RecSys):
    def __init__(self, data, n_components=20):
        super().__init__(data)
        self.n_components = n_components
        self.U = None
        self.V = None

    def fit(self):
        # Create the ratings matrix
        num_users = len(self.allusers)
        num_movies = len(self.allmovies)
        ratings_matrix = np.zeros((num_users, num_movies))

        # Map movie IDs to matrix indices
        movie_id_to_idx = {mID: idx for idx, mID in enumerate(self.allmovies)}

        for uID, mID, rating in self.data.train[['uID', 'mID', 'rating']].values:
            ratings_matrix[uID-1, movie_id_to_idx[mID]] = rating

        # Apply NMF to decompose the rating matrix
        nmf = NMF(n_components=self.n_components, random_state=42)
        self.U = nmf.fit_transform(ratings_matrix)
        self.V = nmf.components_

    def predict(self):
        # Reconstruct the ratings matrix by multiplying U and V
        predicted_ratings = self.U.dot(self.V)

        # Get the predicted ratings for the test dataset
        predicted_test_ratings = []
        # Map movie IDs to matrix indices
        movie_id_to_idx = {mID: idx for idx, mID in enumerate(self.allmovies)}
        for uID, mID, rating in self.data.test[['uID', 'mID', 'rating']].values:
            predicted_test_ratings.append(predicted_ratings[uID-1, movie_id_to_idx[mID]])

        return np.array(predicted_test_ratings)

    def rmse(self, yp):
        yp[np.isnan(yp)] = 3  # Impute NaN values with 3
        yt = np.array(self.data.test.rating)
        return np.sqrt(((yt - yp) ** 2).mean())


# Create an instance of NMFRecSys
nmf_rec_sys = NMFRecSys(data)

# Fit the NMF model
nmf_rec_sys.fit()

# Predict ratings
predicted_ratings = nmf_rec_sys.predict()

# Calculate RMSE
rmse = nmf_rec_sys.rmse(predicted_ratings)
print("RMSE:", rmse)


RMSE: 2.862365426778359


Interpretation: having an RMSE value of around 2.8 means that, on average, the predicted ratings from the recommendation system deviate from the actual ratings by approximately 2.8 units. 

There are several reasons why non-negative matrix factorization (NMF) may not perform as well as other methods such as baseline or similarity-based methods. Firstly,  NMF may not have enough model complexity to capture the underlying patterns in the data. The number of components (latent factors) used in NMF can significantly impact its performance. If the chosen number of components is too low, the model may struggle to represent the data accurately.Secondly, NMF is sensitive to the initialization of the factor matrices. Different initializations can lead to different solutions and affect the model's performance. 
To deal with this, we can increase model complexity by increasing the number of components (latent factors) used in NMF. This can allow the model to capture more intricate patterns in the data. We can also try different implementations and explore their hyperparameters, initialization methods, and optimization algorithms.