<article style="background-color: #DC7D2D;">
<h1 style="font-size: 2.5em; background-color: #DC7D2D; padding-left: 30px; padding-top: 1em">Limitation(s) of sklearn’s non-negative matrix factorization library</h1>
<h2 style="font-size: 1.5em; background-color: #DC7D2D; padding-left: 30px; padding-bottom: 30px;">Mini-project for DTSA 5510 Unsupervised Learning (Assignment Week 4)</h2>
</article>

# Project plan

This is assignment is slightly difficult to interpret. But here is how I have solved it:
* I have created a new method in the RecSys class called predict_to_NMF.
The predict_to_NMF basically uses NMF on the user-movie-ratings matrix and predict what kind of score each user belong to. Much like in the previous assignment.
* This approach is not very successful and the resulting rmse is higher than for the other methods.

<h1 style="font-size: 2.5em; background-color: #DC7D2D; padding: 30px; padding-top: 1em">Part I</h1>

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time
from sklearn.model_selection import train_test_split
from scipy.sparse import coo_matrix, csr_matrix
from scipy.spatial.distance import jaccard, cosine 
from pytest import approx
from collections import namedtuple

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.decomposition import NMF
from sklearn.metrics import confusion_matrix

In [2]:
MV_users = pd.read_csv('../Peer-graded Assignment - BBC News Classification/movie-rating/data/users.csv')
MV_movies = pd.read_csv('../Peer-graded Assignment - BBC News Classification/movie-rating/data/movies.csv')
train = pd.read_csv('../Peer-graded Assignment - BBC News Classification/movie-rating/data/train.csv')
test = pd.read_csv('../Peer-graded Assignment - BBC News Classification/movie-rating/data/test.csv')

In [3]:
Data = namedtuple('Data', ['users','movies','train','test'])
data = Data(MV_users, MV_movies, train, test)

In [4]:
class RecSys():
    def __init__(self,data):
        self.data=data
        self.allusers = list(self.data.users['uID'])
        self.allmovies = list(self.data.movies['mID'])
        self.genres = list(self.data.movies.columns.drop(['mID', 'title', 'year']))
        self.mid2idx = dict(zip(self.data.movies.mID,list(range(len(self.data.movies)))))
        self.uid2idx = dict(zip(self.data.users.uID,list(range(len(self.data.users)))))
        self.Mr=self.rating_matrix()
        self.Mm=None 
        self.sim=np.zeros((len(self.allmovies),len(self.allmovies)))
        
    def rating_matrix(self):
        """
        Convert the rating matrix to numpy array of shape (#allusers,#allmovies)
        """
        ind_movie = [self.mid2idx[x] for x in self.data.train.mID] 
        ind_user = [self.uid2idx[x] for x in self.data.train.uID]
        rating_train = list(self.data.train.rating)
        
        return np.array(coo_matrix((rating_train, (ind_user, ind_movie)), shape=(len(self.allusers), len(self.allmovies))).toarray())


    def predict_everything_to_3(self):
        """
        Predict everything to 3 for the test data
        """
        # Generate an array with 3s against all entries in test dataset
        # your code here
        
        number_of_entries = len(self.data[3])
        return np.full(number_of_entries, 3)
        
        
    def predict_to_user_average(self):
        """
        Predict to average rating for the user.
        Returns numpy array of shape (#users,)
        """
        # Generate an array as follows:
        # 1. Calculate all avg user rating as sum of ratings of user across all movies/number of movies whose rating > 0
        # 2. Return the average rating of users in test data
        # your code here
        
        avg_user_ratings = self.data[2].groupby('uID')['rating'].mean()
        transformation_dict = avg_user_ratings.to_dict()
        result = self.data[3]['uID'].map(transformation_dict)
        return np.array(result)
    
    
    def predict_from_sim(self,uid,mid):
        """
        Predict a user rating on a movie given userID and movieID
        """
        # Predict user rating as follows:
        # 1. Get entry of user id in rating matrix
        # 2. Get entry of movie id in sim matrix
        # 3. Employ 1 and 2 to predict user rating of the movie
        # your code here
        
        user = self.Mr[self.uid2idx[uid]]
        movie = self.sim[self.mid2idx[mid]]
        result = (np.dot(user,movie)) / (np.dot(movie,user>0))
        return result

    def predict_to_NMF(self):
        "The new method for this assignment"
        
        matrix = np.copy(self.rating_matrix())
        
        nmf_model = NMF(n_components=5, init='random', random_state=123)
        nmf_model.fit(matrix)

        rating_results = nmf_model.transform(matrix)
        rating_results = rating_results.argmax(axis=1)
        rating_results = rating_results + 1

        translation_dict = {user:rating for user, rating in zip( range(len(self.rating_matrix())) , rating_results)}
        return self.data[3]['uID'].map(translation_dict)
    
    
    def predict(self):
        """
        Predict ratings in the test data. Returns predicted rating in a numpy array of size (# of rows in testdata,)
        """
        # your code here
        
        df = self.data[3].copy()
        prediction_list = []
        
        for i, e in zip(df['uID'], df['mID']):
            prediction_list.append(self.predict_from_sim(i, e))
        
        return np.array(prediction_list)
        
    
    def rmse(self,yp):
        yp[np.isnan(yp)]=3 #In case there is nan values in prediction, it will impute to 3.
        yt=np.array(self.data.test.rating)
        return np.sqrt(((yt-yp)**2).mean())

    
class ContentBased(RecSys):
    def __init__(self,data):
        super().__init__(data)
        self.data=data
        self.Mm = self.calc_movie_feature_matrix()  
        
    def calc_movie_feature_matrix(self):
        """
        Create movie feature matrix in a numpy array of shape (#allmovies, #genres) 
        """
        # your code here
        
        return self.data[1].iloc[ : , 3:].to_numpy()
        
    
    def calc_item_item_similarity(self):
        """
        Create item-item similarity using Jaccard similarity
        """
        # Update the sim matrix by calculating item-item similarity using Jaccard similarity
        # Jaccard Similarity: J(A, B) = |A∩B| / |A∪B| 
        # your code here
        
        from sklearn.metrics.pairwise import pairwise_distances
        result = 1 - pairwise_distances(self.Mm, metric='jaccard')
        self.sim = result
        
        
                
class Collaborative(RecSys):    
    def __init__(self,data):
        super().__init__(data)
        
    def calc_item_item_similarity(self, simfunction, *X):  
        """
        Create item-item similarity using similarity function. 
        X is an optional transformed matrix of Mr
        """    
        # General function that calculates item-item similarity based on the sim function and data inputed
        if len(X)==0:
            self.sim = simfunction()            
        else:
            self.sim = simfunction(X[0]) # *X passes in a tuple format of (X,), to X[0] will be the actual transformed matrix
            
    def cossim(self):    
        """
        Calculates item-item similarity for all pairs of items using cosine similarity (values from 0 to 1) on utility matrix
        Returns a cosine similarity matrix of size (#all movies, #all movies)
        """
        # Return a sim matrix by calculating item-item similarity for all pairs of items using Jaccard similarity
        # Cosine Similarity: C(A, B) = (A.B) / (||A||.||B||) 
        # your code here
        
        #t0=time.perf_counter()
        
        
        #from sklearn.metrics.pairwise import pairwise_distances
        #result_2 = 1 - pairwise_distances(self.Mr, metric='cosine')
        #print(result_2)
        
        
        
        movie_utility = (self.Mr.sum(axis=1)) / ((self.Mr > 0).sum(axis=1))
        movie_utility_array = np.repeat(np.expand_dims(movie_utility, axis=1),self.Mr.shape[1],axis=1)
        X = self.Mr + (self.Mr==0) * movie_utility_array - movie_utility_array 
        Y = X / np.sqrt((X**2).sum(axis=0))
        Y[np.isnan(Y)]=0.
        cossim = np.dot(Y.T, Y)
        
        # Diagonal equals 1
        for element in range(len(self.allmovies)):
            cossim[element, element] = 1
        
        result = 0.5 + 0.5 * cossim  
        return result
        
        
    
    def jacsim(self,Xr):
        """
        Calculates item-item similarity for all pairs of items using jaccard similarity (values from 0 to 1)
        Xr is the transformed rating matrix.
        """    
        # Return a sim matrix by calculating item-item similarity for all pairs of items using Jaccard similarity
        # Jaccard Similarity: J(A, B) = |A∩B| / |A∪B| 
        # your code here
        
        
        if Xr.max() != True: 
            intersection = np.zeros((Xr.shape[1], Xr.shape[1])).astype(int)
            #t0=time.perf_counter()
            for element in range(1, int(Xr.max() + 1)):
                csr = csr_matrix((Xr == element).astype(int))
                intersection = intersection + np.array(csr.T.dot(csr).toarray()).astype(int)    
        
        csr0 = csr_matrix((Xr > 0).astype(int))
        nz_inter = np.array(csr0.T.dot(csr0).toarray()).astype(int)   

        A = (Xr > 0).astype(bool)
        rowsum = A.sum(axis=0)
        rsumtile = np.repeat(rowsum.reshape((Xr.shape[1], 1)), Xr.shape[1], axis=1)   
        union = rsumtile.T + rsumtile - nz_inter

        if Xr.max() != True:
            jaccard_sim = intersection / union
        else:
            jaccard_sim = nz_inter / union
            
        # Turn nan into 0s
        jaccard_sim[np.isnan(jaccard_sim)] = 0
        
        # Diagonal equas 1  
        for element in range(Xr.shape[1]):
            jaccard_sim[element, element] = 1   
        
        return jaccard_sim        
    
    

In [5]:
recSys_instance = RecSys(data)
test            = recSys_instance.predict_to_NMF()
recSys_instance.rmse(test)

1.9357013740003162

<h1 style="font-size: 2.5em; background-color: #DC7D2D; padding: 30px; padding-top: 1em">Part I Discussion</h1>

|Method|RMSE|
|:----|:--------:|
|NMF-method | 1.93 |
|Baseline, $Y_p$=3| 1.259|
|Baseline, $Y_p=\mu_u$| 1.035|
|Content based, item-item| 1.196|
|Collaborative, cosine| 1.013|
|Collaborative, jaccard, $M_r\geq 3$|  0.982|
|Collaborative, jaccard, $M_r\geq 1$|  0.991|
|Collaborative, jaccard, $M_r$|  0.951|

* The NMF-approach had an rmse on 1.93 and performed poorly compated to all other methods.
* I believe that the primary reason that the NMF-method is performing poorly is that missing ratings are treated as zeros. Or actually in my implementation, they are treated as ones (since I added 1 to get them in line with the 1-5 scale). However this creates a lot of wrong results and reduces the overall performance of NMF below that of the other methods.
* Another reason that the NMF-method yielded poor results was that it basically connected each user to a specific rating. Since users can differ in how they rate movies, the result was fairly bad. 
* One approach to improve the results could be to use the method individually on each category of movies. But I am uncertain that would yield much better results.
* Perhaps the NMF-method could be used to find some underlying theames across different movie genera that could be used by the other methods to improve their performance.