# DTSA-5510 Week 4: Part 2
## Author: Alan Klein
## Create Date: 2025-04-27
github link: https://github.com/Saganoky/DTSA-5510-Week-4-Kaggle-Project

## Part 2: Questions

1. Load the movie ratings data (as in the HW3-recommender-system) and use matrix factorization technique(s) and predict the missing ratings from the test data. Measure the RMSE. You should use sklearn library. [10 pts]

The performance was terrible, having a RMSE of 2.85 for both the test and training data.


2. Discuss the results and why sklearn's non-negative matrix factorization library did not work well compared to simple baseline or similarity-based methods weâ€™ve done in Module 3. Can you suggest a way(s) to fix it? [10 pts]

With 50 iterations, this process is taking 5 minutes to run and its not reaching convergence, if we increased the iterations it could get better results.  You could try and apply some spare data techniques to increase performance.  Additionally, we could try regularization or PCA to reduce the dimensionality.

In [None]:
import pandas as pd
import numpy as np
import itertools

from collections import namedtuple

from sklearn.metrics import accuracy_score, root_mean_squared_error
from scipy.sparse import coo_matrix
from sklearn.decomposition import NMF


In [18]:
MV_users = pd.read_csv('data/part_2/users.csv')
MV_movies = pd.read_csv('data/part_2/movies.csv')
train = pd.read_csv('data/part_2/train.csv')
test = pd.read_csv('data/part_2/test.csv')

Data = namedtuple('Data', ['users','movies','train','test'])
data = Data(MV_users, MV_movies, train, test)

In [54]:
# Merge the data to be used for NMF, needs to be user ratings and movie genres for each row.
# Copy code from week 3 assignment

allusers = list(data.users['uID'])
allmovies = list(data.movies['mID'])
genres = list(data.movies.columns.drop(['mID', 'title', 'year']))
mid2idx = dict(zip(data.movies.mID,list(range(len(data.movies)))))
uid2idx = dict(zip(data.users.uID,list(range(len(data.users)))))

ind_movie = [mid2idx[x] for x in data.train.mID] 
ind_user = [uid2idx[x] for x in data.train.uID]
rating_train = list(data.train.rating)
Movie_Ratings = np.array(coo_matrix((rating_train, (ind_user, ind_movie)), shape=(len(allusers), len(allmovies))).toarray())


# Create wide dataframe
uid2idx_df = pd.DataFrame(list(uid2idx.items()), columns=['uID', 'index'])

movie_ratings_df = pd.DataFrame(Movie_Ratings)
movie_ratings_df['index'] = list(range(len(movie_ratings_df)))
movie_ratings_df = pd.merge(movie_ratings_df, uid2idx_df, on='index')

train_enriched_1 = pd.merge(data.train, data.users, on='uID')
train_enriched_2 = pd.merge(train_enriched_1, data.movies, on='mID')
train_enriched_3 = pd.merge(train_enriched_2, movie_ratings_df, on='uID')

# Get X_train and y_train
X_train = np.array(train_enriched_3.drop(columns=['uID', 'mID', 'rating',
                                          'gender', 'occupation','zip', 
                                          'title','year','index']))
y_train = np.array(train_enriched_3['rating'])


# Enrich test set
test_enriched_1 = pd.merge(data.test, data.users, on='uID')
test_enriched_2 = pd.merge(test_enriched_1, data.movies, on='mID')
test_enriched_3 = pd.merge(test_enriched_2, movie_ratings_df, on='uID')

# Get X_test and y_test
X_test = np.array(test_enriched_3.drop(columns=['uID', 'mID', 'rating',
                                          'gender', 'occupation','zip', 
                                          'title','year','index']))    
y_test = np.array(test_enriched_3['rating'])  


# Not going to normalizing the data, since we are using NMF.


In [58]:
def create_nmf_model(feature_data):
    nmf_model = NMF(n_components=5, random_state=1337, max_iter=50)
    W = nmf_model.fit_transform(feature_data)
    H = nmf_model.components_
    return W, H, nmf_model

def label_permute_compare(labels,W,n=5):
    pred = np.argmax(W, axis = 1)
    pred_pd = pd.DataFrame(pred)
    best_acc = 0
    best_rmse = 0
    best_perm = None
   
    for perm in itertools.permutations(range(n)):
        perm_yp = pred_pd.replace(list(range(n)), list(perm))
        perm_rmse = root_mean_squared_error(labels, perm_yp)
        perm_acc = accuracy_score(labels, perm_yp)
        if perm_rmse > best_rmse:
            best_rmse = perm_rmse
            best_perm = perm
            best_acc = perm_acc
    return best_perm, best_acc, best_rmse

def print_results(best_perm, best_rmse, best_acc, train_labs):
    print("Best permutation:", best_perm)
    print("Best RMSE:", best_rmse)
    print("Best accuracy:", best_acc)
    print("Train labels:", train_labs)

W, H, nmf_model = create_nmf_model(X_train)
best_perm, best_acc, best_rmse = label_permute_compare(y_train,W,n=5)
print_results(best_perm, best_rmse, best_acc, y_train)

# run on validation set
w_val = nmf_model.transform(X_test) 
val_best_perm, val_best_acc, val_best_rmse = label_permute_compare(y_test,w_val,n=5)
print_results(val_best_perm, val_best_rmse, val_best_acc, y_test)





Best permutation: (0, 4, 3, 1, 2)
Best RMSE: 2.84590445599657
Best accuracy: 0.10517520631411163
Train labels: [5 4 5 ... 3 3 1]
Best permutation: (0, 4, 3, 1, 2)
Best RMSE: 2.8512978427016757
Best accuracy: 0.10539786644804591
Train labels: [4 5 3 ... 4 3 4]
