# Part 2: NMF evaluation

The results of this experiment are lacking with a RMSE of 2.8593 compared to the baseline RMSE of 1.1162.

There are some limitations of sklearn's NMF. The first is Sparsity Handling. According to my research, NMF in sklearn can not handle sparse matrices efficiently for large-scale recommender systems. The reasons are because sklearn's implementation internally often converts sparse to dense format during computations. This is memory and computationally intensive.

There are also limitations on producing a model that has an acceptible RMSE metric. The first reason is that NMF struccles with sparse data. Most users rate only a small fraction of all movies, so the model might not have enough information to learn good representation and often leads to overfitting. Another problem is that NMF tends to oversimplify complex patterns. Real user behavior often has complex, nuanced patterns and some patterns might require many components to represent accurately. The model may oversimplify these relationships, leading to less accurate predictions.

To improve RMSE, we can handle missing values explicitly rather than treating them as zeros, add bias terms for users and items, and incorporate additional features and context. These are all time intensive data munging and error prone. We can introduce more problems by enhancing the data using these techniques, so we are better off choosing a different supervised learning model if we have the right training data.


In [5]:
import pandas as pd
def load_and_prepare_data(file_path):
    df = pd.read_csv(file_path)
    print(f"Shape of dataset: {df.shape}")
    print("\nData sample:")
    print(df.head())
    return df

df_train = load_and_prepare_data('data/train.csv')
df_test = load_and_prepare_data('data/test.csv')
df_movies = load_and_prepare_data('data/movies.csv')
df_users = load_and_prepare_data('data/users.csv')
df_train.head()
df_test.head()
df_movies.head()
df_users.head()

Shape of dataset: (700146, 3)

Data sample:
    uID   mID  rating
0   744  1210       5
1  3040  1584       4
2  1451  1293       5
3  5455  3176       2
4  2507  3074       5
Shape of dataset: (300063, 3)

Data sample:
    uID   mID  rating
0  2233   440       4
1  4274   587       5
2  2498   454       3
3  2868  2336       5
4  1636  2686       5
Shape of dataset: (3883, 21)

Data sample:
   mID                        title  year  Doc  Com  Hor  Adv  Wes  Dra  Ani  \
0    1                    Toy Story  1995    0    1    0    0    0    0    1   
1    2                      Jumanji  1995    0    0    0    1    0    0    0   
2    3             Grumpier Old Men  1995    0    1    0    0    0    0    0   
3    4            Waiting to Exhale  1995    0    1    0    0    0    1    0   
4    5  Father of the Bride Part II  1995    0    1    0    0    0    0    0   

   ...  Chi  Cri  Thr  Sci  Mys  Rom  Fil  Fan  Act  Mus  
0  ...    1    0    0    0    0    0    0    0    0    0  
1  ...

Unnamed: 0,uID,gender,age,accupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [6]:
import pandas as pd
import numpy as np
from sklearn.decomposition import NMF
from sklearn.metrics import mean_squared_error
from scipy.sparse import csr_matrix

def create_user_movie_matrix(df_train):
    user_movie_matrix = csr_matrix((df_train['rating'], 
                                  (df_train['uID'], df_train['mID'])))
    return user_movie_matrix

def train_nmf_model(user_movie_matrix, n_components=20):
    model = NMF(n_components=n_components, init='random', random_state=42)
    user_features = model.fit_transform(user_movie_matrix)
    item_features = model.components_
    return model, user_features, item_features

def predict_ratings(user_features, item_features, test_users, test_items):
    predicted_ratings = np.zeros(len(test_users))
    for i in range(len(test_users)):
        if test_users[i] < user_features.shape[0] and test_items[i] < item_features.shape[1]:
            predicted_ratings[i] = np.dot(user_features[test_users[i]], 
                                        item_features[:, test_items[i]])
    return predicted_ratings

def evaluate_model(df_train, df_test, n_components=20):
    user_movie_matrix = create_user_movie_matrix(df_train)
    model, user_features, item_features = train_nmf_model(user_movie_matrix, 
                                                        n_components)
    
    test_predictions = predict_ratings(user_features, item_features, 
                                     df_test['uID'].values, 
                                     df_test['mID'].values)
    
    rmse = np.sqrt(mean_squared_error(df_test['rating'], test_predictions))
    return rmse, test_predictions

#Evaluate
rmse, predictions = evaluate_model(df_train, df_test)
print(f"RMSE: {rmse:.4f}")

# Compare with baseline
global_mean = df_train['rating'].mean()
baseline_predictions = np.full(len(df_test), global_mean)
baseline_rmse = np.sqrt(mean_squared_error(df_test['rating'], baseline_predictions))
print(f"Baseline RMSE: {baseline_rmse:.4f}")

RMSE: 2.8593
Baseline RMSE: 1.1162
