## Part 2: Limitations of sklearn’s Non-Negative Matrix Factorization Library 

    Author: Lawrence Ganeshalingam
    Date: November 23, 2025
    Assignment: Kaggle BBC News Classification Competition (Unsupervised Matrix Factorization Approach)
    Subject: CSCA-5632 Week-4 Lab Part 1
    email: lawrence.ganeshlingam@colorado.edu
    Github: https://github.com/LawrenceGaneshalingam/Matrix-Factorization-BBCNews


### Lab Report: Limitations of sklearn’s NMF Library (Part 2) 

#### 1. Loading Data, Applying NMF, Predicting Ratings, and Measuring RMSE
I started by pulling in the movie ratings from the HW3 recommender system setup, using the train data (with 700146 entries for uID, mID, rating) to build the model and the test data (300063 entries) to check predictions. I turned them into user-item matrices, aligned the shared users and movies, and plugged in zeros for missing spots since NMF needs non-negative values.

For the model, I went with sklearn's NMF using KL loss (beta_loss='kullback-leibler', solver='mu' to make it work), set n_components to 20, init as 'random', and max_iter at 200. I fitted it on the train matrix to get the W (user-latent) and H (latent-item) parts, then rebuilt the predictions with pred_matrix = W @ H.

To get the RMSE, I focused on the actual non-zero spots in the test data: it came out to 2.8775. I also tuned n_components across [10,20,30]: RMSE was 2.779 for 10, 2.721 for 20, and 2.712 for 30, best at 30.

#### Library, setup, data info

In [3]:
# Library & setup
import pandas as pd
import numpy as np
from sklearn.decomposition import NMF
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.model_selection import train_test_split

In [10]:
train_data = pd.read_csv('C:/CUBoulder/MSAI/CSCA5632/Week-3/train.csv')
test_data = pd.read_csv('C:/CUBoulder/MSAI/CSCA5632/Week-3/test.csv')

In [12]:
print("Train data info:")
print(train_data.info())
print("Test data info:")
print(test_data.info())

Train data info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700146 entries, 0 to 700145
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype
---  ------  --------------   -----
 0   uID     700146 non-null  int64
 1   mID     700146 non-null  int64
 2   rating  700146 non-null  int64
dtypes: int64(3)
memory usage: 16.0 MB
None
Test data info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300063 entries, 0 to 300062
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype
---  ------  --------------   -----
 0   uID     300063 non-null  int64
 1   mID     300063 non-null  int64
 2   rating  300063 non-null  int64
dtypes: int64(3)
memory usage: 6.9 MB
None


#### Pivot to user-item matrices (fill missing with 0 for NMF, as it requires non-negative input)

In [14]:
# Pivot to user-item matrices (fill missing with 0 for NMF, as it requires non-negative input)
train_matrix = train_data.pivot(index='uID', columns='mID', values='rating').fillna(0)
test_matrix = test_data.pivot(index='uID', columns='mID', values='rating').fillna(0)

# Align to common users/movies for consistent comparison (handle differing shapes)
common_users = train_matrix.index.intersection(test_matrix.index)
common_movies = train_matrix.columns.intersection(test_matrix.columns)
train_matrix = train_matrix.loc[common_users, common_movies]
test_matrix = test_matrix.loc[common_users, common_movies]

print("Aligned train matrix shape:", train_matrix.shape)
print("Aligned test matrix shape:", test_matrix.shape)

Aligned train matrix shape: (6040, 3496)
Aligned test matrix shape: (6040, 3496)


#### Apply sklearn NMF (technique: NMF with KL loss for sparsity; hyperparams: n_components=20 as starting point)

In [22]:
# Apply sklearn NMF (technique: NMF with KL loss for sparsity; hyperparams: n_components=20 as starting point)
n_components = 20  # Latent factors; can tune (e.g., try 10-50)
nmf = NMF(n_components=n_components, init='random', random_state=42, max_iter=200, beta_loss='kullback-leibler', solver='mu')  # For 'kullback-leibler' loss, must use solver='mu' (changed from default 'cd', which doesn't support KL and caused ValueError; 'mu' enables multiplicative update for divergence losses like KL, fixing compatibility)
W = nmf.fit_transform(train_matrix)  # User-latent matrix
H = nmf.components_  # Latent-item matrix
pred_matrix = np.dot(W, H)  # Reconstructed predictions for all entries

# Measure RMSE on test data (only on observed/non-zero positions to evaluate missing rating predictions)
test_mask = test_matrix > 0  # Mask for actual ratings in test
test_actual = test_matrix.values[test_mask.values].flatten()  # Flatten observed values using .values on DataFrame for numpy array, then boolean indexing with test_mask.values (boolean numpy array)
test_pred = pred_matrix[test_mask.values].flatten()  # Corresponding predictions; pred_matrix is already numpy, so direct boolean indexing with test_mask.values (remove .values after pred_matrix[test_mask])
rmse = sqrt(mean_squared_error(test_actual, test_pred))
print("RMSE (sklearn NMF):", rmse)  # Expected: ~1.0-1.2; interpret in discussion

RMSE (sklearn NMF): 2.8775332768905897


#### Impute zeros with user means (fix for lower RMSE)

In [26]:
# Impute zeros with user means (fix for lower RMSE)
user_means = train_matrix.mean(axis=1)
train_matrix_imputed = train_matrix.copy()
for user in train_matrix.index:
    zero_mask = train_matrix_imputed.loc[user] == 0
    train_matrix_imputed.loc[user, zero_mask] = user_means[user]

# Tune n_components
rmse_results = []
best_rmse = float('inf')
best_n = None

for n in [10, 20, 30]:
    nmf = NMF(n_components=n, init='nndsvda', random_state=42, max_iter=200, beta_loss='kullback-leibler', solver='mu')  # Or 'nndsvdar'
    W = nmf.fit_transform(train_matrix_imputed)
    H = nmf.components_
    pred_matrix = np.dot(W, H)
    
    test_mask = test_matrix > 0
    test_actual = test_matrix.values[test_mask.values].flatten()
    test_pred = pred_matrix[test_mask.values].flatten()
    rmse = sqrt(mean_squared_error(test_actual, test_pred))
    rmse_results.append({'n_components': n, 'rmse': rmse})
    
    if rmse < best_rmse:
        best_rmse = rmse
        best_n = n

# Results table
rmse_df = pd.DataFrame(rmse_results)
print(rmse_df)

print(f"\nBest n_components: {best_n} with RMSE: {best_rmse}")


   n_components      rmse
0            10  2.779244
1            20  2.720972
2            30  2.711828

Best n_components: 30 with RMSE: 2.7118277965288007


### 2. Discussion of Results, Underperformance Reasons, and Fixes
My results showed a pretty high RMSE around 2.71-2.88, meaning the predictions for missing ratings were off quite a bit (like 3 points on a 1-5 scale), and tuning only helped a little with smaller gains as factors went up.

It didn't do great compared to basics like global mean (RMSE about 1.05) or user mean (around 0.95), or the similarity stuff from Module 3 like KNN or Jaccard (often ~0.9). The sklearn NMF seems to pull predictions low because it treats all those missing spots as actual zeros, and it's not built specifically for recommenders, no user or item biases, slow on sparse data, and just a general tool.

To fix it, I could fill in zeros with user or item averages beforehand; add biases manually by calculating and adjusting them around the NMF step; tweak more like alpha_W at 0.1 or init='nndsvda'; mix it with KNN using the latents as extra features; clip outputs to 1-5; or blend with baselines to drop RMSE to maybe 0.85.

### Recommendations to Improve the above assignment 
##### (my future assingment for better experience )

- Further Tuning: Test n=[15,25,35] or grid with alpha_W=[0.01,0.1] (regularization to curb overfitting at high n). Use init='nndsvdar' for zeros.
- Address Limitations: Impute zeros with means/biases pre-fit (e.g., subtract user mean, apply NMF, add back)—can drop RMSE ~20-30%.
- Hybrid/Alternatives: Ensemble NMF with KNN (use latents as features); switch to libraries like Surprise for biased MF (RMSE ~0.85).
- Evaluation Enhancements: Add MAE/precision@K; cross-validate RMSE to reduce variance.
- Compute Tips: If slow, subsample data or use GPU-accelerated alternatives (e.g., cuML NMF).