# Matrix Factorization

Okay, so time to use Matrix Factorization. I'll be using Non-Negative Matrix Factorization (NMF) from [sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html). I'm a fan of Non-Negative Matrix Factorization for 2 reasons:
1. There are no negative values, which prevents weird results. Since there is no 'thumbs down' other than not purchasing a product, I don;t want to have negative values spitting out negative results.
2. I prefer the US matrix form, since it allows for manipulation with a K matrix... U is u x k, S is k x s, and K is a k x k matrix you can throw in... if K is I (identity) it has absolutly no effect, but you can maipulate results by tossing numbers on the off-diagonals (for example, in a boom recommender, you can find the group that pregnancy books are in, find the group that stillbirth books are in, and have it negativley influence result for pregnancy books if you're interested in stillbirth books, since that seems like a cruel thing to hit customers with)

There is an issue however... I won't be able to use non-users to validate the way I did above; they need to get a vector at the same time as everyone else... so I need to 'knock out' values in my utility_matrix to fit back in later...but that means I need to grab that again, since I'm in a new notebook...

In [1]:
import pandas as pd
from scipy import sparse

# first load my data...
ratings = pd.read_csv('../data/ratings.csv', delimiter='|', header=None, names=['user_id', 'system_id', 'ratings'])

# get highest user_id & highest system_id
highest_user_id = ratings.user_id.max()
highest_system_id = ratings.system_id.max()

# make a sparse matrix...
utility_matrix = sparse.lil_matrix((highest_user_id + 1, highest_system_id + 1))
# +1 to be able to use actual ids, as opposed to having to make consessions

# of course, now I need to fill it with ratings...
for _, row in ratings.iterrows():
        utility_matrix[row.user_id, row.system_id] = row.ratings

## Validation

Now to remove 20% of my data points and call it the *train_utility_matrix*.  Goal is to make a list of tuples of *user_id* and *system_id*, then set the values at those locations to 0 in a copy of the *utility_matrix*.

In [2]:
from random import shuffle

train_utility_matrix = utility_matrix.copy()
utility_dict = utility_matrix.todok(copy=False)
lst = utility_dict.keys()
shuffle(lst)
cut = int(len(lst)*0.8)
train = lst[:cut]
hold_out = lst[cut:]

# now remove hold_out
for tup in hold_out:
    train_utility_matrix[tup] = 0
    
for tup in hold_out[:10]:
    print utility_matrix[tup], train_utility_matrix[tup]

1.0 0.0
1.0 0.0
3.0 0.0
1.0 0.0
1.0 0.0
2.0 0.0
1.0 0.0
1.0 0.0
1.0 0.0
3.0 0.0


### Non-regularized

For fun, just fitting it to see what happens.

In [None]:
import numpy as np
from math import sqrt
from sklearn.decomposition import NMF

ks = [a for a in xrange(4,34,4)]
best_k = 0
max_err = [0 for a in ks]
rmse = [0 for a in ks]
lowest_rmse = 9001  # it's over 9,000
for i, k in enumerate(ks):
    nmf = NMF(k)
    nmf.fit(train_utility_matrix)
    #  v = WH, W = users, H = items
    H = nmf.components_
    W = nmf.transform(train_utility_matrix)
    # find rmse for this...
    temp_rmse = 0
    n = 0
    for j, tup in enumerate(hold_out):
        # rating and predicted rating...
        rat = int(ratings[(ratings['user_id'] == tup[0]) & (ratings['system_id'] == tup[1])]['ratings'])
        pred = np.dot(W[tup[0], :], H[:, tup[1]])
        err = pred - rat
        # save worst error...
        if abs(err) > max_err[i]:
            max_err[i] = abs(err)
        # calc RMSE
        temp_rmse = temp_rmse + err**2
        n += 1
        print 'Progress {:.2f}% \r'.format(100.0 * (i*len(hold_out) + j + 1)/ float(len(ks) * len(hold_out))),
    # done, finish the calculation
    rmse[i] = sqrt(temp_rmse/float(n))
    if rmse[i] < lowest_rmse:
        best_k = k
        lowest_rmse = rmse[i]
    

Progress 37.50% 