# Recommender From Scratch

Given that my entire dataset is only 17,688 entries, fits neatly into a csv file, and runs in a Jupyter Notebook, I'm going to try to write a recommender from scratch using a lecture from Matt Drury.

To start, I'm going to need my Utility Matrix...

In [1]:
import pandas as pd
from scipy import sparse

# first load my data...
ratings = pd.read_csv('../data/ratings.csv', delimiter='|', header=None, names=['user_id', 'system_id', 'ratings'])

# get highest user_id & highest system_id
highest_user_id = ratings.user_id.max()
highest_system_id = ratings.system_id.max()

# make a sparse matrix...
utility_matrix = sparse.lil_matrix((highest_user_id + 1, highest_system_id + 1))
# +1 to be able to use actual ids, as opposed to having to make consessions

# of course, now I need to fill it with ratings...
for _, row in ratings.iterrows():
        utility_matrix[row.user_id, row.system_id] = row.ratings


Next step is to find cosine similarities... the following function ws taken from my [Galvanize](https://www.galvanize.com/seattle/data-science) classwork on the subject.

In [2]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def make_cos_sim_and_neighborhoods(ratings_mat, neighborhood_size):
    '''
    Accepts a 2 dimensional matrix ratings_mat, and an integer neighborhood_size.
    Returns a tuple containing:
        - items_cos_sim, an item-item matrix where each element is the
        cosine_similarity of the items at the corresponding row and column. This
        is a square matrix where the length of each dimension equals the number
        of columns in ratings_mat.
        - neighborhood, a 2-dimensional matrix where each row is the neighborhood
        for that item. The elements are the indices of the n (neighborhood_size)
        most similar items. Most similar items are at the end of the row.
    '''
    items_cos_sim = cosine_similarity(ratings_mat.T)
    least_to_most_sim_indexes = np.argsort(items_cos_sim, 1)
    neighborhood = least_to_most_sim_indexes[:, -neighborhood_size:]
    return items_cos_sim, neighborhood

In [3]:
# use the utility matrix from above and a neighborhood size of 75 for fun...
items_cos_sim, neighborhoods = make_cos_sim_and_neighborhoods(utility_matrix, neighborhood_size=75)

Now to be able to make predictions...

In [4]:
import time

def pred_one_user(user_id, item_cos_sim, neighborhoods, ratings_mat, timer=False):
    '''
    Accept user id as arg. Return the predictions for a single user.

    Optional argument to specify whether or not timing should be provided
    on this operation.
    '''
    if timer:
        start = time.clock()
    n_items = ratings_mat.shape[1]  # number of items
    items_rated_by_this_user = ratings_mat[user_id].nonzero()[1]
    # Just initializing so we have somewhere to put rating preds
    output = np.zeros(n_items)
    for item_to_rate in xrange(n_items):

        relevant_items = np.intersect1d(neighborhoods[item_to_rate],
                                        items_rated_by_this_user,
                                        assume_unique=True)
                                    # assume_unique speeds up intersection op
        output[item_to_rate] = ratings_mat[user_id, relevant_items] * \
            item_cos_sim[item_to_rate, relevant_items] / \
            item_cos_sim[item_to_rate, relevant_items].sum()
    output = np.nan_to_num(output)  # get rid of nan values...
    if timer:
        end = time.clock()
        print 'output... {:.3f}'.format(end-start)
    return output

In [5]:
# test the above...
user_test = ratings.groupby('user_id').ratings.count().reset_index(name='count')
test_user = user_test[user_test['count'] == user_test['count'].max()]['user_id']
print test_user

2437    65232
2462    66112
Name: user_id, dtype: int64


In [6]:
preds = pred_one_user(65232, items_cos_sim, neighborhoods, utility_matrix)
for system in ratings[ratings['user_id'] == 65232]['system_id']:
    rat = int(ratings[(ratings['user_id'] == 65232) & (ratings['system_id'] == system)]['ratings'])
    print rat, preds[system]

1 1.02857636626
1 1.03726877441
1 1.06797898995
1 1.03109590023
1 1.09266881413
1 1.03368475257
1 1.02419317323
1 1.02094702972
1 1.04861761502
3 2.57315861917
1 1.02295067111
1 1.0




## Validation...

Okay, so not too terrible... but we need to do some sort of validation... and this is an issue.  There are different methods; randomly remove some number of values and check against them, use a time-based model where we train on a certain time period, then look at test/validation as time moves forward, etc.

My plan, for better or worse, is to remove about 20% of users, rerun the sims and neighborhoods, then check agianst the hold-out users.  I can create a train_utility_matrix for this, then predict on the standard one using my holdouts.

In [9]:
from random import shuffle

# select a hold-out set...
user_ids = list(set(ratings['user_id'].tolist()))
shuffle(user_ids)  # somewhat shuffled already, but whatever
index = int(len(user_ids)*.8)
train_users = user_ids[:index]
hold_out = user_ids[index:]

# now to make a new matrix...
train_users.sort()
highest_user_id = train_users[-1]
train_utility_matrix = sparse.lil_matrix((highest_user_id + 1, highest_system_id + 1))
# +1 to be able to use actual ids, as opposed to having to make consessions

# of course, now I need to fill it with ratings...
for _, row in ratings.iterrows():
    if row.user_id in train_users:
        train_utility_matrix[row.user_id, row.system_id] = row.ratings

In [10]:
# use the train utility matrix from above and a neighborhood size of 75 for fun...
train_items_cos_sim, train_neighborhoods = make_cos_sim_and_neighborhoods(train_utility_matrix, neighborhood_size=75)

Since I have the ratings from my hold_out users in my old Utility_matrix, I can reuse it here:

In [19]:
from math import sqrt

n = 0
RMSE = 0
max_err = 0
for i, user in enumerate(hold_out):
    # use actual utility matrix to get their ratings... train for the test info
    preds = pred_one_user(user, train_items_cos_sim, train_neighborhoods, utility_matrix)
    for system in ratings[ratings['user_id'] == user]['system_id']:
        rat = int(ratings[(ratings['user_id'] == user) & (ratings['system_id'] == system)]['ratings'])
        sub = preds[system] - rat
        if abs(sub) >= max_err:
            max_err = abs(sub)
        RMSE = RMSE + (preds[system] - rat)**2
        n += 1
    print 'Progress {:.2f}% \r'.format(100.0*(i + 1)/float(len(hold_out))),
RMSE = sqrt(RMSE/n)
print 
print 'RMSE = {}, Max Error = {}'.format(RMSE, max_err)



Progress 100.00% 
RMSE = 0.182441491839, Max Error = 5.0


The RMSE is okay, that max error is really high... I'm off by as much as 5 for some users... this is similar to the results I was getting in Spark, though with a better RMSE (guess Spark's method not as good as this method, but likely makes up for it in ability to handle Big Data). 

## Matrix Factorization

Okay, so time to use Matrix Factorization.  I'll be using Nonnegative Matrix Factorization (NMF) from [sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html). I'm a fan of Nonnegative Matrix Factorization for 2 reasons:
1. There are no negative values, which prevents weird results.  Since there is no 'thumbs down' other than not purchasing a product, I don;t want to have negative values spitting out negative results.
2. I prefer the US matrix form, since it allows for manipulation with a K matrix... U is u x k, S is k x s, and K is a k x k matrix you can throw in... if K is I (identity) it has absolutly no effect, but you can maipulate results by tossing numbers on the off-diagonals (for example, in a boom recommender, you can find the group that pregnancy books are in, find the group that stillbirth books are in, and have it negativley influence result for pregnancy books if you're interested in stillbirth books, since that seems like a cruel thing to hit customers with)

There is an issue however... I won;t be able to use non-users to validate the way I did above; they need to get a vector at the same time as everyone else... so I need to 'knock out' values in my utility_matrix to fit back in later...