# Recommender Systems (Movie Reviews)
- Done as part of a SharpestMinds skills test
- Author: Chris Hodapp
- Date: 2018-02-04

This assumes that `ml-100k` from movielens data (should be the `ml-100k` archive from https://grouplens.org/datasets/movielens/100k/) has been downloaded and uncompressed in the local directory.

## Loading data

In [1]:
import pandas as pd
import numpy as np
import sklearn.model_selection

In [2]:
ml = pd.read_csv("ml-100k/u.data", sep="\t", header=None,
                 names=("user_id", "movie_id", "rating", "time"))
# Convert Unix seconds to a Pandas timestamp:
ml["time"] = pd.to_datetime(ml["time"], unit="s")

In [3]:
ml[:10]

Unnamed: 0,user_id,movie_id,rating,time
0,196,242,3,1997-12-04 15:55:49
1,186,302,3,1998-04-04 19:22:22
2,22,377,1,1997-11-07 07:18:36
3,244,51,2,1997-11-27 05:02:03
4,166,346,1,1998-02-02 05:33:16
5,298,474,4,1998-01-07 14:20:06
6,115,265,2,1997-12-03 17:51:28
7,253,465,5,1998-04-03 18:34:27
8,305,451,3,1998-02-01 09:20:17
9,6,86,3,1997-12-31 21:16:53


In [4]:
ml.shape

(100000, 4)

In [5]:
max_user  = int(ml["user_id"].max() + 1)
max_movie = int(ml["movie_id"].max() + 1)
max_user, max_movie, max_user * max_movie

(944, 1683, 1588752)

To get an idea of data sparsity:

In [6]:
ml.shape[0] / (max_user * max_movie)

0.06294248567429025

## Training/testing split:

In [7]:
ml_train, ml_test = sklearn.model_selection.train_test_split(ml, test_size=0.25)

## Conversion to utility matrix:

We need a mask for some later steps, hence the m > 0 step; ratings go only from 1 to 5, so values of 0 are automatically unknown/missing data.

In [8]:
def df2mat(df):
    m = np.zeros((max_user, max_movie))
    m[df["user_id"], df["movie_id"]] = df["rating"]
    return m, m > 0
ml_mat_train, ml_mask_train = df2mat(ml_train)
ml_mat_test,  ml_mask_test  = df2mat(ml_test)

If this were an actual large amount of data, which a 944x1683 matrix doesn't really count as, you'd probably want [sparse matrices](https://docs.scipy.org/doc/scipy/reference/sparse.html) and to use 8-bit ints rather than 32-bit floats, for instance:

```python
ml_mat = scipy.sparse.coo_matrix(
    (ml["rating"], (ml["user_id"], ml["movie_id"])),
    shape=(max_user, max_movie),
    dtype=np.int8)
```

## Slope One implementation

- Based on:  [Slope One Predictors for Online Rating-Based Collaborative Filtering](https://arxiv.org/pdf/cs/0702144v1.pdf)
- TODO: This needs better explanation but I'm not sure if it should reside here, in the Python code, or in the blag post

In [9]:
def deviation(M, mask):
    m,n = M.shape
    m2 = mask.astype(np.int)
    counts = m2.T @ m2
    S = m2.T @ M
    diffs = S.T - S
    dev = diffs / np.maximum(1, counts)
    return dev, counts

The implementation of 'deviation' above might be less-optimal vastly larger matrices. For one thing, Slope One doesn't really *need* a utility matrix, though it's easier from one. One could readily compute deviation from the list of ratings, though I don't know a fast way to do this.

In [22]:
def predict_one(M, mask, dev, counts, u, j, weighted = False):
    m,n = M.shape
    # S_u is a mask over M's columns for items user 'u' rated:
    S_u = mask[u, :]
    if weighted:
        # In 'Weighted Slope One', we sum over everything user 'u' rated,
        # regardless of whether other users rated both this and item j:
        S_u[j] = False
        c_j = counts[j, S_u]
        devs = dev[j, S_u]
        u = M[u, S_u]
        return ((devs + u) * c_j).sum() / max(1.0, c_j.sum())
    else:
        # In the 'Slope One' formula we are summing over R_j, which is:
        # Every item 'i' (i != j), such that: user 'u' rated item 'i', and
        # at least one other user rated both item 'i' and item 'j'.
        # Below we compute this likewise as a mask over M's columns:
        R_j = S_u * (counts[u, :] > 0)
        R_j[j] = False
        u = M[u, R_j].sum()
        devs = dev[j, R_j].sum()
        card = max(1.0, R_j.sum())
        return (u + devs) / card

In [11]:
def predict(M, mask, dev, counts, dataframe, weighted=False):
    err_mae = 0
    err_rms = 0
    for row in dataframe.itertuples():
        p = predict_one(M, mask, dev, counts,
                        row.user_id, row.movie_id, weighted=weighted)
        err_mae += np.abs(p - row.rating)
        err_rms += np.square(p - row.rating)
    err_mae = err_mae / len(dataframe)
    err_rms = np.sqrt(err_rms / len(dataframe))
    return err_mae, err_rms

In [12]:
# Compute deviation (which is basically our model) from training:
dev, counts = deviation(ml_mat_train, ml_mask_train)

In [13]:
err_mae_train, err_rms_train = predict(ml_mat_train, ml_mask_train, dev, counts, ml_train)
err_mae_test,  err_rms_test  = predict(ml_mat_test,  ml_mask_test,  dev, counts, ml_test)

In [14]:
print("Training error: MAE={}, RMS={}".format(err_mae_train, err_rms_train))
print("Testing error: MAE={}, RMS={}".format(err_mae_test, err_rms_test))

Training error: MAE=0.6705920085426367, RMS=0.8698715649103287
Testing error: MAE=0.7690465575577171, RMS=0.9922856864635986


## Weighted Slope One

In [15]:
err_mae_train, err_rms_train = predict(ml_mat_train, ml_mask_train, dev, counts, ml_train, True)
err_mae_test,  err_rms_test  = predict(ml_mat_test,  ml_mask_test,  dev, counts, ml_test,  True)
# why must I pass both dataframe and matrix?

In [16]:
print("Training error: MAE={}, RMS={}".format(err_mae_train, err_rms_train))
print("Testing error: MAE={}, RMS={}".format(err_mae_test, err_rms_test))

Training error: MAE=0.7366789114708739, RMS=0.9834047856298306
Testing error: MAE=0.8970442751834006, RMS=1.2349462361157233


## SVD implementation

# Implementations in `scikit-surprise`

[Surprise](http://surpriselib.com/) contains implementations of many of the same things, so these are tested below. This same dataset is included as a built-in, but for consistency, we may as well load it from our dataframe.

In [17]:
import surprise
from surprise.dataset import Dataset

In [18]:
reader = surprise.Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ml[["user_id", "movie_id", "rating"]], reader)

In [19]:
surprise.model_selection.cross_validate(surprise.NormalPredictor(), data, cv=5)

{'fit_time': (0.09823107719421387,
  0.09905147552490234,
  0.09962821006774902,
  0.09817862510681152,
  0.09549379348754883),
 'test_mae': array([ 1.22155201,  1.22409488,  1.22286712,  1.21837144,  1.22539256]),
 'test_rmse': array([ 1.5236036 ,  1.52539035,  1.52403493,  1.51874752,  1.52268848]),
 'test_time': (0.22506093978881836,
  0.22594857215881348,
  0.2251570224761963,
  0.22582459449768066,
  0.2660207748413086)}

In [20]:
surprise.model_selection.cross_validate(surprise.SlopeOne(), data, cv=5)

{'fit_time': (0.7243213653564453,
  0.7411255836486816,
  0.812612771987915,
  0.8363091945648193,
  0.669187068939209),
 'test_mae': array([ 0.746087  ,  0.74980919,  0.73966436,  0.74414759,  0.73613966]),
 'test_rmse': array([ 0.94990862,  0.95352051,  0.94189575,  0.94693599,  0.93514955]),
 'test_time': (2.5336906909942627,
  2.3776566982269287,
  2.3864481449127197,
  2.502324342727661,
  2.2572689056396484)}

In [21]:
surprise.model_selection.cross_validate(surprise.SVD(), data, cv=5)

{'fit_time': (5.609443664550781,
  5.635587453842163,
  5.525452136993408,
  5.772013902664185,
  4.828369617462158),
 'test_mae': array([ 0.744339  ,  0.74548684,  0.72696361,  0.73389245,  0.73644925]),
 'test_rmse': array([ 0.94308608,  0.94312428,  0.92535598,  0.93239132,  0.93154266]),
 'test_time': (0.24836158752441406,
  0.2776060104370117,
  0.24745988845825195,
  0.2395009994506836,
  0.28363823890686035)}