# Recommender Systems (Movie Reviews)
- Done as part SharpestMinds skills test
- Author: Chris Hodapp
- Date: 2018-02-04

This assumes that `ml-100k` from movielens data (should be the `ml-100k` archive from https://grouplens.org/datasets/movielens/100k/) has been downloaded and uncompressed in the local directory.

## Loading data

In [1]:
import pandas as pd

In [2]:
ml = pd.read_csv("ml-100k/u.data", sep="\t", header=None,
                 names=("user_id", "movie_id", "rating", "time"))
# Convert Unix seconds to a Pandas timestamp:
ml["time"] = pd.to_datetime(ml["time"], unit="s")

In [3]:
ml[:10]

Unnamed: 0,user_id,movie_id,rating,time
0,196,242,3,1997-12-04 15:55:49
1,186,302,3,1998-04-04 19:22:22
2,22,377,1,1997-11-07 07:18:36
3,244,51,2,1997-11-27 05:02:03
4,166,346,1,1998-02-02 05:33:16
5,298,474,4,1998-01-07 14:20:06
6,115,265,2,1997-12-03 17:51:28
7,253,465,5,1998-04-03 18:34:27
8,305,451,3,1998-02-01 09:20:17
9,6,86,3,1997-12-31 21:16:53


## Conversion to utility matrix

## Slope One implementation

- Based on:  [Slope One Predictors for Online Rating-Based Collaborative Filtering](https://arxiv.org/pdf/cs/0702144v1.pdf)

## SVD implementation

# Implementations in `scikit-surprise`

[Surprise](http://surpriselib.com/) contains implementations of many of the same things, so these are tested below. This same dataset is included as a built-in, but for consistency, we may as well load it from our dataframe.

In [4]:
import surprise
from surprise.dataset import Dataset

In [5]:
reader = surprise.Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ml[["user_id", "movie_id", "rating"]], reader)

In [6]:
surprise.model_selection.cross_validate(surprise.NormalPredictor(), data, cv=5)

{'fit_time': (0.09998893737792969,
  0.09818625450134277,
  0.1028749942779541,
  0.11400818824768066,
  0.1002800464630127),
 'test_mae': array([ 1.21683815,  1.22864312,  1.22023948,  1.22270293,  1.23460341]),
 'test_rmse': array([ 1.51518163,  1.53196473,  1.52401875,  1.52228593,  1.53492893]),
 'test_time': (0.22759389877319336,
  0.22928690910339355,
  0.23446130752563477,
  0.2422645092010498,
  0.2239856719970703)}

In [7]:
surprise.model_selection.cross_validate(surprise.SlopeOne(), data, cv=5)

{'fit_time': (0.7037899494171143,
  0.721583366394043,
  0.8074040412902832,
  0.7357473373413086,
  0.6973576545715332),
 'test_mae': array([ 0.7471126 ,  0.74196661,  0.74298677,  0.73929044,  0.73975864]),
 'test_rmse': array([ 0.94925965,  0.94434669,  0.94661347,  0.93993236,  0.94192486]),
 'test_time': (2.674623489379883,
  2.4104838371276855,
  2.4269442558288574,
  2.2999517917633057,
  2.1615943908691406)}

In [8]:
surprise.model_selection.cross_validate(surprise.SVD(), data, cv=5)

{'fit_time': (5.146217346191406,
  5.063850402832031,
  5.20699143409729,
  5.4639341831207275,
  4.780801057815552),
 'test_mae': array([ 0.73932333,  0.74096928,  0.73779762,  0.7343451 ,  0.7427832 ]),
 'test_rmse': array([ 0.93781629,  0.93760565,  0.94041924,  0.93228853,  0.93985592]),
 'test_time': (0.2464735507965088,
  0.25014376640319824,
  0.24506402015686035,
  0.23924827575683594,
  0.22957062721252441)}