# Recommendation Engines - MovieLens Data

## Tuesday June 20 2017

MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota.

This data set consists of: * 100,000 ratings (1-5) from 943 users on 1682 movies. * Each user has rated at least 20 movies. * Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set. Detailed descriptions of the data file can be found at the end of this file.

### Tasks

1. Load the data into the recommendation format
2. Build and assess model accuracy
3. Make individual recommendations
4. Try multiple models and compare accuracy
5. Consider how a company could use this

In [5]:
# Install Surpise - a useful library for recommendation engines
!pip install scikit-surprise

Collecting scikit-surprise
  Using cached scikit-surprise-1.0.3.tar.gz
Building wheels for collected packages: scikit-surprise
  Running setup.py bdist_wheel for scikit-surprise: started
  Running setup.py bdist_wheel for scikit-surprise: finished with status 'done'
  Stored in directory: C:\Users\michael.stone\AppData\Local\pip\Cache\wheels\5c\84\0c\21a872115299d7e2170620fc9fad866ec7588e958d9ac77b35
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.0.3


In [7]:
# Load Surprise
from surprise import SVD
from surprise import Dataset
from surprise import evaluate, print_perf
from surprise import Reader

In [8]:
# 1. Load the data into the recommendation format

# As we're loading a custom dataset, we need to define a reader. In the
# movielens dataset, each line has the following format:
# 'user item rating timestamp', separated by '\t' characters.
reader = Reader(line_format='user item rating timestamp', sep='\t')

data = Dataset.load_from_file(file_path = '../../data/u.data', reader=reader)
data.split(n_folds=5)

In [12]:
# 2. Build and assess model accuracy

# We'll use the famous SVD algorithm.
algo = SVD()

# Evaluate performances of our algorithm on the dataset.
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])

print_perf(perf)

Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.9403
MAE:  0.7406
------------
Fold 2
RMSE: 0.9317
MAE:  0.7366
------------
Fold 3
RMSE: 0.9333
MAE:  0.7381
------------
Fold 4
RMSE: 0.9352
MAE:  0.7352
------------
Fold 5
RMSE: 0.9410
MAE:  0.7397
------------
------------
Mean RMSE: 0.9363
Mean MAE : 0.7380
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9403  0.9317  0.9333  0.9352  0.9410  0.9363  
MAE     0.7406  0.7366  0.7381  0.7352  0.7397  0.7380  


In [19]:
# 3. Make individual recommendations
uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(302)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=2, verbose=True)

user: 196        item: 302        r_ui = 2.00   est = 4.30   {'was_impossible': False}


# Non-Negative Matrix Factorisation

In [14]:
# 4. Try multiple models and compare accuracy

# Try at least 3 of the models mentioned below:
#random_pred.NormalPredictor    Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.
#baseline_only.BaselineOnly    Algorithm predicting the baseline estimate for given user and item.
#knns.KNNBasic    A basic collaborative filtering algorithm.
#knns.KNNWithMeans    A basic collaborative filtering algorithm, taking into account the mean ratings of each user.
#knns.KNNBaseline    A basic collaborative filtering algorithm taking into account a baseline rating.
#matrix_factorization.SVD    The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.
#matrix_factorization.SVDpp    The SVD++ algorithm, an extension of SVD taking into account implicit ratings.
#matrix_factorization.NMF    A collaborative filtering algorithm based on Non-negative Matrix Factorization.
#slope_one.SlopeOne    A simple yet accurate collaborative filtering algorithm.
#co_clustering.CoClustering    A collaborative filtering algorithm based on co-clustering.


# Here's how to run Non-Negative Matrix Factorisiation
from surprise import NMF

# Now we will try Non-Negative Matrix Factorisiation (a form of collaborative filtering)
algo.NMF = NMF()

# Evaluate performances of our algorithm on the dataset.
perf.NMF = evaluate(algo.NMF, data, measures=['RMSE', 'MAE'])

print_perf(perf.NMF)

Evaluating RMSE, MAE of algorithm NMF.

------------
Fold 1
RMSE: 0.9680
MAE:  0.7628
------------
Fold 2
RMSE: 0.9600
MAE:  0.7574
------------
Fold 3
RMSE: 0.9622
MAE:  0.7579
------------
Fold 4
RMSE: 0.9614
MAE:  0.7546
------------
Fold 5
RMSE: 0.9681
MAE:  0.7571
------------
------------
Mean RMSE: 0.9639
Mean MAE : 0.7580
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9680  0.9600  0.9622  0.9614  0.9681  0.9639  
MAE     0.7628  0.7574  0.7579  0.7546  0.7571  0.7580  


In [18]:
# 3. Make individual recommendations
uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(302)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = algo.NMF.predict(uid, iid, r_ui=2, verbose=True)

user: 196        item: 302        r_ui = 2.00   est = 4.30   {'was_impossible': False}


# KNN with Means

In [21]:
# 4. Try multiple models and compare accuracy

# Try at least 3 of the models mentioned below:
#random_pred.NormalPredictor    Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.
#baseline_only.BaselineOnly    Algorithm predicting the baseline estimate for given user and item.
#knns.KNNBasic    A basic collaborative filtering algorithm.
#knns.KNNWithMeans    A basic collaborative filtering algorithm, taking into account the mean ratings of each user.
#knns.KNNBaseline    A basic collaborative filtering algorithm taking into account a baseline rating.
#matrix_factorization.SVD    The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.
#matrix_factorization.SVDpp    The SVD++ algorithm, an extension of SVD taking into account implicit ratings.
#matrix_factorization.NMF    A collaborative filtering algorithm based on Non-negative Matrix Factorization.
#slope_one.SlopeOne    A simple yet accurate collaborative filtering algorithm.
#co_clustering.CoClustering    A collaborative filtering algorithm based on co-clustering.


# Here's how to run Non-Negative Matrix Factorisiation
from surprise import KNNWithMeans

# Now we will try Non-Negative Matrix Factorisiation (a form of collaborative filtering)
algo.KNNWithMeans = KNNWithMeans()

# Evaluate performances of our algorithm on the dataset.
perf.KNNWithMeans = evaluate(algo.KNNWithMeans, data, measures=['RMSE', 'MAE'])

print_perf(perf.KNNWithMeans)

Evaluating RMSE, MAE of algorithm KNNWithMeans.

------------
Fold 1
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9526
MAE:  0.7513
------------
Fold 2
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9457
MAE:  0.7476
------------
Fold 3
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9503
MAE:  0.7506
------------
Fold 4
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9487
MAE:  0.7455
------------
Fold 5
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9573
MAE:  0.7507
------------
------------
Mean RMSE: 0.9509
Mean MAE : 0.7491
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9526  0.9457  0.9503  0.9487  0.9573  0.9509  
MAE     0.7513  0.7476  0.7506  0.7455  0.7507  0.7491  


In [22]:
# 3. Make individual recommendations
uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(302)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = algo.KNNWithMeans.predict(uid, iid, r_ui=2, verbose=True)

user: 196        item: 302        r_ui = 2.00   est = 4.21   {'actual_k': 40, 'was_impossible': False}


# CoClustering

In [23]:
# 4. Try multiple models and compare accuracy

# Try at least 3 of the models mentioned below:
#random_pred.NormalPredictor    Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.
#baseline_only.BaselineOnly    Algorithm predicting the baseline estimate for given user and item.
#knns.KNNBasic    A basic collaborative filtering algorithm.
#knns.KNNWithMeans    A basic collaborative filtering algorithm, taking into account the mean ratings of each user.
#knns.KNNBaseline    A basic collaborative filtering algorithm taking into account a baseline rating.
#matrix_factorization.SVD    The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.
#matrix_factorization.SVDpp    The SVD++ algorithm, an extension of SVD taking into account implicit ratings.
#matrix_factorization.NMF    A collaborative filtering algorithm based on Non-negative Matrix Factorization.
#slope_one.SlopeOne    A simple yet accurate collaborative filtering algorithm.
#co_clustering.CoClustering    A collaborative filtering algorithm based on co-clustering.


# Here's how to run Non-Negative Matrix Factorisiation
from surprise import CoClustering

# Now we will try Non-Negative Matrix Factorisiation (a form of collaborative filtering)
algo.CoClustering = CoClustering()

# Evaluate performances of our algorithm on the dataset.
perf.CoClustering = evaluate(algo.CoClustering, data, measures=['RMSE', 'MAE'])

print_perf(perf.CoClustering)

Evaluating RMSE, MAE of algorithm CoClustering.

------------
Fold 1
RMSE: 0.9648
MAE:  0.7541
------------
Fold 2
RMSE: 0.9682
MAE:  0.7604
------------
Fold 3
RMSE: 0.9644
MAE:  0.7578
------------
Fold 4
RMSE: 0.9647
MAE:  0.7528
------------
Fold 5
RMSE: 0.9677
MAE:  0.7544
------------
------------
Mean RMSE: 0.9659
Mean MAE : 0.7559
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9648  0.9682  0.9644  0.9647  0.9677  0.9659  
MAE     0.7541  0.7604  0.7578  0.7528  0.7544  0.7559  


In [24]:
# 3. Make individual recommendations
uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(302)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = algo.CoClustering.predict(uid, iid, r_ui=2, verbose=True)

user: 196        item: 302        r_ui = 2.00   est = 4.37   {'was_impossible': False}


# SlopeOne

In [25]:
# 4. Try multiple models and compare accuracy

# Try at least 3 of the models mentioned below:
#random_pred.NormalPredictor    Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.
#baseline_only.BaselineOnly    Algorithm predicting the baseline estimate for given user and item.
#knns.KNNBasic    A basic collaborative filtering algorithm.
#knns.KNNWithMeans    A basic collaborative filtering algorithm, taking into account the mean ratings of each user.
#knns.KNNBaseline    A basic collaborative filtering algorithm taking into account a baseline rating.
#matrix_factorization.SVD    The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.
#matrix_factorization.SVDpp    The SVD++ algorithm, an extension of SVD taking into account implicit ratings.
#matrix_factorization.NMF    A collaborative filtering algorithm based on Non-negative Matrix Factorization.
#slope_one.SlopeOne    A simple yet accurate collaborative filtering algorithm.
#co_clustering.CoClustering    A collaborative filtering algorithm based on co-clustering.


# Here's how to run Non-Negative Matrix Factorisiation
from surprise import SlopeOne

# Now we will try Non-Negative Matrix Factorisiation (a form of collaborative filtering)
algo.SlopeOne = SlopeOne()

# Evaluate performances of our algorithm on the dataset.
perf.SlopeOne = evaluate(algo.SlopeOne, data, measures=['RMSE', 'MAE'])

print_perf(perf.SlopeOne)

Evaluating RMSE, MAE of algorithm SlopeOne.

------------
Fold 1
RMSE: 0.9480
MAE:  0.7453
------------
Fold 2
RMSE: 0.9400
MAE:  0.7406
------------
Fold 3
RMSE: 0.9420
MAE:  0.7430
------------
Fold 4
RMSE: 0.9442
MAE:  0.7417
------------
Fold 5
RMSE: 0.9518
MAE:  0.7446
------------
------------
Mean RMSE: 0.9452
Mean MAE : 0.7431
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9480  0.9400  0.9420  0.9442  0.9518  0.9452  
MAE     0.7453  0.7406  0.7430  0.7417  0.7446  0.7431  


In [26]:
# 3. Make individual recommendations
uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(302)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = algo.SlopeOne.predict(uid, iid, r_ui=2, verbose=True)

user: 196        item: 302        r_ui = 2.00   est = 4.25   {'was_impossible': False}


# BaseLineOnly

In [27]:
# 4. Try multiple models and compare accuracy

# Try at least 3 of the models mentioned below:
#random_pred.NormalPredictor    Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.
#baseline_only.BaselineOnly    Algorithm predicting the baseline estimate for given user and item.
#knns.KNNBasic    A basic collaborative filtering algorithm.
#knns.KNNWithMeans    A basic collaborative filtering algorithm, taking into account the mean ratings of each user.
#knns.KNNBaseline    A basic collaborative filtering algorithm taking into account a baseline rating.
#matrix_factorization.SVD    The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.
#matrix_factorization.SVDpp    The SVD++ algorithm, an extension of SVD taking into account implicit ratings.
#matrix_factorization.NMF    A collaborative filtering algorithm based on Non-negative Matrix Factorization.
#slope_one.SlopeOne    A simple yet accurate collaborative filtering algorithm.
#co_clustering.CoClustering    A collaborative filtering algorithm based on co-clustering.


# Here's how to run Non-Negative Matrix Factorisiation
from surprise import BaselineOnly

# Now we will try Non-Negative Matrix Factorisiation (a form of collaborative filtering)
algo.BaselineOnly = BaselineOnly()

# Evaluate performances of our algorithm on the dataset.
perf.BaselineOnly = evaluate(algo.BaselineOnly, data, measures=['RMSE', 'MAE'])

print_perf(perf.BaselineOnly)

Evaluating RMSE, MAE of algorithm BaselineOnly.

------------
Fold 1
Estimating biases using als...
RMSE: 0.9460
MAE:  0.7498
------------
Fold 2
Estimating biases using als...
RMSE: 0.9373
MAE:  0.7456
------------
Fold 3
Estimating biases using als...
RMSE: 0.9438
MAE:  0.7499
------------
Fold 4
Estimating biases using als...
RMSE: 0.9439
MAE:  0.7470
------------
Fold 5
Estimating biases using als...
RMSE: 0.9505
MAE:  0.7506
------------
------------
Mean RMSE: 0.9443
Mean MAE : 0.7486
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9460  0.9373  0.9438  0.9439  0.9505  0.9443  
MAE     0.7498  0.7456  0.7499  0.7470  0.7506  0.7486  


In [28]:
# 3. Make individual recommendations
uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(302)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = algo.BaselineOnly.predict(uid, iid, r_ui=2, verbose=True)

user: 196        item: 302        r_ui = 2.00   est = 4.16   {'was_impossible': False}


In [4]:
# Summary Results
import pandas as pd

summary_results = pd.read_csv("summary_results.csv", index_col=0)
summary_results

Unnamed: 0_level_0,RMSE,MAE,Prediction
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BaselineOnly,0.9443,0.7486,4.16
SVD,0.9363,0.738,4.3
Non-Negative Matrix Factorisation,0.9639,0.758,4.3
KNN with Means,0.9509,0.7491,4.21
CoClustering,0.9659,0.7559,4.37
SlopeOne,0.9452,0.7431,4.25


##### 5. Consider how a company could use this

How might a company use a recommendation like this in practice? Write a few paragraphs covering how they could use the above covering:
- How the algorithm works?
- What data would be used?
- How would we know if it's working?
- What is the benefit of using an algorithm over this over just recommending the most popular films overall?