<a href="https://colab.research.google.com/github/HSV-AI/presentations/blob/master/2021/210217_Recommendation_Systems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![HSV-AI Logo](https://github.com/HSV-AI/hugo-website/blob/master/static/images/logo_v9.png?raw=true)

# Quick Start for Recommendation Systems

Agenda:
- Welcome
- Project updates
- Current news
- Presentation on Recommendation Systems
- Q&A
- Close

We will start with a common dataset used for exploring recommendation systems, the [MovieLens Dataset](http://grouplens.org/datasets/movielens/)

We will also use the [Surprise](http://surpriselib.com/) library to build a few different recommendation systems and look at their accuracy for the dataset.

The name SurPRISE (roughly :) ) stands for Simple Python RecommendatIon System Engine.

In [None]:
!wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip

--2021-02-15 12:58:02--  http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘ml-latest-small.zip’


2021-02-15 12:58:05 (4.44 MB/s) - ‘ml-latest-small.zip’ saved [978202/978202]



In [None]:
!unzip ml-latest-small.zip

Archive:  ml-latest-small.zip
   creating: ml-latest-small/
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  


In [1]:
!pip install scikit-surprise

Collecting scikit-surprise
[?25l  Downloading https://files.pythonhosted.org/packages/97/37/5d334adaf5ddd65da99fc65f6507e0e4599d092ba048f4302fe8775619e8/scikit-surprise-1.1.1.tar.gz (11.8MB)
[K     |████████████████████████████████| 11.8MB 6.7MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp36-cp36m-linux_x86_64.whl size=1618236 sha256=9bbcffb8d69841f02d58ecac464a09dcb37ff5d69540e311319a04fb3920aaa0
  Stored in directory: /root/.cache/pip/wheels/78/9c/3d/41b419c9d2aff5b6e2b4c0fc8d25c538202834058f9ed110d0
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.1


In [7]:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate


# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')


## Looking at the data

u.data     -- The full u data set, 100000 ratings by 943 users on 1682 items.

- Each user has rated at least 20 movies.
- Users and items are numbered consecutively from 1.
- The data is randomly ordered.
- This is a tab separated list of user id | item id | rating | timestamp. 
- The time stamps are unix seconds since 1/1/1970 UTC


In [23]:
import pandas as pd

data_df = pd.read_table('/root/.surprise_data/ml-100k/ml-100k/u.data', header=None)
data_df.head()

Unnamed: 0,0,1,2,3
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


u.item     -- Information about the items (movies); 
- this is a tab separated list of
              movie id | movie title | release date | video release date |
              IMDb URL | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western |
- The last 19 fields are the genres, a 1 indicates the movie is of that genre, a 0 indicates it is not
- movies can be in several genres at once.
- The movie ids are the ones used in the u.data data set.

In [22]:
item_df = pd.read_csv('/root/.surprise_data/ml-100k/ml-100k/u.item', sep="|", encoding='iso-8859-1', header=None)
item_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0


u.user     -- Demographic information about the users
- this is a tab separated list of
              user id | age | gender | occupation | zip code
- The user ids are the ones used in the u.data data set.

In [21]:
user_df = pd.read_csv('/root/.surprise_data/ml-100k/ml-100k/u.user', sep="|", encoding='iso-8859-1', header=None)
user_df.head()

Unnamed: 0,0,1,2,3,4
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


## Types of Recommendation Systems (From Wikipedia)

### Collaborative Filtering

> Collaborative filtering is based on the assumption that people who agreed in the past will agree in the future, and that they will like similar kinds of items as they liked in the past. The system generates recommendations using only information about rating profiles for different users or items. By locating peer users/items with a rating history similar to the current user or item, they generate recommendations using this neighborhood. Collaborative filtering methods are classified as memory-based and model-based.

> #### Memory Based

> They are called memory-based because the algorithm is not complicated, but requires a lot of memory to keep track of the results.

> #### Model Based

Model Based approaches build some type of machine learning model. For the surprise package, there are three models avaialble: SVD, SVDpp, and NMF.

> #### Problems with Collaborative Filtering

> - Cold start: For a new user or item, there isn't enough data to make accurate recommendations.
- Scalability: In many of the environments in which these systems make recommendations, there are millions of users and products. Thus, a large amount of computation power is often necessary to calculate recommendations.
- Sparsity: The number of items sold on major e-commerce sites is extremely large. The most active users will only have rated a small subset of the overall database. Thus, even the most popular items have very few ratings.

### Content Based Filtering

> Content-based filtering methods are based on a description of the item and a profile of the user's preferences. These methods are best suited to situations where there is known data on an item (name, location, description, etc.), but not on the user. Content-based recommenders treat recommendation as a user-specific classification problem and learn a classifier for the user's likes and dislikes based on an item's features.

In [None]:
# We'll use the famous SVD algorithm.
algo = SVD()

## Accuracy measures:

- rmse	Compute RMSE (Root Mean Squared Error).
- mse	Compute MSE (Mean Squared Error).
- mae	Compute MAE (Mean Absolute Error).
- fcp	Compute FCP (Fraction of Concordant Pairs).

In [5]:

# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MSE', 'MAE', 'FCP'], cv=5, verbose=True)


Evaluating RMSE, MSE, MAE, FCP of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9367  0.9401  0.9339  0.9319  0.9384  0.9362  0.0030  
MSE (testset)     0.8775  0.8839  0.8722  0.8685  0.8806  0.8765  0.0055  
MAE (testset)     0.7389  0.7386  0.7360  0.7351  0.7396  0.7377  0.0017  
FCP (testset)     0.6986  0.7015  0.6976  0.7025  0.6957  0.6992  0.0025  
Fit time          4.95    4.93    4.95    4.92    4.92    4.94    0.01    
Test time         0.20    0.20    0.14    0.14    0.19    0.17    0.03    


{'fit_time': (4.953832149505615,
  4.932880401611328,
  4.952653646469116,
  4.923021554946899,
  4.918716192245483),
 'test_fcp': array([0.69864914, 0.70151484, 0.69762264, 0.70253807, 0.69566175]),
 'test_mae': array([0.73889904, 0.7385772 , 0.73604459, 0.73513605, 0.73963754]),
 'test_mse': array([0.8774512 , 0.88386805, 0.87218185, 0.8685187 , 0.88055808]),
 'test_rmse': array([0.93672365, 0.94014257, 0.93390676, 0.93194351, 0.93838056]),
 'test_time': (0.20032739639282227,
  0.19601655006408691,
  0.14214229583740234,
  0.1350100040435791,
  0.19290852546691895)}