# Recommender Systems (RecSys) and SVD

## Learning Goals
- Perform collaborative filtering from ratings matrices using `pandas` and `sklearn` on the beers data
- Understand why this approach represents collaborative filtering
- Perform collaborative filtering using the [python-recsys](https://github.com/ocelma/python-recsys) library that provides some nice built-in recommender functionality
- Understand how SVDs or other matrix decompositions might fit in in the context of a recommender algorithm

>Note: The [`recsys` package](https://github.com/python-recsys/python-recsys.git) and its dependencies work best with Python 2. If you want to use Python 3, use [Surprise](http://surpriselib.com) or other options.

Recommender Systems have become ubiquitous in the modern data science landscape, as companies like Google, Netflix, Pandora, and Facebook rely on them to provide targeted content recommendations and create a more enjoyable user experience.  In this lab, we'll focus on the process of ***collaborative filtering*** for building recommenders on two different datasets (beers and movies).  

[Collaborative Filtering](https://en.wikipedia.org/wiki/Collaborative_filtering) relies on a ***ratings matrix*** for all items, to generate similarities between items and users based on similar ratings.

[Content-Based Filtering](https://en.wikipedia.org/wiki/Recommender_system#Content-based_filtering) explicitly maps items and/or users into a shared feature space based on explicit user/item characteristics. State of the art recommenders will often rely on hybrid approaches, so seek understand the differences, strengths, and weaknesses of each approach.

### Datasets
- [Beer Ratings](https://github.com/pburkard88/DS_BOS_06/blob/master/Data/beer_reviews.tar.gz): A dataset of beer reviews
- [Movielens Data](https://github.com/pburkard88/DS_BOS_06/blob/master/Data/movielens): A dataset of movie ratings from the original [here](http://grouplens.org/datasets/movielens/)

## Similarity based Recommendation System: Beers
The first dataset is a list of beer reviews from a collection of reviewers. We'll use this data to generate a reviewer/beer ratings matrix from which we can perform collaborative filtering and recommend beers based on user preferences.

## Get the Data

In [None]:
# import the usual suspects
import pandas as pd
import numpy as np

Now let's get the data.  If you don't already have it locally you can use curl to pull it down.

In [None]:
#! curl -O https://s3.amazonaws.com/demo-datasets/beer_reviews.tar.gz

These steps are optional; just download the data and `read_csv()` the file into `pandas`.

In [None]:
#! mv 'beer_reviews.tar.gz' ~/

Import the data into a `pandas` dataframe called `df` by calling `read_csv()` with the appropriate path and the parameter `compression='gzip'` (you don't need this if you already extracted your file, it's just nice to see that pandas can handle gzipped data).

In [None]:
df = pd.read_csv("/Users/jb/beer_reviews.tar.gz", low_memory=False, compression='gzip', error_bad_lines=False)

### Explore the Data
Let's look at the data with `head()`

In [None]:
df.head()

Create a separate data frame `df_test` to investigate a little bit further by selecting out only the **beer_name="Pale Ale"** reviews using the `isIn([])` function. Then sort this resulting table by **review_profilename** and examine the rows. You should notice that the same reviewer can review multiple Pale Ales.

In [None]:
df_test = df[df.beer_name.isin(['Pale Ale'])].sort_values('review_profilename', axis=0)
df_test.sample(10)

Let's restrict this to the top 250 beers. Use the `value_counts()` method to get a sorted list by value count on **beer_name** and then taking the first 250.  Overwrite `df` with this new data.

In [None]:
df.beer_name.value_counts()

In [None]:
n = 250
top_n = df.beer_name.value_counts().index[:n]

df = df[df['beer_name'].isin(top_n)]
df.head()

How big is this dataset?

In [None]:
df.info()

Aggregate the data in a pivot table called `df_wide` using the `pivot_table` method. Display the mean review_overall for each beer_name aggregating the review_overall values by review_profilename. Use the mean (numpy.mean) as aggregator.  In other words, the `values` parameter should contain **review_overall** and the `index` parameter should contain **beer_name** and **beer_name**.  Make sure to call `unstack()` at the end.

In [None]:
df_wide = pd.pivot_table(df, values=["review_overall"],
        index=["beer_name", "review_profilename"],
        aggfunc=np.mean).unstack()
df_wide.shape

Display the head of the pivot table, but only for 5 users (columns are users)

In [None]:
df_wide.iloc[0:5, 0:5]

### Discussion: what do you notice in this table?

Set Nans to zero with the `fillna()` function.

In [None]:
df_wide = df_wide.fillna(0)

Check that rows are beers by examining the first few rows.

In [None]:
pd.Series(df_wide.index[:10])

### Calculate distance between beers

This is the key.  We have our ratings matrix now and we're going to use cosine_similarity from scikit-learn to compute the distance between all beers in this space.

In [None]:
# import distance methods
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import manhattan_distances
from sklearn.metrics.pairwise import euclidean_distances

Apply `cosine_similarity()` to `df_wide` to calculate pairwise distances and store this in a variable called `dists`.

In [None]:
dists = cosine_similarity(df_wide)
dists

### Discussion: what type of object is dists?

Convert dists to a Pandas DataFrame, use the index as column index as well (distances are a square matrix).  This means we'll have a beers by beers matrix of the distances between every beer from the ratings space.  Check out the first 10 or so rows and columns and make sure things look right (should see 1s on the diagonal).

In [None]:
dists = pd.DataFrame(dists, columns=df_wide.index)

dists.index = dists.columns
dists.iloc[0:10, 0:10]

Select some beers and store them in `beers_i_like` then look their distances to other beers with `head()`

In [None]:
beers_i_like = ['Sierra Nevada Pale Ale', '120 Minute IPA', 'Allagash White']
dists[beers_i_like].head()

Sum the distances of my favorite beers by row, to have one distance from each beer in the sample.  For instance if there are 3 beers in your `beers_i_like` then you will be summing 3 numbers for each row.  Store the results in `beers_summed`.  There are 2 ways you can do this:  
1. Calling `apply()` with a lambda function that contains `np.sum()` with `axis=1`
2. Calling `np.sum()` with `axis=1` on the entire dataframe (sliced by columns you like)

In [None]:
#beers_summed = dists[beers_i_like].apply(lambda row: np.sum(row), axis=1)
beers_summed = np.sum(dists[beers_i_like], axis=1)

Optional: which function is faster? use ```%timeit``` to check

In [None]:
%timeit dists[beers_i_like].apply(lambda row: np.sum(row), axis=1)

In [None]:
%timeit np.sum(dists[beers_i_like], axis=1) #should be much faster

Sort summed beers from best to worse using `order()`

In [None]:
beers_summed = beers_summed.sort_values(ascending=False)
beers_summed

Filter out the beers used as input using `isin()` and store this in `ranked_beers`, then transform this to a list using `tolist()`.  Print out the first 5 elements.

In [None]:
ranked_beers = beers_summed.index[beers_summed.index.isin(beers_i_like)==False]
ranked_beers = ranked_beers.tolist()
ranked_beers[:5]

Define a function that does what we just did for an arbitrary input list of beers. it should also receive the maximum number of beers requested n as optional parameter.

In [None]:
def get_similar(beers, n=None):
    """
    calculates which beers are most similar to the inputs. Must not return
    the beers that were inputted.
    
    Parameters
    ----------
    beers: list
        some beers!
    
    Returns
    -------
    ranked_beers: list
        rank ordered beers
    """
    beers = [beer for beer in beers if beer in dists.columns]
    beers_summed = dists[beers].apply(lambda row: np.sum(row), axis=1)
    beers_summed = beers_summed.sort_values(ascending=False)
    ranked_beers = beers_summed.index[beers_summed.index.isin(beers)==False]
    ranked_beers = ranked_beers.tolist()
    if n is None:
        return ranked_beers
    else:
        return ranked_beers[:n]

Test your function. Find the 10 beers most similar to "120 Minute IPA"

In [None]:
for beer in get_similar(["120 Minute IPA"], 10):
    print(beer)

Cool, let's try again with the 10 beers most similar to ["Coors Light", "Bud Light", "Amstel Light"]

In [None]:
for i, beer in enumerate(get_similar(["Coors Light", "Bud Light", "Amstel Light"], 10)):
    print("%d) %s" % (i+1, beer))

## Movie Recommendations with Recsys
[python-recsys](https://github.com/ocelma/python-recsys) is a nice python library for implementing recommender systems.  We'll use it here to try and make movie recommendations from the [movielens dataset](http://grouplens.org/datasets/movielens/).  

### Install Recsys
First run something like the below code to install everything that you need for recsys.

## install python-recsys

### first install dependencies

pip install csc-pysparse networkx divisi2

In [None]:
#!pip install csc-pysparse networkx divisi2

### then install recsys
git clone https://github.com/python-recsys/python-recsys.git
cd python-recsys/

python setup.py install

### then Restart Kernel

Import `recsys.algorithm`, set `recsys.algorithm.VERBOSE = True` and import `recsys.algorithm.factorize.SVD` class

In [None]:
import recsys.algorithm
recsys.algorithm.VERBOSE = True
from recsys.algorithm.factorize import SVD

### Get the Data
Download the movielens dataset [here](http://files.grouplens.org/datasets/movielens/ml-20m.zip)

>Note: The MovieLens website is constantly changing, and has recently reformatted files to .CSV, though the original .DAT files are hosted on various repos.

Let's look at the files, you can do this however you like.

In [None]:
#! ls ~/data/movielens

Read in the movies.dat data into a variable `movies` by using `pd.read_table` with `sep='::'`.  Make sure to set the `names` to ITEMID, Title, and Genres to set the columns and the `index_col` to ITEMID.

In [None]:
movies = pd.read_table('https://raw.githubusercontent.com/databricks/spark-training/master/data/movielens/medium/movies.dat', sep='::', names= ['ITEMID', 'Title', 'Genres'], index_col= 'ITEMID')

### Explore the Data
Take a look at the movies data with `head()`.

In [None]:
movies.head()

Load the [ratings.dat](https://raw.githubusercontent.com/databricks/spark-training/master/data/movielens/medium/ratings.dat) data into a `ratings` variable with the same separator, and the column names UserID, MovieID, Rating, Timestamp.

In [None]:
ratings = pd.read_table('https://raw.githubusercontent.com/databricks/spark-training/master/data/movielens/medium/ratings.dat', sep='::', names= ['UserID','MovieID','Rating','Timestamp'])

In [None]:
ratings.head()

Initialize an `SVD` instance called `svd`

In [None]:
svd = SVD()

Populate it with the data from the ratings dataset, using the built in `load_data()` method.  You should use `format={'col':0, 'row':1, 'value':2, 'ids': int}` and don't forget the `sep` parameter.

In [None]:
svd.load_data(filename='./ml-20m/ratings.dat', sep='::', format={'col':0, 'row':1, 'value':2, 'ids': int})

Compute SVD with a call to `svd.compute()`.  
- Use `k=100`
- Use `min_values=10`
- Use `pre_normalize=None`
- Use `mean_center=True`
- Use `post_normalize=True`

$M=U \Sigma V^T$:

In [None]:
k = 100
svd.compute(k=k, min_values=10, pre_normalize=None, mean_center=True, post_normalize=True)

you can also save the output SVD model (in a zip file)

In [None]:
# svd.compute(k=k, min_values=10, pre_normalize=None, mean_center=True, post_normalize=True, savefile='/tmp/movielens')

Reload a saved model:

In [None]:
# svd2 = SVD(filename='/tmp/movielens')

## Computing Similarities and Making Recommendations
Let's compute similarity between two movies, first we need to use the movies table to get the itemid that will be used with the ratings data that generated our svd.

Determine the movie ids of "Toy Story (1995)" and "Bug's Life, A (1998)".

In [None]:
movies[movies.Title == "Toy Story (1995)"]

In [None]:
movies[movies.Title == "Bug's Life, A (1998)"]

Print the similarity of these 2 movies by calling `svd.similarity()` with those 2 IDs.

In [None]:
ITEMID1 = 1    # Toy Story (1995)
ITEMID2 = 2355 # A bug's life (1998)
print svd.similarity(ITEMID1, ITEMID2)
# print svd2.similarity(ITEMID1, ITEMID2) to check

Use `svd.similar()` to get movies similar to Toy Story.

In [None]:
svd.similar(ITEMID1)

Try using `svd.predict()` to predict ratings for a given user and movie, $\hat{r}_{ui}$

In [None]:
MIN_RATING = 0.0
MAX_RATING = 5.0
ITEMID = 1
USERID = 1
svd.predict(ITEMID, USERID, MIN_RATING, MAX_RATING)

Look it up in the matrix...

In [None]:
svd.get_matrix().value(ITEMID, USERID)

Try using `svd.recommend()` to Recommend non rated movies to a user (`is_row=False`)

In [None]:
svd.recommend(USERID, is_row=False)

Which users should see Toy Story? (e.g. which users -that have not rated Toy Story- would give it a high rating?)

In [None]:
svd.recommend(ITEMID)

Find out [more about recsys](https://github.com/ocelma/python-recsys)

## Learning Goals
- Perform collaborative filtering from ratings matrices using `pandas` and `sklearn` on the beers data
- Understand why this approach represents collaborative filtering
- Perform collaborative filtering using the [python-recsys](https://github.com/ocelma/python-recsys) library that provides some nice built-in recommender functionality
- Understand how SVDs or other matrix decompositions might fit in in the context of a recommender algorithm