## Part 3: Building a Recommender System with Implicit Feedback

In this tutorial, we will build an implicit feedback recommender system using the [implicit](https://github.com/benfred/implicit) package.

What is implicit feedback, exactly? Let's revisit collaborative filtering. In [Part 1](https://github.com/topspinj/recommender-tutorial/blob/master/part-1-item-item-recommender.ipynb), we learned that [collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering) is based on the assumption that `similar users like similar things`. The user-item matrix, or "utility matrix", is the foundation of collaborative filtering. In the utility matrix, rows represent users and columns represent items. 

<img src="images/utility-matrix.png" width="30%"/>

The cells of the matrix are populated by a given user's degree of preference towards an item, which can come in the form of:

1. **explicit feedback:** direct feedback towards an item (e.g., movie ratings which we explored in [Part 1](https://github.com/topspinj/recommender-tutorial/blob/master/part-1-item-item-recommender.ipynb))
2. **implicit feedback:** indirect behaviour towards an item (e.g., purchase history, browsing history, search behaviour)

Implicit feedback makes assumptions about a user's preference based on their actions towards items. Let's take Netflix for example. If you binge-watch a show and blaze through all seasons in a week, there's a high chance that you like that show. However, if you start watching a series and stop halfway through the first episode, there's suspicion to believe that you probably don't like that show. 

<img src="images/netflix_implicit_feedback.png" width="50%"/>

### Step 1: Import Dependencies

We'll be using the following packages to build our implicit feedback recommender system:

- [numpy](https://numpy.org/)
- [pandas](https://pandas.pydata.org/)
- [implicit](https://github.com/benfred/implicit)
- scipy (specifically, the [csr_matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) class)

In [1]:
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix

import implicit

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

### Step 2: Load the Data

Since we're already familiar with MovieLens from Part 1 and 2 of this tutorial series, we'll continue using this dataset. You can access the MovieLens dataset that we'll be working with via this zip file url [here](https://grouplens.org/datasets/movielens/), or directly download [here](http://files.grouplens.org/datasets/movielens/ml-latest-small.zip). We're working with data in `ml-latest-small.zip` and will need to add the following files to our local directory: 
- ratings.csv
- movies.csv

Alternatively, you can access the data here: 
- https://s3-us-west-2.amazonaws.com/recommender-tutorial/movies.csv
- https://s3-us-west-2.amazonaws.com/recommender-tutorial/ratings.csv

In [2]:
ratings = pd.read_csv("https://s3-us-west-2.amazonaws.com/recommender-tutorial/ratings.csv")
movies = pd.read_csv("https://s3-us-west-2.amazonaws.com/recommender-tutorial/movies.csv")

In [3]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


For this implicit feedback tutorial, we'll treat movie ratings as the number of times that a user watched a movie. For example, if Jane (a user in our database) gave `Batman` a rating of 1 and `Legally Blonde` a rating of 5, we'll assume that Jane watched Batman one time and Legally Blonde five times. 

### Step 3: Transforming the Data

Similar to [Part 1](https://github.com/topspinj/recommender-tutorial/blob/master/part-1-item-item-recommender.ipynb), we need to transform the `ratings` dataframe into a user-item matrix where rows represent users and columns represent movies. The cells of this matrix will be populated with implicit feedback: in this case, the number of times a user watched a movie. 

The `create_X()` function outputs a sparse matrix **X** with four mapper dictionaries:
- **user_mapper:** maps user id to user index
- **movie_mapper:** maps movie id to movie index
- **user_inv_mapper:** maps user index to user id
- **movie_inv_mapper:** maps movie index to movie id

We need these dictionaries because they map which row and column of the utility matrix corresponds to which user ID and movie ID, respectively.

The **X** (user-item) matrix is a [scipy.sparse.csr_matrix](scipylinkhere) which stores the data sparsely.



<img src="images/user-movie-matrix.png" width="500px" align="left">

In [4]:
def create_X(df):
    """
    Generates a sparse matrix from ratings dataframe.
    
    Args:
        df: pandas dataframe
    
    Returns:
        X: sparse matrix
        user_mapper: dict that maps user id's to user indices
        user_inv_mapper: dict that maps user indices to user id's
        movie_mapper: dict that maps movie id's to movie indices
        movie_inv_mapper: dict that maps movie indices to movie id's
    """
    N = df['userId'].nunique()
    M = df['movieId'].nunique()

    user_mapper = dict(zip(np.unique(df["userId"]), list(range(N))))
    movie_mapper = dict(zip(np.unique(df["movieId"]), list(range(M))))
    
    user_inv_mapper = dict(zip(list(range(N)), np.unique(df["userId"])))
    movie_inv_mapper = dict(zip(list(range(M)), np.unique(df["movieId"])))
    
    user_index = [user_mapper[i] for i in df['userId']]
    movie_index = [movie_mapper[i] for i in df['movieId']]

    X = csr_matrix((df["rating"], (movie_index, user_index)), shape=(M, N))
    
    return X, user_mapper, movie_mapper, user_inv_mapper, movie_inv_mapper

In [5]:
X, user_mapper, movie_mapper, user_inv_mapper, movie_inv_mapper = create_X(ratings)

### Creating Movie Title Mappers

We need to interpret a movie title from its index in the user-item matrix and vice versa. Let's create 2 helper functions that make this interpretation easy:

- `get_movie_index()` - converts a movie title to movie index
    - Note that this function uses [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy)'s string matching to get the approximate movie title match based on the string that gets passed in. This means that you don't need to know the exact spelling and formatting of the title to get the corresponding movie index.
- `get_movie_title()` - converts a movie index to movie title

In [6]:
from fuzzywuzzy import process

def movie_finder(title):
    all_titles = movies['title'].tolist()
    closest_match = process.extractOne(title,all_titles)
    return closest_match[0]

movie_title_mapper = dict(zip(movies['title'], movies['movieId']))
movie_title_inv_mapper = dict(zip(movies['movieId'], movies['title']))

def get_movie_index(title):
    fuzzy_title = movie_finder(title)
    movie_id = movie_title_mapper[fuzzy_title]
    movie_idx = movie_mapper[movie_id]
    return movie_idx

def get_movie_title(movie_idx): 
    movie_id = movie_inv_mapper[movie_idx]
    title = movie_title_inv_mapper[movie_id]
    return title 

It's time to test it out! Let's get the movie index of `Legally Blonde`. 

In [7]:
get_movie_index('Legally Blonde')

3282

Let's pass this index value into `get_movie_title()`. We're expecting Legally Blonde to get returned.

In [8]:
get_movie_title(3282)

'Legally Blonde (2001)'

Great! These helper functions will be useful when we want to interpret our recommender results.

### Step 4: Building Our Implicit Feedback Recommender Model


We've transformed and prepared our data so that we can start creating our recommender model.

The [implicit](https://github.com/benfred/implicit) package is built around a linear algebra technique called [matrix factorization](https://en.wikipedia.org/wiki/Matrix_factorization_(recommender_systems)), which can help us discover latent features underlying the interactions between users and movies. These latent features give a more compact representation of user tastes and item descriptions. Matrix factorization is particularly useful for very sparse data and can enhance the quality of recommendations. The algorithm works by factorizing the original user-item matrix into two factor matrices:

- user-factor matrix (n_users, k)
- item-factor matrix (k, n_items)

We are reducing the dimensions of our original matrix into "taste" dimensions. We cannot interpret what each latent feature $k$ represents. However, we could imagine that one latent feature may represent users who like romantic comedies from the 1990s, while another latent feature may represent movies which are independent foreign language films.

$$X_{mn} \approx P_{mk} \times Q_{nk}^T = \hat{X}$$

<img src="images/matrix-factorization.png" width="60%"/>

In traditional matrix factorization, such as SVD, we would attempt to solve the factorization at once which can be very computationally expensive. As a more practical alternative, we can use a technique called `Alternating Least Squares (ALS)` instead. With ALS, we solve for one factor matrix at a time:

- Step 1: hold user-factor matrix fixed and solve for the item-factor matrix
- Step 2: hold item-factor matrix fixed and solve for the user-item matrix

We alternate between Step 1 and 2 above, until the dot product of the item-factor matrix and user-item matrix is approximately equal to the original X (user-item) matrix. This approach is less computationally expensive and can be run in parallel.

The [implicit](https://github.com/benfred/implicit) package implements matrix factorization using Alternating Least Squares (see docs [here](https://implicit.readthedocs.io/en/latest/als.html)). Let's initiate the model using the `AlternatingLeastSquares` class.

In [9]:
model = implicit.als.AlternatingLeastSquares(factors=50)



This model comes with a couple of hyperparameters that can be tuned to generate optimal results:

- factors ($k$): number of latent factors,
- regularization ($\lambda$): prevents the model from overfitting during training

In this tutorial, we'll set $k = 50$ and $\lambda = 0.01$ (the default). In a real-world scenario, I highly recommend tuning these hyperparameters before generating recommendations to generate optimal results.

The next step is to fit our model with our user-item matrix. 

In [10]:
model.fit(X)

100%|██████████| 15.0/15 [00:06<00:00,  2.30it/s]


Now, let's test out the model's recommendations. We can use the model's `similar_items()` method which returns the most relevant movies of a given movie. We can use our helpful `get_movie_index()` function to get the movie index of the movie that we're interested in.

In [11]:
movie_of_interest = 'forrest gump'

movie_index = get_movie_index(movie_of_interest)
related = model.similar_items(movie_index)
related

[(314, 0.9999999),
 (277, 0.87162775),
 (510, 0.8275397),
 (257, 0.82705915),
 (97, 0.7511661),
 (461, 0.7223397),
 (418, 0.6889433),
 (1938, 0.6659612),
 (123, 0.6441393),
 (43, 0.61916924)]

The output of `similar_items()` is not user-friendly. We'll need to use our `get_movie_title()` function to interpret what our results are. 

In [12]:
print(f"Because you watched {movie_finder(movie_of_interest)}...")
for r in related:
    recommended_title = get_movie_title(r[0])
    if recommended_title != movie_finder(movie_of_interest):
        print(recommended_title)

Because you watched Forrest Gump (1994)...
Shawshank Redemption, The (1994)
Silence of the Lambs, The (1991)
Pulp Fiction (1994)
Braveheart (1995)
Schindler's List (1993)
Jurassic Park (1993)
Matrix, The (1999)
Apollo 13 (1995)
Seven (a.k.a. Se7en) (1995)


When we treat user ratings as implicit feedback, the results look pretty good! You can test out other movies by changing the `movie_of_interest` variable.

### Step 5: Generating User-Item Recommendations

A cool feature of [implicit](https://github.com/benfred/implicit) is that you can pull personalized recommendations for a given user. Let's test it out on a user in our dataset.

In [13]:
user_id = 95

In [14]:
user_ratings = ratings[ratings['userId']==user_id].merge(movies[['movieId', 'title']])
user_ratings = user_ratings.sort_values('rating', ascending=False)
print(f"Number of movies rated by user {user_id}: {user_ratings['movieId'].nunique()}")

Number of movies rated by user 95: 168


User 95 watched 168 movies. Their highest rated movies are below:

In [15]:
user_ratings = ratings[ratings['userId']==user_id].merge(movies[['movieId', 'title']])
user_ratings = user_ratings.sort_values('rating', ascending=False)
top_5 = user_ratings.head()
top_5

Unnamed: 0,userId,movieId,rating,timestamp,title
24,95,1089,5.0,1048382826,Reservoir Dogs (1992)
34,95,1221,5.0,1043340018,"Godfather: Part II, The (1974)"
83,95,3019,5.0,1043340112,Drugstore Cowboy (1989)
26,95,1175,5.0,1105400882,Delicatessen (1991)
27,95,1196,5.0,1043340018,Star Wars: Episode V - The Empire Strikes Back...


Their lowest rated movies:

In [16]:
bottom_5 = user_ratings[user_ratings['rating']<3].tail()
bottom_5

Unnamed: 0,userId,movieId,rating,timestamp,title
93,95,3690,2.0,1043339908,Porky's Revenge (1985)
122,95,5283,2.0,1043339957,National Lampoon's Van Wilder (2002)
100,95,4015,2.0,1043339957,"Dude, Where's My Car? (2000)"
164,95,7373,1.0,1105401093,Hellboy (2004)
109,95,4732,1.0,1043339283,Bubble Boy (2001)


Based on their preferences above, we can get a sense that user 95 likes action and crime movies from the early 1990's over light-hearted American comedies from the early 2000's. Let's see what recommendations our model will generate for user 95.

We'll use the `recommend()` method, which takes in the user index of interest and transposed user-item matrix. 

In [17]:
X_t = X.T.tocsr()

user_idx = user_mapper[user_id]
recommendations = model.recommend(user_idx, X_t)
recommendations

[(855, 1.127779),
 (1043, 0.98673713),
 (1210, 0.9256185),
 (3633, 0.90900886),
 (1978, 0.8929481),
 (4155, 0.84075284),
 (2979, 0.82858247),
 (3609, 0.78015),
 (4791, 0.7672245),
 (4010, 0.7530525)]

We can't interpret the results as is since movies are represented by their index. We'll have to loop over the list of recommendations and get the movie title for each movie index. 

In [18]:
for r in recommendations:
    recommended_title = get_movie_title(r[0])
    print(recommended_title)

Abyss, The (1989)
Star Trek: First Contact (1996)
Hunt for Red October, The (1990)
Lord of the Rings: The Fellowship of the Ring, The (2001)
Star Wars: Episode I - The Phantom Menace (1999)
Chicago (2002)
Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)
Ocean's Eleven (2001)
Lord of the Rings: The Return of the King, The (2003)
Punch-Drunk Love (2002)


User 95's recommendations consist of action, crime, and thrillers. None of their recommendations are comedies. 