<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

# Content Filtering Example
In this exercise we will demonstrate how content-based filtering can be used to build simple but reasonably effective recommendation systems.  We will demonstrate this on a subset of the [MovieLens](https://grouplens.org/datasets/movielens/) dataset containing 100,000 movie ratings.  We will use item-content filtering for this example, meaning we will attempt to recommend new movies to a user that are very similar to movies which they have previously watched and rated highly.  

To determine the similarity of movies we will use feature information about the items - in this case specifically we will use the genre information which is available to us in the dataset for each move. This is a good start, although if we had more information we could use other features such as plot summaries, cast, director, year made etc to better determine similarity of movies.

**Notes:** 
- This does not need to be run on GPU

**References:**  
- Review the details on the MovieLens dataset [here](https://grouplens.org/datasets/movielens/)  

In [5]:
import os
import urllib
import zipfile
import time
from itertools import combinations

import torch
import numpy as np
import pandas as pd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error

## Prepare the data

In [6]:
# Download the data from the GroupLens website
datapath = './data/ml-latest-small'

if not os.path.exists('./data'):
    os.makedirs('./data')
if not os.path.exists(datapath):
    url = 'https://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
    urllib.request.urlretrieve(url,filename='data/ml-latest-small.zip')
    zip_ref = zipfile.ZipFile('data/ml-latest-small.zip', 'r')
    zip_ref.extractall('data/')
    zip_ref.close()

In [18]:
# Load data
ratings = pd.read_csv(os.path.join(datapath,'ratings.csv'))
movies = pd.read_csv(os.path.join(datapath,'movies.csv'))
ratings = ratings.merge(movies,on='movieId')
ratings = ratings[['userId','movieId','genres','rating']]
ratings['genres'] = ratings['genres'].apply(lambda x: x.replace('|',' '))
ratings.head()

Unnamed: 0,userId,movieId,genres,rating
0,1,1,Adventure Animation Children Comedy Fantasy,4.0
1,5,1,Adventure Animation Children Comedy Fantasy,4.0
2,7,1,Adventure Animation Children Comedy Fantasy,4.5
3,15,1,Adventure Animation Children Comedy Fantasy,2.5
4,17,1,Adventure Animation Children Comedy Fantasy,4.5


To perform item-content filtering we need toy define a set of features for each of our items, and then use those features to evaluate similarity of items.  In this case we will use the genre information as the feature representation of each movie (but we could use other information if we had it such as cast,director,plot,year made etc.).  We will apply a bag of words model to one-hot encode each of the genres listed in the 'genres' column for each movie.  

In [19]:
# Get vector representations of genre
vec = CountVectorizer()
genres_vec = vec.fit_transform(movies['genres'])

# Display resulting feature vectors
genres_vectorized = pd.DataFrame(genres_vec.todense(),columns=vec.get_feature_names_out(),index=movies.movieId)
genres_vectorized.head()


Unnamed: 0_level_0,action,adventure,animation,children,comedy,crime,documentary,drama,fantasy,fi,...,listed,musical,mystery,no,noir,romance,sci,thriller,war,western
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,1,1,1,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,0,1,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
5,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now that we are able to represent each item as a numerical feature vector, we can calculate the similarity of items to each other.  We will use cosine similarity as our similarity metric and build a similarity matrix showing the similarity of every movie to every other in the set.

In [20]:
# Build similarity marrix of movies based on similarity of genres
csmatrix = cosine_similarity(genres_vec)
csmatrix = pd.DataFrame(csmatrix,columns=movies.movieId,index=movies.movieId)
csmatrix.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.774597,0.316228,0.258199,0.447214,0.0,0.316228,0.632456,0.0,0.258199,...,0.4,0.316228,0.316228,0.447214,0.0,0.67082,0.774597,0.0,0.316228,0.447214
2,0.774597,1.0,0.0,0.0,0.0,0.0,0.0,0.816497,0.0,0.333333,...,0.0,0.0,0.0,0.0,0.0,0.288675,0.333333,0.0,0.0,0.0
3,0.316228,0.0,1.0,0.816497,0.707107,0.0,1.0,0.0,0.0,0.0,...,0.316228,0.0,0.5,0.0,0.0,0.353553,0.408248,0.0,0.0,0.707107
4,0.258199,0.0,0.816497,1.0,0.57735,0.0,0.816497,0.0,0.0,0.0,...,0.258199,0.408248,0.816497,0.0,0.0,0.288675,0.333333,0.57735,0.0,0.57735
5,0.447214,0.0,0.707107,0.57735,1.0,0.0,0.707107,0.0,0.0,0.0,...,0.447214,0.0,0.707107,0.0,0.0,0.5,0.57735,0.0,0.0,1.0


In [21]:
# Split our data into training and test sets
X = ratings.drop(labels=['rating','genres'],axis=1)
y = ratings['rating']
X_train, X_val, y_train, y_val = train_test_split(X,y,random_state=0, test_size=0.2)

We are now ready to generated predicted ratings for each user-item pair.  The process we will use to generate each predicted rating is as follows:  
- Filter the similarity matrix to only the movies previously watched by the user  
- Find the previously watched movie that is most similar to the movie for which we want to generate the predicted rating (nearest neighbor approach)
- Get the user's rating for the most similar previously watched movie and use that as our prediction

In [22]:
def predict_rating(user_item_pair,simtable=csmatrix,X_train=X_train, y_train=y_train):
    movie_to_rate = user_item_pair['movieId']
    user = user_item_pair['userId']
    # Filter similarity matrix to only movies already reviewed by user
    movies_watched = X_train.loc[X_train['userId']==user, 'movieId'].tolist()
    simtable_filtered = simtable.loc[movie_to_rate,movies_watched]
    # Get the most similar movie already watched to current movie to rate
    most_similar_watched = simtable_filtered.index[np.argmax(simtable_filtered)]
    # Get user's rating for most similar movie
    idx = X_train.loc[(X_train['userId']==user) & (X_train['movieId']==most_similar_watched)].index.values[0]
    most_similar_rating = y_train.loc[idx]
    return most_similar_rating

In [23]:
# Get the predicted ratings for each movie in the validation set and calculate the RMSE
ratings_valset = X_val.apply(lambda x: predict_rating(x),axis=1)
val_rmse = np.sqrt(mean_squared_error(y_val,ratings_valset))
print('RMSE of predicted ratings is {:.3f}'.format(val_rmse))

RMSE of predicted ratings is 1.243


## Get predicted rating for a user-movie pair
Now that our model is trained we can use it to generate predicted ratings of a given user for a given movie.  To do so we simply feed a user-item pair into our model and get the predicted rating.  We could also do other things such as determine which movie (out of all movies in our set) a particular user might rate the highest, and recommend that to him/her.

In [24]:
def predict_new_pair_rating(user,movie,simtable=csmatrix,X_train=X_train, y_train=y_train):
    # Filter similarity matrix to only movies already reviewed by user
    movies_watched = X_train.loc[X_train['userId']==user, 'movieId'].tolist()
    simtable_filtered = simtable.loc[movie,movies_watched]
    # Get the most similar movie already watched to current movie to rate
    most_similar_watched = simtable_filtered.index[np.argmax(simtable_filtered)]
    # Get user's rating for most similar movie
    idx = X_train.loc[(X_train['userId']==user) & (X_train['movieId']==most_similar_watched)].index.values[0]
    most_similar_rating = y_train.loc[idx]
    return most_similar_rating

In [25]:
rating = predict_new_pair_rating(5,10)
print('Predicted rating is {:.1f}'.format(rating))

Predicted rating is 4.0


## Generate recommendations for user
We can also use our content filtering approach to generate recommendations for a user of movies they would like.

To generate recommendations for movies to watch we will do the following:  
- Identify the previously watched movie the user has rated the highest
- Find the most similar movies to the user's highest ratest movie
- Remove any movies from the list the user has already seen
- Return the top matches as the recommendations to watch

In [44]:
def generate_recommendations(user,simtable,ratings):
    # Get top rated movie by user
    user_ratings = ratings.loc[ratings['userId']==user]
    user_ratings = user_ratings.sort_values(by='rating',axis=0,ascending=False)
    topratedmovie = user_ratings.iloc[0,:]['movieId']
    topratedmovie_title = movies.loc[movies['movieId']==topratedmovie,'title'].values[0]
    # Find most similar movies to the user's top rated movie
    sims = simtable.loc[topratedmovie,:]
    mostsimilar = sims.sort_values(ascending=False).index.values
    # Get 10 most similar movies excluding the movie itself
    mostsimilar = mostsimilar[1:11]
    # Get titles of movies from ids
    mostsimmovies_names = []
    for m in mostsimilar:
        mostsimmovies_names.append(movies.loc[movies['movieId']==m,'title'].values[0])
    return topratedmovie_title, mostsimmovies_names



In [46]:
user = 5
topratedmovie, recs = generate_recommendations(user,simtable=csmatrix,ratings=ratings)
print("User's highest rated movie was {}".format(topratedmovie))
for i,rec in enumerate(recs):
  print('Recommendation {}: {}'.format(i,rec))

User's highest rated movie was Beauty and the Beast (1991)
Recommendation 0: Tangled (2010)
Recommendation 1: Princess and the Frog, The (2009)
Recommendation 2: Cinderella (1950)
Recommendation 3: Return of Jafar, The (1994)
Recommendation 4: Aladdin and the King of Thieves (1996)
Recommendation 5: All Dogs Go to Heaven 2 (1996)
Recommendation 6: Nightmare Before Christmas, The (1993)
Recommendation 7: Cinderella (1997)
Recommendation 8: Strange Magic (2015)
Recommendation 9: Cloudy with a Chance of Meatballs (2009)
