## Collaborative Filtering

*Prepared by:*
**Jude Michael Teves**  
Faculty, Software Technology Department  
College of Computer Studies - De La Salle University

This notebook is for introducing different similarity metrics that we could use in the context of recommender systems.

## Preliminaries

### Import libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import jaccard_score
from scipy.stats import pearsonr

### Load Data

We will be using the MovieLens dataset here. I have already preprocessed the data so it will be easier for us to process later on.

In [2]:
df_ratings = pd.read_csv('https://raw.githubusercontent.com/Cyntwikip/data-repository/main/movielens_movie_ratings.csv')
df_ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [3]:
df_genres = pd.read_csv('https://raw.githubusercontent.com/Cyntwikip/data-repository/main/movielens_movie_genres.csv')
df_genres.head()

Unnamed: 0,movieId,title,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),0,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),0,0,0,0,1,0,0,1,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II (1995),0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## User-based Collaborative Filtering

### Build User-Item Matrix

In [94]:
user_id = 3

In [95]:
df_user = df_ratings.pivot(index='userId', columns='movieId', values='rating')
df_user

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,,,,,,2.5,,,,...,,,,,,,,,,
607,4.0,,,,,,,,,,...,,,,,,,,,,
608,2.5,2.0,2.0,,,,,,,4.0,...,,,,,,,,,,
609,3.0,,,,,,,,,4.0,...,,,,,,,,,,


### Retrieve *k* most similar users

#### Preprocessing - Mean Imputation

In [96]:
df_user_filled = df_user.apply(lambda x: x.fillna(x.mean()), axis=1)
df_user_filled.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,4.366379,4.0,4.366379,4.366379,4.0,4.366379,4.366379,4.366379,4.366379,...,4.366379,4.366379,4.366379,4.366379,4.366379,4.366379,4.366379,4.366379,4.366379,4.366379
2,3.948276,3.948276,3.948276,3.948276,3.948276,3.948276,3.948276,3.948276,3.948276,3.948276,...,3.948276,3.948276,3.948276,3.948276,3.948276,3.948276,3.948276,3.948276,3.948276,3.948276
3,2.435897,2.435897,2.435897,2.435897,2.435897,2.435897,2.435897,2.435897,2.435897,2.435897,...,2.435897,2.435897,2.435897,2.435897,2.435897,2.435897,2.435897,2.435897,2.435897,2.435897
4,3.555556,3.555556,3.555556,3.555556,3.555556,3.555556,3.555556,3.555556,3.555556,3.555556,...,3.555556,3.555556,3.555556,3.555556,3.555556,3.555556,3.555556,3.555556,3.555556,3.555556
5,4.0,3.636364,3.636364,3.636364,3.636364,3.636364,3.636364,3.636364,3.636364,3.636364,...,3.636364,3.636364,3.636364,3.636364,3.636364,3.636364,3.636364,3.636364,3.636364,3.636364


#### Similarity Computation

In [99]:
k = 10
reference_user = df_user_filled.loc[user_id]
user_similarities = df_user_filled.apply(lambda x: pearsonr(x, reference_user)[0], axis=1)
similar_users = user_similarities.drop(user_id, axis=0).nlargest(k)
similar_users

userId
441    0.117418
496    0.067878
549    0.064006
231    0.061159
527    0.058456
537    0.058072
313    0.055313
518    0.050288
244    0.049511
246    0.048314
dtype: float64

### Get average rating of similar users

In [100]:
predicted_ratings = df_user.loc[similar_users.index].mean().sort_values(ascending=False)
predicted_ratings

movieId
2450      5.0
68954     5.0
68486     5.0
2683      5.0
1199      5.0
         ... 
193581    NaN
193583    NaN
193585    NaN
193587    NaN
193609    NaN
Length: 9724, dtype: float64

#### Recommend items

In [101]:
user_unrated_items = df_user.loc[user_id].isna()
recommended_items = predicted_ratings[user_unrated_items].head(10)
recommended_items

movieId
2450     5.0
68954    5.0
68486    5.0
2683     5.0
1199     5.0
1200     5.0
1997     5.0
3153     5.0
66371    5.0
1213     5.0
dtype: float64

Let's observe how other similar users rated those items.

In [102]:
df_user.loc[similar_users.index, recommended_items.index]

movieId,2450,68954,68486,2683,1199,1200,1997,3153,66371,1213
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
441,,,,5.0,,,,,,
496,,,,,,,,,,
549,,,,,,,,,,
231,,,,,,5.0,,,,
527,5.0,,,,,,5.0,,,
537,,,,,,,,,,
313,,,,,5.0,5.0,5.0,5.0,,5.0
518,,,,,,,,,,
244,,,,,,5.0,,,,
246,,5.0,5.0,,,,,,5.0,


### Variation: Get weighted average of similar users

In [103]:
def get_weighted_similarity(x):
    weighted_similarity = x*similar_users
    norm = similar_users[~weighted_similarity.isna()].sum()
    rating = weighted_similarity.sum()/norm
    return rating

predicted_ratings = df_user.loc[similar_users.index].apply(get_weighted_similarity, axis=0)
predicted_ratings = predicted_ratings.sort_values(ascending=False)
predicted_ratings

  rating = weighted_similarity.sum()/norm


movieId
1333      5.0
1982      5.0
3071      5.0
1961      5.0
2450      5.0
         ... 
193581    NaN
193583    NaN
193585    NaN
193587    NaN
193609    NaN
Length: 9724, dtype: float64

#### Recommend items

In [104]:
user_unrated_items = df_user.loc[user_id].isna()
recommended_items = predicted_ratings[user_unrated_items].head(10)
recommended_items

movieId
1333    5.0
1982    5.0
3071    5.0
1961    5.0
2450    5.0
2118    5.0
2355    5.0
2137    5.0
4638    5.0
1035    5.0
dtype: float64

Let's observe how other similar users rated those items.

In [105]:
df_user.loc[similar_users.index, recommended_items.index]

movieId,1333,1982,3071,1961,2450,2118,2355,2137,4638,1035
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
441,,,,,,,,,,
496,,,,,,,,,,
549,,,,,,,,,,
231,,,,,,,,,,
527,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
537,,,,,,,,,,
313,,,,,,,,,,
518,,,,,,,,,,
244,,,,,,,,,,
246,,,,,,,,,,


## Item-based Collaborative Filtering

## References

1. F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872

## End
<sup>made by **Jude Michael Teves**</sup> <br>
<sup>for comments, corrections, suggestions, please email:</sup><sup> <href>judemichaelteves@gmail.com</href> or <href>jude.teves@dlsu.edu.ph</href></sup><br>