## Content-based Recommender Systems

*Prepared by:*
**Jude Michael Teves**  
Faculty, Software Technology Department  
College of Computer Studies - De La Salle University

This notebook is for demonstrating how to do a simple content-based recommendation.

## Preliminaries

### Import library

In [1]:
import numpy as np
import pandas as pd

### Load Data

We will be using the MovieLens dataset here. I have already preprocessed the data so it will be easier for us to process later on.

In [2]:
df_ratings = pd.read_csv('https://raw.githubusercontent.com/Cyntwikip/data-repository/main/movielens_movie_ratings.csv')
df_ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [3]:
df_genres = pd.read_csv('https://raw.githubusercontent.com/Cyntwikip/data-repository/main/movielens_movie_genres.csv')
df_genres.head()

Unnamed: 0,movieId,title,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),0,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),0,0,0,0,1,0,0,1,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II (1995),0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Content-based - Implicit Rating

- Build the Item Profile matrix.
- Let's focus on userId 1. Compute the user profile.  
- Ignore the `ratings` column for now (Implicit rating). Recommend movies that the user has not watched based on the genres.  

Hint! Use the following import to compute the similarity.

In [4]:
from sklearn.metrics.pairwise import cosine_similarity

### Building the Item Profile matrix

In [5]:
df_item = df_genres.drop('title', axis=1).set_index('movieId')
df_item

Unnamed: 0_level_0,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
5,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,1,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
193583,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
193585,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
193587,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### Computing the User Profile 

In [6]:
user_likes = df_ratings.query("userId==1")['movieId']
user_likes

0         1
1         3
2         6
3        47
4        50
       ... 
227    3744
228    3793
229    3809
230    4006
231    5060
Name: movieId, Length: 232, dtype: int64

In [7]:
user_profile = df_item.loc[user_likes].mean(axis=0)
user_profile

Action         0.387931
Adventure      0.366379
Animation      0.125000
Children       0.181034
Comedy         0.357759
Crime          0.193966
Documentary    0.000000
Drama          0.293103
Fantasy        0.202586
Film-Noir      0.004310
Horror         0.073276
IMAX           0.000000
Musical        0.094828
Mystery        0.077586
Romance        0.112069
Sci-Fi         0.172414
Thriller       0.237069
War            0.094828
Western        0.030172
dtype: float64

Take note of the top genres here. You should be seeing that the recommended movies have these genres, more or less.

In [8]:
user_profile.sort_values(ascending=False).head()

Action       0.387931
Adventure    0.366379
Comedy       0.357759
Drama        0.293103
Thriller     0.237069
dtype: float64

### Retrieving Similar Items

In [9]:
df_scores = df_genres.copy()
scores = cosine_similarity(df_item, user_profile.values.reshape(1,-1)).reshape(-1)
df_scores['similarity'] = scores
df_scores.head()

Unnamed: 0,movieId,title,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,...,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,similarity
0,1,Toy Story (1995),0,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0.634702
1,2,Jumanji (1995),0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.498514
2,3,Grumpier Old Men (1995),0,0,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0.382473
3,4,Waiting to Exhale (1995),0,0,0,0,1,0,0,1,...,0,0,0,0,1,0,0,0,0,0.507109
4,5,Father of the Bride Part II (1995),0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0.411876


The recommended movies below are consistent with our **User Profile**.

In [10]:
df_scores_sorted = df_scores.sort_values('similarity', ascending=False)
df_scores_filtered = df_scores_sorted.query(f"movieId not in {user_likes.values.tolist()}")
df_scores_filtered.head(10).T

Unnamed: 0,8597,6570,3608,4681,4005,9394,3526,5471,7409,5379
movieId,117646,55116,4956,6990,5657,164226,4818,26184,80219,8968
title,Dragonheart 2: A New Beginning (2000),"Hunting Party, The (2007)","Stunt Man, The (1980)",The Great Train Robbery (1978),Flashback (1990),Maximum Ride (2016),Extreme Days (2001),"Diamond Arm, The (Brilliantovaya ruka) (1968)",Machete (2010),After the Sunset (2004)
Action,1,1,1,1,1,1,1,1,1,1
Adventure,1,1,1,1,1,1,1,1,1,1
Animation,0,0,0,0,0,0,0,0,0,0
Children,0,0,0,0,0,0,0,0,0,0
Comedy,1,1,1,1,1,1,1,1,1,1
Crime,0,0,0,1,1,0,0,1,1,1
Documentary,0,0,0,0,0,0,0,0,0,0
Drama,1,1,1,1,1,0,1,0,0,0


## Content-based - Explicit Rating

### User Profile

In [11]:
user_id = 3
user_ratings = df_ratings.query(f"userId=={user_id}")['rating']

#### Exploratory Data Analysis

In [12]:
user_ratings.value_counts().sort_index()

0.5    20
2.0     1
3.0     1
3.5     1
4.0     1
4.5     5
5.0    10
Name: rating, dtype: int64

In [13]:
user_ratings.mean()

2.4358974358974357

#### Apply ratings to Item Profile matrix

In [14]:
user_watched = df_ratings.query(f"userId=={user_id}")['movieId']
df_item_rated = df_item.loc[user_watched] * user_ratings.values.reshape(-1, 1)
df_item_rated = df_item_rated.replace(0, np.nan)
# df_item_rated = df_item_rated - user_ratings.mean()
df_item_rated.head()

Unnamed: 0_level_0,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
31,,,,,,,,0.5,,,,,,,,,,,
527,,,,,,,,0.5,,,,,,,,,,0.5,
647,0.5,,,,,0.5,,0.5,,,,,,,,,,0.5,
688,0.5,0.5,,,0.5,,,,,,,,,,,,,0.5,
720,,0.5,0.5,,0.5,,,,,,,,,,,,,,


#### Computing User Profile

In [15]:
user_profile_rated = df_item_rated.mean(axis=0)
user_profile_rated = user_profile_rated.fillna(0)
user_profile_rated

Action         3.571429
Adventure      2.727273
Animation      0.500000
Children       0.500000
Comedy         1.000000
Crime          0.500000
Documentary    0.000000
Drama          0.750000
Fantasy        3.375000
Film-Noir      0.000000
Horror         4.687500
IMAX           0.000000
Musical        0.500000
Mystery        5.000000
Romance        0.500000
Sci-Fi         4.200000
Thriller       4.142857
War            0.500000
Western        0.000000
dtype: float64

Take note of the top genres here. You should be seeing that the recommended movies have these genres, more or less.

In [16]:
user_profile_rated.sort_values(ascending=False).head()

Mystery     5.000000
Horror      4.687500
Sci-Fi      4.200000
Thriller    4.142857
Action      3.571429
dtype: float64

Here are the genres that this user does not like.

In [17]:
user_profile_rated.sort_values(ascending=False).tail()

Romance        0.5
Film-Noir      0.0
IMAX           0.0
Documentary    0.0
Western        0.0
dtype: float64

### Retrieving Similar Items

#### Compute Similarity

In [18]:
df_scores_rated = df_genres.copy()
scores = cosine_similarity(df_item, user_profile_rated.values.reshape(1,-1)).reshape(-1)
df_scores_rated['similarity'] = scores
# df_scores_rated.head()

#### Filter out watched movies

The recommended movies below are consistent with our **User Profile**.

In [19]:
df_scores_sorted = df_scores_rated.sort_values('similarity', ascending=False)
df_scores_filtered = df_scores_sorted.query(f"movieId not in {user_likes.values.tolist()}")
df_scores_filtered.head(10).T

Unnamed: 0,5802,5980,6145,2354,5593,5826,7712,4690,9689,1662
movieId,31804,36509,43932,3113,26887,32213,90345,7001,184253,2232
title,Night Watch (Nochnoy dozor) (2004),"Cave, The (2005)",Pulse (2006),End of Days (1999),"Langoliers, The (1995)",Cube Zero (2004),"Thing, The (2011)",Invasion of the Body Snatchers (1978),The Cloverfield Paradox (2018),Cube (1997)
Action,1,1,1,1,0,0,0,0,0,0
Adventure,0,1,0,0,0,0,0,0,0,0
Animation,0,0,0,0,0,0,0,0,0,0
Children,0,0,0,0,0,0,0,0,0,0
Comedy,0,0,0,0,0,0,0,0,0,0
Crime,0,0,0,0,0,0,0,0,0,0
Documentary,0,0,0,0,0,0,0,0,0,0
Drama,0,0,1,0,1,0,0,0,0,0


These are the movies that this user will probably not like.

In [20]:
df_scores_filtered.sort_values('similarity').head(10).T

Unnamed: 0,9100,3208,8621,3269,6872,2399,3293,3295,8648,210
movieId,144210,4329,118784,4426,62662,3182,4453,4458,120478,246
title,Just Eat It: A Food Waste Story (2014),Rio Bravo (1959),Good Copy Bad Copy (2007),Kiss Me Deadly (1955),Tokyo-Ga (1985),Mr. Death: The Rise and Fall of Fred A. Leucht...,Michael Jordan to the Max (2000),Africa: The Serengeti (1994),The Salt of the Earth (2014),Hoop Dreams (1994)
Action,0,0,0,0,0,0,0,0,0,0
Adventure,0,0,0,0,0,0,0,0,0,0
Animation,0,0,0,0,0,0,0,0,0,0
Children,0,0,0,0,0,0,0,0,0,0
Comedy,0,0,0,0,0,0,0,0,0,0
Crime,0,0,0,0,0,0,0,0,0,0
Documentary,1,0,1,0,1,1,1,1,1,1
Drama,0,0,0,0,0,0,0,0,0,0


## Content-based - Explicit Rating *mean-subtraction variation*

### User Profile

In [21]:
user_id = 3
user_ratings = df_ratings.query(f"userId=={user_id}")['rating']

#### Exploratory Data Analysis

In [22]:
user_ratings.value_counts().sort_index()

0.5    20
2.0     1
3.0     1
3.5     1
4.0     1
4.5     5
5.0    10
Name: rating, dtype: int64

In [23]:
user_ratings.mean()

2.4358974358974357

#### Apply ratings to Item Profile matrix

In [24]:
user_watched = df_ratings.query(f"userId=={user_id}")['movieId']
df_item_rated = df_item.loc[user_watched] * user_ratings.values.reshape(-1, 1)
df_item_rated = df_item_rated.replace(0, np.nan)
df_item_rated = df_item_rated - user_ratings.mean()
df_item_rated.head()

Unnamed: 0_level_0,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
31,,,,,,,,-1.935897,,,,,,,,,,,
527,,,,,,,,-1.935897,,,,,,,,,,-1.935897,
647,-1.935897,,,,,-1.935897,,-1.935897,,,,,,,,,,-1.935897,
688,-1.935897,-1.935897,,,-1.935897,,,,,,,,,,,,,-1.935897,
720,,-1.935897,-1.935897,,-1.935897,,,,,,,,,,,,,,


#### Computing User Profile

In [25]:
user_profile_rated = df_item_rated.mean(axis=0)
user_profile_rated = user_profile_rated.fillna(0)
user_profile_rated

Action         1.135531
Adventure      0.291375
Animation     -1.935897
Children      -1.935897
Comedy        -1.435897
Crime         -1.935897
Documentary    0.000000
Drama         -1.685897
Fantasy        0.939103
Film-Noir      0.000000
Horror         2.251603
IMAX           0.000000
Musical       -1.935897
Mystery        2.564103
Romance       -1.935897
Sci-Fi         1.764103
Thriller       1.706960
War           -1.935897
Western        0.000000
dtype: float64

Take note of the top genres here. You should be seeing that the recommended movies have these genres, more or less.

**Question:** What if we try to get the user profile of `userId = 1` instead? Note that the ratings made by that user are highly imbalanced unlike this user.

In [26]:
user_profile_rated.sort_values(ascending=False).head()

Mystery     2.564103
Horror      2.251603
Sci-Fi      1.764103
Thriller    1.706960
Action      1.135531
dtype: float64

Here are the genres that this user does not like.

In [27]:
user_profile_rated.sort_values(ascending=False).tail()

Musical     -1.935897
Romance     -1.935897
Children    -1.935897
Animation   -1.935897
War         -1.935897
dtype: float64

### Retrieving Similar Items

#### Compute Similarity

In [28]:
df_scores_rated = df_genres.copy()
scores = cosine_similarity(df_item, user_profile_rated.values.reshape(1,-1)).reshape(-1)
df_scores_rated['similarity'] = scores
# df_scores_rated.head()

#### Filter out watched movies

The recommended movies below are consistent with our **User Profile**.

In [29]:
df_scores_sorted = df_scores_rated.sort_values('similarity', ascending=False)
df_scores_filtered = df_scores_sorted.query(f"movieId not in {user_watched.values.tolist()}")
df_scores_filtered.head(10).T

Unnamed: 0,5802,7712,5826,4690,9689,1662,5980,2354,6034,5651
movieId,31804,90345,32213,7001,184253,2232,36509,3113,39400,27482
title,Night Watch (Nochnoy dozor) (2004),"Thing, The (2011)",Cube Zero (2004),Invasion of the Body Snatchers (1978),The Cloverfield Paradox (2018),Cube (1997),"Cave, The (2005)",End of Days (1999),"Fog, The (2005)",Cube 2: Hypercube (2002)
Action,1,0,0,0,0,0,1,1,1,0
Adventure,0,0,0,0,0,0,1,0,0,0
Animation,0,0,0,0,0,0,0,0,0,0
Children,0,0,0,0,0,0,0,0,0,0
Comedy,0,0,0,0,0,0,0,0,0,0
Crime,0,0,0,0,0,0,0,0,0,0
Documentary,0,0,0,0,0,0,0,0,0,0
Drama,0,0,0,0,0,0,0,0,0,0


These are the movies that this user will probably not like.

In [30]:
df_scores_filtered.sort_values('similarity').head(10).T

Unnamed: 0,618,44,1545,1390,8983,6230,786,1369,5160,8275
movieId,783,48,2081,1907,138702,46062,1029,1873,8360,105540
title,"Hunchback of Notre Dame, The (1996)",Pocahontas (1995),"Little Mermaid, The (1989)",Mulan (1998),Feast (2014),High School Musical (2006),Dumbo (1941),"Misérables, Les (1998)",Shrek 2 (2004),"All Dogs Christmas Carol, An (1998)"
Action,0,0,0,0,0,0,0,0,0,0
Adventure,0,0,0,1,0,0,0,0,1,0
Animation,1,1,1,1,1,0,1,0,1,1
Children,1,1,1,1,1,1,1,0,1,1
Comedy,0,0,1,1,1,1,0,0,1,1
Crime,0,0,0,0,0,0,0,1,0,0
Documentary,0,0,0,0,0,0,0,0,0,0
Drama,1,1,0,1,1,1,1,1,0,0


## References

1. F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872

## End
<sup>made by **Jude Michael Teves**</sup> <br>
<sup>for comments, corrections, suggestions, please email:</sup><sup> <href>judemichaelteves@gmail.com</href> or <href>jude.teves@dlsu.edu.ph</href></sup><br>