<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/PreferredAI/tutorials/blob/master/recommender-systems/06_contextual_awareness.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/PreferredAI/tutorials/blob/master/recommender-systems/06_contextual_awareness.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

# Contextual Awareness

Traditional matrix factorization assumes that a recommendation is primarily, if not exclusively, based on the specific user and item in question. However, preference may actually be context-sensitive.  The suitable recommendation may depend on various factors such as time of day, current location, etc.  To incorporate such context factors into the model, we associate them with latent vectors that may participate in the prediction by interacting with the user and item latent vectors.  One paradigm for contextual recommendation is Factorization Machine, which is the focus of this tutorial.

## 1. Setup

In [1]:
!git clone https://github.com/srendle/libfm.git
!make all -C libfm

Cloning into 'libfm'...
remote: Enumerating objects: 233, done.[K
remote: Total 233 (delta 0), reused 0 (delta 0), pack-reused 233[K
Receiving objects: 100% (233/233), 129.46 KiB | 4.04 MiB/s, done.
Resolving deltas: 100% (112/112), done.
make: Entering directory '/content/libfm'
cd src/libfm; make all
make[1]: Entering directory '/content/libfm/src/libfm'
g++ -O3 -Wall -c libfm.cpp -o libfm.o
mkdir -p ../../bin/
g++ -O3 -Wall libfm.o -o ../../bin/libFM
g++ -O3 -Wall -c tools/transpose.cpp -o tools/transpose.o
mkdir -p ../../bin/
g++ -O3 tools/transpose.o -o ../../bin/transpose
g++ -O3 -Wall -c tools/convert.cpp -o tools/convert.o
mkdir -p ../../bin/
g++ -O3 tools/convert.o -o ../../bin/convert
make[1]: Leaving directory '/content/libfm/src/libfm'
make: Leaving directory '/content/libfm'


In [2]:
!pip install --quiet cornac==1.15.4

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.5/18.5 MB[0m [31m72.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
import os
import sys
from collections import defaultdict

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

import cornac
from cornac.utils import cache

print(f"System version: {sys.version}")
print(f"Cornac version: {cornac.__version__}")

SEED = 42  # @param 
VERBOSE = False  # @param 

System version: 3.10.11 (main, Apr  5 2023, 14:15:10) [GCC 9.4.0]
Cornac version: 1.15.4


## 2. Data

In this tutorial, we use [MovieLens + IMDb/Rotten Tomatoes](http://files.grouplens.org/datasets/hetrec2011/hetrec2011-movielens-readme.txt) dataset, which is released during the HetRec Workshop 2011.  This dataset has rich information about movies (e.g., actors, genres) in addition to ratings. More details can be found @ https://grouplens.org/datasets/hetrec-2011/.

In [4]:
cache("http://files.grouplens.org/datasets/hetrec2011/hetrec2011-movielens-2k-v2.zip", unzip=True)

Data from http://files.grouplens.org/datasets/hetrec2011/hetrec2011-movielens-2k-v2.zip
will be cached into /root/.cornac/hetrec2011-movielens-2k-v2.zip


0.00B [00:00, ?B/s]

Unzipping ...
File cached!


'/root/.cornac/hetrec2011-movielens-2k-v2.zip'

The data, after being downloaded and unzipped, includes the following files:

In [5]:
!ls /root/.cornac

movie_actors.dat     movies.dat		   user_ratedmovies-timestamps.dat
movie_countries.dat  movie_tags.dat	   user_taggedmovies.dat
movie_directors.dat  readme.txt		   user_taggedmovies-timestamps.dat
movie_genres.dat     tags.dat
movie_locations.dat  user_ratedmovies.dat


### User-Movie ratings

A user assigns a rating to a movie. There is also a timestamp recording when the rating was given.

In [6]:
user_ratedmovies_df = pd.read_csv("/root/.cornac/user_ratedmovies.dat", sep="\t")
user_ratedmovies_df.head()

Unnamed: 0,userID,movieID,rating,date_day,date_month,date_year,date_hour,date_minute,date_second
0,75,3,1.0,29,10,2006,23,17,16
1,75,32,4.5,29,10,2006,23,23,44
2,75,110,4.0,29,10,2006,23,30,8
3,75,160,2.0,29,10,2006,23,16,52
4,75,163,4.0,29,10,2006,23,29,30


### User-Movie Tags

In addition to the rating mentioned above, a user could also assign one or more tags to a movie.  Again, these are timestamped as well.

In [7]:
user_taggedmovies_df = pd.read_csv("/root/.cornac/user_taggedmovies.dat", sep="\t")
user_taggedmovies_df.head()

Unnamed: 0,userID,movieID,tagID,date_day,date_month,date_year,date_hour,date_minute,date_second
0,75,353,5290,29,10,2006,23,20,15
1,78,4223,5264,16,4,2007,4,43,45
2,127,1343,1544,28,8,2007,3,42,27
3,127,1343,12330,28,8,2007,3,42,27
4,127,2080,1451,28,8,2007,3,42,47


### Tags Info

The mapping of a tag id to its textual description is available in the dataset.

In [8]:
tag_df = pd.read_csv("/root/.cornac/tags.dat", sep="\t", encoding="iso-8859-1")
tag_df.head()

Unnamed: 0,id,value
0,1,earth
1,2,police
2,3,boxing
3,4,painter
4,5,whale


### Movie Info

The original movie information -title and year- available at MovieLens10M dataset have been extended with public data provided in IMDb and Rotten Tomatoes websites.

In [9]:
movie_df = pd.read_csv("/root/.cornac/movies.dat", sep="\t", encoding="iso-8859-1")
movie_df.head()

Unnamed: 0,id,title,imdbID,spanishTitle,imdbPictureURL,year,rtID,rtAllCriticsRating,rtAllCriticsNumReviews,rtAllCriticsNumFresh,...,rtAllCriticsScore,rtTopCriticsRating,rtTopCriticsNumReviews,rtTopCriticsNumFresh,rtTopCriticsNumRotten,rtTopCriticsScore,rtAudienceRating,rtAudienceNumRatings,rtAudienceScore,rtPictureURL
0,1,Toy story,114709,Toy story (juguetes),http://ia.media-imdb.com/images/M/MV5BMTMwNDU0...,1995,toy_story,9.0,73,73,...,100,8.5,17,17,0,100,3.7,102338,81,http://content7.flixster.com/movie/10/93/63/10...
1,2,Jumanji,113497,Jumanji,http://ia.media-imdb.com/images/M/MV5BMzM5NjE1...,1995,1068044-jumanji,5.6,28,13,...,46,5.8,5,2,3,40,3.2,44587,61,http://content8.flixster.com/movie/56/79/73/56...
2,3,Grumpy Old Men,107050,Dos viejos gruñones,http://ia.media-imdb.com/images/M/MV5BMTI5MTgy...,1993,grumpy_old_men,5.9,36,24,...,66,7.0,6,5,1,83,3.2,10489,66,http://content6.flixster.com/movie/25/60/25602...
3,4,Waiting to Exhale,114885,Esperando un respiro,http://ia.media-imdb.com/images/M/MV5BMTczMTMy...,1995,waiting_to_exhale,5.6,25,14,...,56,5.5,11,5,6,45,3.3,5666,79,http://content9.flixster.com/movie/10/94/17/10...
4,5,Father of the Bride Part II,113041,Vuelve el padre de la novia (Ahora también abu...,http://ia.media-imdb.com/images/M/MV5BMTg1NDc2...,1995,father_of_the_bride_part_ii,5.3,19,9,...,47,5.4,5,1,4,20,3.0,13761,64,http://content8.flixster.com/movie/25/54/25542...


### Data Statistics

In [10]:
n_users = user_ratedmovies_df.userID.nunique()
n_movies = user_ratedmovies_df.movieID.nunique()
n_tags = tag_df.id.nunique()

print("Number of users:", n_users)
print("Number of movies:", n_movies)
print("Number of ratings:", len(user_ratedmovies_df))
print("-" * 30)
print("Number of tags:", n_tags)
print("Number of tag assignments:", len(user_taggedmovies_df))
print("Number of tagged movies:", user_taggedmovies_df.movieID.nunique())

Number of users: 2113
Number of movies: 10109
Number of ratings: 855598
------------------------------
Number of tags: 13222
Number of tag assignments: 47957
Number of tagged movies: 5908


### Data Splitting

In [11]:
train_df, test_df = train_test_split(user_ratedmovies_df, test_size=0.2, random_state=SEED)
print("Training size:", len(train_df))
print("Test size:", len(test_df))

Training size: 684478
Test size: 171120


## 3. Traditional Matrix Factorization

Matrix factorization (MF) only makes use of (user, item, rating) information to train a recommendation model.  We include MF as a baseline to see if context produces an improvement.

To train MF model with the provided data, we make use of the Cornac library as follows:

In [12]:
eval_method = cornac.eval_methods.BaseMethod.from_splits(
  train_data=list(train_df.itertuples(index=False)), 
  test_data=list(test_df.itertuples(index=False)),
  exclude_unknowns=False, 
  verbose=VERBOSE,
  seed=SEED,
)

mf = cornac.models.MF(
  k=10, 
  max_iter=20, 
  learning_rate=0.01, 
  lambda_reg=0.02, 
  use_bias=True,
  verbose=VERBOSE, seed=SEED,
)

test_result, _ = eval_method.evaluate(
  model=mf, metrics=[cornac.metrics.RMSE()], user_based=False
)
print(test_result)

   |   RMSE | Train (s) | Test (s)
-- + ------ + --------- + --------
MF | 0.7576 |    0.4617 |   5.7045



## 4. Factorization Machines with Contextual Information




Factorization Machines (FM) model formulates rating prediction as a regression problem in which user, item, and additional contextual information are combined into a feature vector $\mathbf{x}_i$.  The predictor consists of global bias, first-order, and second-order interactions of the input features.  The estimation is as follow:

$$
\hat{y}(\mathbf{x}_i) = w_0 + \sum_{f=1}^{F} w_f . x_{if} + \sum_{f=1}^{F} \sum_{g=f+1}^{F} x_{if} . x_{ig} \Big( \sum_{k=1}^{K} v_{fk} . v_{gk} \Big)
$$

where $F$ is the length of feature vectors and $K$ is the dimensionality of second-order latent factors.

FM can be extended and generalized into higher-order interactions.  Given the degree of data sparsity commonly faced in recommender systems, second-order FM is usually sufficient. Higher orders would be less efficient and harder to estimate.

Without the additional features and with only user and item, second-order FM is reduced to matrix factorization.  Therefore, FM is a general framework to incorporate contextual information while maintaining the effectiveness of factorization models in recommender systems.

To learn the parameters of the FM regressor, we minimize the following regularized squared loss function:

$$
\mathcal{L}(\mathbf{w, V} | \lambda) = \frac{1}{2} \sum_{(\mathbf{x}, y) \in \mathcal{D}} (y - \hat{y}(\mathbf{x})) + \frac{\lambda}{2} || \mathbf{w} ||^2 + \frac{\lambda}{2} ||\mathbf{V}||_2^2
$$


In this tutorial, we will train FM model using the [libFM](http://www.libfm.org/) package released by the original authors [2].  In order to use the software, we need to prepare the data in a specific format expected by libFM.

As an example of contextual information, we use the user-movie tag assignments.  First, we identify a set of tags for each pair of (user, movie):

In [13]:
user_movie_tags = defaultdict(set)
for uid, mid, tid, *_ in user_taggedmovies_df.itertuples(index=False):
  user_movie_tags[(uid, mid)].add(tid)

Second, we maintain mappings from ID to index for users, movies, and tags.  These will be used to create feature vectors for the FM model.   

In [14]:
user_id2idx = eval_method.global_uid_map
movie_id2idx = eval_method.global_iid_map

# create mapping for tags
tag_id2idx = defaultdict()
for tagid, _ in tag_df.itertuples(index=False):
  tag_id2idx.setdefault(tagid, len(tag_id2idx))
assert len(tag_id2idx) == n_tags

For each feature vector, most of the values will be zeros.  Thus, we will save a lot of memory by storing the data in a sparse format.  The code below will create training and test data in the sparse format used by libFM.

In [15]:
def to_fm_sparse_fmt(rating, uid, mid, tags):
  # order of features: user, movie, tags
  user_start_idx = 0
  movie_start_idx = n_users
  tag_start_idx = movie_start_idx + n_movies
  return "{} {}:1 {}:1 {}\n".format(
    rating,
    user_id2idx[uid],
    movie_id2idx[mid] + movie_start_idx, 
    " ".join("{}:1".format(tag_id2idx[t] + tag_start_idx) for t in tags)
  )

# save training data to file
with open("train.libfm", "w") as f:
  for uid, mid, rating, *_ in train_df.itertuples(index=False):
    f.write(to_fm_sparse_fmt(rating, uid, mid, user_movie_tags[(uid, mid)]))

# save test data to file
with open("test.libfm", "w") as f:
  for uid, mid, rating, *_ in test_df.itertuples(index=False):
    f.write(to_fm_sparse_fmt(rating, uid, mid, user_movie_tags[(uid, mid)]))   

Let's take a look at how the data is stored in files. 

In [16]:
!head train.libfm

4.5 0:1 2113:1 
3.0 1:1 2114:1 
4.5 2:1 2115:1 
2.0 3:1 2116:1 
4.0 4:1 2117:1 
2.5 5:1 2118:1 
2.0 6:1 2119:1 
3.5 7:1 2120:1 
4.0 8:1 2121:1 
3.5 9:1 2122:1 


In [17]:
!head test.libfm

4.0 597:1 4195:1 13921:1
3.0 404:1 5014:1 
4.0 128:1 3404:1 
3.0 66:1 2626:1 
4.0 399:1 9555:1 
3.5 1646:1 4225:1 
5.0 426:1 2738:1 
4.0 1057:1 6819:1 
3.0 136:1 3685:1 
4.0 229:1 3053:1 


For details on how to use libFM, we can refer to the manual: http://www.libfm.org/libfm-1.42.manual.pdf.

Below is the list of arguments that we can input to the libFM:

In [18]:
!./libfm/bin/libFM

----------------------------------------------------------------------------
libFM
  Version: 1.4.4
  Author:  Steffen Rendle, srendle@libfm.org
  WWW:     http://www.libfm.org/
This program comes with ABSOLUTELY NO WARRANTY; for details see license.txt.
This is free software, and you are welcome to redistribute it under certain
conditions; for details see license.txt.
----------------------------------------------------------------------------
-cache_size     cache size for data storage (only applicable if data is
                in binary format), default=infty
-dim            'k0,k1,k2': k0=use bias, k1=use 1-way interactions,
                k2=dim of 2-way interactions; default=1,1,8
-help           this screen
-init_stdev     stdev for initialization of 2-way factors; default=0.1
-iter           number of iterations; default=100
-learn_rate     learn_rate for SGD; default=0.1
-load_model     filename for reading the FM model
-meta           filename for meta information about dat




To train a model on our data, we run the following command:

In [19]:
!./libfm/bin/libFM -task r -train train.libfm -test test.libfm -seed $SEED -dim "1,1,10" -iter 200

----------------------------------------------------------------------------
libFM
  Version: 1.4.4
  Author:  Steffen Rendle, srendle@libfm.org
  WWW:     http://www.libfm.org/
This program comes with ABSOLUTELY NO WARRANTY; for details see license.txt.
This is free software, and you are welcome to redistribute it under certain
conditions; for details see license.txt.
----------------------------------------------------------------------------
Loading train...	
has x = 0
has xt = 1
num_rows=684478	num_values=1398326	num_features=25444	min_target=0.5	max_target=5
Loading test... 	
has x = 0
has xt = 1
num_rows=171120	num_values=349755	num_features=25431	min_target=0.5	max_target=5
#relations: 0
Loading meta data...	
#Iter=  0	Train=0.93938	Test=0.937978
#Iter=  1	Train=0.835144	Test=0.872688
#Iter=  2	Train=0.811651	Test=0.845675
#Iter=  3	Train=0.802268	Test=0.830956
#Iter=  4	Train=0.798476	Test=0.822256
#Iter=  5	Train=0.796305	Test=0.816585
#Iter=  6	Train=0.794784	Test=0.812691
#I

The numbers reported above are RMSE. As we can observe, the decreasing RMSE over iterations illustrates a stable training process.  FM model achieves a better result (lower RMSE) on the test set as compared to the matrix factorization model.  If we train the model even longer, it could potentially improve the performance further as the RMSE is still decreasing.

## 5. Other Contextual Information to Be Explored

Below are some other information provided within the dataset.  They also can be used as additional features to the FM model.  Temporal information (*timestamps*) that comes with ratings and tag assignments could be utilized as well.

### Movie Genres

The genres that a movie belongs to.

In [20]:
pd.read_csv("/root/.cornac/movie_genres.dat", sep="\t", encoding="iso-8859-1").head()

Unnamed: 0,movieID,genre
0,1,Adventure
1,1,Animation
2,1,Children
3,1,Comedy
4,1,Fantasy


### Movie Directors

The directors of a movie.

In [21]:
pd.read_csv("/root/.cornac/movie_directors.dat", sep="\t", encoding="iso-8859-1").head()

Unnamed: 0,movieID,directorID,directorName
0,1,john_lasseter,John Lasseter
1,2,joe_johnston,Joe Johnston
2,3,donald_petrie,Donald Petrie
3,4,forest_whitaker,Forest Whitaker
4,5,charles_shyer,Charles Shyer


### Movie Actors

The main actors and actresses of a movie. 

A ranking is given to each actor of a movie according to the order in which the actor appears on the movie's IMDb cast Web page.

In [22]:
pd.read_csv("/root/.cornac/movie_actors.dat", sep="\t", encoding="iso-8859-1").head()

Unnamed: 0,movieID,actorID,actorName,ranking
0,1,annie_potts,Annie Potts,10
1,1,bill_farmer,Bill Farmer,20
2,1,don_rickles,Don Rickles,3
3,1,erik_von_detten,Erik von Detten,13
4,1,greg-berg,Greg Berg,17


### Movie Countries

The country of origin of a movie.

In [23]:
pd.read_csv("/root/.cornac/movie_countries.dat", sep="\t", encoding="iso-8859-1").head()

Unnamed: 0,movieID,country
0,1,USA
1,2,USA
2,3,USA
3,4,USA
4,5,USA


### Movie Location

The filming locations of a movie.

In [24]:
pd.read_csv("/root/.cornac/movie_locations.dat", sep="\t", encoding="iso-8859-1").head()

Unnamed: 0,movieID,location1,location2,location3,location4
0,1,,,,
1,2,Canada,British Columbia,,
2,2,Canada,British Columbia,Delta,
3,2,Canada,British Columbia,Delta,Tsawwassen
4,2,Canada,British Columbia,Maple Ridge,


### Movie Year

The released year of a movie.

In [25]:
movie_df[["id", "title", "year"]].head()

Unnamed: 0,id,title,year
0,1,Toy story,1995
1,2,Jumanji,1995
2,3,Grumpy Old Men,1993
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995


### Movie Tags and Frequencies

The tags assigned to a movies, and the number of times the tags were assigned to each movie.

In [26]:
pd.read_csv("/root/.cornac/movie_tags.dat", sep="\t", encoding="iso-8859-1").head()

Unnamed: 0,movieID,tagID,tagWeight
0,1,7,1
1,1,13,3
2,1,25,3
3,1,55,3
4,1,60,1


## References

1.   Aggarwal, C. C. (2016). Recommender systems (Vol. 1). Cham: Springer International Publishing.
2.   Rendle, S. (2012). Factorization machines with libfm. ACM Transactions on Intelligent Systems and Technology (TIST), 3(3), 1-22.
3.   Cornac - A Comparative Framework for Multimodal Recommender Systems (https://cornac.preferred.ai/)

