<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/PreferredAI/tutorials/blob/master/multimodal-www23/02_addressing_sparsity.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/PreferredAI/tutorials/blob/master/multimodal-www23/02_addressing_sparsity.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

# 0. Setup

In [None]:
!pip install --quiet cornac==1.15.0

In [None]:
import os
import sys
from collections import defaultdict

import cornac
from cornac.utils import cache
from cornac.datasets import filmtrust, amazon_clothing
from cornac.eval_methods import RatioSplit
from cornac.models import PMF, SoRec, WMF, CTR, BPR, VBPR
from cornac.data import GraphModality, TextModality, ImageModality
from cornac.data.text import BaseTokenizer

import numpy as np
import pandas as pd
import torch
import tensorflow.compat.v1 as tf

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

print(f"System version: {sys.version}")
print(f"Cornac version: {cornac.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"Tensorflow version: {tf.__version__}")

SEED = 42
VERBOSE = True
USE_GPU = torch.cuda.is_available()

# 1. Multimodality

While preference data in the form of user-item interactions are the backbone of many recommender systems, such data tends to be sparse in nature. One way to address this sparsity is to look beyond the interaction data to the additional information associated with users or with items. The intuition is that items with similarity in "content profiles" would be correlated with similarity in preferences. Multimodality deals with how to model both preference data (one modality) and some content data either on user or item side (other modalities). In this tutorial, we see three forms of additional modalities, namely text, image, and graph, and investigate whether they add value to the the resulting recommendations.

## 1.1. Text Modality

Often times, we are interested in building a recommender system for textual items (e.g., news, scientific papers), or items associated with text (e.g., titles, descriptions, reviews).  Text is informative and descriptive, therefore, exploiting textual information for better recommendations is an important topic in recommender systems.  In this tutorial, we introduce CTR [3], a recommendation model that combines matrix factorization and probablistic topic modeling. 



### Collaborative Topic Regression (CTR)

Under factorization framework, adoption prediction is in the form of $\hat{r}_{i,j} = \mathbf{u}_i^T \mathbf{v}_j $.  The intuition in CTR model is that two items with similar topics would behave similarly. Thus, item latent factors $\mathbf{v_j}$ is assumed to be drawn from a Normal distribution:

$$
\mathbf{v}_j \sim \mathcal{N}(\mathbf{\theta}_j, \lambda^{-1} \mathbf{I})
$$

where the mean $\mathbf{\theta}_j$ is a vector indicating topic proportions of the item $j$. It is equivalent to:

\begin{align}
\mathbf{v}_j &= \mathbf{\theta}_j + \mathbf{\epsilon}_j \\
\mathbf{\epsilon}_j &\sim \mathcal{N}(\mathbf{0}, \lambda^{-1} \mathbf{I})
\end{align}

Please refer to paper [3] for the generative process of CTR model.


CTR also extends matrix factorization, in which the base model is WMF under implicit feedback setting. The adoption $p_{i,j}$ and confidence $c_{i,j}$ are defined as follows: 

\begin{equation}
p_{i,j} = 
\begin{cases} 
r_{i, j} &\mbox{if } r_{i,j} > 0 \\
0 & \mbox{otherwise} 
\end{cases}
\end{equation}


\begin{equation}
c_{i,j} = 
\begin{cases} 
a & \mbox{if } r_{i,j} > 0 \\
b & \mbox{otherwise }
\end{cases}
\end{equation}

The learning of CTR model is done via minimizing the following negative log-likelihood:

$$ \mathcal{L}(\mathbf{U,V,\theta, \beta}|\lambda) = \frac{1}{2} \sum_{i,j} c_{i,j} (p_{i,j} - \mathbf{u}_i^T \mathbf{v}_j)^2 - \sum_{j}\sum_{n} \log \big( \sum_{k=1}^K \mathbf{\theta}_{j,k} \mathbf{\beta}_{k,w_{jn}} \big) + \frac{\lambda}{2} \sum_{i=1}^{N} ||\mathbf{u}_i||^2 + \frac{\lambda}{2} \sum_{j=1}^{M} (\mathbf{v}_j - \mathbf{\theta}_j)^T (\mathbf{v}_j - \mathbf{\theta}_j) $$

It is an iterative procedure of alternating between three steps:
- Optimize for user and item latent vectors, $\mathbf{u}_i$ and $\mathbf{v}_j$, based on the current topic proportions $\mathbf{\theta}_j$.  
- Optimize for topic proportions $\mathbf{\theta}_j$ based on the current vectors $\mathbf{u}_i$ and $\mathbf{v}_j$ and topic words $\mathbf{\beta}_k$.
- Optimize for topic words $\mathbf{\beta}_k$ based on the current topic proportions $\mathbf{\theta}_i$.

Let's experiment with two models CTR and WMF on a dataset from Amazon Clothing category.  Using this dataset, CTR will learn topics from item description.  

In [None]:
K = 20
ctr = CTR(k=K, max_iter=50, a=1.0, b=0.01, lambda_u=0.01, lambda_v=0.01, verbose=VERBOSE, seed=SEED)
wmf = WMF(k=K, max_iter=50, a=1.0, b=0.01, learning_rate=0.005, lambda_u=0.01, lambda_v=0.01, 
          verbose=VERBOSE, seed=SEED)

ratings = amazon_clothing.load_feedback()
docs, item_ids = amazon_clothing.load_text()

item_text_modality = TextModality(
    corpus=docs,
    ids=item_ids,
    tokenizer=BaseTokenizer(sep=" ", stop_words="english"),
    max_vocab=5000,
    max_doc_freq=0.5,
)

ratio_split = RatioSplit(
    data=ratings,
    test_size=0.2,
    rating_threshold=4.0,
    exclude_unknowns=True,
    item_text=item_text_modality,
    verbose=VERBOSE,
    seed=SEED,
)

rec_50 = cornac.metrics.Recall(50)

cornac.Experiment(eval_method=ratio_split, models=[ctr, wmf], metrics=[rec_50]).run()

The results show that CTR model performs significantly better than WMF model in terms of Recall@50, which is due to the contribution of items' textual information.

## 1.2. Image Modality

In some contexts, item images are informative (e.g., fashion). With the existence of effective methods to learn image representation, using item images in recommender systems is gaining popularity. In this tutorial, we present VBPR [4], a recommendation model making use of item image features extracted from pre-trained Convolutional Neural Network (CNN).

### Visual Bayesian Personalized Ranking (VBPR)

VBPR, which is also based on matrix factorization, is an extension of BPR model.  The novelty of VBPR is on how item visual features incorporated into the matrix factorization framework.  The preference score user $i$ giving to item $j$ is predicted as follows:

$$
\hat{r}_{i,j} = \alpha + b_i + b_j + \mathbf{u}_i^T \mathbf{v}_j + \mathbf{p}_{i}^T(\mathbf{E} \times \mathbf{f}_j) + \mathbf{\Theta}^T \mathbf{f}_j
$$

where:
- $\alpha, b_i, b_j$ are global bias, user bias, and item bias, respectively
- $\mathbf{u}_i \in \mathbb{R}^K$ and $\mathbf{v}_j \in \mathbb{R}^K$ are user and item latent vectors, respectively
- $\mathbf{f}_j \in \mathbb{R}^D$ is the item image feature vector
- $\mathbf{p}_i \in \mathbb{R}^Q$ is user visual preference, and $(\mathbf{E} \times \mathbf{f}_j) \in \mathbb{R}^Q$ is item visual representation with $\mathbf{E} \in \mathbb{R}^{K \times D}$ is the projection from visual feature space into preference space
- $\mathbf{\Theta} \in \mathbb{R}^D$ is global visual bias vector

Learning parameters of VBPR model can be done, similarly to BPR, via minimizing the following negative log-likelihood:

$$ \mathcal{L}(\mathbf{U,V,b,E,\Theta, P}|\lambda) = \sum_{(j >_i l) \in \mathbf{S}} \ln (1 + \exp\{-(\hat{r}_{i,j} - \hat{r}_{i,l})\}) + \frac{\lambda}{2} \sum_{i=1}^{N} (||\mathbf{u}_i||^2 + ||\mathbf{p}_i||^2) + \frac{\lambda}{2} \sum_{j=1}^{M} (b_j + ||\mathbf{v}_j||^2) + \frac{\lambda}{2} ||\mathbf{\Theta}||^2 + \frac{\lambda}{2} ||\mathbf{E}||^2_2 $$

Noted that global bias $\alpha$ and user bias $b_i$ do not affect the ranking of items, thus they are redundant and removed from the model parameters.

Let's compare VBPR and BPR models with an experiment on Amazon Clothing dataset.


In [None]:
K = 10
vbpr = VBPR(k=K, k2=K, n_epochs=50, batch_size=100, learning_rate=0.001,
            lambda_w=1, lambda_b=0.01, lambda_e=0.0, use_gpu=True, verbose=VERBOSE, seed=SEED)
bpr = BPR(k=(K * 2), max_iter=50, learning_rate=0.001, lambda_reg=0.001, verbose=VERBOSE, seed=SEED)

ratings = amazon_clothing.load_feedback()
img_features, item_ids = amazon_clothing.load_visual_feature()

item_image_modality = ImageModality(features=img_features, ids=item_ids, normalized=True)

ratio_split = RatioSplit(
    data=ratings,
    test_size=0.2,
    rating_threshold=4.0,
    exclude_unknowns=True,
    item_image=item_image_modality,
    verbose=VERBOSE,
    seed=SEED,
)

auc = cornac.metrics.AUC()

cornac.Experiment(eval_method=ratio_split, models=[vbpr, bpr], metrics=[auc]).run()

The results show that VBPR obtains higher performance than BPR in terms of AUC. That can be attributed to the usage of item visual features.

## 1.3. Graph Modality

In recommender systems, graph can be used to represent user social network or item contexts (e.g., co-views, co-purchases).  In this tutorial, we take the former as an example and discuss SoRec [2], a representative model for this class of algorithms. 


### Social Recommendation (SoRec)

SoRec model is based on matrix factorization framework. The idea is fusing user-item rating matrix with the user’s social network.  In summary, the *user-item rating matrix* ($R$) and the *user-user graph adjacency matrix* ($G$) are factorized with shared users' latent factors.  The user latent vectors in $\mathbf{U}$ are shared to capture both user preferences as well as their social connections.  The rating prediction is obtained as $\hat{r}_{i,j} = \mathbf{u}_i^T \mathbf{v}_j$, similarly to PMF model.

To learn the model parameters, we minimize the following loss function:

$$ \mathcal{L}(\mathbf{U,V,Z}|\lambda,\lambda_C) = \frac{1}{2} \sum_{r_{i,j} \in \mathcal{R}} (r_{i,j} - \mathbf{u}_i^T \mathbf{v}_j)^2 + \frac{\lambda_C}{2} \sum_{g_{i,h} \in \mathcal{G}} (g_{i,h} - \mathbf{u}_i^T \mathbf{z}_h)^2 + \frac{\lambda}{2} \sum_{i=1}^{N} ||\mathbf{u}_i||^2 + \frac{\lambda}{2} \sum_{j=1}^{M} ||\mathbf{v}_j||^2 + \frac{\lambda}{2} \sum_{h=1}^{N} ||\mathbf{z}_h||^2 $$

where $\lambda_C$ is the relative importance of the social network factorization and $\lambda$ is the regularization weight. 

Let's do a comparison between SoRec and its base model PMF on [FilmTrust dataset](http://konect.cc/networks/librec-filmtrust-trust/).

In [None]:
K = 20
sorec = SoRec(k=K, max_iter=50, learning_rate=0.001, lambda_reg=0.001, lambda_c=3.0, verbose=VERBOSE, seed=SEED)
pmf = PMF(k=K, max_iter=50, learning_rate=0.001, lambda_reg=0.001, verbose=VERBOSE, seed=SEED)

ratings = filmtrust.load_feedback()
trust = filmtrust.load_trust()

user_graph_modality = GraphModality(data=trust)

ratio_split = RatioSplit(
    data=ratings,
    test_size=0.2,
    rating_threshold=4.0,
    exclude_unknowns=True,
    user_graph=user_graph_modality,
    verbose=VERBOSE,
    seed=SEED,
)

mae = cornac.metrics.MAE()

cornac.Experiment(eval_method=ratio_split, models=[sorec, pmf], metrics=[mae]).run()

From the experiment, we see that SoRec achieves lower (better) MAE score as compared to PMF.  This improvement should be explained by useful information from user social network captured inside the model predictions.

# 2. Cross-Modal Utilization

Multimodal recommender systems are commonly catalogued based on the type of auxiliary data (modality) they leverage, such as preference data plus user-network (social), user/item texts
(textual), or item images (visual) respectively. One consequence of this siloization along modality lines is the tendency for virtual walls to arise between modalities. For instance, a model ostensibly designed for images would experiment with only the image modality, and compare to other models also purportedly designed
for images. In turn, a text-based model would be
compared to another text-based model, similarly
with item graph. However, most multimodal recommendation algorithms are innately machine learning models that fit the preference data, aided by the auxiliary data as features
in some form. While the raw representations of modalities
may differ, the eventual representations used in the learning
process may have commonalities in form (textual product
description may be represented as term vectors, related
items as a vector of adjacent graph neighbors, etc.). Indeed,
if we peel off the layer of pre-processing steps specific to
a modality, we find that, for most models, the underlying
representation can accommodate other modalities. It is this aspect that we explore in this notebook, i.e., using a model for a modality different from the
one it was originally designed for [5]. 

## 2.1. Using CDL, VBPR and MCF with the Text Modality

We consider the Amazon Clothing dataset consisting of user-item ratings and item content information (e.g., text, visual features, relations). For the purpose of the following experiment, assume that we are interested in item textual descriptions only. To leverage this modalidy we consider three different models, namely CDL [6], VBPR [3], and MCF [7]. While the former was originally and experimented with text auxiliary data, VBPR and MCF have been investigated for integrating visual and graph information respectively. The following code illustrates how Cornac [4] enables to use VBPR and MCF with text auxiliary data.

In [None]:
# Load and split the Amazon clothing dataset
ratings = amazon_clothing.load_feedback()
docs, item_ids = amazon_clothing.load_text()

ratio_split = RatioSplit(
    data=ratings,
    test_size=0.2,
    rating_threshold=4.0,
    exclude_unknowns=True,
    seed=SEED,
    verbose=VERBOSE,
)

# obtain global mapping of item ID to synchronize across modalities 
item_id_map = ratio_split.global_iid_map

# build item text modality using the item text corpus
VOCAB_SIZE = 5000
text_modality = TextModality(corpus=docs, ids=item_ids, max_vocab=VOCAB_SIZE)
text_modality.build(id_map=item_id_map)

# here we use term-frequency matrix from item text as features, other choices available
item_ids = list(item_id_map.keys())
tf_mat = text_modality.count_matrix.A  # term-frequency matrix
tf_mat = tf_mat[:len(item_ids)]  # remove unknown items during data splitting

# build image modality with the term-frequency matrix as features 
image_modality = ImageModality(features=tf_mat, ids=item_ids)
image_modality.build(id_map=item_id_map)

# build graph modality with the term-frequency matrix as features.
# Under the hood this will construct a k-nearest neighbor graph of items, encoding textual similarities among them.
graph_modality = GraphModality.from_feature(features=tf_mat, ids=item_ids, k=5, similarity="cosine")
graph_modality.build(id_map=item_id_map)


# provide all built modalities for access by models during the experiment
ratio_split.add_modalities(item_text=text_modality, 
                           item_image=image_modality, 
                           item_graph=graph_modality)


cdl = cornac.models.CDL(k=50, autoencoder_structure=[200], vocab_size=VOCAB_SIZE, 
                        act_fn='tanh', max_iter=50, seed=SEED, verbose=VERBOSE)

vbpr = cornac.models.VBPR(k=10, k2=40, n_epochs=50, use_gpu=USE_GPU, seed=SEED, verbose=VERBOSE)

mcf = cornac.models.MCF(k=50, max_iter=50, seed=SEED, verbose=VERBOSE)


recall = cornac.metrics.Recall(k=50)
ndcg = cornac.metrics.NDCG(k=50)


text_exp = cornac.Experiment(eval_method=ratio_split,  
                             models=[cdl, vbpr, mcf],
                             metrics=[recall, ndcg])
text_exp.run()

## Question

Without looking at the next sections, can you write a Cornac code to use the above three models with the Image-modility and the Graph-modality?

## 2.2. Using CDL, VBPR and MCF with the Image Modality

In [None]:
ratings = amazon_clothing.load_feedback()
features, item_ids = amazon_clothing.load_visual_feature()


# construct item modalities using the image features
image_modality = ImageModality(features=features, ids=item_ids) 
text_modality = TextModality(features=features, ids=item_ids)
graph_modality = GraphModality.from_feature(features=features, ids=item_ids, k=5, similarity="cosine")


# provide all modalities into evaluation method to synchronize the building process
# as we don't have to build them separately for this case (available features)
ratio_split = RatioSplit(
    data=ratings,
    test_size=0.2,
    rating_threshold=4.0,
    exclude_unknowns=True,
    seed=SEED,
    verbose=VERBOSE,
    item_text=text_modality,
    item_image=image_modality,
    item_graph=graph_modality,
)


cdl = cornac.models.CDL(k=50, autoencoder_structure=[200], vocab_size=text_modality.feature_dim,
                        act_fn='tanh', max_iter=50, seed=SEED, verbose=VERBOSE)

vbpr = cornac.models.VBPR(k=10, k2=40, n_epochs=50, use_gpu=False, seed=SEED, verbose=VERBOSE)

mcf = cornac.models.MCF(k=50, max_iter=50, seed=SEED, verbose=VERBOSE)


recall = cornac.metrics.Recall(k=50)
ndcg = cornac.metrics.NDCG(k=50)


image_exp = cornac.Experiment(eval_method=ratio_split, 
                              models=[cdl, vbpr, mcf],
                              metrics=[recall, ndcg])
image_exp.run()

## 2.3. Using CDL, VBPR and MCF with Graph Modality

In [None]:
ratings = amazon_clothing.load_feedback()
contexts = amazon_clothing.load_graph()


ratio_split = RatioSplit(
    data=ratings,
    test_size=0.2,
    rating_threshold=4.0,
    exclude_unknowns=True,
    seed=SEED,
    verbose=VERBOSE,
)

# obtain global mapping of item ID to synchronize across modalities 
item_id_map = ratio_split.global_iid_map  
item_ids = list(item_id_map.keys())

# build item graph modality using the item contexts
graph_modality = GraphModality(data=contexts).build(id_map=item_id_map)
adj_mat = graph_modality.matrix.A  # item graph adjacency matrix

# build text and image modalities with the adjacency matrix as features 
text_modality = TextModality(features=adj_mat, ids=item_ids).build(id_map=item_id_map)
image_modality = ImageModality(features=adj_mat, ids=item_ids).build(id_map=item_id_map)


# provide all built modalities for access by models during the experiment
ratio_split.add_modalities(item_text=text_modality,
                           item_image=image_modality,
                           item_graph=graph_modality)


cdl = cornac.models.CDL(k=50, autoencoder_structure=[200], vocab_size=text_modality.feature_dim,
                        act_fn='tanh', max_iter=50, seed=SEED, verbose=VERBOSE)

vbpr = cornac.models.VBPR(k=10, k2=40, n_epochs=50, use_gpu=USE_GPU, seed=SEED, verbose=VERBOSE)

mcf = cornac.models.MCF(k=50, max_iter=50, seed=SEED, verbose=VERBOSE)


recall = cornac.metrics.Recall(k=50)
ndcg = cornac.metrics.NDCG(k=50)


graph_exp = cornac.Experiment(eval_method=ratio_split, 
                              models=[cdl, vbpr, mcf],
                              metrics=[recall, ndcg])
graph_exp.run()

## 2.4. Results: Recall and NDCG Bar Plots

To make it convenient to analyze the results of the experiments from sections 2—4, the following code generates the Recall and NDCG bar plots across models and modalities.

In [None]:
res_df = defaultdict(list)
for text_res, image_res, graph_res in zip(text_exp.result, image_exp.result, graph_exp.result):
  assert text_res.model_name == image_res.model_name == graph_res.model_name
  res_df["Model"].extend([text_res.model_name] * 3)
  res_df["Modality"].extend(["Text", "Image", "Graph"])
  res_df[recall.name].extend([text_res.metric_avg_results[recall.name],
                              image_res.metric_avg_results[recall.name],
                              graph_res.metric_avg_results[recall.name]])
  res_df[ndcg.name].extend([text_res.metric_avg_results[ndcg.name],
                            image_res.metric_avg_results[ndcg.name],
                            graph_res.metric_avg_results[ndcg.name]])
res_df = pd.DataFrame(res_df)

### Recall

In [None]:
sns.barplot(x="Model", y=recall.name, hue="Modality", palette="Set1", data=res_df);

### NDCG

In [None]:
sns.barplot(x="Model", y=ndcg.name, hue="Modality", palette="Set1", data=res_df);

## Question

Based on the above results, what can you infer regarding cross-modality utilization? 

## References

1.   Ma, H., Yang, H., Lyu, M. R., & King, I. (2008, October). Sorec: social recommendation using probabilistic matrix factorization. In Proceedings of the 17th ACM conference on Information and knowledge management (pp. 931-940).
2.   Wang, C., & Blei, D. M. (2011, August). Collaborative topic modeling for recommending scientific articles. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 448-456).
3.   He, R., & McAuley, J. (2016, February). VBPR: visual bayesian personalized ranking from implicit feedback. In Thirtieth AAAI Conference on Artificial Intelligence.
4.   Salah, A., Truong, Q. T., & Lauw, H. W. (2020). Cornac: A Comparative Framework for Multimodal Recommender Systems. J. Mach. Learn. Res., 21, 95-1. https://cornac.preferred.ai
5.   Truong, Q. T., Salah, A., Tran, T. B., Guo, J., & Lauw, H. W. (2021). Exploring Cross-Modality Utilization in Recommender Systems. IEEE Internet Computing.
6.   Wang, H., Wang, N., & Yeung, D. Y. (2015, August). Collaborative deep learning for recommender systems. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1235-1244).
7.   Park, C., Kim, D., Oh, J., & Yu, H. (2017, April). Do" Also-Viewed" Products Help User Rating Prediction?. In Proceedings of the 26th International Conference on World Wide Web (pp. 1113-1122).
