MovieLens Rating Prediction Workshop Notebook

This notebook runs faster on a GPU runtime. To enable it, go to Edit > Notebook Settings > Hardware Accelerator > GPU.


## Setup

In [1]:
import os

# Use the eager mode
os.environ['PT_HPU_LAZY_MODE'] = '0'

# Verify the environment variable is set
print(f"PT_HPU_LAZY_MODE: {os.environ['PT_HPU_LAZY_MODE']}")

import torch
os.environ['TORCH'] = torch.__version__
print(torch.__version__)

import habana_frameworks.torch.core as htcore

# use rich traceback

from rich import traceback
traceback.install()

device = torch.device("hpu")

PT_HPU_LAZY_MODE: 0




2.4.0a0+git74cd574


  return isinstance(object, types.FunctionType)


## Link Regression on the MovieLens Dataset

This notebook shows how to load a set of `*.csv` files into a `torch_geometric.data.HeteroData` object and how to train a [heterogeneous graph model](https://pytorch-geometric.readthedocs.io/en/latest/notes/heterogeneous.html#hgtutorial).

We are going to use the [Movielens dataset](https://grouplens.org/datasets/movielens/), which is collected by the GroupLens Research group. The toy dataset describes movies, users, and their ratings. We are going to predict the rating of a user for a movie.

## Data Ingestion

In [2]:
from torch_geometric.data import download_url, extract_zip
import pandas as pd

dataset_name = 'ml-latest-small'

url = f'https://files.grouplens.org/datasets/movielens/{dataset_name}.zip'
extract_zip(download_url(url, '.'), '.')

movies_path = f'./{dataset_name}/movies.csv'
ratings_path = f'./{dataset_name}/ratings.csv'

Using existing file ml-latest-small.zip
Extracting ./ml-latest-small.zip


In [3]:
# Load the entire ratings dataframe into memory:
ratings_df = pd.read_csv(ratings_path)[["userId", "movieId", "rating"]]

# Load the entire movie dataframe into memory:
movies_df = pd.read_csv(movies_path, index_col='movieId')

print('movies.csv:')
print('===========')
print(movies_df[["genres", "title"]].head())
print(f"Number of movies: {len(movies_df)}")
print()
print('ratings.csv:')
print('============')
print(ratings_df[["userId", "movieId", "rating"]].head())
print(f"Number of ratings: {len(ratings_df)}")
print()

movies.csv:
                                              genres   
movieId                                                
1        Adventure|Animation|Children|Comedy|Fantasy  \
2                         Adventure|Children|Fantasy   
3                                     Comedy|Romance   
4                               Comedy|Drama|Romance   
5                                             Comedy   

                                      title  
movieId                                      
1                          Toy Story (1995)  
2                            Jumanji (1995)  
3                   Grumpier Old Men (1995)  
4                  Waiting to Exhale (1995)  
5        Father of the Bride Part II (1995)  
Number of movies: 9742

ratings.csv:
   userId  movieId  rating
0       1        1     4.0
1       1        3     4.0
2       1        6     4.0
3       1       47     5.0
4       1       50     5.0
Number of ratings: 100836



Additionally, let's add our ratings to the dataset to get predictions for movies we haven't seen yet.

There are two ways to add ratings:
1. **Add ratings manually**
2. **Upload IMDB ratings**


### Add your ratings manually


We recommend adding at least 10 ratings. Let's first check out the most rated movies. Additional movies in the table are: *Avatar*, *The Dark Knight*, *Pretty Women*,
*Titanic*, *The Lion King*, *Jurassic Park*, *The Matrix*, *The Lord of the Rings* and *The Avengers*. Please note that the article in the movie title is often at the end of the title.

In [4]:
from fuzzywuzzy import fuzz

# Specify your userId
our_user_id = ratings_df['userId'].max() + 1

print('Most rated movies:')
print('==================')
most_rated_movies = ratings_df['movieId'].value_counts().head(10)
print(movies_df.loc[most_rated_movies.index][["title"]])

# Initialize your rating list
ratings = []

Most rated movies:
                                             title
movieId                                           
356                            Forrest Gump (1994)
318               Shawshank Redemption, The (1994)
296                            Pulp Fiction (1994)
593               Silence of the Lambs, The (1991)
2571                            Matrix, The (1999)
260      Star Wars: Episode IV - A New Hope (1977)
480                           Jurassic Park (1993)
110                              Braveheart (1995)
589              Terminator 2: Judgment Day (1991)
527                        Schindler's List (1993)


In [5]:
# Add your ratings here:
num_ratings = 5
while len(ratings) < num_ratings:
    print(f'Select the {len(ratings) + 1}. movie:')
    print('=====================================')
    movie = input('Please enter the movie title: ')
    movies_df['title_score'] = movies_df['title'].apply(lambda x: fuzz.ratio(x, movie))
    print(movies_df.sort_values('title_score', ascending=False)[['title']].head(5))
    movie_id = input('Please enter the movie id: ')
    if not movie_id:
        continue
    movie_id = int(movie_id)
    rating = float(input('Please enter your rating: '))
    if not rating:
        continue
    assert 0 <= rating <= 5
    ratings.append({'movieId': movie_id, 'rating': rating, 'userId': our_user_id})
    print()

Select the 1. movie:
             title
movieId           
5475      Z (1969)
4745      O (2001)
1260      M (1931)
163981   31 (2016)
2188     54 (1998)

Select the 2. movie:
             title
movieId           
5475      Z (1969)
4745      O (2001)
1260      M (1931)
163981   31 (2016)
2188     54 (1998)

Select the 3. movie:
             title
movieId           
5475      Z (1969)
4745      O (2001)
1260      M (1931)
163981   31 (2016)
2188     54 (1998)

Select the 4. movie:
             title
movieId           
5475      Z (1969)
4745      O (2001)
1260      M (1931)
163981   31 (2016)
2188     54 (1998)

Select the 5. movie:
             title
movieId           
5475      Z (1969)
4745      O (2001)
1260      M (1931)
163981   31 (2016)
2188     54 (1998)



In [6]:
# Add your ratings to the rating dataframe
ratings_df = pd.concat([ratings_df, pd.DataFrame.from_records(ratings)])

### Upload your IMDB ratings

If you have an IMDB account, you can also upload your IMDB ratings. To do so, please follow the following steps:
1. Go to https://www.imdb.com/
2. Login to your account
3. Go to `Your Ratings`
4. Click on `Export Ratings` after clicking the three dots in the upper right corner
5. Upload the downloaded `ratings.csv` file to the current directory
6. Rename the file to `imdb_ratings.csv`
7. Run the following cell


In [7]:
# Select our userId
our_user_id = ratings_df['userId'].max() + 1

# Load the IMDB ratings:
imdb_rating_path = f'./imdb_ratings.csv'
imdb_ratings_df = pd.read_csv(imdb_rating_path)
imdb_ratings_df.columns = imdb_ratings_df.columns.str.strip().str.lower()

# The IMDB movie titles / ids do not match the movie titles /ids in the movielens dataframes
# so we need to map them:
imdb_ratings_df['title'] = imdb_ratings_df['title'] + ' (' + imdb_ratings_df['year'].astype(str) + ')'
imdb_ratings_df['title'] = imdb_ratings_df['title'].str.strip()
movies_df['title'] = movies_df['title'].str.strip()
imdb_ratings_df = imdb_ratings_df.merge(movies_df['title'].reset_index(), on='title', how='inner', )

# The ratings are on a scale from 1 to 10, so we need to transform them to a scale from 0 to 5:
imdb_ratings_df['rating'] = (imdb_ratings_df['your rating'] / 2).astype(int)

# Your ratings that we are going to use:
print('Your IMDB ratings:')
print('==================')
print(imdb_ratings_df[['title', 'rating']].head(10))

# Finally, we can add the ratings to the ratings data frame:
imdb_ratings_df['userId'] = our_user_id
ratings_df = pd.concat([ratings_df, imdb_ratings_df[['movieId', 'rating', 'userId']]])

Your IMDB ratings:
Empty DataFrame
Columns: [title, rating]
Index: []


## Data Preprocessing

We are going to use the genre as well as the title of the movie as node features. For the `title` features, we are going to use a pre-trained [sentence transformer](https://www.sbert.net/) model to encode the title into a vector.
For the `genre` features, we are going to use a one-hot encoding.

In [None]:
from habana_frameworks.torch.hpu import wrap_in_hpu_graph

In [12]:
import numpy as np
import torch
from sentence_transformers import SentenceTransformer

# One-hot encode the genres:
genres = movies_df['genres'].str.get_dummies('|').values
genres = torch.from_numpy(genres).to(torch.float).to(device)

# Load the pre-trained sentence transformer model and encode the movie titles:
model = SentenceTransformer('all-MiniLM-L6-v2')
with torch.no_grad():
    titles = model.encode(movies_df['title'].tolist(), convert_to_tensor=True, show_progress_bar=True)
    # titles = titles.cpu()
    titles = titles.to(device)

# Concatenate the genres and title features:
movie_features = torch.cat([genres, titles], dim=-1)

# We don't have user features, which is why we use an identity matrix
user_features = torch.eye(len(ratings_df['userId'].unique()), device=device)


Batches:   0%|          | 0/305 [00:00<?, ?it/s]

In [13]:
import habana_frameworks.torch as ht
ht.hpu.wrap_in_hpu_graph?

[0;31mSignature:[0m
[0mht[0m[0;34m.[0m[0mhpu[0m[0;34m.[0m[0mwrap_in_hpu_graph[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mmodule[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0masynchronous[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdisable_tensor_cache[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdry_run[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_graphs[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Wraps the forward method of a module in an HPU graph capture and replay mechanism.

Args:
    module (torch.nn.Module): The module to be wrapped.
    asynchronous (bool, optional): Specifies whether the graph capture and replay should be asynchronous.
        Defaults to False.
    disable_tensor_cache (bool, optional): Specifies whether to cache tensors during graph replay.
        Defaults to False.
    dr

The `ratings.csv` file contains the ratings of users for movies. From this
file we are extracting the `userId`. We create a mapping from the `userId`
to a unique consecutive value in the range `[0, num_users]`. This is needed as we want our final data representation to be as compact as possible, *e.g.*, the representation of a user in the first row should be accessible via `x[0]`.
The same we do for the `movieId`.
Afterwards, we obtain the final `edge_index` representation of shape `[2, num_ratings]` from `ratings.csv` by merging mapped user and movie indices with the raw indices given by the original data frame.


In [17]:
# Create a mapping from the userId to a unique consecutive value in the range [0, num_users]:
unique_user_id = ratings_df['userId'].unique()
unique_user_id = pd.DataFrame(data={
    'userId': unique_user_id,
    'mappedUserId': pd.RangeIndex(len(unique_user_id))
    })
print("Mapping of user IDs to consecutive values:")
print("==========================================")
print(unique_user_id.head())
print()

# Create a mapping from the movieId to a unique consecutive value in the range [0, num_movies]:
unique_movie_id = ratings_df['movieId'].unique()
unique_movie_id = pd.DataFrame(data={
    'movieId': unique_movie_id,
    'mappedMovieId': pd.RangeIndex(len(unique_movie_id))
    })
print("Mapping of movie IDs to consecutive values:")
print("===========================================")
print(unique_movie_id.head())
print()

# Merge the mappings with the original data frame:
ratings_df = ratings_df.merge(unique_user_id, on='userId')
ratings_df = ratings_df.merge(unique_movie_id, on='movieId')

# With this, we are ready to create the edge_index representation in COO format
# following the PyTorch Geometric semantics:
edge_index = torch.stack([
    torch.tensor(ratings_df['mappedUserId_x'].values),
    torch.tensor(ratings_df['mappedMovieId_x'].values)]
    , dim=0)

assert edge_index.shape == (2, len(ratings_df))

print("Final edge indices pointing from users to movies:")
print("================================================")
print(edge_index[:, :10])

Mapping of user IDs to consecutive values:
   userId  mappedUserId
0       1             0
1       5             1
2       7             2
3      15             3
4      17             4

Mapping of movie IDs to consecutive values:
   movieId  mappedMovieId
0        1              0
1        3              1
2        6              2
3       47              3
4       50              4

Final edge indices pointing from users to movies:
tensor([[ 0,  4,  6, 14, 16, 17, 18, 20, 26, 30],
        [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0]])


## Heterogeneous Graph Construction

With this we are ready to initialize our heterogeneous graph data object and pass the
necessary information to it.

We also take care of adding reverse edges to the `HeteroData` object. This allows our GNN
model to use both directions of the edges for the message passing.

In [18]:
import torch_geometric.transforms as T
from torch_geometric.data import HeteroData

# Create the heterogeneous graph data object:
data = HeteroData()

# Add the user nodes:
data['user'].x = user_features  # [num_users, num_features_users]

# Add the movie nodes:
data['movie'].x = movie_features  # [num_movies, num_features_movies]

# Add the rating edges:
data['user', 'rates', 'movie'].edge_index = edge_index  # [2, num_ratings]

# Add the rating labels:
rating = torch.from_numpy(ratings_df['rating'].values).to(torch.float)
data['user', 'rates', 'movie'].edge_label = rating  # [num_ratings]

# We also need to make sure to add the reverse edges from movies to users
# in order to let a GNN be able to pass messages in both directions.
# We can leverage the `T.ToUndirected()` transform for this from PyG:
data = T.ToUndirected()(data)

# With the above transformation we also got reversed labels for the edges.
# We are going to remove them:
del data['movie', 'rev_rates', 'user'].edge_label

assert data['user'].num_nodes == len(unique_user_id)
assert data['user', 'rates', 'movie'].num_edges == len(ratings_df)
assert data['movie'].num_features == 404

data

HeteroData(
  user={ x=[611, 611] },
  movie={ x=[9742, 404] },
  (user, rates, movie)={
    edge_index=[2, 100841],
    edge_label=[100841],
  },
  (movie, rev_rates, user)={ edge_index=[2, 100841] }
)

## Dataset Splitting

We can now split our data into a training, validation and test set. We are going to use
the `T.RandomLinkSplit` transform from PyG to do this. This transform will randomly
split the links with their label/rating into training, validation and test set.
We are going to use 80% of the edges for training, 10% for validation and 10% for testing.

In [19]:
train_data, val_data, test_data = T.RandomLinkSplit(
    num_val=0.1,
    num_test=0.1,
    neg_sampling_ratio=0.0,
    edge_types=[('user', 'rates', 'movie')],
    rev_edge_types=[('movie', 'rev_rates', 'user')],
)(data)
train_data, val_data

(HeteroData(
   user={ x=[611, 611] },
   movie={ x=[9742, 404] },
   (user, rates, movie)={
     edge_index=[2, 80673],
     edge_label=[80673],
     edge_label_index=[2, 80673],
   },
   (movie, rev_rates, user)={ edge_index=[2, 80673] }
 ),
 HeteroData(
   user={ x=[611, 611] },
   movie={ x=[9742, 404] },
   (user, rates, movie)={
     edge_index=[2, 80673],
     edge_label=[10084],
     edge_label_index=[2, 10084],
   },
   (movie, rev_rates, user)={ edge_index=[2, 80673] }
 ))

## Graph Neural Network

We are now ready to define our GNN model. We are going to use a simple GNN model with
two message passing layers for the encoding of the user and movie nodes.
Additionally, we are going to use a decoder to predict the rating for the encoded
user-movie combination.

In [24]:
from torch_geometric.nn import SAGEConv, to_hetero

class GNNEncoder(torch.nn.Module):
    def __init__(self, hidden_channels, out_channels):
        super().__init__()
        self.conv1 = SAGEConv((-1, -1), hidden_channels)
        self.conv2 = SAGEConv((-1, -1), out_channels)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index).relu()
        x = self.conv2(x, edge_index)
        return x


class EdgeDecoder(torch.nn.Module):
    def __init__(self, hidden_channels):
        super().__init__()
        self.lin1 = torch.nn.Linear(2 * hidden_channels, hidden_channels)
        self.lin2 = torch.nn.Linear(hidden_channels, 1)

    def forward(self, z_dict, edge_label_index):
        row, col = edge_label_index
        z = torch.cat([z_dict['user'][row], z_dict['movie'][col]], dim=-1)

        z = self.lin1(z).relu()
        z = self.lin2(z)
        return z.view(-1)


class Model(torch.nn.Module):
    def __init__(self, hidden_channels):
        super().__init__()
        self.encoder = GNNEncoder(hidden_channels, hidden_channels)
        self.encoder = to_hetero(self.encoder, data.metadata(), aggr='sum')
        self.decoder = EdgeDecoder(hidden_channels)

    def forward(self, x_dict, edge_index_dict, edge_label_index):
        z_dict = self.encoder(x_dict, edge_index_dict)
        return self.decoder(z_dict, edge_label_index)

# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

device = torch.device('hpu')

model = Model(hidden_channels=32).to(device)

print(model)

Model(
  (encoder): GraphModule(
    (conv1): ModuleDict(
      (user__rates__movie): SAGEConv((-1, -1), 32, aggr=mean)
      (movie__rev_rates__user): SAGEConv((-1, -1), 32, aggr=mean)
    )
    (conv2): ModuleDict(
      (user__rates__movie): SAGEConv((-1, -1), 32, aggr=mean)
      (movie__rev_rates__user): SAGEConv((-1, -1), 32, aggr=mean)
    )
  )
  (decoder): EdgeDecoder(
    (lin1): Linear(in_features=64, out_features=32, bias=True)
    (lin2): Linear(in_features=32, out_features=1, bias=True)
  )
)


## Training a Heterogeneous GNN

Training our GNN is then similar to training any PyTorch model.
We move the model to the desired device, and initialize an optimizer that takes care of adjusting model parameters via stochastic gradient descent.

The training loop applies the forward computation of the model, computes the loss from ground-truth labels and obtained predictions, and adjusts model parameters via back-propagation and stochastic gradient descent.


In [25]:
model.train()
model = torch.compile(model,backend="hpu_backend")

In [26]:
import torch.nn.functional as F

optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

def train():
    model.train()
    optimizer.zero_grad()
    pred = model(train_data.x_dict, train_data.edge_index_dict,
                 train_data['user', 'movie'].edge_label_index)
    target = train_data['user', 'movie'].edge_label
    loss = F.mse_loss(pred, target)
    loss.backward()
    optimizer.step()
    return float(loss)

@torch.no_grad()
def test(data):
    data = data.to(device)
    model.eval()
    pred = model(data.x_dict, data.edge_index_dict,
                 data['user', 'movie'].edge_label_index)
    pred = pred.clamp(min=0, max=5)
    target = data['user', 'movie'].edge_label.float()
    rmse = F.mse_loss(pred, target).sqrt()
    return float(rmse)


for epoch in range(1, 301):
    train_data = train_data.to(device)
    loss = train()
    train_rmse = test(train_data)
    val_rmse = test(val_data)
    print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}, Train: {train_rmse:.4f}, '
          f'Val: {val_rmse:.4f}')



Epoch: 001, Loss: 13.7033, Train: 3.4880, Val: 3.4738
Epoch: 002, Loss: 12.1663, Train: 3.1141, Val: 3.1031
Epoch: 003, Loss: 9.6976, Train: 2.4089, Val: 2.4054
Epoch: 004, Loss: 5.8026, Train: 1.3080, Val: 1.3180
Epoch: 005, Loss: 1.7110, Train: 1.7670, Val: 1.7561
Epoch: 006, Loss: 3.4628, Train: 1.7768, Val: 1.7661
Epoch: 007, Loss: 3.5347, Train: 1.2298, Val: 1.2303
Epoch: 008, Loss: 1.5123, Train: 1.0834, Val: 1.0941
Epoch: 009, Loss: 1.1738, Train: 1.3792, Val: 1.3865
Epoch: 010, Loss: 1.9023, Train: 1.5928, Val: 1.5966
Epoch: 011, Loss: 2.5370, Train: 1.6417, Val: 1.6443
Epoch: 012, Loss: 2.6952, Train: 1.5510, Val: 1.5542
Epoch: 013, Loss: 2.4056, Train: 1.3586, Val: 1.3638
Epoch: 014, Loss: 1.8459, Train: 1.1330, Val: 1.1405
Epoch: 015, Loss: 1.2837, Train: 1.0181, Val: 1.0256
Epoch: 016, Loss: 1.0365, Train: 1.1186, Val: 1.1222
Epoch: 017, Loss: 1.2513, Train: 1.2630, Val: 1.2635
Epoch: 018, Loss: 1.5953, Train: 1.2515, Val: 1.2524
Epoch: 019, Loss: 1.5663, Train: 1.1096, Val

## Evaluation

From the validation results, our model can generalize well to unseen data. The val RMSE is should be around 0.9, meaning that, on average our model is off by 0.9 stars. We can now evaluate our model on the test set and take a closer look into the predictions.

In [27]:
with torch.no_grad():
    test_data = test_data.to(device)
    pred = model(test_data.x_dict, test_data.edge_index_dict,
                 test_data['user', 'movie'].edge_label_index)
    pred = pred.clamp(min=0, max=5)
    target = test_data['user', 'movie'].edge_label.float()
    rmse = F.mse_loss(pred, target).sqrt()
    print(f'Test RMSE: {rmse:.4f}')

userId = test_data['user', 'movie'].edge_label_index[0].cpu().numpy()
movieId = test_data['user', 'movie'].edge_label_index[1].cpu().numpy()
pred = pred.cpu().numpy()
target = target.cpu().numpy()

print(pd.DataFrame({'userId': userId, 'movieId': movieId, 'rating': pred, 'target': target}))



Test RMSE: 1.3009
       userId  movieId    rating  target
0         131     3196  2.693830     4.0
1         533      856  2.709803     3.5
2         479       65  2.598594     5.0
3         102     2310  3.101466     4.5
4         316       12  2.438756     4.0
...       ...      ...       ...     ...
10079     373       25  3.420612     3.0
10080     364       34  3.342680     0.5
10081     181     2219  2.639490     2.0
10082     452     6202  2.850482     1.0
10083     121     1190  3.009157     5.0

[10084 rows x 4 columns]


## Movie recommendations

We can now use the model to generate ratings for a movie we haven't seen.


In [34]:
our_user_id = 5

# Your mappedUserId
mapped_user_id = unique_user_id[unique_user_id['userId'] == our_user_id]['mappedUserId'].values[0]

# Select movies that you haven't seen before
movies_rated = ratings_df[ratings_df['mappedUserId'] == mapped_user_id]
movies_not_rated = movies_df[~movies_df.index.isin(movies_rated['movieId'])]
movies_not_rated = movies_not_rated.merge(unique_movie_id, on='movieId')
movie = movies_not_rated.sample(1)

print(f"The movie we want to predict a raiting for is:  {movie['title'].item()}")

The movie we want to predict a raiting for is:  Dial M for Murder (1954)


In [None]:
# Create new `edge_label_index` between the user and the movie
edge_label_index = torch.tensor([
    mapped_user_id,
    movie.mappedMovieId.item()], device=device).unsqueeze(1)

with torch.no_grad():
    test_data.to(device)
    pred = model(test_data.x_dict, test_data.edge_index_dict, edge_label_index)
    pred = pred.clamp(min=0, max=5).detach().cpu().numpy()



In [37]:
pred.item()

2.834235906600952

## Explaining the Predictions

PyTorch Geometric also provides a way to explain the predictions of a GNN. Let's check which movie ratings have influenced this prediction the most.

We will use the [captum](https://captum.ai/) library to explain the predictions.

In [38]:
from torch_geometric.explain import Explainer, CaptumExplainer

explainer = Explainer(
    model=model,
    algorithm=CaptumExplainer('IntegratedGradients'),
    explanation_type='model',
    model_config=dict(
        mode='regression',
        task_level='edge',
        return_type='raw',
    ),
    node_mask_type=None,
    edge_mask_type='object',
)

explanation = explainer(
    test_data.x_dict, test_data.edge_index_dict, index=0,
    edge_label_index=edge_label_index).cpu().detach()
explanation



HeteroExplanation(
  prediction=[1],
  target=[1],
  index=[1],
  edge_label_index=[2, 1],
  user={ x=[611, 611] },
  movie={ x=[9742, 404] },
  (user, rates, movie)={
    edge_mask=[90757],
    edge_index=[2, 90757],
  },
  (movie, rev_rates, user)={
    edge_mask=[90757],
    edge_index=[2, 90757],
  }
)

In [39]:
# User to movie link + attribution
user_to_movie = explanation['user', 'movie'].edge_index.numpy().T
user_to_movie_attr = explanation['user', 'movie'].edge_mask.numpy().T
user_to_movie_df = pd.DataFrame(
    np.hstack([user_to_movie, user_to_movie_attr.reshape(-1,1)]),
    columns = ['mappedUserId', 'mappedMovieId', 'attr']
)

# Movie to user link + attribution
movie_to_user = explanation['movie', 'user'].edge_index.numpy().T
movie_to_user_attr = explanation[ 'movie', 'user'].edge_mask.numpy().T
movie_to_user_df = pd.DataFrame(
    np.hstack([movie_to_user, movie_to_user_attr.reshape(-1,1)]),
    columns = ['mappedMovieId', 'mappedUserId','attr']
)
explanation_df = pd.concat([user_to_movie_df, movie_to_user_df])
explanation_df[["mappedUserId", "mappedMovieId"]] = explanation_df[["mappedUserId", "mappedMovieId"]].astype(int)

print(f"Attribtion for all edges towards prediction of movie rating of movie:\n {movie['title'].item()}")
print("==========================================================================================")
print(explanation_df.sort_values(by='attr'))

Attribtion for all edges towards prediction of movie rating of movie:
 Dial M for Murder (1954)
       mappedUserId  mappedMovieId      attr
57771           591            531 -0.010007
76252           485            531 -0.009576
56876            98            531 -0.009318
76796           218            531 -0.009179
88854           603            531 -0.008979
...             ...            ...       ...
80512             1            251  0.088108
49847             1            252  0.088561
75845             1            245  0.089052
63441             1            257  0.093790
38068             1            253  0.094610

[181514 rows x 3 columns]


In [40]:
# Select links that connect to our user
explanation_df = explanation_df[explanation_df['mappedUserId'] == mapped_user_id]

# We group the attribution scores by movie
explanation_df = explanation_df.groupby('mappedMovieId').sum()

# Merge with movies_df to receive title
# But first, we need to add the original id
explanation_df = explanation_df.merge(unique_movie_id, on='mappedMovieId')
explanation_df = explanation_df.merge(movies_df, on='movieId')

pd.options.display.float_format = "{:,.9f}".format

print("Top movies that influenced the prediction:")
print("==============================================")
print(explanation_df.sort_values(by='attr', ascending=False, key= lambda x: abs(x))[['title', 'attr']].head())

Top movies that influenced the prediction:
                                title        attr
19              Lion King, The (1994) 0.094581954
23     Remains of the Day, The (1993) 0.093746770
11                   Apollo 13 (1995) 0.088945812
18  Ace Ventura: Pet Detective (1994) 0.088519772
17                   Quiz Show (1994) 0.088076740
