MovieLens Rating Prediction Notebook

This notebook runs faster on a GPU runtime. To enable it, go to Edit > Notebook Settings > Hardware Accelerator > GPU.


## Setup

In [1]:
import torch

print(torch.__version__)


2.4.0+cu121


In [2]:
# Install required packages
import os

os.environ['TORCH'] = torch.__version__
!pip install -q torch-scatter -f https://pytorch-geometric.com/whl/torch-${TORCH}.html
!pip install pyg-lib -f https://data.pyg.org/whl/nightly/torch-${TORCH}.html
!pip install git+https://github.com/pyg-team/pytorch_geometric.git

!pip install sentence_transformers
!pip3 install fuzzywuzzy[speedup]
!pip install captum

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/10.9 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/10.9 MB[0m [31m45.2 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.9/10.9 MB[0m [31m56.4 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m7.2/10.9 MB[0m [31m67.6 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m10.9/10.9 MB[0m [31m83.2 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m10.9/10.9 MB[0m [31m83.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.9/10.9 MB[0m [31m56.9 MB/s[0m eta [36m0:00:00[0m
[?25hLooking in links: https://data.pyg.org/whl/nightly/torch-2.4.0+cu121.html
Collecting 

## Link Regression on the MovieLens Dataset

This notebook shows how to load a set of `*.csv` files into a `torch_geometric.data.HeteroData` object and how to train a [heterogeneous graph model](https://pytorch-geometric.readthedocs.io/en/latest/notes/heterogeneous.html#hgtutorial).

We are going to use the [Movielens dataset](https://grouplens.org/datasets/movielens/), which is collected by the GroupLens Research group. The toy dataset describes movies, users, and their ratings. We are going to predict the rating of a user for a movie.

## Data Ingestion

In [3]:
from torch_geometric.data import download_url, extract_zip
import pandas as pd

dataset_name = 'ml-latest-small'

url = f'https://files.grouplens.org/datasets/movielens/{dataset_name}.zip'
extract_zip(download_url(url, '.'), '.')

movies_path = f'./{dataset_name}/movies.csv'
ratings_path = f'./{dataset_name}/ratings.csv'

Downloading https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Extracting ./ml-latest-small.zip


In [4]:
# Load the entire ratings dataframe into memory:
ratings_df = pd.read_csv(ratings_path)[["userId", "movieId", "rating"]]

# Load the entire movie dataframe into memory:
movies_df = pd.read_csv(movies_path, index_col='movieId')

print('movies.csv:')
print('===========')
print(movies_df[["genres", "title"]].head())
print(f"Number of movies: {len(movies_df)}")
print()
print('ratings.csv:')
print('============')
print(ratings_df[["userId", "movieId", "rating"]].head())
print(f"Number of ratings: {len(ratings_df)}")
print()

movies.csv:
                                              genres  \
movieId                                                
1        Adventure|Animation|Children|Comedy|Fantasy   
2                         Adventure|Children|Fantasy   
3                                     Comedy|Romance   
4                               Comedy|Drama|Romance   
5                                             Comedy   

                                      title  
movieId                                      
1                          Toy Story (1995)  
2                            Jumanji (1995)  
3                   Grumpier Old Men (1995)  
4                  Waiting to Exhale (1995)  
5        Father of the Bride Part II (1995)  
Number of movies: 9742

ratings.csv:
   userId  movieId  rating
0       1        1     4.0
1       1        3     4.0
2       1        6     4.0
3       1       47     5.0
4       1       50     5.0
Number of ratings: 100836



Additionally, let's add our ratings to the dataset to get predictions for movies we haven't seen yet.

There are two ways to add ratings:
1. **Add ratings manually**
2. **Upload IMDB ratings**


### Add your ratings manually


We recommend adding at least 10 ratings. Let's first check out the most rated movies. Additional movies in the table are: *Avatar*, *The Dark Knight*, *Pretty Women*,
*Titanic*, *The Lion King*, *Jurassic Park*, *The Matrix*, *The Lord of the Rings* and *The Avengers*. Please note that the article in the movie title is often at the end of the title.

In [5]:
from fuzzywuzzy import fuzz

# Specify your userId
our_user_id = ratings_df['userId'].max() + 1

print('Most rated movies:')
print('==================')
most_rated_movies = ratings_df['movieId'].value_counts().head(10)
print(movies_df.loc[most_rated_movies.index][["title"]])

# Initialize your rating list
ratings = []

Most rated movies:
                                             title
movieId                                           
356                            Forrest Gump (1994)
318               Shawshank Redemption, The (1994)
296                            Pulp Fiction (1994)
593               Silence of the Lambs, The (1991)
2571                            Matrix, The (1999)
260      Star Wars: Episode IV - A New Hope (1977)
480                           Jurassic Park (1993)
110                              Braveheart (1995)
589              Terminator 2: Judgment Day (1991)
527                        Schindler's List (1993)


In [7]:
3# Add your ratings here:
num_ratings = 5

while len(ratings) < num_ratings:
    print(f'Select the {len(ratings) + 1}. movie:')
    print('=====================================')
    movie = input('Please enter the movie title: ')
    movies_df['title_score'] = movies_df['title'].apply(lambda x: fuzz.ratio(x, movie))
    print(movies_df.sort_values('title_score', ascending=False)[['title']].head(5))
    movie_id = input('Please enter the movie id: ')
    if not movie_id:
        continue
    movie_id = int(movie_id)
    rating = float(input('Please enter your rating: '))
    if not rating:
        continue
    assert 0 <= rating <= 5
    ratings.append({'movieId': movie_id, 'rating': rating, 'userId': our_user_id})
    print()

Select the 1. movie:
Please enter the movie title: man
                 title
movieId               
143511    Human (2015)
26152    Batman (1966)
188675   Dogman (2018)
56156    Hitman (2007)
592      Batman (1989)
Please enter the movie id: 26152
Please enter your rating: 3

Select the 2. movie:
Please enter the movie title: spider
                      title
movieId                    
6197          Spider (2002)
3168      Easy Rider (1969)
5349      Spider-Man (2002)
51077    Ghost Rider (2007)
158           Casper (1995)
Please enter the movie id: 5349
Please enter your rating: 5

Select the 3. movie:
Please enter the movie title: woman
                   title
movieId                 
8666     Catwoman (2004)
188675     Dogman (2018)
5564      Swimfan (2002)
2894      Romance (1999)
5670     Comedian (2002)
Please enter the movie id: 8666
Please enter your rating: 3

Select the 4. movie:
Please enter the movie title: mom
                      title
movieId                    
171

In [8]:
# Add your ratings to the rating dataframe
ratings_df = pd.concat([ratings_df, pd.DataFrame.from_records(ratings)])

### Upload your IMDB ratings (Optional)

If you have an IMDB account, you can also upload your IMDB ratings. To do so, please follow the following steps:
1. Go to https://www.imdb.com/
2. Login to your account
3. Go to `Your Ratings`
4. Click on `Export Ratings` after clicking the three dots in the upper right corner
5. Upload the downloaded `ratings.csv` file to the current directory
6. Rename the file to `imdb_ratings.csv`
7. Run the following cell


In [None]:
# Select our userId
our_user_id = ratings_df['userId'].max() + 1

# Load the IMDB ratings:
imdb_rating_path = f'./imdb_ratings.csv'
imdb_ratings_df = pd.read_csv(imdb_rating_path)
imdb_ratings_df.columns = imdb_ratings_df.columns.str.strip().str.lower()

# The IMDB movie titles / ids do not match the movie titles /ids in the movielens dataframes
# so we need to map them:
imdb_ratings_df['title'] = imdb_ratings_df['title'] + ' (' + imdb_ratings_df['year'].astype(str) + ')'
imdb_ratings_df['title'] = imdb_ratings_df['title'].str.strip()
movies_df['title'] = movies_df['title'].str.strip()
imdb_ratings_df = imdb_ratings_df.merge(movies_df['title'].reset_index(), on='title', how='inner', )

# The ratings are on a scale from 1 to 10, so we need to transform them to a scale from 0 to 5:
imdb_ratings_df['rating'] = (imdb_ratings_df['your rating'] / 2).astype(int)

# Your ratings that we are going to use:
print('Your IMDB ratings:')
print('==================')
print(imdb_ratings_df[['title', 'rating']].head(10))

# Finally, we can add the ratings to the ratings data frame:
imdb_ratings_df['userId'] = our_user_id
ratings_df = pd.concat([ratings_df, imdb_ratings_df[['movieId', 'rating', 'userId']]])

## Data Preprocessing

We are going to use the genre as well as the title of the movie as node features. For the `title` features, we are going to use a pre-trained [sentence transformer](https://www.sbert.net/) model to encode the title into a vector.
For the `genre` features, we are going to use a one-hot encoding.

In [10]:
import numpy as np
import torch
from sentence_transformers import SentenceTransformer

# One-hot encode the genres:
genres = movies_df['genres'].str.get_dummies('|').values
genres = torch.from_numpy(genres).to(torch.float)

# Load the pre-trained sentence transformer model and encode the movie titles:
model = SentenceTransformer('all-MiniLM-L6-v2')
with torch.no_grad():
    titles = model.encode(movies_df['title'].tolist(), convert_to_tensor=True, show_progress_bar=True)
    titles = titles.cpu()

# Concatenate the genres and title features:
movie_features = torch.cat([genres, titles], dim=-1)

# We don't have user features, which is why we use an identity matrix
user_features = torch.eye(len(ratings_df['userId'].unique()))


  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/305 [00:00<?, ?it/s]

The `ratings.csv` file contains the ratings of users for movies. From this
file we are extracting the `userId`. We create a mapping from the `userId`
to a unique consecutive value in the range `[0, num_users]`. This is needed as we want our final data representation to be as compact as possible, *e.g.*, the representation of a user in the first row should be accessible via `x[0]`.
The same we do for the `movieId`.
Afterwards, we obtain the final `edge_index` representation of shape `[2, num_ratings]` from `ratings.csv` by merging mapped user and movie indices with the raw indices given by the original data frame.


In [11]:
# Create a mapping from the userId to a unique consecutive value in the range [0, num_users]:
unique_user_id = ratings_df['userId'].unique()
unique_user_id = pd.DataFrame(data={
    'userId': unique_user_id,
    'mappedUserId': pd.RangeIndex(len(unique_user_id))
    })
print("Mapping of user IDs to consecutive values:")
print("==========================================")
print(unique_user_id.head())
print()

# Create a mapping from the movieId to a unique consecutive value in the range [0, num_movies]:
unique_movie_id = ratings_df['movieId'].unique()
unique_movie_id = pd.DataFrame(data={
    'movieId': unique_movie_id,
    'mappedMovieId': pd.RangeIndex(len(unique_movie_id))
    })
print("Mapping of movie IDs to consecutive values:")
print("===========================================")
print(unique_movie_id.head())
print()

# Merge the mappings with the original data frame:
ratings_df = ratings_df.merge(unique_user_id, on='userId')
ratings_df = ratings_df.merge(unique_movie_id, on='movieId')

# With this, we are ready to create the edge_index representation in COO format
# following the PyTorch Geometric semantics:
edge_index = torch.stack([
    torch.tensor(ratings_df['mappedUserId'].values),
    torch.tensor(ratings_df['mappedMovieId'].values)]
    , dim=0)

assert edge_index.shape == (2, len(ratings_df))

print("Final edge indices pointing from users to movies:")
print("================================================")
print(edge_index[:, :10])

Mapping of user IDs to consecutive values:
   userId  mappedUserId
0       1             0
1       2             1
2       3             2
3       4             3
4       5             4

Mapping of movie IDs to consecutive values:
   movieId  mappedMovieId
0        1              0
1        3              1
2        6              2
3       47              3
4       50              4

Final edge indices pointing from users to movies:
tensor([[ 0,  4,  6, 14, 16, 17, 18, 20, 26, 30],
        [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0]])


## Heterogeneous Graph Construction

With this we are ready to initialize our heterogeneous graph data object and pass the
necessary information to it.

We also take care of adding reverse edges to the `HeteroData` object. This allows our GNN
model to use both directions of the edges for the message passing.

In [12]:
import torch_geometric.transforms as T
from torch_geometric.data import HeteroData

# Create the heterogeneous graph data object:
data = HeteroData()

# Add the user nodes:
data['user'].x = user_features  # [num_users, num_features_users]

# Add the movie nodes:
data['movie'].x = movie_features  # [num_movies, num_features_movies]

# Add the rating edges:
data['user', 'rates', 'movie'].edge_index = edge_index  # [2, num_ratings]

# Add the rating labels:
rating = torch.from_numpy(ratings_df['rating'].values).to(torch.float)
data['user', 'rates', 'movie'].edge_label = rating  # [num_ratings]

# We also need to make sure to add the reverse edges from movies to users
# in order to let a GNN be able to pass messages in both directions.
# We can leverage the `T.ToUndirected()` transform for this from PyG:
data = T.ToUndirected()(data)

# With the above transformation we also got reversed labels for the edges.
# We are going to remove them:
del data['movie', 'rev_rates', 'user'].edge_label

assert data['user'].num_nodes == len(unique_user_id)
assert data['user', 'rates', 'movie'].num_edges == len(ratings_df)
assert data['movie'].num_features == 404

data

HeteroData(
  user={ x=[611, 611] },
  movie={ x=[9742, 404] },
  (user, rates, movie)={
    edge_index=[2, 100841],
    edge_label=[100841],
  },
  (movie, rev_rates, user)={ edge_index=[2, 100841] }
)

## Dataset Splitting

We can now split our data into a training, validation and test set. We are going to use
the `T.RandomLinkSplit` transform from PyG to do this. This transform will randomly
split the links with their label/rating into training, validation and test set.
We are going to use 80% of the edges for training, 10% for validation and 10% for testing.

<font color='red'>Please note this part is implemented for you and do not modify it for the grading purpose. However, feel free to use different configurations just for fun</font>

In [13]:
train_data, val_data, test_data = T.RandomLinkSplit(
    num_val=0.1,
    num_test=0.1,
    neg_sampling_ratio=0.0,
    edge_types=[('user', 'rates', 'movie')],
    rev_edge_types=[('movie', 'rev_rates', 'user')],
)(data)
train_data, val_data

(HeteroData(
   user={ x=[611, 611] },
   movie={ x=[9742, 404] },
   (user, rates, movie)={
     edge_index=[2, 80673],
     edge_label=[80673],
     edge_label_index=[2, 80673],
   },
   (movie, rev_rates, user)={ edge_index=[2, 80673] }
 ),
 HeteroData(
   user={ x=[611, 611] },
   movie={ x=[9742, 404] },
   (user, rates, movie)={
     edge_index=[2, 80673],
     edge_label=[10084],
     edge_label_index=[2, 10084],
   },
   (movie, rev_rates, user)={ edge_index=[2, 80673] }
 ))

## Graph Neural Network

The following model is provided for you, and your own GraphSage and GAT layers will serve as components during the model training process. You need to implement the model below yourself, which will be used for future training.

<font color='red'>This is optional and will add bonus points to your project.</font>

## GraphSage Implementation (Optional)

Now let's start working on our own implementation of layers! This part is to get you familiar with how to implement Pytorch layer based on Message Passing. You will be implementing the **forward**, **message** and **aggregate** functions.

Generally, the **forward** function is where the actual message passing is conducted. All logic in each iteration happens in **forward**, where we'll call **propagate** function to propagate information from neighbor nodes to central nodes.  So the general paradigm will be pre-processing -> propagate -> post-processing.

Recall the process of message passing we introduced in homework 1. **propagate** further calls **message** which transforms information of neighbor nodes into messages, **aggregate** which aggregates all messages from neighbor nodes into one, and **update** which further generates the embedding for nodes in the next iteration.

Our implementation is slightly variant from this, where we'll not explicitly implement **update**, but put the logic for updating nodes in **forward** function. To be more specific, after information is propagated, we can further conduct some operations on the output of **propagate**. The output of **forward** is exactly the embeddings after the current iteration.

In addition, tensors passed to **propagate()** can be mapped to the respective nodes $i$ and $j$ by appending _i or _j to the variable name, .e.g. x_i and x_j. Note that we generally refer to $i$ as the central nodes that aggregates information, and refer to $j$ as the neighboring nodes, since this is the most common notation.

Please find more details in the comments. One thing to note is that we're adding **skip connections** to our GraphSage. Formally, the update rule for our model is described as below:

\begin{equation}
h_v^{(l)} = W_l\cdot h_v^{(l-1)} + W_r \cdot AGG(\{h_u^{(l-1)}, \forall u \in N(v) \})
\end{equation}

For simplicity, we use mean aggregations where:

\begin{equation}
AGG(\{h_u^{(l-1)}, \forall u \in N(v) \}) = \frac{1}{|N(v)|} \sum_{u\in N(v)} h_u^{(l-1)}
\end{equation}

Additionally, $\ell$-2 normalization is applied after each iteration.

In order to complete the work correctly, we have to understand how the different functions interact with each other. In **propagate** we can pass in any parameters we want. For example, we pass in $x$ as an parameter:

... = propagate(..., $x$=($x_{central}$, $x_{neighbor}$), ...)

Here $x_{central}$ and $x_{neighbor}$ represent the features from **central** nodes and from **neighbor** nodes. If we're using the same representations from central and neighbor, then $x_{central}$ and $x_{neighbor}$ could be identical.

Suppose $x_{central}$ and $x_{neighbor}$ are both of shape N * d, where N is number of nodes, and d is dimension of features.

Then in message function, we can take parameters called $x\_i$ and $x\_j$. Usually $x\_i$ represents "central nodes", and $x\_j$ represents "neighbor nodes". Pay attention to the shape here: $x\_i$ and $x\_j$ are both of shape E * d (**not N!**). $x\_i$ is obtained by concatenating the embeddings of central nodes of all edges through lookups from $x_{central}$ we passed in propagate. Similarly, $x\_j$ is obtained by concatenating the embeddings of neighbor nodes of all edges through lookups from $x_{neighbor}$ we passed in propagate.

Let's look at an example. Suppose we have 4 nodes, so $x_{central}$ and $x_{neighbor}$ are of shape 4 * d. We have two edges (1, 2) and (3, 0). Thus, $x\_i$ is obtained by $[x_{central}[1]^T; x_{central}[3]^T]^T$, and $x\_j$ is obtained by $[x_{neighbor}[2]^T; x_{neighbor}[0]^T]^T$

<font color='red'>For the following questions, DON'T refer to any existing implementations online.</font>

In [None]:
import torch.nn.functional as F

import torch_scatter
from torch_geometric.nn.conv import MessagePassing

class GraphSage(MessagePassing):

    def __init__(self, in_channels, out_channels, normalize = True,
                 bias = False, **kwargs):
        super(GraphSage, self).__init__(**kwargs)

        self.in_channels = in_channels
        self.out_channels = out_channels
        self.normalize = normalize

        self.lin_l = None
        self.lin_r = None

        ############################################################################
        # TODO: Your code here!
        # Define the layers needed for the message and update functions below.
        # self.lin_l is the linear transformation that you apply to embedding
        #            for central node.
        # self.lin_r is the linear transformation that you apply to aggregated
        #            message from neighbors.
        # Our implementation is ~2 lines, but don't worry if you deviate from this.



        ############################################################################

        self.reset_parameters()

    def reset_parameters(self):
        self.lin_l.reset_parameters()
        self.lin_r.reset_parameters()

    def forward(self, x, edge_index, size = None):
        """"""

        out = None

        ############################################################################
        # TODO: Your code here!
        # Implement message passing, as well as any post-processing (our update rule).
        # 1. First call propagate function to conduct the message passing.
        #    1.1 See there for more information:
        #        https://pytorch-geometric.readthedocs.io/en/latest/notes/create_gnn.html
        #    1.2 We use the same representations for central (x_central) and
        #        neighbor (x_neighbor) nodes, which means you'll pass x=(x, x)
        #        to propagate.
        # 2. Update our node embedding with skip connection.
        # 3. If normalize is set, do L-2 normalization (defined in
        #    torch.nn.functional)
        # Our implementation is ~5 lines, but don't worry if you deviate from this.



        ############################################################################

        return out

    def message(self, x_j):

        out = None

        ############################################################################
        # TODO: Your code here!
        # Implement your message function here.
        # Our implementation is ~1 lines, but don't worry if you deviate from this.


        # x_j.shape == [num_edges, num_node_features]
        #  Basically, for every edge on the graph there is a message that will be
        #  transferred by it. This message has some representation (hidden state).


        ############################################################################

        return out

    def aggregate(self, inputs, index, dim_size = None):

        out = None

        # The axis along which to index number of nodes.
        node_dim = self.node_dim

        ############################################################################
        # TODO: Your code here!
        # Implement your aggregate function here.
        # See here as how to use torch_scatter.scatter:
        # https://pytorch-scatter.readthedocs.io/en/latest/functions/scatter.html#torch_scatter.scatter
        # Our implementation is ~1 lines, but don't worry if you deviate from this.


        # 1. input.shape == [num_edges, num_node_features]
        #       (basically the output of self.message)

        # 2. index.shape == [num_edges]
        #       (destination node of each edge, basically takes values between [0, ..., num_nodes - 1])

        # 3. self.node_dim == -2
        #       (the dimension of the nodes in the `inputs` tensor. It's -2 by default in the constructor
        #       of MessagePassing (https://pytorch-geometric.readthedocs.io/en/latest/_modules/torch_geometric/nn/conv/message_passing.html#MessagePassing)

        # 4. dim_size == index.max() + 1,
        #       it will be the size of the output, which will be the number of nodes as we want
        #       to compute the aggregated message for every node

        # aggregate the message for every node
        #  The shape of the output will be: [num_nodes, num_features]
        #  (i.e. the aggregated message of every node)

        ############################################################################

        return out


## Graph neural network (GNN) model
We are now ready to define our GNN model. We are going to use a simple GNN model with
two message passing layers for the encoding of the user and movie nodes.
Additionally, we are going to use a decoder to predict the rating for the encoded
user-movie combination.

<font color='red'>You may choose your own GNN layers implemented above, or you can use the provided GNN layers, such as GAT, GCN, etc., from the PyTorch Geometric library</font>

In [23]:
from torch_geometric.nn import SAGEConv, to_hetero, GCNConv

class GNNEncoder(torch.nn.Module):
    def __init__(self, hidden_channels, out_channels):
        super().__init__()
        self.conv1 = SAGEConv((-1, -1), hidden_channels) # First layer with add_self_loops=False
        self.conv2 = SAGEConv((-1, -1), out_channels) # Second layer with add_self_loops=False

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index).relu() # First conv layer with ReLU
        x = self.conv2(x, edge_index) # Second conv layer
        return x

#class GNNEncoder(torch.nn.Module):
 #   def __init__(self, hidden_channels, out_channels):
  #      super().__init__()

        ############################################################################
        # TODO: Your code here!
        # Initialize two Graph convolution layers.
        # The first layer should map from the input dimension to hidden_channels.
        # The second layer should map from hidden_channels to out_channels.
        #
        # Hint 1: Use SAGEConv or any other conv operators from torch_geometric.nn
        # Hint 2: The input dimension is unknown at initialization, so use (-1, -1)
        #         as the input size for both layers.
        #
        # Your implementation should be 2 lines of code.
        ############################################################################
      #  self.conv1 = GCNConv(-1, hidden_channels, add_self_loops=False)  # First layer with add_self_loops=False
       # self.conv2 = GCNConv(-1, out_channels, add_self_loops=False)  # Second layer with add_self_loops=False

  #  def forward(self, x, edge_index):

        ############################################################################
        # TODO: Your code here!
        # Implement the forward pass of the GNN encoder.
        # 1. Apply the first Graph conv layer (self.conv1) to the input.
        # 2. Apply a ReLU activation function to the output of the first layer.
        # 3. Apply the second Graph conv layer (self.conv2) to the result.
        # 4. Return the final output.
        #
        # Hint 1: Use the conv1 and conv2 layers you defined in __init__
        # Hint 2: You can chain operations using method chaining, e.g., .relu()
        #
        # Your implementation should be 2-3 lines of code.
        ############################################################################

     #   x = self.conv1(x, edge_index).relu()  # First conv layer with ReLU
    #  x = self.conv2(x, edge_index)  # Second conv layer
      #  return x

class EdgeDecoder(torch.nn.Module):
    def __init__(self, hidden_channels):
        super().__init__()
        self.lin1 = torch.nn.Linear(2 * hidden_channels, hidden_channels)
        self.lin2 = torch.nn.Linear(hidden_channels, 1)

    def forward(self, z_dict, edge_label_index):
        row, col = edge_label_index
        z = torch.cat([z_dict['user'][row], z_dict['movie'][col]], dim=-1)

        z = self.lin1(z).relu()
        z = self.lin2(z)
        return z.view(-1)


class Model(torch.nn.Module):
    def __init__(self, hidden_channels):
        super().__init__()
        self.encoder = GNNEncoder(hidden_channels, hidden_channels)
        self.encoder = to_hetero(self.encoder, data.metadata(), aggr='sum')
        self.decoder = EdgeDecoder(hidden_channels)

    def forward(self, x_dict, edge_index_dict, edge_label_index):
        z_dict = self.encoder(x_dict, edge_index_dict)
        return self.decoder(z_dict, edge_label_index)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = Model(hidden_channels=32).to(device)

print(model)

Model(
  (encoder): GraphModule(
    (conv1): ModuleDict(
      (user__rates__movie): SAGEConv((-1, -1), 32, aggr=mean)
      (movie__rev_rates__user): SAGEConv((-1, -1), 32, aggr=mean)
    )
    (conv2): ModuleDict(
      (user__rates__movie): SAGEConv((-1, -1), 32, aggr=mean)
      (movie__rev_rates__user): SAGEConv((-1, -1), 32, aggr=mean)
    )
  )
  (decoder): EdgeDecoder(
    (lin1): Linear(in_features=64, out_features=32, bias=True)
    (lin2): Linear(in_features=32, out_features=1, bias=True)
  )
)


## Training a Heterogeneous GNN

Training our GNN is then similar to training any PyTorch model.
We move the model to the desired device, and initialize an optimizer that takes care of adjusting model parameters via stochastic gradient descent.

The training loop applies the forward computation of the model, computes the loss from ground-truth labels and obtained predictions, and adjusts model parameters via back-propagation and stochastic gradient descent.


In [24]:
import torch.nn.functional as F

optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

def train():
    model.train()
    optimizer.zero_grad()
    pred = model(train_data.x_dict, train_data.edge_index_dict,train_data['user', 'movie'].edge_label_index)
    target = train_data['user', 'movie'].edge_label
    loss = F.mse_loss(pred, target)
    loss.backward()
    optimizer.step()
    return float(loss)

@torch.no_grad()
def test(data):
    data = data.to(device)
    model.eval()
    pred = model(data.x_dict, data.edge_index_dict,
                 data['user', 'movie'].edge_label_index)
    pred = pred.clamp(min=0, max=5)
    target = data['user', 'movie'].edge_label.float()
    rmse = F.mse_loss(pred, target).sqrt()
    return float(rmse)


for epoch in range(1, 301):
    train_data = train_data.to(device)
    loss = train()
    train_rmse = test(train_data)
    val_rmse = test(val_data)
    print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}, Train: {train_rmse:.4f}, '
          f'Val: {val_rmse:.4f}')

Epoch: 001, Loss: 12.4500, Train: 3.2671, Val: 3.2826
Epoch: 002, Loss: 10.6739, Train: 2.8540, Val: 2.8703
Epoch: 003, Loss: 8.1455, Train: 2.1051, Val: 2.1223
Epoch: 004, Loss: 4.4314, Train: 1.0818, Val: 1.0796
Epoch: 005, Loss: 1.1704, Train: 1.8257, Val: 1.7816
Epoch: 006, Loss: 5.3059, Train: 1.7431, Val: 1.6966
Epoch: 007, Loss: 3.2936, Train: 1.1046, Val: 1.0795
Epoch: 008, Loss: 1.2202, Train: 1.1533, Val: 1.1578
Epoch: 009, Loss: 1.3300, Train: 1.4476, Val: 1.4613
Epoch: 010, Loss: 2.0956, Train: 1.6097, Val: 1.6253
Epoch: 011, Loss: 2.5912, Train: 1.6166, Val: 1.6323
Epoch: 012, Loss: 2.6135, Train: 1.4996, Val: 1.5140
Epoch: 013, Loss: 2.2487, Train: 1.2989, Val: 1.3097
Epoch: 014, Loss: 1.6871, Train: 1.0932, Val: 1.0949
Epoch: 015, Loss: 1.1950, Train: 1.0300, Val: 1.0153
Epoch: 016, Loss: 1.0608, Train: 1.1693, Val: 1.1407
Epoch: 017, Loss: 1.3674, Train: 1.3169, Val: 1.2827
Epoch: 018, Loss: 1.7343, Train: 1.3002, Val: 1.2668
Epoch: 019, Loss: 1.6906, Train: 1.1541, Val

## Evaluation

From the validation results, our model can generalize well to unseen data. The val RMSE is should be around 0.9, meaning that, on average our model is off by 0.9 stars. We can now evaluate our model on the test set and take a closer look into the predictions.

In [25]:
with torch.no_grad():
    test_data = test_data.to(device)
    pred = model(test_data.x_dict, test_data.edge_index_dict,
                 test_data['user', 'movie'].edge_label_index)
    pred = pred.clamp(min=0, max=5)
    target = test_data['user', 'movie'].edge_label.float()
    rmse = F.mse_loss(pred, target).sqrt()
    print(f'Test RMSE: {rmse:.4f}')

userId = test_data['user', 'movie'].edge_label_index[0].cpu().numpy()
movieId = test_data['user', 'movie'].edge_label_index[1].cpu().numpy()
pred = pred.cpu().numpy()
target = target.cpu().numpy()

print(pd.DataFrame({'userId': userId, 'movieId': movieId, 'rating': pred, 'target': target}))

Test RMSE: 0.8826
       userId  movieId    rating  target
0         306     3511  2.741167     0.5
1         357     1059  3.861096     4.5
2         375      670  3.676630     4.0
3         231     2270  3.821468     3.0
4         437      791  4.050112     5.0
...       ...      ...       ...     ...
10079     490     3934  3.780862     4.0
10080     513     1038  3.542411     4.0
10081      42     2654  3.933442     5.0
10082      99       12  4.207401     3.5
10083      67      941  3.378377     4.0

[10084 rows x 4 columns]


## Movie recommendations

We can now use the model to generate ratings for a movie we haven't seen.


In [29]:
# Your mappedUserId
mapped_user_id = unique_user_id[unique_user_id['userId'] == our_user_id]['mappedUserId'].values

# Check if mapped_user_id is found
if mapped_user_id.size > 0:
    # Select movies that you haven't seen before
    movies_rated = ratings_df[ratings_df['mappedUserId'] == mapped_user_id[0]]
    movies_not_rated = movies_df[~movies_df.index.isin(movies_rated['movieId'])]
    movies_not_rated = movies_not_rated.merge(unique_movie_id, on='movieId')
    movie = movies_not_rated.sample(1)

    print(f"The movie we want to predict a raiting for is:  {movie['title'].item()}")
else:
    print(f"User ID {our_user_id} not found in the dataset.")

User ID 612 not found in the dataset.


In [34]:
mapped_user_id = unique_user_id[unique_user_id['userId'] == our_user_id]

# Check if the DataFrame is empty before accessing elements
if not mapped_user_id.empty:
    mapped_user_id = mapped_user_id.iloc[0]

    # Create the new edge_label_index
    edge_label_index = torch.tensor([
        mapped_user_id['mappedUserId'],  # User ID
        movie['mappedMovieId'].item()  # Movie ID
    ])

    with torch.no_grad():
        test_data.to(device)
        pred = model(test_data.x_dict, test_data.edge_index_dict, edge_label_index)
        pred = pred.clamp(min=0, max=5).detach().cpu().numpy()

    print(f"Predicted rating: {pred}")
else:
    print(f"User ID {our_user_id} not found in the dataset.")

User ID 612 not found in the dataset.


In [35]:
pred.item()

ValueError: can only convert an array of size 1 to a Python scalar

## Explaining the Predictions

PyTorch Geometric also provides a way to explain the predictions of a GNN. Let's check which movie ratings have influenced this prediction the most.

We will use the [captum](https://captum.ai/) library to explain the predictions.

In [36]:
from torch_geometric.explain import Explainer, CaptumExplainer

explainer = Explainer(
    model=model,
    algorithm=CaptumExplainer('IntegratedGradients'),
    explanation_type='model',
    model_config=dict(
        mode='regression',
        task_level='edge',
        return_type='raw',
    ),
    node_mask_type=None,
    edge_mask_type='object',
)

explanation = explainer(
    test_data.x_dict, test_data.edge_index_dict, index=0,
    edge_label_index=edge_label_index).cpu().detach()
explanation

NameError: name 'edge_label_index' is not defined

In [None]:
# User to movie link + attribution
user_to_movie = explanation['user', 'movie'].edge_index.numpy().T
user_to_movie_attr = explanation['user', 'movie'].edge_mask.numpy().T
user_to_movie_df = pd.DataFrame(
    np.hstack([user_to_movie, user_to_movie_attr.reshape(-1,1)]),
    columns = ['mappedUserId', 'mappedMovieId', 'attr']
)

# Movie to user link + attribution
movie_to_user = explanation['movie', 'user'].edge_index.numpy().T
movie_to_user_attr = explanation[ 'movie', 'user'].edge_mask.numpy().T
movie_to_user_df = pd.DataFrame(
    np.hstack([movie_to_user, movie_to_user_attr.reshape(-1,1)]),
    columns = ['mappedMovieId', 'mappedUserId','attr']
)
explanation_df = pd.concat([user_to_movie_df, movie_to_user_df])
explanation_df[["mappedUserId", "mappedMovieId"]] = explanation_df[["mappedUserId", "mappedMovieId"]].astype(int)

print(f"Attribtion for all edges towards prediction of movie rating of movie:\n {movie['title'].item()}")
print("==========================================================================================")
print(explanation_df.sort_values(by='attr'))

Attribtion for all edges towards prediction of movie rating of movie:
 Savages (2012)
       mappedUserId  mappedMovieId      attr
42293           447           6090 -0.008770
23312           566           8872 -0.001163
85148           306           5778 -0.000265
40746           566           1054 -0.000029
65887           426            756 -0.000027
...             ...            ...       ...
31919           610           8872  0.028762
44472           610           1054  0.029672
51347           610            926  0.046084
90391           610            756  0.059868
34869           610           5778  0.060863

[181514 rows x 3 columns]


In [None]:
# Select links that connect to our user
explanation_df = explanation_df[explanation_df['mappedUserId'] == mapped_user_id]

# We group the attribution scores by movie
explanation_df = explanation_df.groupby('mappedMovieId').sum()

# Merge with movies_df to receive title
# But first, we need to add the original id
explanation_df = explanation_df.merge(unique_movie_id, on='mappedMovieId')
explanation_df = explanation_df.merge(movies_df, on='movieId')

pd.options.display.float_format = "{:,.9f}".format

print("Top movies that influenced the prediction:")
print("==============================================")
print(explanation_df.sort_values(by='attr', ascending=False, key= lambda x: abs(x))[['title', 'attr']].head())

Top movies that influenced the prediction:
               title        attr
3  Paper Moon (1973) 0.061177694
0  Spider-Man (2002) 0.059889573
1      Avatar (2009) 0.046110074
4           Paterson 0.029951031
2    Iron Man (2008) 0.029699224


Find more on the official PyTorch Geometric website [here](https://www.pyg.org/).