# Installation

In [1]:
!pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install torch-geometric

Looking in links: https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
Collecting torch-scatter
[?25l  Downloading https://pytorch-geometric.com/whl/torch-1.8.0%2Bcu101/torch_scatter-2.0.7-cp37-cp37m-linux_x86_64.whl (2.5MB)
[K     |████████████████████████████████| 2.5MB 4.0MB/s 
[?25hInstalling collected packages: torch-scatter
Successfully installed torch-scatter-2.0.7
Looking in links: https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
Collecting torch-sparse
[?25l  Downloading https://pytorch-geometric.com/whl/torch-1.8.0%2Bcu101/torch_sparse-0.6.9-cp37-cp37m-linux_x86_64.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 4.1MB/s 
Installing collected packages: torch-sparse
Successfully installed torch-sparse-0.6.9
Looking in links: https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
Collecting torch-cluster
[?25l  Downloading https://pytorch-geometric.com/whl/torch-1.8.0%2Bcu101/torch_cluster-1.5.9-cp37-cp37m-linux_x86_64.whl (1.0MB)
[K     |███████

# Introduction
In this notebook we are going to predict movie ratings in the [MovieLens 100K Dataset](https://grouplens.org/datasets/movielens/100k/). The dataset contains around 100,000 ratings from around 1000 users on 1700 movies. The users and movies have features associated with them.

For this task, we are going to build a graph containing two types of nodes; one representing users, and another representing movies. The graph will contain edges connecting users to the movies they rated. Each edge will contain one attribute, i.e. the rating. 

We will be going through the following steps:


1. Downloading and processing the dataset
2. Converting the dataset to a PyG Data object
3. Training a simple Graph Neural Network to predict movie ratings  

# Downloading and Processing Dataset

### Download dataset
We will first download and unzip the movielens 100k dataset. A full description of the files contained in this dataset can be found [here](http://files.grouplens.org/datasets/movielens/ml-100k-README.txt).



In [2]:
from six.moves import urllib

url ="http://files.grouplens.org/datasets/movielens/ml-100k.zip"
filename = url.rpartition("/")[2]  #  = ml-100k.zip
data = urllib.request.urlopen(url)
with open(filename, 'wb') as f:
    f.write(data.read())
folder=filename[:-4] # ml-100k
!ls 

ml-100k.zip  sample_data


In [3]:
import zipfile
zip = zipfile.ZipFile(filename)
zip.extractall()
!ls ml-100k

allbut.pl  u1.base  u2.test  u4.base  u5.test  ub.base	u.genre  u.occupation
mku.sh	   u1.test  u3.base  u4.test  ua.base  ub.test	u.info	 u.user
README	   u2.base  u3.test  u5.base  ua.test  u.data	u.item


### Loading Edges
The movie ratings are stored in the file *u.data*. The file contains user ids,  movie ids, ratings and timestamps associated with each rating


In [4]:
import pandas as pd

filename_edges= folder+"/u.data"

# Load file using pandas , and specify seperator as "|", and provide names of the columns.
# Discard the timestamp
df_edges = pd.read_csv(filename_edges, sep="\t",
                    header=None, names=["user_id", "movie_id", "rating", "timestamp"],
                    usecols = ["user_id", "movie_id", "rating"])
df_edges.head()                  

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


### Loading User features

Next we will load the user features which are stored in *u.user*. This file contains the age, gender, occupation and zipcode of each user. 

In [5]:
filename_users = folder +"/u.user"


In [6]:
# Load file using pandas , and specify seperator as "|", and provide names of the columns.
# Keep the user_id , age and gender columns and discard the zipcode column
df_users = pd.read_csv(
    filename_users,
    sep="|",
    header=None,
    names=["user_id", "age", "gender", "occupation", "zipcode"],
    usecols=["user_id", "age", "gender", "occupation"],
)
df_users.head()

Unnamed: 0,user_id,age,gender,occupation
0,1,24,M,technician
1,2,53,F,other
2,3,23,M,writer
3,4,24,M,technician
4,5,33,F,other


We will be ignoring the zipcode, and one-hot encoding the gender and occupation features. We can compute the one-hot encoded features using the pandas  *get_dummies* function. We will also standardize the age of the users

In [7]:
df_gender_onehot = pd.get_dummies(df_users["gender"])
df_occupation_onehot = pd.get_dummies(df_users["occupation"])
age = df_users["age"]
age_standard = (age -age.mean())/age.std()
# post-processed user features
df_users_pp = pd.concat( [df_users["user_id"], age_standard, df_occupation_onehot, df_gender_onehot], axis=1)

df_users_pp.head()

Unnamed: 0,user_id,age,administrator,artist,doctor,educator,engineer,entertainment,executive,healthcare,homemaker,lawyer,librarian,marketing,none,other,programmer,retired,salesman,scientist,student,technician,writer,F,M
0,1,-0.824422,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1
1,2,1.554043,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0
2,3,-0.906438,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1
3,4,-0.824422,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1
4,5,-0.086278,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0


### Loading Movie features
Finally we will load the movie features contained in the file *u.item* and *u.genre*

Quoting from the [readme](http://files.grouplens.org/datasets/movielens/ml-100k-README.txt) of the dataset:

---


u.item     -- Information about the items (movies); this is a tab separated
              list of
              movie id | movie title | release date | video release date |
              IMDb URL | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western |
              The last 19 fields are the genres, a 1 indicates the movie
              is of that genre, a 0 indicates it is not; movies can be in
              several genres at once.
              The movie ids are the ones used in the u.data data set.

u.genre    -- A list of the genres.


In [8]:
filename_movies = folder + "/u.item"


In [9]:
filename_genre = folder + "/u.genre"
df_genre = pd.read_csv(filename_genre, header=None, sep="|", names=["genre", "id"])
list_genre = df_genre.genre.to_list()
list_genre

['unknown',
 'Action',
 'Adventure',
 'Animation',
 "Children's",
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Fantasy',
 'Film-Noir',
 'Horror',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Thriller',
 'War',
 'Western']

In [10]:
 movie_columns = ["movie_id", "title", "_", "year", "url"] + list_genre

 #Load file using pandas , and specify seperator as "|", and provide names of the columns
 df_movies = pd.read_csv(filename_movies, sep="|",  header=None, names=movie_columns,
            usecols= ["movie_id"] + list_genre )
 df_movies

Unnamed: 0,movie_id,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,4,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
4,5,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1677,1678,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1678,1679,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0
1679,1680,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0
1680,1681,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


### Summary summarizing the dataset we loaded so far:

In [11]:
df_users_pp # User features

Unnamed: 0,user_id,age,administrator,artist,doctor,educator,engineer,entertainment,executive,healthcare,homemaker,lawyer,librarian,marketing,none,other,programmer,retired,salesman,scientist,student,technician,writer,F,M
0,1,-0.824422,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1
1,2,1.554043,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0
2,3,-0.906438,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1
3,4,-0.824422,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1
4,5,-0.086278,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
938,939,-0.660390,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0
939,940,-0.168294,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
940,941,-1.152486,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1
941,942,1.143963,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0


In [12]:
df_movies # Movie features

Unnamed: 0,movie_id,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,4,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
4,5,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1677,1678,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1678,1679,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0
1679,1680,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0
1680,1681,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [13]:
df_edges.head()

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


# Converting to a Pytorch Geometric Dataset

## Heterogeneous Graphs in Pytorch Geometric

The dataframe *df_edges* contains an edge list, with the first column referring to an id of the user, and the second column referring to an id of a movie. Pytorch Geometric does not (currently) natively handle this format. Currently PyG expects that the source and target indices in the edge list belong to the same set. 

However there is a simple workaround 

Consider this edge list, where the first row contains the user ids and the second row contains the movie ids:

```
0 5 4 2 1 3
0 1 2 1 2 1
```

We can shift the movie ids by the number of users (i.e. 6) to form this edge list

```
0 5 4 2 1 3
6 7 8 7 8 7
```

We can pass this edge list to existing convolutional operators in PyG without modifying them or writing our own convolutional operator. 

However we have to keep in mind that the two node types (users and movies) still represent fundamentally different entities and they both have features of different sizes.  We will come back to this issue when we create the graph neural network.





## Converting to Pytorch Geometric Format
We will now convert the dataset format to that of Pytorch Geometric. We will maintain the edge index as two separate arrays in order to facilate future calculations related to the heterogeneous nature of the problem which will become clear in the next section

In [14]:
import torch
from torch_geometric.data import Data

assert (df_movies["movie_id"].isin(df_edges["movie_id"])).all()
assert (df_users_pp["user_id"].isin(df_edges["user_id"])).all()
assert (df_edges["movie_id"].isin(df_movies["movie_id"])).all()
assert (df_edges["user_id"].isin(df_users_pp["user_id"])).all()

# User and Movie IDs
x_user = torch.tensor(df_users_pp.drop(columns=["user_id"]).values)
x_movie = torch.tensor(df_movies.drop(columns=["movie_id"]).values)

# Ids start at 1 in the original dataset.
# Shift the ids back so that they start a 0.
edge_index_user = torch.tensor(df_edges["user_id"].values) - 1 
edge_index_movie = torch.tensor(df_edges["movie_id"].values) - 1 

# Attributes of edge . ie. ratings
edge_ratings = torch.tensor(df_edges["rating"].values)

# Number of edges
n_edges = edge_ratings.shape[0]

# Checks
assert len(edge_index_user.unique()) == len(x_user)
assert len(edge_index_movie.unique()) == len(x_movie)

In [15]:
# Define train and test split for later model validation
test_size = 0.2
train_size= 0.8

ind_cut = int(train_size*n_edges)

edge_ratings_train = edge_ratings[:ind_cut]
edge_ratings_test = edge_ratings[ind_cut:]
edge_index_user_train = edge_index_user[:ind_cut]
edge_index_user_test = edge_index_user[ind_cut:]
edge_index_movie_train = edge_index_movie[:ind_cut]
edge_index_movie_test = edge_index_movie[ind_cut:]

# Package attributes into Data object
device = "cuda"

data = Data(edge_index_user_train = edge_index_user_train.to(device),
            edge_index_user_test = edge_index_user_test.to(device),
            edge_index_movie_train = edge_index_movie_train.to(device),
            edge_index_movie_test = edge_index_movie_test.to(device),
            edge_ratings_train = edge_ratings_train.to(device),
            edge_ratings_test = edge_ratings_test.to(device),
            n_edges = n_edges,
            x_user=x_user.float().to(device), x_movie=x_movie.float().to(device),
            n_user = x_user.shape[0],
            n_movie=x_movie.shape[0]).to(device)

# Graph Neural Network for Heterogeneous Data

Now that we have defined our data format, we now design a graph neural network to input this data and give out a prediction of movie ratings. The general idea of this graph neural network is 



1. Map the user and movie feature vectors into feature vectors of the same dimension. This makes it straight forward to use convolutional operators that expect homogeneous nodal features. For this purpose we will use two different linear layers. Note the user nodes have features of 24 dimensions and movie nodes have features of 19 dimensions. But note even if they both had the same dimensions, this step is still necessary since the features have a different meaning!
2. Construct an edge list with movie ids shifted as described above
3. Pass the features from step 1 and edge list from step 2 through multiple graph convolutional operators
4. For each edge, concatenate the features of the adjacent nodes (one of which will be a user node and the other will be a movie node) and pass this concatenated vector through a linear layer to predict the user's rating of the movie



In [16]:
from  torch.nn import Linear
from torch_geometric.nn import SGConv
import torch.nn.functional as F

In [17]:
class MovieNet(torch.nn.Module):
    def __init__(self,data):
        super().__init__()

        # Encoder of the user features
        self.user_encoder = Linear(data.x_user.shape[1], 5)

        # Encoder of the movie features
        self.movie_encoder = Linear(data.x_movie.shape[1], 5)

        #First convolutional layer
        self.conv1 = SGConv(5, 5)

        #Second convolutional layer
        self.conv2 = SGConv(5, 5)

        #Linear layer to predict movie rating
        self.regr = Linear(10,1)

    def convolutional_operators(self, x, edge_index):
        # Pass the encoded features through multiple convolutional layers
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = self.conv2(x,edge_index)
        x = F.relu(x)
        return x

    def encode_features(self, x_user, x_movie):
        """
        Function that encodes the user and movie features
        """
        x_user_enc = self.user_encoder(x_user)
        x_user_enc = F.relu(x_user_enc)

        x_movie_enc = self.movie_encoder(x_movie)
        x_movie_enc = F.relu(x_movie_enc)
        return x_user_enc, x_movie_enc

    def construct_edge_index(self, data):
      
        # Training edge indices. These are the edges that will be passed to the
        # convolutional operators
        edge_index_user_train = data.edge_index_user_train
        edge_index_movie_train = data.edge_index_movie_train

        # Combine the edge indices into the usual format expected by the convolutional operators
        edge_index = torch.stack([edge_index_user_train,
                                  edge_index_movie_train])
        
        # Shift the node indices by the number of users, such as the indices
        # in edge_index correspond to the indices in *x_combined* feature matrix 
        # defined above
        edge_index[1] += data.n_user
        
        # Convert directed graph to undirected graph to allow  information
        # to  propagate in both direction (movie to user and user to movie)
        edge_index_undirected = torch.cat([edge_index, edge_index.flip(dims=(0,1)) ],dim=1)
        return edge_index, edge_index_undirected



    def forward(self,  data):
        # Encoded user and movie features
        x_user_enc, x_movie_enc = self.encode_features(data.x_user, data.x_movie)

        # Concatenate the user and movie encodings
        x_combined = torch.cat([x_user_enc,x_movie_enc])

        edge_index, edge_index_undirected = self.construct_edge_index(data)

        x = self.convolutional_operators(x_combined, edge_index_undirected)
        # Extract the features corresponding to user nodes 
        # and features corresponding to movie nodes using the indices in the 
        # edge list
        user_feature_i = x[edge_index[0]]
        movie_feature_j = x[edge_index[1]] 
        
        x = torch.cat([user_feature_i,movie_feature_j],dim=1)
        y = self.regr(x)
        return y.reshape(-1)

    def predict(self, edge_index_user, edge_index_movie, data):
        """
        Similar function to forward, except that the predictions are made on 
        the specified user/movie pairs
        """
        x_user_enc, x_movie_enc = self.encode_features(data.x_user, data.x_movie)
        x_combined = torch.cat([x_user_enc,x_movie_enc])
        edge_index, edge_index_undirected = self.construct_edge_index(data)
        
        x = self.convolutional_operators(x_combined, edge_index_undirected)

        user_feature_i = x[edge_index_user]
        movie_feature_j = x[edge_index_movie+ data.n_user] 
        
        x = torch.cat([user_feature_i,movie_feature_j],dim=1)
        y = self.regr(x)
        return  y.reshape(-1)



# Training and evaluation

Finally we train and evaluate the prediction capability of the Graph Neural Network on the 80/20 split defined earlier. 


In [18]:
model = MovieNet(data).to(device)
optimizer = torch.optim.Adam(params=model.parameters(), lr=0.01,
                             weight_decay=0.001)

for epoch in range(5000):

  model.train()
  optimizer.zero_grad()

  y_pred_train = model(data)

  # Compute mean-square error
  mse_train = torch.mean((y_pred_train-data.edge_ratings_train)**2)
  mse_train.backward()
  optimizer.step()

  if epoch %200 ==0:
    model.eval()

    y_pred_test = model.predict(data.edge_index_user_test, 
                          data.edge_index_movie_test,
                          data)
    mse_test = torch.mean((y_pred_test-data.edge_ratings_test)**2)

    print(" Epoch: {}, RMSE Train: {}, RMSE Test: {}".format(
        epoch, torch.sqrt(mse_train),torch.sqrt(mse_test) )
        )

 Epoch: 0, RMSE Train: 3.9864938259124756, RMSE Test: 3.905395984649658
 Epoch: 200, RMSE Train: 1.094533920288086, RMSE Test: 1.0932884216308594
 Epoch: 400, RMSE Train: 1.0730518102645874, RMSE Test: 1.0746990442276
 Epoch: 600, RMSE Train: 1.0493640899658203, RMSE Test: 1.051659107208252
 Epoch: 800, RMSE Train: 1.036614179611206, RMSE Test: 1.0430688858032227
 Epoch: 1000, RMSE Train: 1.0335804224014282, RMSE Test: 1.0419386625289917
 Epoch: 1200, RMSE Train: 1.0314987897872925, RMSE Test: 1.0406438112258911
 Epoch: 1400, RMSE Train: 1.0297006368637085, RMSE Test: 1.0396449565887451
 Epoch: 1600, RMSE Train: 1.0284234285354614, RMSE Test: 1.0390608310699463
 Epoch: 1800, RMSE Train: 1.025988221168518, RMSE Test: 1.0374845266342163
 Epoch: 2000, RMSE Train: 1.0233066082000732, RMSE Test: 1.034857153892517
 Epoch: 2200, RMSE Train: 1.0219359397888184, RMSE Test: 1.0333436727523804
 Epoch: 2400, RMSE Train: 1.0214073657989502, RMSE Test: 1.032547950744629
 Epoch: 2600, RMSE Train: 1.0


## Optional Exercise
Try obtaining a Root mean square error (RMSE) test below 1.0 by modifying the GNN defined above by e.g. using dropout layer, using another convolutional operator.
For reference, a list of RMSE obtained by classical ML methods on the Movie-lens 100k can be found at http://surpriselib.com/. Results by GNN based approaches can be found here https://paperswithcode.com/sota/collaborative-filtering-on-movielens-100k.

# References

[Heterogeneous Graph Neural Network](https://dl.acm.org/doi/pdf/10.1145/3292500.3330961)

