<a href="https://colab.research.google.com/github/SytzeAndr/NGCF_RP32/blob/master/NGCF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import numpy as np
import random as rd
from google.colab import drive

**File loading**

Here we use the Google Drive mountpoint to load files. For this to work, note the following:



*   The first time you execute this, it will provide a link, which you need to follow and give permission for Colab to access your Google Drive.
*   Make sure that the data is located in the folder `RP_data` which should be located in the root of your Drive.

In [17]:
drive.mount('/content/drive')
data_path = './drive/My Drive/RP_data'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**DataReader class**

The DataReader class is a utility class that will help loading the data, compute useful properties of the data and create sample batches.

In [0]:
class DataReader(object):
  def __init__(self, path='./drive/My Drive/RP_data', batch_size=100):
    self.path = path
    self.batch_size = batch_size

    train_file = path + '/train.txt'
    test_file = path + '/test.txt'

    self.n_train = 0
    self.n_test = 0
    self.exist_users = []
    self.train_items = {}
    self.test_items = {}

    with open(train_file) as f:
      for l in f.readlines():
        if len(l) > 0:
          l = l.strip('\n')
          items = [int(i) for i in l.split(' ')]
          uid, train_items = items[0], items[1:]
          self.exist_users.append(uid)
          self.train_items[uid] = train_items
          self.n_train += len(train_items)

    with open(test_file) as f:
      for l in f.readlines():
        if len(l) > 0:
          l = l.strip('\n')
          try:
            items = [int(i) for i in l.split(' ')]
            uid, test_items = items[0], items[1:]
            self.exist_users.append(uid)
            self.test_items[uid] = test_items
            self.n_test += len(test_items)
          except Exception:
            continue

    train_max_item = max([max(items) for items in list(self.train_items.values())])
    test_max_item = max([max(items) for items in list(self.test_items.values())])
    self.n_items = max(train_max_item, test_max_item)
    self.n_users = max(self.exist_users)
  

  def sample(self):
    if self.batch_size <= self.n_users:
      users = rd.sample(self.exist_users, self.batch_size)
    else:
      users = [rd.choice(self.exist_users) for _ in range(self.batch_size)]


    def sample_pos_items_for_u(u, num):
      pos_items = self.train_items[u]
      n_pos_items = len(pos_items)
      pos_batch = []
      while True:
        if len(pos_batch) == num: break
        pos_id = np.random.randint(low=0, high=n_pos_items, size=1)[0]
        pos_i_id = pos_items[pos_id]

        if pos_i_id not in pos_batch:
          pos_batch.append(pos_i_id)
      return pos_batch


    def sample_neg_items_for_u(u, num):
      neg_batch = []
      while True:
        if len(neg_batch) == num: break
        neg_id = np.random.randint(low=0, high=self.n_items, size=1)[0]
        if neg_id not in self.train_items[u] and neg_id not in neg_batch:
          neg_batch.append(neg_id)
      return neg_batch
    

    pos_items, neg_items = [], []
    for u in users:
      pos_items += sample_pos_items_for_u(u, 1)
      neg_items += sample_neg_items_for_u(u, 1)

    return users, pos_items, neg_items


In [9]:
# small test to verify whether we can load data
dataReader = DataReader(data_path, 10)
dataReader.sample()

([50482, 6047, 46408, 6515, 13980, 19365, 22653, 51054, 33271, 51850],
 [57054, 51777, 32403, 924, 3076, 17585, 10362, 82674, 52365, 48920],
 [49145, 89035, 15884, 68525, 82618, 70079, 77642, 78705, 15443, 31528])

#The steps to take
As a rough outline, we can sketch the steps to be taken to train a Neural Graph Collaborative Filtering system as follows. 
1. Create a user-item interaction graph from our data.
2. Perform message constructing by implementing the message passing formula, which defines the relations between users and items. We then use message aggregation to create a representation for each user, which defines our first-order propagation. 
3. Use the representations from the first order-representation to create higher order representations.
4. Use embeddings obtained from our L layers to train a neural network, using theory about graph neural networks.
5. Make a prediction on our test sets and measure the recall and normalized discounted cumulative gain (ndcg), which should produce a table.


# step 1
The initial embedding is already performed in the data sets and thus we can consider to already have our embedding table. In the paper's work, it is explained that in their NGCF framework the embeddings are refined by propagating them on the user-item interaction graph. 

This implies that we need to construct an interaction graph, and update its weights according to the message construction/message aggregation functions.

By using PyTorch and PyTorch Geometric (PyG) we are able to construct a graph neural network and perform various operations. We are planning to use this framework. 

## GCN with torch_geometric (PyG)
Some of its steps are described in this blog post:
https://towardsdatascience.com/hands-on-graph-neural-networks-with-pytorch-pytorch-geometric-359487e221a8

There is also a google colab which trains a GCN to identify 'spammers'
https://colab.research.google.com/github/zaidalyafeai/Notebooks/blob/master/Deep_GCN_Spam.ipynb#scrollTo=_4_eVOI2M4Uo

**Importing torch_geometric 1.3.2**

`torch_geometric` is a geometric deep learning extension library for PyTorch. We don't use the latest version (loaded by default by google colab) due to inconsistencies with PyTorch. This also means we have to downgrade a few other packages aswell.

In [48]:
pip install torch===1.2.0 torchvision===0.4.0 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html


In [10]:
import torch
# we use torch version 1.2.0 instead of the latest due to dependency errors
print(torch.__version__)

1.2.0


In [7]:
# these were the corresponding versions for torch-geometric 1.3.2, released 4 oct 2019, which runs on torch 1.2.0.
# grab some coffee, this might take a while
!pip install torch-scatter==1.3.1
!pip install torch-sparse==0.4.0
!pip install torch-cluster==1.4.4
!pip install torch-spline-conv==1.1.0
!pip install torch-geometric==1.3.2



In [0]:
import torch
import torch_geometric


In [9]:
# verify that torch geometric is imported, should be 1.3.2
print(torch_geometric.__version__)

1.3.2


**Creating the interaction graph**

First we create an interaction graph. The Data object from torch_geometric represents a graph structure, and as such, we should create one given our data.

In [0]:
from torch_geometric.data import Data

# get data
batch_length = 10
dataReader = DataReader()
edges = []
# define an edge for every user and item
for user in range(dataReader.n_users):
  pos_items = dataReader.train_items[user]
  for item in pos_items:
    # increase the item id by the number of users to distinct the nodes from users
    edges.append([user, item + dataReader.n_users])
    edges.append([item + dataReader.n_users, user])

# following example from https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html
edge_index = torch.tensor(edges, dtype=torch.long).t().contiguous()

# we can extend x with any number of features we want.
# for now, we only have it include its corresponding userid or itemid, but we might want to extend this
x = torch.tensor(list(range(dataReader.n_users)) + list(range(dataReader.n_items)), dtype=torch.long)

# data is a representation of the directed graph
data = Data(x=x, edge_index=edge_index)



Data is a representation of our interaction graph

In [47]:
# torch_geometric.data.Data provides a number of utility functions
# these prints are a check to verify if our graph seems valid
print(data.is_undirected())
print(data.is_directed())
print(data.num_edges)
print(data.num_nodes)
print(data.contains_isolated_nodes())
print(data.contains_self_loops())

True
False
4761412
144240
False
False


# step 2
For each user-item pair (u,i), we define the message from i to u as

$m_{u \leftarrow i} = \dfrac{1}{\sqrt{|N_u||N_i|}} (W_1 e_i + W_2(e_i \odot e_u))$

Where $N_u, N_i$ are the first hop neighbors of $u$ and $i$, $W_1, W_2$ are trainable weight matrices to distill useful information for propagation, and $e_i$ and $e_u$ are embeddings of the users and items.

Message passing, constructing and aggregation aswell as training neural networks is all included in the `torch_geometric` package.


https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html

In [0]:
from torch_geometric.nn import MessagePassing

**todo: it is not entirely clear for me how to set the initial embeddings from the given data. according to the paper the "embedding table serves as an initial state for user embeddings and item embeddings to be optimized in an end-to-end fashion"**