<a href="https://colab.research.google.com/github/SytzeAndr/NGCF_RP32/blob/master/NGCF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import numpy as np
import random as rd
from google.colab import drive

**File loading**

Here we use the Google Drive mountpoint to load files. For this to work, note the following:



*   The first time you execute this, it will provide a link, which you need to follow and give permission for Colab to access your Google Drive.
*   Make sure that the data is located in the folder `RP_data` which should be located in the root of your Drive.

In [3]:
drive.mount('/content/drive')
data_path = './drive/My Drive/RP_data'

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


**Data class**

The Data class is a utility class that will help loading the data, compute useful properties of the data and create sample batches.

In [0]:
class Data(object):
  def __init__(self, path, batch_size):
    self.path = path
    self.batch_size = batch_size

    train_file = path + '/train.txt'
    test_file = path + '/test.txt'

    self.n_train = 0
    self.n_test = 0
    self.exist_users = []
    self.train_items = {}
    self.test_items = {}

    with open(train_file) as f:
      for l in f.readlines():
        if len(l) > 0:
          l = l.strip('\n')
          items = [int(i) for i in l.split(' ')]
          uid, train_items = items[0], items[1:]
          self.exist_users.append(uid)
          self.train_items[uid] = train_items
          self.n_train += len(train_items)

    with open(test_file) as f:
      for l in f.readlines():
        if len(l) > 0:
          l = l.strip('\n')
          try:
            items = [int(i) for i in l.split(' ')]
            uid, test_items = items[0], items[1:]
            self.exist_users.append(uid)
            self.test_items[uid] = test_items
            self.n_test += len(test_items)
          except Exception:
            continue

    train_max_item = max([max(items) for items in list(self.train_items.values())])
    test_max_item = max([max(items) for items in list(self.test_items.values())])
    self.n_items = max(train_max_item, test_max_item)
    self.n_users = max(self.exist_users)
  

  def sample(self):
    if self.batch_size <= self.n_users:
      users = rd.sample(self.exist_users, self.batch_size)
    else:
      users = [rd.choice(self.exist_users) for _ in range(self.batch_size)]


    def sample_pos_items_for_u(u, num):
      pos_items = self.train_items[u]
      n_pos_items = len(pos_items)
      pos_batch = []
      while True:
        if len(pos_batch) == num: break
        pos_id = np.random.randint(low=0, high=n_pos_items, size=1)[0]
        pos_i_id = pos_items[pos_id]

        if pos_i_id not in pos_batch:
          pos_batch.append(pos_i_id)
      return pos_batch


    def sample_neg_items_for_u(u, num):
      neg_batch = []
      while True:
        if len(neg_batch) == num: break
        neg_id = np.random.randint(low=0, high=self.n_items, size=1)[0]
        if neg_id not in self.train_items[u] and neg_id not in neg_batch:
          neg_batch.append(neg_id)
      return neg_batch
    

    pos_items, neg_items = [], []
    for u in users:
      pos_items += sample_pos_items_for_u(u, 1)
      neg_items += sample_neg_items_for_u(u, 1)

    return users, pos_items, neg_items

data = Data(data_path, 10)

In [7]:
data.sample()

([23323, 42808, 25158, 13144, 34845, 46601, 15128, 3246, 42544, 43562],
 [66111, 5260, 58472, 47765, 64092, 58786, 19457, 8990, 52296, 58227],
 [38466, 49956, 45615, 25744, 2817, 71553, 42324, 56610, 56784, 90067])

#The steps to take
As a rough outline, we can sketch the steps to be taken to train a Neural Graph Collaborative Filtering system as follows. 
1. Create a user-item interaction graph from our data.
2. Perform message constructing by implementing the message passing formula, which defines the relations between users and items. We then use message aggregation to create a representation for each user, which defines our first-order propagation. 
3. Use the representations from the first order-representation to create higher order representations.
4. Use embeddings obtained from our L layers to train a neural network, using theory about graph neural networks.
5. Make a prediction on our test sets and measure the recall and normalized discounted cumulative gain (ndcg), which should produce a table.


# step 1
The initial embedding is already performed in the data sets and thus we can consider to already have our embedding table. In the paper's work, it is explained that in their NGCF framework the embeddings are refined by propagating them on the user-item interaction graph. 

This implies that we need to construct an interaction graph, and update its weights according to the message construction/message aggregation functions.

By using PyTorch and PyTorch Geometric (PyG) we are able to construct a graph neural network and perform various operations. We are planning to use this framework. 

## GCN with torch_geometric (PyG)
Some of its steps are described in this blog post:
https://towardsdatascience.com/hands-on-graph-neural-networks-with-pytorch-pytorch-geometric-359487e221a8

There is also a google colab which trains a GCN to identify 'spammers'
https://colab.research.google.com/github/zaidalyafeai/Notebooks/blob/master/Deep_GCN_Spam.ipynb#scrollTo=_4_eVOI2M4Uo

First we need to install torch_geometric. This is a geometric deep learning extension library for PyTorch.

In [1]:
pip install torch===1.2.0 torchvision===0.4.0 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html


In [3]:
import torch
# we use torch version 1.2.0 instead of the latest due to dependency errors
print(torch.__version__)

1.2.0


In [8]:
# these were the corresponding versions for torch-geometric 1.3.2, released 4 oct 2019, which runs on torch 1.2.0.
!pip install torch-scatter==1.3.1
!pip install torch-sparse==0.4.0
!pip install torch-cluster==1.4.4
!pip install torch-spline-conv==1.1.0
!pip install torch-geometric==1.3.2

Collecting torch-scatter==1.3.1
  Downloading https://files.pythonhosted.org/packages/35/d4/750403a8aa32cdb3d2d05849c6a10e4e0604de5e0cc94b81a0d0d69a75f3/torch_scatter-1.3.1.tar.gz
Building wheels for collected packages: torch-scatter
  Building wheel for torch-scatter (setup.py) ... [?25l[?25hdone
  Created wheel for torch-scatter: filename=torch_scatter-1.3.1-cp36-cp36m-linux_x86_64.whl size=2726045 sha256=9a31666bfe71bc02e93eb5e865f671d7f2b4a1041dcfec5df6933be8451318d0
  Stored in directory: /root/.cache/pip/wheels/7f/21/0b/c42fa9353ceec5e87464599e470a03e4250ec667b4a392fa7d
Successfully built torch-scatter
Installing collected packages: torch-scatter
  Found existing installation: torch-scatter 2.0.4
    Uninstalling torch-scatter-2.0.4:
      Successfully uninstalled torch-scatter-2.0.4
Successfully installed torch-scatter-1.3.1
Collecting torch-sparse==0.4.0
  Downloading https://files.pythonhosted.org/packages/b0/0a/2ff678e0d04e524dd2cf990a6202ced8c0ffe3fe6b08e02f25cc9fd27da0/to

In [0]:
import torch
import torch_geometric

In [16]:
# verify that torch geometric is imported, should be 1.3.2
print(torch_geometric.__version__)

1.3.2


**todo: create a graph by following the steps from either the blogpost or the google colab file linked earlier**

# step 2
For each user-item pair (u,i), we define the message from i to u as

$m_{u \leftarrow i} = \dfrac{1}{\sqrt{|N_u||N_i|}} (W_1 e_i + W_2(e_i \odot e_u))$

Where $N_u, N_i$ are the first hop neighbors of $u$ and $i$.