<a href="https://colab.research.google.com/github/SytzeAndr/NGCF_RP32/blob/master/NGCF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import numpy as np
import random as rd
from google.colab import drive

**File loading**

Here we use the Google Drive mountpoint to load files. For this to work, note the following:



*   The first time you execute this, it will provide a link, which you need to follow and give permission for Colab to access your Google Drive.
*   Make sure that the data is located in the folder `RP_data` which should be located in the root of your Drive.

In [17]:
drive.mount('/content/drive')
data_path = './drive/My Drive/RP_data'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Data class**

The Data class is a utility class that will help loading the data, compute useful properties of the data and create sample batches.

In [0]:
class Data(object):
  def __init__(self, path='./drive/My Drive/RP_data', batch_size=100):
    self.path = path
    self.batch_size = batch_size

    train_file = path + '/train.txt'
    test_file = path + '/test.txt'

    self.n_train = 0
    self.n_test = 0
    self.exist_users = []
    self.train_items = {}
    self.test_items = {}

    with open(train_file) as f:
      for l in f.readlines():
        if len(l) > 0:
          l = l.strip('\n')
          items = [int(i) for i in l.split(' ')]
          uid, train_items = items[0], items[1:]
          self.exist_users.append(uid)
          self.train_items[uid] = train_items
          self.n_train += len(train_items)

    with open(test_file) as f:
      for l in f.readlines():
        if len(l) > 0:
          l = l.strip('\n')
          try:
            items = [int(i) for i in l.split(' ')]
            uid, test_items = items[0], items[1:]
            self.exist_users.append(uid)
            self.test_items[uid] = test_items
            self.n_test += len(test_items)
          except Exception:
            continue

    train_max_item = max([max(items) for items in list(self.train_items.values())])
    test_max_item = max([max(items) for items in list(self.test_items.values())])
    self.n_items = max(train_max_item, test_max_item)
    self.n_users = max(self.exist_users)
  

  def sample(self):
    if self.batch_size <= self.n_users:
      users = rd.sample(self.exist_users, self.batch_size)
    else:
      users = [rd.choice(self.exist_users) for _ in range(self.batch_size)]


    def sample_pos_items_for_u(u, num):
      pos_items = self.train_items[u]
      n_pos_items = len(pos_items)
      pos_batch = []
      while True:
        if len(pos_batch) == num: break
        pos_id = np.random.randint(low=0, high=n_pos_items, size=1)[0]
        pos_i_id = pos_items[pos_id]

        if pos_i_id not in pos_batch:
          pos_batch.append(pos_i_id)
      return pos_batch


    def sample_neg_items_for_u(u, num):
      neg_batch = []
      while True:
        if len(neg_batch) == num: break
        neg_id = np.random.randint(low=0, high=self.n_items, size=1)[0]
        if neg_id not in self.train_items[u] and neg_id not in neg_batch:
          neg_batch.append(neg_id)
      return neg_batch
    

    pos_items, neg_items = [], []
    for u in users:
      pos_items += sample_pos_items_for_u(u, 1)
      neg_items += sample_neg_items_for_u(u, 1)

    return users, pos_items, neg_items


In [9]:
# small test to verify whether we can load data
data = Data(data_path, 10)
data.sample()

([50482, 6047, 46408, 6515, 13980, 19365, 22653, 51054, 33271, 51850],
 [57054, 51777, 32403, 924, 3076, 17585, 10362, 82674, 52365, 48920],
 [49145, 89035, 15884, 68525, 82618, 70079, 77642, 78705, 15443, 31528])

#The steps to take
As a rough outline, we can sketch the steps to be taken to train a Neural Graph Collaborative Filtering system as follows. 
1. Create a user-item interaction graph from our data.
2. Perform message constructing by implementing the message passing formula, which defines the relations between users and items. We then use message aggregation to create a representation for each user, which defines our first-order propagation. 
3. Use the representations from the first order-representation to create higher order representations.
4. Use embeddings obtained from our L layers to train a neural network, using theory about graph neural networks.
5. Make a prediction on our test sets and measure the recall and normalized discounted cumulative gain (ndcg), which should produce a table.


# step 1
The initial embedding is already performed in the data sets and thus we can consider to already have our embedding table. In the paper's work, it is explained that in their NGCF framework the embeddings are refined by propagating them on the user-item interaction graph. 

This implies that we need to construct an interaction graph, and update its weights according to the message construction/message aggregation functions.

By using PyTorch and PyTorch Geometric (PyG) we are able to construct a graph neural network and perform various operations. We are planning to use this framework. 

## GCN with torch_geometric (PyG)
Some of its steps are described in this blog post:
https://towardsdatascience.com/hands-on-graph-neural-networks-with-pytorch-pytorch-geometric-359487e221a8

There is also a google colab which trains a GCN to identify 'spammers'
https://colab.research.google.com/github/zaidalyafeai/Notebooks/blob/master/Deep_GCN_Spam.ipynb#scrollTo=_4_eVOI2M4Uo

First we need to install torch_geometric. This is a geometric deep learning extension library for PyTorch.

In [10]:
pip install torch===1.2.0 torchvision===0.4.0 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch===1.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/30/57/d5cceb0799c06733eefce80c395459f28970ebb9e896846ce96ab579a3f1/torch-1.2.0-cp36-cp36m-manylinux1_x86_64.whl (748.8MB)
[K     |████████████████████████████████| 748.9MB 18kB/s 
[?25hCollecting torchvision===0.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/06/e6/a564eba563f7ff53aa7318ff6aaa5bd8385cbda39ed55ba471e95af27d19/torchvision-0.4.0-cp36-cp36m-manylinux1_x86_64.whl (8.8MB)
[K     |████████████████████████████████| 8.8MB 38.7MB/s 
Installing collected packages: torch, torchvision
  Found existing installation: torch 1.4.0
    Uninstalling torch-1.4.0:
      Successfully uninstalled torch-1.4.0
  Found existing installation: torchvision 0.5.0
    Uninstalling torchvision-0.5.0:
      Successfully uninstalled torchvision-0.5.0
Successfully installed torch-1.2.0 torchvision-0.4.0


In [10]:
import torch
# we use torch version 1.2.0 instead of the latest due to dependency errors
print(torch.__version__)

1.2.0


In [7]:
# these were the corresponding versions for torch-geometric 1.3.2, released 4 oct 2019, which runs on torch 1.2.0.
# grab some coffee, this might take a while
!pip install torch-scatter==1.3.1
!pip install torch-sparse==0.4.0
!pip install torch-cluster==1.4.4
!pip install torch-spline-conv==1.1.0
!pip install torch-geometric==1.3.2



In [0]:
import torch
import torch_geometric

In [9]:
# verify that torch geometric is imported, should be 1.3.2
print(torch_geometric.__version__)

1.3.2


First we create an interaction graph. 

*(todo: this is just a simple design that probably needs to be tweaked such that it matches the format of the torch_geometric library)*

In [31]:
# first we want to create an interaction graph.

# get data
batch_length = 10
data = Data()

edges = []
# define an edge for every user and item
for user in range(data.n_users):
  pos_items = data.train_items[user]
  for item in pos_items:
    edges.append((user, item))

print(len(edges))
print(data.n_users)
print(data.n_items)
print(edges[0:100])

2380706
52642
91598
[(0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (0, 8), (0, 9), (0, 10), (0, 11), (0, 12), (0, 13), (0, 14), (0, 15), (0, 16), (0, 17), (0, 18), (0, 19), (0, 20), (0, 21), (0, 22), (0, 23), (0, 24), (0, 25), (0, 26), (0, 27), (0, 28), (0, 29), (0, 30), (0, 31), (0, 32), (0, 33), (0, 34), (0, 35), (0, 36), (0, 37), (0, 38), (0, 39), (0, 40), (0, 41), (0, 42), (0, 43), (0, 44), (0, 45), (0, 46), (0, 47), (0, 48), (0, 49), (0, 50), (0, 51), (0, 52), (1, 53), (1, 54), (1, 49), (1, 55), (1, 56), (1, 57), (1, 58), (1, 59), (1, 60), (1, 61), (1, 62), (1, 63), (1, 64), (1, 65), (1, 66), (1, 67), (1, 68), (1, 69), (1, 70), (1, 71), (1, 72), (1, 73), (1, 74), (1, 26), (1, 75), (1, 76), (1, 77), (1, 78), (1, 79), (1, 80), (1, 81), (1, 82), (1, 83), (1, 14), (1, 84), (1, 85), (1, 86), (1, 87), (1, 88), (1, 89), (1, 90), (1, 91), (1, 92), (1, 93), (1, 94), (1, 95), (1, 96)]


# step 2
For each user-item pair (u,i), we define the message from i to u as

$m_{u \leftarrow i} = \dfrac{1}{\sqrt{|N_u||N_i|}} (W_1 e_i + W_2(e_i \odot e_u))$

Where $N_u, N_i$ are the first hop neighbors of $u$ and $i$.

In [0]:
for userid in 