# LightGCN for Recommendation
This notebook demonstrates the training of LightGCN (https://arxiv.org/abs/2002.02126) for Recommendation with TigerGraph. Pytorch Geometric's implementation of LightGCN is used here. We train the model on the LastFM dataset from PyG datasets with TigerGraph as the data store. The dataset contains 1,892 users, 17,632 items and 92,834 edges between users and items. And the dataset is already splitted into train and test sets. The metric for evaluation is recall@k.


## Data Processing
Here we assume the dataset is already ingested into the TigerGraph database. If not, please refer to the example on data ingestion first.

For each edge, the attributes include "is_train, is_test". For each node, the attribute **role is 0 or 1** indicates the node is user or item.

## Connect to TigerGraph

In [16]:
from pyTigerGraph import TigerGraphConnection
import torch

conn = TigerGraphConnection(
    host="http://127.0.0.1", # Change the address to your database server's
    graphname="LastFM",
    username="tigergraph",
    password="tigergraph",
    useCert=False
)

In [17]:
conn.getVertexCount("*")

{'User': 1892, 'Item': 17632}

In [18]:
conn.getEdgeCount('*')

{'Interact': 92834}

## Train on whole graph
Here, we use the full graph for recommendation. This will NOT work when the graph is very large. See the section of Stochastic training for huge graphs. However, we still include this example for illustration purpose.

We load the whole graph from TigerGraph.


In [19]:
graph_loader = conn.gds.graphLoader(
    num_batches=1,
    v_extra_feats=["role", 'id'],
    e_extra_feats=["is_train","is_test"],
    output_format = "PyG")

**Note**: After the graphloader, the nodes' id is reindexed. This dataset don't contains node 

In [20]:
data = graph_loader.data

In [21]:
data

Data(edge_index=[2, 185668], is_train=[185668], is_test=[185668], role=[19524], id=[19524])

In [22]:
train_user_item = data.edge_index[:, data.is_train]
test_user_item = data.edge_index[:, data.is_test]
train_user_item = train_user_item[:, data.role[train_user_item[0]] == 0]
test_user_item = test_user_item[:, data.role[test_user_item[0]] == 0]

In [23]:
users = (data.role == 0).nonzero().squeeze().tolist()
items = (data.role == 1).nonzero().squeeze()

## Evaluation
We use recall@k to evaluate the performace. Specifically, for each user, we rank all items based on the scores, and then caculate the test recall@k from top k items.

In [24]:
def get_user_record(user_item, users):
    user_record = {}
    for u in users:
        user_record[u] = user_item[1][user_item[0]==u].tolist()
    return user_record

In [25]:
train_user_record = get_user_record(train_user_item, users)
test_user_record = get_user_record(test_user_item, users)
# The user record is {user1:[item1, item2], user2:[], ...}

In [26]:
from collections import defaultdict

In [27]:
def recallk(embeddings, items, train_user_record, test_user_record):
    k_list = [2, 5, 10, 30]
    recall = defaultdict(list)
    items = torch.LongTensor(items)
    items_vec = embeddings[items]
    for user, test_items in test_user_record.items():
        if len(test_items) == 0:
            continue
        user_vec = embeddings[user]
        scores = torch.sum(user_vec * items_vec, dim=1)
        scores, indices = torch.topk(scores, 200)
        predict_items = items[indices].numpy()
        predict_items = [i for i in predict_items if i not in train_user_record[user]]
        # Filter out the items already in the training set
        for k in k_list:
            num = len(set(test_items) & set(predict_items[:k]))
            recall[k].append(num / len(test_items))
    for k in k_list:
        print('recall@{}:, {}'.format(k, sum(recall[k])/len(recall[k])))

## Model

We use LightGCN as our model, and the BPR loss function. For more details, please refer https://arxiv.org/abs/2002.02126.

In [28]:
import torch
import torch.nn.functional as F
from torch.nn import Embedding, ModuleList
from torch_geometric.nn import LGConv


class LightGCN(torch.nn.Module):
    def __init__(self, num_nodes, embedding_dim, num_layers, dropout, **kwargs):
        super(LightGCN, self).__init__()
        self.num_nodes = num_nodes
        self.embedding_dim = embedding_dim
        self.num_layers = num_layers
        self.alpha = torch.tensor([1.0/(num_layers + 1)] * (num_layers + 1))
        self.embedding = Embedding(num_nodes, embedding_dim)
        self.convs = torch.nn.ModuleList([LGConv(**kwargs) for _ in range(num_layers)])
        self.dropout = dropout
        self.reset_parameters()
        
    def reset_parameters(self):
        torch.nn.init.normal_(self.embedding.weight, std=0.1)
        # torch.nn.init.xavier_uniform_(self.embedding.weight)
        for conv in self.convs:
            conv.reset_parameters()
            
    def forward(self, edge_index, nodes):
        x = self.embedding(nodes)
        out = x * self.alpha[0]
        for i in range(self.num_layers):
            x = self.convs[i](x, edge_index)
            out = out + x * self.alpha[i + 1]
        return out

    def decode(self, z, users, pos_items, neg_items):
        pos_scores = (z[users] * z[pos_items]).sum(dim=-1)
        neg_scores = (z[users] * z[neg_items]).sum(dim=-1)
        # BPR Loss
        maxi = F.logsigmoid(pos_scores - neg_scores)
        loss = -maxi.mean()
        return loss


## Train the model using the whole graph

In [29]:
# Hyperparameters
hp = {"embedding_dim": 64, "num_layers": 2, "dropout": 0.6, "lr": 0.001, "l2_penalty": 1e-5}

In [30]:
num_nodes = sum(conn.getVertexCount("*").values())

In [31]:
model = LightGCN(num_nodes, hp['embedding_dim'], hp['num_layers'], hp['dropout'])
optimizer = torch.optim.Adam(
    model.parameters(), lr=hp["lr"], weight_decay=hp["l2_penalty"]
)

In [32]:
train_edge_index = data.edge_index[:, data.is_train]
test_edge_index = data.edge_index[:, data.is_test]

In [33]:
user_pos_index = train_edge_index[:, data.role[train_edge_index[0]] == 0]

In [34]:
from torch.utils.data import DataLoader

In [35]:
batch_size = 20000
pos_users = user_pos_index[0]
pos_items = user_pos_index[1]
for epoch in range(201):
    model.train()
    neg_items = items[torch.randint(len(items), (pos_items.shape[0],))]
    # print(h.shape)
    total_loss = 0
    count = 0
    for perm in DataLoader(range(pos_users.size(0)), batch_size,
                           shuffle=True):
        optimizer.zero_grad()
        h = model(train_edge_index, torch.LongTensor(range(0, num_nodes)))
        loss = model.decode(h, pos_users[perm], pos_items[perm], neg_items[perm])
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        count += 1
    if epoch % 40 == 0:
        print('Epoch: {}, training loss: {}'.format(epoch, total_loss / count))
        model.eval()
        with torch.no_grad():
            h = model(train_edge_index, torch.LongTensor(range(0, num_nodes)))
            recallk(h, items, train_user_record, test_user_record)

Epoch: 0, training loss: 0.6881372928619385
recall@2:, 0.00013700838168923275
recall@5:, 0.00029658284977433915
recall@10:, 0.0005121120014737035
recall@30:, 0.0022509636448935573
Epoch: 40, training loss: 0.29866740852594376
recall@2:, 0.03901264710987819
recall@5:, 0.06873777301303904
recall@10:, 0.10054871127836525
recall@30:, 0.1772820576323309
Epoch: 80, training loss: 0.20639914646744728
recall@2:, 0.040543566171695265
recall@5:, 0.07484413952929295
recall@10:, 0.11180803164368167
recall@30:, 0.19327400117238122
Epoch: 120, training loss: 0.18053408712148666
recall@2:, 0.04218936917622792
recall@5:, 0.08059376298503335
recall@10:, 0.11900254412605393
recall@30:, 0.20621602286401142
Epoch: 160, training loss: 0.16577080264687538
recall@2:, 0.04614484211206697
recall@5:, 0.0848391269737443
recall@10:, 0.12585340720227683
recall@30:, 0.21683980545744358
Epoch: 200, training loss: 0.15787839516997337
recall@2:, 0.04756660002318656
recall@5:, 0.08793789972442569
recall@10:, 0.13155483

## Stochastic Training
For stochastic training, we split the training edges into batches. At each specific batch, to do the recommendation, we need to know the neighbor graph for the each pair of nodes that has a edge.
We use the edgeNeighborLoader, which can load the neghbors of the pair nodes of a edge and has same parameters as neighborLoader().

In [36]:
train_edge_neighbor_loader = conn.gds.edgeNeighborLoader(
    v_extra_feats=["id", "role"],
    num_batches=5,
    e_extra_feats=["is_train", "is_test"],
    output_format="PyG",
    num_neighbors=10,
    num_hops=2,
    filter_by="is_train",
    shuffle=False,
)

### Use the Whole graph for recall
As we use recall as the evaluation metric, we need to get the user_record.

The Whole graph won't be used for training and inference.

In [37]:
graph_loader = conn.gds.graphLoader(
    num_batches=1,
    v_extra_feats=["role", 'id'],
    e_extra_feats=["is_train","is_test"],
    output_format = "PyG")

In [38]:
data = graph_loader.data

In [39]:
train_user_item = data.edge_index[:, data.is_train]
test_user_item = data.edge_index[:, data.is_test]
train_user_item = train_user_item[:, data.role[train_user_item[0]] == 0]
test_user_item = test_user_item[:, data.role[test_user_item[0]] == 0]
train_user_item_id = data.id[train_user_item]
test_user_item_id = data.id[test_user_item]
users_id = data.id[data.role==0].tolist()
items_id = data.id[data.role==1].tolist()

In [40]:
def get_user_record(user_item, users):
    user_record = {}
    for u in users:
        user_record[u] = user_item[1][user_item[0]==u].tolist()
    return user_record

In [41]:
train_user_record = get_user_record(train_user_item_id, users_id)
test_user_record = get_user_record(test_user_item_id, users_id)

### Train

In [42]:
model = LightGCN(num_nodes, hp['embedding_dim'], hp['num_layers'], hp['dropout'])
optimizer = torch.optim.Adam(
    model.parameters(), lr=hp["lr"], weight_decay=hp["l2_penalty"]
)

In [43]:
for epoch in range(201):
    model.train()
    total_loss = 0
    for bid, batch in enumerate(train_edge_neighbor_loader):
        # get the training edges and negative edges sampled in the same batch
        train_edges = batch.edge_index[:, batch.is_seed]
        items = (batch.role == 1).nonzero().squeeze()
        users2items = train_edges[:, batch.role[train_edges[0]] == 0]
        users = users2items[0]
        pos_items = users2items[1]
        neg_items = items[torch.randint(len(items), (pos_items.shape[0],))]
        nodes_id = batch.id
        train_graph_edges = batch.edge_index[:, batch.is_train]
        optimizer.zero_grad()
        h = model(train_graph_edges, nodes_id)
        loss = model.decode(h, users, pos_items, neg_items)
        loss.backward()
        optimizer.step()
        total_loss += loss.item() 
    if epoch % 20 == 0:
        print('Epoch: {}, training loss: {}'.format(epoch, total_loss))

Epoch: 0, training loss: 3.4278719425201416
Epoch: 20, training loss: 2.4438130855560303
Epoch: 40, training loss: 1.4304150640964508
Epoch: 60, training loss: 1.1493873298168182
Epoch: 80, training loss: 1.0042845904827118
Epoch: 100, training loss: 0.9102801829576492
Epoch: 120, training loss: 0.8601862490177155
Epoch: 140, training loss: 0.8185435086488724
Epoch: 160, training loss: 0.777618020772934
Epoch: 180, training loss: 0.7471616566181183
Epoch: 200, training loss: 0.7317674607038498


### Get nodes' embedding
Use neighbor loader to get the embedding of each node.

In [44]:
neighbor_loader = conn.gds.neighborLoader(
    v_extra_feats=["id", "role"],
    num_batches=5,
    e_extra_feats=["is_train", "is_test"],
    output_format="PyG",
    num_neighbors=10,
    num_hops=2,
    shuffle=False,
)

Installing and optimizing queries. It might take a minute if this is the first time you use this loader.
Query installation finished.


In [45]:
@torch.no_grad()
def infer(model, neighbor_loader):
    embeddings = torch.zeros(num_nodes, hp['embedding_dim'])
    model.eval()
    for bid, batch in enumerate(neighbor_loader):
        train_graph_edges = batch.edge_index[:, batch.is_train]
        nodes_id = batch.id
        is_seed = batch.is_seed
        h = model(train_graph_edges, nodes_id)
        embeddings[nodes_id[is_seed]] = h[is_seed]
    print(embeddings)
    return embeddings
        

In [46]:
embeddigns = infer(model, neighbor_loader)

tensor([[ 0.1570, -0.1485,  0.0248,  ..., -0.2449,  0.1912, -0.2190],
        [-0.2568,  0.3872,  0.4651,  ...,  0.2266, -0.1529,  0.1842],
        [ 0.2613, -0.2763, -0.1870,  ..., -0.2249,  0.2745, -0.3259],
        ...,
        [ 0.0124,  0.0258,  0.0468,  ...,  0.0222, -0.0103, -0.0133],
        [ 0.0125,  0.0272,  0.0489,  ...,  0.0214, -0.0121, -0.0113],
        [ 0.0100, -0.0144, -0.0210,  ..., -0.0118,  0.0021, -0.0202]])


In [47]:
recallk(embeddigns, items_id, train_user_record, test_user_record)

recall@2:, 0.042619458636972425
recall@5:, 0.08181410490295786
recall@10:, 0.12334695609084659
recall@30:, 0.22187967001928496
