# KAMPING Tutorial 2. Homogenous Graph neural network modeling

Date created: 2024-10-25

In [54]:
# Import kamping library before starting the tutorial
import kamping

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


flak## 1. Create a list of KeggGraph objects from a directory with KGML files

In the previous tutorial we have shown how to use parse information from a single KGML file into KeggGraph object for storing information in a easily-access way. In this tutorial, we will show you how to dataset can be used in one of the most popular graph-machine learning package "pytorch-geometric" through provided utility function with ease.

Machine-learning graph model also use data contains more than one graphs, you can use `kamping.create_graphs` function to create a list of KeggGraph objects from a directory with KGML files.

In this tutorial we target the homogenous graph, which is defined only have one type of nodes in a graph. In our case, a homogenous graph is a "gene-only" graph or "metbaolite-only" graph. Training on Homogenous graph is easy to understand, which is the reason why we start from here. Later, we will show you KAMPING can also convert heterogenous graph in a similar way with just a littble bit extra effort. 

In [3]:
gene_graphs = kamping.create_graphs('../data/kgml_hsa', type='gene', verbose=True, ignore_file=['hsa01100.xml'])


            Visit https://www.kegg.jp/kegg-bin/show_pathway?hsa00190 for pathway details.

            There are likely no edges in which to parse...
INFO:KeggGraph:Now parsing: path:hsa00220...
INFO:KeggGraph:Graph path:hsa00220 parsed successfully!
INFO:KeggGraph:Now parsing: path:hsa00230...
INFO:KeggGraph:Graph path:hsa00230 parsed successfully!
INFO:KeggGraph:Now parsing: path:hsa00232...
INFO:KeggGraph:Graph path:hsa00232 parsed successfully!
INFO:KeggGraph:Now parsing: path:hsa00240...
INFO:KeggGraph:Graph path:hsa00240 parsed successfully!
INFO:KeggGraph:Now parsing: path:hsa00250...
INFO:KeggGraph:Graph path:hsa00250 parsed successfully!
INFO:KeggGraph:Now parsing: path:hsa00260...
INFO:KeggGraph:Graph path:hsa00260 parsed successfully!
INFO:KeggGraph:Now parsing: path:hsa00270...
INFO:KeggGraph:Graph path:hsa00270 parsed successfully!
INFO:KeggGraph:Now parsing: path:hsa00280...
INFO:KeggGraph:Graph path:hsa00280 parsed successfully!
INFO:KeggGraph:Now parsing: path:hsa00290

The batch processing of KGML could also be useful in regular task. To access the result of specific KGML file you can use code below.  

In [4]:
gene_graph_00010 = [graph for graph in gene_graphs if graph.name == 'path:hsa00010'][0]

In [5]:
gene_graph_00010

KEGG Pathway: 
            [Title]: Glycolysis / Gluconeogenesis
            [Name]: path:hsa00010
            [Org]: hsa
            [Link]: https://www.kegg.jp/kegg-bin/show_pathway?hsa00010
            [Image]: https://www.kegg.jp/kegg/pathway/hsa/hsa00010.png
            [Link]: https://www.kegg.jp/kegg-bin/show_pathway?hsa00010
            Graph type: gene 
            Number of Genes: 67
            Number of Compounds: 0
            Gene ID type : kegg
            Compound ID type : kegg
            Number of Nodes: 67
            Number of Edges: 559

In [6]:
gene_graph_00010.edges

Unnamed: 0,entry1,entry2,type,subtype_name,subtype_value,entry1_type,entry2_type
0,hsa:10327,hsa:124,PPrel,compound-propagation,custom,gene,gene
1,hsa:10327,hsa:125,PPrel,compound-propagation,custom,gene,gene
2,hsa:10327,hsa:126,PPrel,compound-propagation,custom,gene,gene
3,hsa:10327,hsa:127,PPrel,compound-propagation,custom,gene,gene
4,hsa:10327,hsa:128,PPrel,compound-propagation,custom,gene,gene
...,...,...,...,...,...,...,...
554,hsa:9562,hsa:387712,PPrel,compound-propagation,custom,gene,gene
555,hsa:9562,hsa:441531,PPrel,compound-propagation,custom,gene,gene
556,hsa:9562,hsa:5223,PPrel,compound-propagation,custom,gene,gene
557,hsa:9562,hsa:5224,PPrel,compound-propagation,custom,gene,gene


In this tutorial, we will use pre-processed protein embedding information directly from uniprot, so we need to convert the KEGG gene ID into UniProt ID. We don't need to convert the KEGG compound id so we keep it untouched. If you didn't specify the "compound_target" when initalizing the converter, it will be default as "kegg". The same if you only want to convert gene ID. 

In [7]:
converter = kamping.Converter('hsa', gene_target='uniprot', verbose=True)

In [8]:
for graph in gene_graphs:
    converter.convert(graph)

INFO:kamping.parser.convert:Conversion of path:hsa00010 complete!
INFO:kamping.parser.convert:Conversion of path:hsa00020 complete!
INFO:kamping.parser.convert:Conversion of path:hsa00030 complete!
INFO:kamping.parser.convert:Conversion of path:hsa00040 complete!
INFO:kamping.parser.convert:Conversion of path:hsa00051 complete!
INFO:kamping.parser.convert:Conversion of path:hsa00052 complete!
INFO:kamping.parser.convert:Conversion of path:hsa00053 complete!
INFO:kamping.parser.convert:Conversion of path:hsa00061 complete!
INFO:kamping.parser.convert:Conversion of path:hsa00062 complete!
INFO:kamping.parser.convert:Conversion of path:hsa00071 complete!
INFO:kamping.parser.convert:Conversion of path:hsa00100 complete!
INFO:kamping.parser.convert:Conversion of path:hsa00120 complete!
INFO:kamping.parser.convert:Conversion of path:hsa00130 complete!
INFO:kamping.parser.convert:Conversion of path:hsa00140 complete!
INFO:kamping.parser.convert:Conversion of path:hsa00220 complete!
INFO:kampi

If you didn't convert Compound ID into other ID. You can use `kamping.get_kegg_mol` function to retrieve the molfile from KEGG for each compound in all graphs and create a MOL object using RDKit (https://www.rdkit.org/). It will return a pd.dataframe with first column the compound ID and second column as the MOL object.

In [9]:
# uncomment to run for the first time
# mols = kamping.get_kegg_mol(graphs)

The process might take a while due to the large number of compounds from so many graphs. It could a good idea to save the created pd.DataFrame for repeated use when testing different approach of embedding metabolite.

In [10]:
import pandas as pd

# uncommented code below if run the first time
# save the mols to a file
# mols.to_pickle('data/mols.pkl')
# retrieve mol from file
mols = pd.read_pickle('data/mols.pkl')

Not all compound has a molFile from KEGG. Most compounds without molFile are glycan which is doesn't have a fixed atom composition.  Right now, we can just ignore them.

In [11]:
mols

Unnamed: 0,id,ROMol
0,cpd:C00038,<rdkit.Chem.rdchem.Mol object at 0x2ce43b1a0>
1,cpd:C01180,<rdkit.Chem.rdchem.Mol object at 0x2cb6ca610>
2,gl:G00083,
3,cpd:C20683,<rdkit.Chem.rdchem.Mol object at 0x2ce412480>
4,cpd:C02593,<rdkit.Chem.rdchem.Mol object at 0x2ce4db650>
...,...,...
1658,gl:G10599,
1659,cpd:C03090,<rdkit.Chem.rdchem.Mol object at 0x2ce7aa070>
1660,cpd:C00097,<rdkit.Chem.rdchem.Mol object at 0x2ce79a020>
1661,cpd:C11134,<rdkit.Chem.rdchem.Mol object at 0x2ce7a9fd0>


After we get the MOL object of each compound, we can use RDkit to embedding them into vectors that can be understanded by machine.

In [12]:
# todo: Might be a good idea to depend on scikit-mol  

In [13]:
mol_embeddings = kamping.get_mol_embeddings_from_dataframe(mols, transformer='morgan')

'
                    total 231 Invalid rows with "None" in the ROMol column


In [14]:
protein_embeddings = kamping.get_uniprot_protein_embeddings(gene_graphs, '../data/embedding/protein_embedding.h5')

In [15]:
gene_graphs[0]

KEGG Pathway: 
            [Title]: Glycolysis / Gluconeogenesis
            [Name]: path:hsa00010
            [Org]: hsa
            [Link]: https://www.kegg.jp/kegg-bin/show_pathway?hsa00010
            [Image]: https://www.kegg.jp/kegg/pathway/hsa/hsa00010.png
            [Link]: https://www.kegg.jp/kegg-bin/show_pathway?hsa00010
            Graph type: gene 
            Number of Genes: 104
            Number of Compounds: 0
            Gene ID type : uniprot
            Compound ID type : kegg
            Number of Nodes: 104
            Number of Edges: 1278

## 2. Create a Pytorch-geometric data object

In [68]:
pyg_one_graph = kamping.convert_to_single_pyg(gene_graphs, embeddings=protein_embeddings)

In [69]:
data = pyg_one_graph
data

Data(x=[7275, 1024])

In [29]:
original_edge_index = data.edge_index

Pytorch-geometric data mainly consist of "edge_index" and "x", which is the feature of nodes. Other information are also saved for prediction interpretation, such as the node_type, node original name ("node_name"), "edge_type", "edge_subtype_name". Other information such as "name" is "combined" indicate it  is a combined graph from small graphs and type="gene" indicate it is a homogenous "gene" graph.

In [23]:
from torch_geometric.loader import DataLoader
dataloader = DataLoader([data])
batch = next(iter(dataloader))

In [25]:
batch.node_name

[['up:P14550',
  'up:P07327',
  'up:V9HWI0',
  'up:P00325',
  'up:V9HW50',
  'up:P00326',
  'up:P08319',
  'up:V9HVX7',
  'up:P11766',
  'up:Q6IRT1',
  'up:P28332',
  'up:Q8IUN7',
  'up:P40394',
  'up:P05091',
  'up:A0A384NPN7',
  'up:P30838',
  'up:Q6PKA6',
  'up:P30837',
  'up:A0A384MTJ7',
  'up:P43353',
  'up:P49189',
  'up:P51648',
  'up:P49419',
  'up:Q96C23',
  'up:P35557',
  'up:Q53Y25',
  'up:A0A384MDW6',
  'up:P19367',
  'up:B3KXY9',
  'up:A8K7J7',
  'up:Q59FD4',
  'up:P52789',
  'up:P52790',
  'up:Q2TB90',
  'up:B3KT70',
  'up:Q9BRR6',
  'up:Q6ZMR3',
  'up:P10515',
  'up:Q86YI5',
  'up:P08559',
  'up:P29803',
  'up:P11177',
  'up:A0A384MDR8',
  'up:P09622',
  'up:A0A024R713',
  'up:P00338',
  'up:V9HWB9',
  'up:P07195',
  'up:Q5U077',
  'up:P07864',
  'up:A0A140VKA7',
  'up:Q9BYZ2',
  'up:A0A140VJM9',
  'up:P06733',
  'up:Q8N0Y7',
  'up:A0A024R4F1',
  'up:P18669',
  'up:Q6FHU2',
  'up:P15259',
  'up:P30613',
  'up:P14618',
  'up:V9HWB8',
  'up:P07738',
  'up:A0A024R782',
  'u

## 3. Training graph neural network model

In [69]:
import torch
import torch.nn.functional as F
from sklearn.metrics import roc_auc_score

from torch_geometric.utils import negative_sampling
from torch_geometric.datasets import Planetoid
import torch_geometric.transforms as T
from torch_geometric.nn import GCNConv, SAGEConv
from torch_geometric.utils import train_test_split_edges

In [122]:
from torch_geometric import transforms
data.train_mask = data.val_mask = data.test_mask = data.y = None
data = train_test_split_edges(data)



In [71]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = "cpu"

In [72]:
class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = SAGEConv(1024, 128)
        self.conv2 = SAGEConv(128, 64)

    def encode(self):
        x = self.conv1(data.x, data.train_pos_edge_index) # convolution 1
        x = x.relu()
        return self.conv2(x, data.train_pos_edge_index) # convolution 2

    def decode(self, z, pos_edge_index, neg_edge_index): # only pos and neg edges
        edge_index = torch.cat([pos_edge_index, neg_edge_index], dim=-1) # concatenate pos and neg edges
        logits = (z[edge_index[0]] * z[edge_index[1]]).sum(dim=-1)  # dot product 
        return logits

    def decode_all(self, z):
        prob_adj = z @ z.t() # get adj NxN
        return (prob_adj > 0).nonzero(as_tuple=False).t() # get predicted edge_list 

In [73]:
model, data = Net().to(device), data.to(device)
optimizer = torch.optim.Adam(params=model.parameters(), lr=0.01)

In [74]:
def get_link_labels(pos_edge_index, neg_edge_index):
    # returns a tensor:
    # [1,1,1,1,...,0,0,0,0,0,..] with the number of ones is equel to the lenght of pos_edge_index
    # and the number of zeros is equal to the length of neg_edge_index
    E = pos_edge_index.size(1) + neg_edge_index.size(1)
    link_labels = torch.zeros(E, dtype=torch.float, device=device)
    link_labels[:pos_edge_index.size(1)] = 1.
    return link_labels

def train():
    model.train()

    neg_edge_index = negative_sampling(
        edge_index=data.train_pos_edge_index, #positive edges
        num_nodes=data.num_nodes, # number of nodes
        num_neg_samples=data.train_pos_edge_index.size(1)) # number of neg_sample equal to number of pos_edges

    optimizer.zero_grad()

    z = model.encode() #encode
    link_logits = model.decode(z, data.train_pos_edge_index, neg_edge_index) # decode

    link_labels = get_link_labels(data.train_pos_edge_index, neg_edge_index)
    loss = F.binary_cross_entropy_with_logits(link_logits, link_labels)
    loss.backward()
    optimizer.step()

    return loss

@torch.no_grad()
def test():
    model.eval()
    perfs = []
    for prefix in ["val", "test"]:
        pos_edge_index = data[f'{prefix}_pos_edge_index']
        neg_edge_index = data[f'{prefix}_neg_edge_index']

        z = model.encode() # encode train
        link_logits = model.decode(z, pos_edge_index, neg_edge_index) # decode test or val
        link_probs = link_logits.sigmoid() # apply sigmoid

        link_labels = get_link_labels(pos_edge_index, neg_edge_index) # get link

        perfs.append(roc_auc_score(link_labels.cpu(), link_probs.cpu())) #compute roc_auc score
    return perfs

In [75]:
best_val_perf = test_perf = 0
for epoch in range(1, 101):
    train_loss = train()
    val_perf, tmp_test_perf = test()
    if val_perf > best_val_perf:
        best_val_perf = val_perf
        test_perf = tmp_test_perf
    log = 'Epoch: {:03d}, Loss: {:.4f}, AUC Val: {:.4f}, AUC Test: {:.4f}'
    print(log.format(epoch, train_loss, best_val_perf, test_perf))

Epoch: 001, Loss: 0.6950, AUC Val: 0.6688, AUC Test: 0.6845
Epoch: 002, Loss: 1.4109, AUC Val: 0.6688, AUC Test: 0.6845
Epoch: 003, Loss: 0.6942, AUC Val: 0.7254, AUC Test: 0.7318
Epoch: 004, Loss: 0.7089, AUC Val: 0.7426, AUC Test: 0.7492
Epoch: 005, Loss: 0.6844, AUC Val: 0.7426, AUC Test: 0.7492
Epoch: 006, Loss: 0.6761, AUC Val: 0.7426, AUC Test: 0.7492
Epoch: 007, Loss: 0.6645, AUC Val: 0.7426, AUC Test: 0.7492
Epoch: 008, Loss: 0.6527, AUC Val: 0.7796, AUC Test: 0.7764
Epoch: 009, Loss: 0.6355, AUC Val: 0.7874, AUC Test: 0.7786
Epoch: 010, Loss: 0.6217, AUC Val: 0.7943, AUC Test: 0.7858
Epoch: 011, Loss: 0.6071, AUC Val: 0.8031, AUC Test: 0.7971
Epoch: 012, Loss: 0.5969, AUC Val: 0.8087, AUC Test: 0.8023
Epoch: 013, Loss: 0.5799, AUC Val: 0.8231, AUC Test: 0.8175
Epoch: 014, Loss: 0.5656, AUC Val: 0.8431, AUC Test: 0.8384
Epoch: 015, Loss: 0.5548, AUC Val: 0.8625, AUC Test: 0.8593
Epoch: 016, Loss: 0.5450, AUC Val: 0.8790, AUC Test: 0.8766
Epoch: 017, Loss: 0.5341, AUC Val: 0.889

# prediction


In [134]:
z = model.encode() #encode
predict_adj = torch.sigmoid(z @ z.t())
# get 2D index of the top 10% edges
# get the number of top edges to select

# this is not sorted by the value of the tensor
num_top_edges = int(0.01 * predict_adj.numel())

value, indices = torch.topk(predict_adj.view(-1), num_top_edges, largest=True)

# convert the 1D indices to 2D indices
indices_2d = torch.stack(torch.unravel_index(indices, predict_adj.shape)).t()

value
print(indices_2d)

tensor([[1729, 1695],
        [3068, 3048],
        [3049, 1636],
        ...,
        [4685, 2539],
        [5464, 2331],
        [2331, 5464]])


In [126]:
original_edge_index = original_edge_index.t()

In [127]:
indices_2d[:, 1]

tensor([1695, 3048, 1636,  ..., 2539, 2331, 5464])

In [136]:
import numpy as np
entry1 = np.array(data.node_name)[indices_2d[:, 0].numpy()]
entry2 = np.array(data.node_name)[indices_2d[:, 1].numpy()]
# combine the two entries
edges = np.stack([entry1, entry2], axis=1)
edges

array([['up:P22612', 'up:Q499G7'],
       ['up:R9QE65', 'up:A0A0S2Z392'],
       ['up:P35626', 'up:P31749'],
       ...,
       ['up:A0A384ME58', 'up:Q13255'],
       ['up:Q8N1C3', 'up:Q6FHM2'],
       ['up:Q6FHM2', 'up:Q8N1C3']], dtype='<U13')

In [ ]:
from unipressed import IdMappingClient
request = IdMappingClient.submit(
    source="UniProtKB_AC-ID",
    dest="Gene_Name",
    ids=entry1
)

[autoreload of jedi.inference.cache failed: Traceback (most recent call last):
  File "/Users/cgu3/miniforge3/envs/kamping/lib/python3.10/site-packages/IPython/extensions/autoreload.py", line 276, in check
    superreload(m, reload, self.old_objects)
  File "/Users/cgu3/miniforge3/envs/kamping/lib/python3.10/site-packages/IPython/extensions/autoreload.py", line 500, in superreload
    update_generic(old_obj, new_obj)
  File "/Users/cgu3/miniforge3/envs/kamping/lib/python3.10/site-packages/IPython/extensions/autoreload.py", line 397, in update_generic
    update(a, b)
  File "/Users/cgu3/miniforge3/envs/kamping/lib/python3.10/site-packages/IPython/extensions/autoreload.py", line 365, in update_class
    update_instances(old, new)
  File "/Users/cgu3/miniforge3/envs/kamping/lib/python3.10/site-packages/IPython/extensions/autoreload.py", line 323, in update_instances
    object.__setattr__(ref, "__class__", new)
TypeError: can't apply this __setattr__ to CachedMetaClass object
]
[autorelo

In [154]:
# check if first column is 'upQ8NFJ5' and second column is 'P35626'
test = 'up:P16520'
edges[(edges[:, 0] == test) & (edges[:, 1] == 'up:P35626')]
edges[(edges[:, 0] == 'up:P35626') & (edges[:, 1] == test)]

array([['up:P35626', 'up:P16520']], dtype='<U13')

In [138]:
edges[edges[:, 0] == 'upQ8NFJ5' & edges[:, 1] == 'P35626']

TypeError: ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

In [129]:
original_edge_index

tensor([[   0,    1],
        [   0,    3],
        [   0,    4],
        ...,
        [7276, 1754],
        [7276, 1755],
        [7276, 1756]])

In [130]:
entry1 = np.array(data.node_name)[original_edge_index[:, 0].numpy()]
entry2 = np.array(data.node_name)[original_edge_index[:, 1].numpy()]
# combine the two entries
edges = np.stack([entry1, entry2], axis=1)
edges

array([['up:P14550', 'up:P07327'],
       ['up:P14550', 'up:P00325'],
       ['up:P14550', 'up:V9HW50'],
       ...,
       ['up:P63165', 'up:P04637'],
       ['up:P63165', 'up:K7PPA8'],
       ['up:P63165', 'up:Q53GA5']], dtype='<U13')