 GraphSage (Sample and Aggregate) algorithm is an inductive (it can generalize to unseen nodes) deep learning method developed by Hamilton, Ying, and Leskovec (2017) for graphs used to generate low-dimensional vector representations for nodes. This is in contrast with the previous graph machine learning methods like Graph Convolutional Networks or DeepWalk which are inherently transductive i.e they can only generate embeddings for the nodes present in the fixed graph during the training.
This implies that, if in future the graph evolves and new nodes (unseen during the training) make their way into the graph then we need to retrain the whole graph in order to compute the embeddings for the new node. This limitation makes the transductive approaches inefficient to get applied on the ever evolving graphs (like social networks, protein-protein networks, etc) because of their inability to generalize on unseen nodes. The other main limitation of transductive approaches is that they cannot leverage the node features e.g text attributes, node profile information, node degrees, etc.
On the other hand, the GraphSage algorithm exploits both the rich node features and the topological structure of each node’s neighborhood simultaneously to efficiently generate representations for new nodes without retraining.

In [1]:
%cd DATA

/home/ubuntu/workspace/GNNs-on-Biological-data/DATA


In [2]:
import torch
import torch.nn.functional as F
from tqdm import tqdm
from torch_geometric.data import NeighborSampler
from torch_geometric.nn import SAGEConv
import os.path as osp
import pandas as pd
import numpy as np
import collections
from pandas.core.common import flatten
import random
import time

from pandas.core.common import flatten
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(rc={'figure.figsize':(16.7,8.27)})
sns.set_theme(style="ticks")
import collections
from scipy.special import softmax
import umap

from torch_geometric.data import InMemoryDataset
from sklearn.model_selection import train_test_split
import torch_geometric.transforms as T
import networkx as nx
import torch
from torch_geometric.data import Data

In [3]:
def attribute_counter(G):
  zeros=0
  ones=0
  for node in tqdm(list(G.nodes())):
    if G.nodes[node]['label']==0:
      zeros+=1
    else:
      ones+=1
  print("#zeros: ", zeros)
  print("#ones: ", ones)
  print("portion of ones " ,ones/(ones+zeros) )

In [4]:
disease_name = "Schizophrenia"

In [5]:
balanced_G = nx.read_gpickle(f"{disease_name}_balanced_more.gpickle")
print(nx.info(balanced_G))
attribute_counter(balanced_G)


  print(nx.info(balanced_G))


Graph with 1329 nodes and 7900 edges


100%|██████████| 1329/1329 [00:00<00:00, 3006596.56it/s]

#zeros:  314
#ones:  1015
portion of ones  0.763732129420617





In [6]:
G=balanced_G

In [7]:
# retrieve the labels for each node, the nodes that are not zero
labels = np.asarray([G.nodes[i]['label'] != 0 for i in G.nodes]).astype(np.int64)

# create edge index. We need to have data as previously shown. We can exploit networkX and scipy for that 
adj = nx.to_scipy_sparse_matrix(G).tocoo() #coordinate format
#print(adj)


The scipy.sparse array containers will be used instead of matrices
in Networkx 3.0. Use `to_scipy_sparse_array` instead.
  adj = nx.to_scipy_sparse_matrix(G).tocoo() #coordinate format


In [8]:
#create edge index in the proper way
row = torch.from_numpy(adj.row.astype(np.int64)).to(torch.long) #create a torch tensor from numpy in long format : for row indexes
col = torch.from_numpy(adj.col.astype(np.int64)).to(torch.long) #                                                   for column indexes
edge_index = torch.stack([row, col], dim=0)
#display(edge_index)

In [9]:
# using degree as embedding. For simplicity, the feature vector describing the 
# will be just its degree, which is enough for us   >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>.. can i use other embeddings? https://medium.com/@st3llasia/graph-embedding-techniques-7d5386c88c5
#is it actually working or we do it for simplicity
embeddings = np.array(list(dict(G.degree()).values())) #list the values of degree of each node as numpy array
# normalizing degree values
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
embeddings = scale.fit_transform(embeddings.reshape(-1,1))
print(len(embeddings))
print(embeddings)

1329
[[ 1.75713634]
 [ 3.30852095]
 [ 4.45954824]
 ...
 [ 0.00557307]
 [-0.54491824]
 [ 0.25579639]]


In [10]:
# custom pytorch dataset
class PPIDATASET(InMemoryDataset):
    def __init__(self, transform=None):
        super(PPIDATASET, self).__init__('.', transform, None, None) #pre transform and pre filter: None, we don't need them
        data = Data(edge_index=edge_index) #Data : A data object describing a homogeneous graph.  for more : https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.Data
        data.num_nodes = G.number_of_nodes()
        # embedding 
        data.x = torch.from_numpy(embeddings).type(torch.float32)
        # labels
        y = torch.from_numpy(labels).type(torch.long)
        data.y = y.clone().detach() #removing tensors computational graph for efficency since it is not needed
        data.num_classes = 2
        # splitting the data into train, validation and test
        train_ratio = 0.70
        validation_ratio = 0.15
        test_ratio = 0.15
        X_train, X_test, y_train, y_test = train_test_split(pd.Series(G.nodes()),  pd.Series(labels), test_size=0.30, random_state=42)
        n_nodes = G.number_of_nodes()
        # create train and test masks for data
        # the Data objects holds a label for each node, and additional node-level attributes: train_mask, val_mask and test_mask, where
        #train_mask denotes against which nodes to train (140 nodes),
        #val_mask denotes which nodes to use for validation, e.g., t
        #test_mask denotes against which nodes to test 
        train_mask = torch.zeros(n_nodes, dtype=torch.bool)
        test_mask = torch.zeros(n_nodes, dtype=torch.bool)
        train_mask[X_train.index] = True
        test_mask[X_test.index] = True
        data['train_mask'] = train_mask
        data['test_mask'] = test_mask
        data['X_train']=X_train
        data['X_test']=X_test
        data['y_test']=y_test
        #data['y_train']=y_train
        #data['y_test']=X_test
        self.data, self.slices = self.collate([data])
    # def _download(self):
    #     return
    # def _process(self):
    #     return
    # def __repr__(self):
    #     return '{}()'.format(self.__class__.__name__)


In [11]:
dataset = PPIDATASET()
#Here, the dataset contains only a single, undirected citation graph, reminder: dataset is like a dictionary that hold the graph inside, here the dictionary
#has only one elemetns:
data = dataset[0] #now data is ready for training and testing

In [12]:
split_idx={}
split_idx['test']=torch.tensor(sorted(data.X_test.index.values))
split_idx['train']=torch.tensor(sorted(data.X_train.index.values))

In [13]:
# lets check the node ids distribution of train, test and val
print('Number of training nodes:', split_idx['train'].size(0))
print('Number of test nodes:', split_idx['test'].size(0))

Number of training nodes: 930
Number of test nodes: 399


In [14]:
# lets check some graph statistics of ppi graph
print("Number of nodes in the graph:", data.num_nodes)
print("Number of edges in the graph:", data.num_edges)
print("Node feature matrix with shape:", data.x.shape) # [num_nodes, num_node_features]
print("Graph connectivity in COO format with shape:", data.edge_index.shape) # [2, num_edges]
print("Target to train against :", data.y.shape) 
print("Node feature length", dataset.num_features)


Number of nodes in the graph: 1329
Number of edges in the graph: 15445
Node feature matrix with shape: torch.Size([1329, 1])
Graph connectivity in COO format with shape: torch.Size([2, 15445])
Target to train against : torch.Size([1329])
Node feature length 1


## Neighborhood Sampling

This module iteratively samples neighbors (at each layer) and constructs bipartite graphs that simulate the actual computation flow of GNNs.

sizes: denotes how much neighbors we want to sample for each node in each layer.

`NeighborSampler` holds the current
    :obj:`batch_size`, the IDs :obj:`n_id` of all nodes involved in the
    computation, and a list of bipartite graph objects via the tuple
    :obj:`(edge_index, e_id, size)`, where :obj:`edge_index` represents the
    bipartite edges between source and target nodes, :obj:`e_id` denotes the
    IDs of original edges in the full graph, and :obj:`size` holds the shape
    of the bipartite graph.

The actual computation graphs are then returned in reverse-mode, meaning
    that we pass messages from a larger set of nodes to a smaller one, until we
    reach the nodes for which we originally wanted to compute embeddings.

To refer in detail: https://pytorch-geometric.readthedocs.io/en/latest/_modules/torch_geometric/data/sampler.html

In [15]:
train_idx = split_idx['train']
train_loader = NeighborSampler(data.edge_index, node_idx=train_idx,
                               sizes=[15, 10, 5], batch_size=64,
                               shuffle=True)




In [16]:
test_idx = split_idx['test']
test_loader = NeighborSampler(data.edge_index, node_idx=test_idx,
                               sizes=[15, 10, 5], batch_size=64,
                               shuffle=False)


In [17]:
from torch.nn import BatchNorm1d

class SAGE(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, num_layers=3):
        super(SAGE, self).__init__()

        self.num_layers = num_layers

        self.convs = torch.nn.ModuleList()
        self.batch_norms = torch.nn.ModuleList()

        self.convs.append(SAGEConv(in_channels, hidden_channels))
        self.batch_norms.append(BatchNorm1d(hidden_channels))

        for _ in range(num_layers - 2):
            self.convs.append(SAGEConv(hidden_channels, hidden_channels))
            self.batch_norms.append(BatchNorm1d(hidden_channels))
        
        self.convs.append(SAGEConv(hidden_channels, out_channels))

    def reset_parameters(self):
        for conv in self.convs:
            conv.reset_parameters()
        for bn in self.batch_norms:
            bn.reset_parameters()

    def forward(self, x, adjs):
        layer_embeddings = []

        for i, (edge_index, _, size) in enumerate(adjs):
            x_target = x[:size[1]]  # Target nodes are always placed first.
            print("shape of x_target", x_target.shape)
            x = self.convs[i]((x, x_target), edge_index)

            print("shape of x", x.shape)
            if i != self.num_layers - 1:
                x = self.batch_norms[i](x)
                x = F.relu(x)
                x = F.dropout(x, p=0.5, training=self.training)
            
            # if i > 0:  # Add residual connection
            #     x = x + layer_embeddings[-1]

            print(x.shape)
            layer_embeddings.append(x)

        return tuple(layer_embeddings)

    def inference(self, x_all, subgraph_loader, device):
        
        pbar = tqdm(total=140)
        pbar.set_description('Evaluating')

        layer_embeddings = []
        
        xs = []
        for batch_size, n_id, adjs in subgraph_loader:
            adjs = [adj.to(device) for adj in adjs]
            # edge_index, _, size = adjs
            
            # for l in range(len(size)):
            #     size[l] = torch.tensor( [item.cpu().detach().numpy() for item in size[l]] )
            x = x_all[n_id]

            for i, (edge_index, _, size) in enumerate(adjs):
                x_target = x[:size[1]]  # Target nodes are always placed first.
                x = self.convs[i]((x, x_target), edge_index)

                if i != self.num_layers - 1:
                    x = self.batch_norms[i](x)
                    x = F.relu(x)
                    x = F.dropout(x, p=0.5, training=self.training)
            
            xs.append(x)
            pbar.update(batch_size)

        x_all = torch.cat(xs, dim=0)

        layer_embeddings = x_all
                
        pbar.close()

        return layer_embeddings

In [18]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SAGE(dataset.num_features, 512, dataset.num_classes, num_layers=3)
model = model.to(device)

In [19]:
# loading node feature matrix and node labels
x = data.x.to(device)
print(x.shape)
y = data.y.squeeze().to(device)

torch.Size([1329, 1])


In [20]:
def train(epoch):
    model.train()

    #pbar = tqdm(total=train_idx.size(0))
    #pbar.set_description(f'Epoch {epoch:02d}')

    total_loss = total_correct = 0
    for batch_size, n_id, adjs in train_loader:
        # `adjs` holds a list of `(edge_index, e_id, size)` tuples.
        print(len(n_id))
        adjs = [adj.to(device) for adj in adjs]
        optimizer.zero_grad()
        print(x[n_id].shape)  
        print(len(adjs))  
        l1_emb, l2_emb, l3_emb = model(x[n_id], adjs=adjs)
        #print("Layer 1 embeddings", l1_emb.shape)
        #print("Layer 2 embeddings", l2_emb.shape)
        out = l3_emb.log_softmax(dim=-1)
        loss = F.cross_entropy(out, y[n_id[:batch_size]])
        loss.backward()
        optimizer.step()

        total_loss += float(loss)
        total_correct += int(out.argmax(dim=-1).eq(y[n_id[:batch_size]]).sum())
        #pbar.update(batch_size)

    #pbar.close()

    loss = total_loss / len(train_loader)
    approx_acc = total_correct / train_idx.size(0)

    return loss, approx_acc

In [21]:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
tick=time.time()
for epoch in range(1, 101):
    loss, acc = train(epoch)
    print(f'Epoch {epoch:02d}, Loss: {loss:.4f}, Approx. Train: {acc:.4f}')
tock=time.time()
print ("training time(seconds):", tock-tick)
print("accuracy", acc)

946
torch.Size([946, 1])
3
shape of x_target torch.Size([784, 1])
shape of x torch.Size([784, 512])
torch.Size([784, 512])
shape of x_target torch.Size([324, 512])
shape of x torch.Size([324, 512])
torch.Size([324, 512])
shape of x_target torch.Size([64, 512])
shape of x torch.Size([64, 2])
torch.Size([64, 2])
943
torch.Size([943, 1])
3
shape of x_target torch.Size([777, 1])
shape of x torch.Size([777, 512])
torch.Size([777, 512])
shape of x_target torch.Size([320, 512])
shape of x torch.Size([320, 512])
torch.Size([320, 512])
shape of x_target torch.Size([64, 512])
shape of x torch.Size([64, 2])
torch.Size([64, 2])
942
torch.Size([942, 1])
3
shape of x_target torch.Size([784, 1])
shape of x torch.Size([784, 512])
torch.Size([784, 512])
shape of x_target torch.Size([328, 512])
shape of x torch.Size([328, 512])
torch.Size([328, 512])
shape of x_target torch.Size([64, 512])
shape of x torch.Size([64, 2])
torch.Size([64, 2])
945
torch.Size([945, 1])
3
shape of x_target torch.Size([787, 1]

In [22]:
@torch.no_grad()
def test():
    model.eval()

    l3_embeddings = model.inference(x, test_loader, device)
    out = l3_embeddings.log_softmax(dim=-1)
    y_true = y[test_idx].cpu().unsqueeze(-1)
    y_pred = out.argmax(dim=-1, keepdim=True)

    return y_true,y_pred

In [23]:
from torchmetrics import Accuracy, Precision, Recall, F1Score, ConfusionMatrix

accuracy = Accuracy(task='binary').to(device)
precision = Precision(task='binary').to(device)
recall = Recall(task='binary').to(device)
f1 = F1Score(task='binary').to(device)
confmat = ConfusionMatrix(task='binary').to(device)

# shapes
y_true, y_pred = test()
print(len(y_true), len(y_pred))
y_true = y_true.view(-1).to(device)
y_pred = y_pred.view(-1).to(device)

test_acc = accuracy(y_pred,y_true)
test_precision = precision(y_pred,y_true)
test_f1score = f1(y_pred,y_true)
test_recall = recall(y_pred,y_true)
conf_matrix = confmat(y_pred, y_true)

TN_test, FP_test, FN_test, TP_test = conf_matrix.view(-1).tolist()

Evaluating: : 399it [00:00, 6506.84it/s]           

399 399





In [24]:
print('Test Accuracy: %s' % test_acc.item())
print('test precision: %s' % test_precision.item())
print('Test f1 score: %s' % test_f1score.item())
print('Test recall: %s' % test_recall.item())
print(" #### confusion matrix test: ")
print( "TP",TP_test,"FP",FP_test)
print("TN", TN_test,"FN",FN_test)

Test Accuracy: 0.7819548845291138
test precision: 0.7922077775001526
Test f1 score: 0.8751793503761292
Test recall: 0.9775640964508057
 #### confusion matrix test: 
TP 305 FP 80
TN 7 FN 7


In [25]:
@torch.no_grad()
def test_train():
    model.eval()

    l3_embeddings = model.inference(x, train_loader, device)
    out = l3_embeddings.log_softmax(dim=-1)
    y_true = y[train_idx].cpu().unsqueeze(-1)
    y_pred = out.argmax(dim=-1, keepdim=True)

    return y_true,y_pred

In [26]:
y_true, y_pred = test_train()
print(len(y_true), len(y_pred))
y_true = y_true.view(-1).to(device)
y_pred = y_pred.view(-1).to(device)

train_acc = accuracy(y_pred,y_true)
train_precision = precision(y_pred,y_true)
train_f1score = f1(y_pred,y_true)
train_recall = recall(y_pred,y_true)
conf_matrix = confmat(y_pred, y_true)

TN_train, FP_train, FN_train, TP_train = conf_matrix.view(-1).tolist()

print('Train Accuracy: %s' % train_acc.item())
print('train precision: %s' % train_precision.item())
print('Train f1 score: %s' % train_f1score.item())
print('Train recall: %s' % train_recall.item())
print(" #### confusion matrix train: ")
print( "TP",TP_train,"FP",FP_train)
print("TN", TN_train,"FN",FN_train)

Evaluating: : 930it [00:00, 10853.71it/s]          

930 930
Train Accuracy: 0.7322580814361572
train precision: 0.7573696374893188
Train f1 score: 0.8429021835327148
Train recall: 0.9502133727073669
 #### confusion matrix train: 
TP 668 FP 214
TN 13 FN 35





In [27]:
#compute the number of trainable parameters:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

total_parameter = count_parameters(model)
print(total_parameter)

530434


In [28]:
# Print model's state_dict
print("Model's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

Model's state_dict:
convs.0.lin_l.weight 	 torch.Size([512, 1])
convs.0.lin_l.bias 	 torch.Size([512])
convs.0.lin_r.weight 	 torch.Size([512, 1])
convs.1.lin_l.weight 	 torch.Size([512, 512])
convs.1.lin_l.bias 	 torch.Size([512])
convs.1.lin_r.weight 	 torch.Size([512, 512])
convs.2.lin_l.weight 	 torch.Size([2, 512])
convs.2.lin_l.bias 	 torch.Size([2])
convs.2.lin_r.weight 	 torch.Size([2, 512])
batch_norms.0.weight 	 torch.Size([512])
batch_norms.0.bias 	 torch.Size([512])
batch_norms.0.running_mean 	 torch.Size([512])
batch_norms.0.running_var 	 torch.Size([512])
batch_norms.0.num_batches_tracked 	 torch.Size([])
batch_norms.1.weight 	 torch.Size([512])
batch_norms.1.bias 	 torch.Size([512])
batch_norms.1.running_mean 	 torch.Size([512])
batch_norms.1.running_var 	 torch.Size([512])
batch_norms.1.num_batches_tracked 	 torch.Size([])


In [29]:
torch.save(model,f'graphsage_model_{disease_name}_lastversion')

restart the runtime and run:

In [30]:
%cd DATA

[Errno 2] No such file or directory: 'DATA'
/home/ubuntu/workspace/GNNs-on-Biological-data/DATA


In [31]:
# # Installing Pytorch Geometric 
# %%capture
# !pip install -q torch-scatter -f https://pytorch-geometric.com/whl/torch-1.10.0+cu113.html
# !pip install -q torch-sparse -f https://pytorch-geometric.com/whl/torch-1.10.0+cu113.html
# !pip install -q torch-cluster -f https://pytorch-geometric.com/whl/torch-1.10.0+cu113.html
# !pip install -q torch-geometric

# !pip install umap-learn
# !pip install networkx

In [32]:
import torch
import torch.nn.functional as F
from tqdm import tqdm
from torch_geometric.data import NeighborSampler
from torch_geometric.nn import SAGEConv
import os.path as osp
import pandas as pd
import numpy as np
import collections
from pandas.core.common import flatten
import random
import time

from pandas.core.common import flatten
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(rc={'figure.figsize':(16.7,8.27)})
sns.set_theme(style="ticks")
import collections
from scipy.special import softmax
import umap

from torch_geometric.data import InMemoryDataset
from sklearn.model_selection import train_test_split
import torch_geometric.transforms as T
import networkx as nx
import torch
from torch_geometric.data import Data


# from torch_geometric.utils import metric

In [33]:
balanced_G = nx.read_gpickle(f"{disease_name}_balanced_more.gpickle")

In [34]:
G=balanced_G

In [35]:
# retrieve the labels for each node, the nodes that are not zero
labels = np.asarray([G.nodes[i]['label'] != 0 for i in G.nodes]).astype(np.int64)

# create edge index. We need to have data as previously shown. We can exploit networkX and scipy for that 
adj = nx.to_scipy_sparse_matrix(G).tocoo() #coordinate format
#print(adj)
#create edge index in the proper way
row = torch.from_numpy(adj.row.astype(np.int64)).to(torch.long) #create a torch tensor from numpy in long format : for row indexes
col = torch.from_numpy(adj.col.astype(np.int64)).to(torch.long) #                                                   for column indexes
edge_index = torch.stack([row, col], dim=0)
#display(edge_index)


# using degree as embedding. For simplicity, the feature vector describing the 
# will be just its degree, which is enough for us   >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>.. can i use other embeddings? https://medium.com/@st3llasia/graph-embedding-techniques-7d5386c88c5
#is it actually working or we do it for simplicity
embeddings = np.array(list(dict(G.degree()).values())) #list the values of degree of each node as numpy array
# normalizing degree values
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
embeddings = scale.fit_transform(embeddings.reshape(-1,1))
# custom pytorch dataset
class PPIDATASET(InMemoryDataset):
    def __init__(self, transform=None):
        super(PPIDATASET, self).__init__('.', transform, None, None) #pre transform and pre filter: None, we don't need them
        data = Data(edge_index=edge_index) #Data : A data object describing a homogeneous graph.  for more : https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.Data
        data.num_nodes = G.number_of_nodes()
        # embedding 
        data.x = torch.from_numpy(embeddings).type(torch.float32)
        # labels
        y = torch.from_numpy(labels).type(torch.long)
        data.y = y.clone().detach() #removing tensors computational graph for efficency since it is not needed
        data.num_classes = 2
        # splitting the data into train, validation and test
        train_ratio = 0.70
        validation_ratio = 0.15
        test_ratio = 0.15
        X_train, X_test, y_train, y_test = train_test_split(pd.Series(G.nodes()),  pd.Series(labels), test_size=1 - train_ratio, random_state=42)
        X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio))
        n_nodes = G.number_of_nodes()
        # create train and test masks for data
        # the Data objects holds a label for each node, and additional node-level attributes: train_mask, val_mask and test_mask, where
        #train_mask denotes against which nodes to train (140 nodes),
        #val_mask denotes which nodes to use for validation, e.g., t
        #test_mask denotes against which nodes to test 
        train_mask = torch.zeros(n_nodes, dtype=torch.bool)
        test_mask = torch.zeros(n_nodes, dtype=torch.bool)
        train_mask[X_train.index] = True
        test_mask[X_test.index] = True
        data['train_mask'] = train_mask
        data['test_mask'] = test_mask
        data['X_train']=X_train
        data['X_test']=X_test
        data['X_val']=X_val
        #data['y_train']=y_train
        #data['y_test']=X_test
        self.data, self.slices = self.collate([data])
    # def _download(self):
    #     return
    # def _process(self):
    #     return
    # def __repr__(self):
    #     return '{}()'.format(self.__class__.__name__)
dataset = PPIDATASET()
#Here, the dataset contains only a single, undirected citation graph, reminder: dataset is like a dictionary that hold the graph inside, here the dictionary
#has only one elemetns:
data = dataset[0] #now data is ready for training and testing


The scipy.sparse array containers will be used instead of matrices
in Networkx 3.0. Use `to_scipy_sparse_array` instead.
  adj = nx.to_scipy_sparse_matrix(G).tocoo() #coordinate format


In [36]:
split_idx={}
split_idx['test']=torch.tensor(sorted(data.X_test.index.values))
split_idx['train']=torch.tensor(sorted(data.X_train.index.values))
split_idx['valid']= torch.tensor(sorted(data.X_val.index.values))

In [37]:
subgraph_loader = NeighborSampler(data.edge_index, node_idx=None, sizes=[-1],
                                  batch_size=64, shuffle=False)



In [38]:
class SAGE(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, num_layers=3):
        super(SAGE, self).__init__()

        self.num_layers = num_layers

        self.convs = torch.nn.ModuleList()
        self.convs.append(SAGEConv(in_channels, hidden_channels))
        for _ in range(num_layers - 2):
            self.convs.append(SAGEConv(hidden_channels, hidden_channels))
        self.convs.append(SAGEConv(hidden_channels, out_channels))

    def reset_parameters(self):
        for conv in self.convs:
            conv.reset_parameters()

    def forward(self, x, adjs):
        # `train_loader` computes the k-hop neighborhood of a batch of nodes,
        # and returns, for each layer, a bipartite graph object, holding the
        # bipartite edges `edge_index`, the index `e_id` of the original edges,
        # and the size/shape `size` of the bipartite graph.
        # Target nodes are also included in the source nodes so that one can
        # easily apply skip-connections or add self-loops.
        for i, (edge_index, _, size) in enumerate(adjs):
            xs = []
            x_target = x[:size[1]]  # Target nodes are always placed first.
            x = self.convs[i]((x, x_target), edge_index)
            if i != self.num_layers - 1:
                x = F.relu(x)
                x = F.dropout(x, p=0.5, training=self.training)
            xs.append(x)
            if i == 0: 
                x_all = torch.cat(xs, dim=0)
                layer_1_embeddings = x_all
            elif i == 1:
                x_all = torch.cat(xs, dim=0)
                layer_2_embeddings = x_all
            elif i == 2:
                x_all = torch.cat(xs, dim=0)
                layer_3_embeddings = x_all    
        #return x.log_softmax(dim=-1)
        return layer_1_embeddings, layer_2_embeddings, layer_3_embeddings

    def inference(self, x_all):
        pbar = tqdm(total=x_all.size(0) * self.num_layers)
        pbar.set_description('Evaluating')

        # Compute representations of nodes layer by layer, using *all*
        # available edges. This leads to faster computation in contrast to
        # immediately computing the final representations of each batch.
        total_edges = 0
        for i in range(self.num_layers):
            xs = []
            for batch_size, n_id, adj in subgraph_loader:
                edge_index, _, size = adj.to(device)
                total_edges += edge_index.size(1)
                x = x_all[n_id].to(device)
                x_target = x[:size[1]]
                x = self.convs[i]((x, x_target), edge_index)
                if i != self.num_layers - 1:
                    x = F.relu(x)
                xs.append(x)

                pbar.update(batch_size)

            if i == 0: 
                x_all = torch.cat(xs, dim=0)
                layer_1_embeddings = x_all
            elif i == 1:
                x_all = torch.cat(xs, dim=0)
                layer_2_embeddings = x_all
            elif i == 2:
                x_all = torch.cat(xs, dim=0)
                layer_3_embeddings = x_all
                
        pbar.close()

        return layer_1_embeddings, layer_2_embeddings, layer_3_embeddings

In [39]:
model = torch.load(f'graphsage_model_{disease_name}_lastversion')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

In [40]:
# load node feature matrix and labels
x = data.x.to(device)
y = data.y.squeeze().to(device)

In [41]:
@torch.no_grad()
def test():
    model.eval()
    
    l1_embeddings, l2_embeddings, l3_embeddings = model.inference(x)
    out = l3_embeddings 
    y_true = y.cpu().unsqueeze(-1)
    y_pred = out.argmax(dim=-1, keepdim=True)

    return y_true,y_pred

In [42]:
# from torchmetrics import Accuracy, Precision, Recall, F1Score, ConfusionMatrix

# accuracy = Accuracy(task='binary').to(device)
# precision = Precision(task='binary').to(device)
# recall = Recall(task='binary').to(device)
# f1 = F1Score(task='binary').to(device)
# confmat = ConfusionMatrix(task='binary').to(device)

# # shapes
# y_true, y_pred = test()
# y_true = y_true.view(-1).to(device)
# y_pred = y_pred.view(-1).to(device)

# test_acc = accuracy(y_pred,y_true)
# test_precision = precision(y_pred,y_true)
# test_f1score = f1(y_pred,y_true)
# test_recall = recall(y_pred,y_true)
# conf_matrix = confmat(y_pred, y_true)

# TN_test, FP_test, FN_test, TP_test = conf_matrix.view(-1).tolist()

In [43]:
# print('Test Accuracy: %s' % test_acc.item())
# print('test precision: %s' % test_precision.item())
# print('Test f1 score: %s' % test_f1score.item())
# print('Test recall: %s' % test_recall.item())
# print(" #### confusion matrix test: ")
# print( "TP",TP_test,"FP",FP_test)
# print("TN", TN_test,"FN",FN_test)

In [44]:
class SAGE(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, num_layers=3):
        super(SAGE, self).__init__()

        self.num_layers = num_layers

        self.convs = torch.nn.ModuleList()
        self.convs.append(SAGEConv(in_channels, hidden_channels))
        for _ in range(num_layers - 2):
            self.convs.append(SAGEConv(hidden_channels, hidden_channels))
        self.convs.append(SAGEConv(hidden_channels, out_channels))

    def reset_parameters(self):
        for conv in self.convs:
            conv.reset_parameters()

    def forward(self, x, edge_index, edge_weight, adjs):
        # `train_loader` computes the k-hop neighborhood of a batch of nodes,
        # and returns, for each layer, a bipartite graph object, holding the
        # bipartite edges `edge_index`, the index `e_id` of the original edges,
        # and the size/shape `size` of the bipartite graph.
        # Target nodes are also included in the source nodes so that one can
        # easily apply skip-connections or add self-loops.
        for i, (edge_index, _, size) in enumerate(adjs):
            xs = []
            x_target = x[:size[1]]  # Target nodes are always placed first.
            x = self.convs[i]((x, x_target), edge_index)
            if i != self.num_layers - 1:
                x = F.relu(x)
                x = F.dropout(x, p=0.5, training=self.training)
            xs.append(x)
            if i == 0: 
                x_all = torch.cat(xs, dim=0)
                layer_1_embeddings = x_all
            elif i == 1:
                x_all = torch.cat(xs, dim=0)
                layer_2_embeddings = x_all
            elif i == 2:
                x_all = torch.cat(xs, dim=0)
                layer_3_embeddings = x_all    
        return x.log_softmax(dim=-1)

In [45]:
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# model = SAGE(dataset.num_features, 512, dataset.num_classes, num_layers=3).to(device)
# data = data.to(device)
# lr = 1e-1

# optimizer = torch.optim.Adam(model.parameters(), lr=lr)
# epochs = 200

# x, edge_index, edge_weight = data.x, data.edge_index, None

# num_epochs = 200

# for epoch in tqdm(range(num_epochs)):
#     model.train()
#     optimizer.zero_grad()
#     log_logits = model(x, edge_index, edge_weight, adj)
#     loss = F.nll_loss(log_logits[data.train_mask], data.y[data.train_mask])
#     loss.backward()
#     optimizer.step()

# with torch.no_grad():
#   model.eval()
#   logits = model(x, edge_index, edge_weight)

#   #test_mask = data['test_mask']
#   preds = logits.max(1)[1]

In [46]:
# explainer = GNNExplainer(model, epochs=200, return_type='log_prob', num_hops = 2) #if num_hops is none it is detected from the num of message passing ops.
#                                                                                   #it is needed to tell the explainer "how far to go" to look for explanations.
# node_idx = 1 
# pred_node_idx = preds[node_idx].item()
# print("Explaining node ", node_idx, " with predicted class: ", pred_node_idx)

# node_feat_mask, edge_mask = explainer.explain_node(node_idx, x, edge_index, edge_weight=edge_weight)

In [47]:
# ax, G = explainer.visualize_subgraph(node_idx, edge_index, edge_mask, y=data.y, seed = node_idx, threshold = None) #you can set threshold to define the hard mask accordign to the sparisty you wnat ot obtain and how much you want to be strict
# plt.show()