## Problem Definition: Drug-Drug Interaction Prediction
The ogbl-ddi dataset from [OGB (Open Graph Benchmark)](https://ogb.stanford.edu/) is a homogeneous, unweighted, undirected graph, representing the drug-drug interaction network. Each node represents an FDA-approved or experimental drug. Edges represent interactions between drugs and can be interpreted as a phenomenon where the joint effect of taking the two drugs together is considerably different from the expected effect in which drugs act independently of each other.

Prediction task: The task is to predict drug-drug interactions given information on already known drug-drug interactions. For the evaluation metric we would like the model to rank true drug interactions higher than non-interacting drug pairs. Specifically, we rank each true drug interaction among a set of approximately 100,000 randomly-sampled negative drug interactions, and count the ratio of positive edges that are ranked at K-place or above (Hits@K). We found K = 20 to be a good threshold in our preliminary experiments.

Main Graph Structure of OGB DDI
The OGB DDI dataset is represented as an undirected graph, where:
* Nodes (𝑉): Represent drugs (4,267 nodes).
* Edges (𝐸): Represent known drug-drug interactions (1,334,889 edges).
* Graph Type: Homogeneous graph (all nodes are of the same type: drugs).
* No Node Features: Unlike other OGB datasets, nodes do not have feature vectors.

Edge Prediction Task: The goal is binary classification of edges—predicting whether a connection (interaction) exists between two drugs.

Graph Properties
Sparse but Large: High number of edges relative to nodes.

High Connectivity: Most drugs are interconnected, forming a dense interaction network.

# 0 - Installation & Imports

In [1]:
%%time
!pip install -q torch==2.4.0
!pip install -q pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.4.0+cpu.html
!pip install -q torch_geometric==2.4.0
!pip install -q ogb==1.3.6

!pip install -q networkx
!pip install -q networkit
!pip install -q dgl==1.0.0

CPU times: user 255 ms, sys: 22.2 ms, total: 278 ms
Wall time: 35.8 s


In [2]:
from ogb.linkproppred.dataset_dgl import DglLinkPropPredDataset
import numpy as np
from IPython.display import clear_output

import networkx as nx
from networkx.algorithms.approximation.distance_measures import diameter


from torch import serialization
from torch_geometric.data.storage import GlobalStorage
from torch_geometric.data.data import DataEdgeAttr, DataTensorAttr
serialization.add_safe_globals([GlobalStorage, DataEdgeAttr, DataTensorAttr])

import networkit as nk

from ogb.linkproppred import PygLinkPropPredDataset, LinkPropPredDataset, Evaluator

import torch_geometric.transforms as T
from torch_geometric.utils import to_networkx ,to_networkit,to_trimesh,get_embeddings

* **NetworkX** is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

* **NetworKit** is a growing open-source toolkit for large-scale network analysis. Its aim is to provide tools for the analysis of large networks. For this purpose, it implements efficient graph algorithms, many of them parallel to utilize multicore architectures.

* The **Open Graph Benchmark (OGB)** is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner OGB contains graph datasets that are managed by data loaders. The loaders handle downloading and pre-processing of the datasets. Additionally, OGB has standardized evaluators and leaderboards to keep track of state-of-the-art results.

* **PyG (PyTorch Geometric)** is a library built upon  PyTorch to easily write and train Graph Neural Networks (GNNs) for a wide range of applications related to structured data.

# 1 - Dataset Loading

Three approaches are available for accessing and loading datasets from the OGB benchmark:
1.   Library-Agnostic Loader
2.   Pytorch Geometric Loader
3.   DGL Loader

### 1.1 - Library-Agnostic Loader

In [3]:
from ogb.linkproppred import LinkPropPredDataset
dataset_dict = LinkPropPredDataset(name = "ogbl-ddi" )
clear_output()
print("Loaded dataset: ", dataset_dict)
print("Number of graphs in dataset:", len(dataset_dict))

Loaded dataset:  LinkPropPredDataset(1)
Number of graphs in dataset: 1


In [4]:
dataset_dict.meta_info

Unnamed: 0,ogbl-ddi
eval metric,hits@20
task type,link prediction
download_name,ddi
version,1
url,http://snap.stanford.edu/ogb/data/linkproppred...
add_inverse_edge,True
has_node_attr,False
has_edge_attr,False
split,target
additional node files,


In [5]:
graph_object_dictionary = dataset_dict[0]
print("First graph of dataset:\n",graph_object_dictionary)

First graph of dataset:
 {'edge_index': array([[4039, 2424, 4039, ...,  338,  835, 3554],
       [2424, 4039,  225, ...,  708, 3554,  835]]), 'edge_feat': None, 'node_feat': None, 'num_nodes': 4267}


The library-agnostic graph object is a dictionary containing the following keys: edge_index, edge_feat, node_feat, and num_nodes, which are detailed below:
* **edge_index**: numpy ndarray of shape (2, num_edges), where each column represents an edge. The first row and the second row represent the indices of source and target nodes. Undirected edges are represented by bi-directional edges.
* **edge_feat**: numpy ndarray of shape (num_edges, edgefeat_dim), where edgefeat_dim is the dimensionality of edge features and i-th row represents the feature of i-th edge. This can be None if no input edge features are available.
* **node_feat**: numpy ndarray of shape (num_nodes, nodefeat_dim), where nodefeat_dim is the dimensionality of node features and i-th row represents the feature of i-th node. This can be None if no input node features are available.
* **num_nodes**: number of nodes in the graph.

### 1.2 - Pytorch Geometric Loader

In [6]:
dataset_pyg = PygLinkPropPredDataset(name='ogbl-ddi')
clear_output()
print("Loaded dataset: ", dataset_pyg)
print("Number of graphs in dataset:", len(dataset_pyg))

Loaded dataset:  PygLinkPropPredDataset()
Number of graphs in dataset: 1


In [7]:
dataset_pyg.print_summary()

PygLinkPropPredDataset (#graphs=1):
+------------+----------+---------------+
|            |   #nodes |        #edges |
|------------+----------+---------------|
| mean       |     4267 |   2.13582e+06 |
| std        |      nan | nan           |
| min        |     4267 |   2.13582e+06 |
| quantile25 |     4267 |   2.13582e+06 |
| median     |     4267 |   2.13582e+06 |
| quantile75 |     4267 |   2.13582e+06 |
| max        |     4267 |   2.13582e+06 |
+------------+----------+---------------+


  std=data.std().item(),


Number of nodes and edges of graphs in dataset (In this case there is only one graph in dataset)

In [8]:
print('Number of classes:', dataset_pyg.num_classes,
      "\t\t- The number of distinct classes the nodes belong to (useful in node classification)")
print('Number of edge features:', dataset_pyg.num_edge_features,
      "\t- The number of features associated with each edge in the graph (e.g., edge weights)")
print('Number of node features:', dataset_pyg.num_node_features,
      "\t- The number of features associated with each node in the graph.")

Number of classes: 0 		- The number of distinct classes the nodes belong to (useful in node classification)
Number of edge features: 0 	- The number of features associated with each edge in the graph (e.g., edge weights)
Number of node features: 0 	- The number of features associated with each node in the graph.


In [9]:
print("Number of graphs in dataset loaded using Pyg::", len(dataset_pyg))
graph_TG = dataset_pyg[0] # containing only training edges
print("Type of a graph within dataset loaded using Pyg:", type(graph_TG))
print(graph_TG)
print(f"Number of nodes in the graph: {graph_TG.num_nodes}")
print(f"Number of edges in the graph: {graph_TG.num_edges}")
print(f"Number of node types: {graph_TG.num_node_types}")
print(f"Number of edge types: {graph_TG.num_edge_types}")

Number of graphs in dataset loaded using Pyg:: 1
Type of a graph within dataset loaded using Pyg: <class 'torch_geometric.data.data.Data'>
Data(num_nodes=4267, edge_index=[2, 2135822])
Number of nodes in the graph: 4267
Number of edges in the graph: 2135822
Number of node types: 1
Number of edge types: 1


In [10]:
edge_split = dataset_pyg.get_edge_split()
clear_output()
print("Edge split is a:  ", type(edge_split), "  & it's keys:  ",  edge_split.keys(), "" )
print("\t|\n\t|-TRAIN split is a:  ",type(edge_split['train'])," & it's keys:  ",edge_split['train'].keys())
print("\t\t|\n\t\t|-TRAIN 'EDGE' split is a:  ",type(edge_split['train']['edge'])," & it's shape:  ",
      edge_split['train']['edge'].shape,"\n\t\t|")
print("\t|-VALID split is a:  ",type(edge_split['valid']),"  & it's keys:  ",edge_split['valid'].keys())
print("\t\t|\n\t\t|-VALID 'EDGE' split is a:  ",type(edge_split['valid']['edge'])," & it's shape:  ",
      edge_split['valid']['edge'].shape)
print("\t\t|-VALID 'EDGE_NEG' split is a:  ",type(edge_split['valid']['edge_neg'])," & it's shape:  ",
      edge_split['valid']['edge_neg'].shape,"\n\t\t|")

print("\t|-TEST split is a:  ",type(edge_split['test']),"  & it's keys:  ",edge_split['test'].keys())
print("\t\t|\n\t\t|-TEST 'EDGE' split is a:  ",type(edge_split['test']['edge'])," & it's shape:  ",
      edge_split['test']['edge'].shape)
print("\t\t|-TEST 'EDGE_NEG' split is a:  ",type(edge_split['test']['edge_neg'])," & it's shape:  ",
      edge_split['test']['edge_neg'].shape,"\n\t\t|")

Edge split is a:   <class 'dict'>   & it's keys:   dict_keys(['train', 'valid', 'test']) 
	|
	|-TRAIN split is a:   <class 'dict'>  & it's keys:   dict_keys(['edge'])
		|
		|-TRAIN 'EDGE' split is a:   <class 'torch.Tensor'>  & it's shape:   torch.Size([1067911, 2]) 
		|
	|-VALID split is a:   <class 'dict'>   & it's keys:   dict_keys(['edge', 'edge_neg'])
		|
		|-VALID 'EDGE' split is a:   <class 'torch.Tensor'>  & it's shape:   torch.Size([133489, 2])
		|-VALID 'EDGE_NEG' split is a:   <class 'torch.Tensor'>  & it's shape:   torch.Size([101882, 2]) 
		|
	|-TEST split is a:   <class 'dict'>   & it's keys:   dict_keys(['edge', 'edge_neg'])
		|
		|-TEST 'EDGE' split is a:   <class 'torch.Tensor'>  & it's shape:   torch.Size([133489, 2])
		|-TEST 'EDGE_NEG' split is a:   <class 'torch.Tensor'>  & it's shape:   torch.Size([95599, 2]) 
		|


### 1.3 - DGL Loader

In [11]:
dataset_dgl = DglLinkPropPredDataset(name='ogbl-ddi')
print(dataset_dgl)
print("Number of graphs in dataset:", len(dataset_dgl))
graph_DGL = dataset_dgl[0]

DglLinkPropPredDataset(1)
Number of graphs in dataset: 1


In [12]:
graph_DGL.adj
graph_DGL.is_homogeneous
graph_DGL.is_multigraph

False

# 2 - EDA

In [13]:
nx_graph = to_networkx(graph_TG)
nx_undirected_graph = nx_graph.to_undirected()

In [14]:
print('Number of edges if Representing DDI as a directed graph:\t', nx_graph.number_of_edges())
print('Number of edges if Representing DDI as an undirected graph:\t', nx_undirected_graph.number_of_edges())

Number of edges if Representing DDI as a directed graph:	 2135822
Number of edges if Representing DDI as an undirected graph:	 1067911


> The ogbl-ddi dataset is a homogeneous, unweighted, undirected graph, representing the drug-drug interaction network. Each node represents an FDA-approved or experimental drug. Edges represent interactions between drugs

In [15]:
G = nx.freeze(nx_undirected_graph)

In [16]:
print('Info of DDI graph','\n-'+'-'*40)
print('Number of nodes in graph:',G.number_of_nodes() )
print('Number of edges in graph:',G.number_of_edges())

print('\nDiameter of graph:', diameter(G))
print('\nDensity of graph:', nx.density(G))

Info of DDI graph 
-----------------------------------------
Number of nodes in graph: 4267
Number of edges in graph: 1067911

Diameter of graph: 5

Density of graph: 0.11733337464515507


In [18]:
graph_nk = nk.nxadapter.nx2nk(nx_undirected_graph, weightAttr=None)
asps = nk.distance.APSP(graph_nk)
asps.run()
arr = asps.getDistances(asarray=True)
np.unique(arr , return_counts=True)

(array([0., 1., 2., 3., 4., 5.]),
 array([    4267,  2135822, 11837652,  4047248,   180912,     1388]))

In [20]:
nk.profiling.Profile.create(graph_nk).show()

  ax.grid(showGrid, which="both", color=theme.getGridColor(), linestyle="-")
  fig.tight_layout()
  ax.grid(showGrid, which="both", color=theme.getGridColor(), linestyle="-")
  fig.tight_layout()
  ax.grid(showGrid, which="both", color=theme.getGridColor(), linestyle="-")
  fig.tight_layout()
  ax.grid(showGrid, which="both", color=theme.getGridColor(), linestyle="-")
  fig.tight_layout()
  ax.grid(showGrid, which="both", color=theme.getGridColor(), linestyle="-")
  fig.tight_layout()
  ax.grid(showGrid, which="both", color=theme.getGridColor(), linestyle="-")
  fig.tight_layout()
  ax.grid(showGrid, which="both", color=theme.getGridColor(), linestyle="-")
  fig.tight_layout()
  ax.grid(showGrid, which="both", color=theme.getGridColor(), linestyle="-")
  fig.tight_layout()
  ax.grid(showGrid, which="both", color=theme.getGridColor(), linestyle="-")
  fig.tight_layout()
