# Chapter 1: Graph 
source: https://doc.dgl.ai/en/1.1.x/guide/graph.html


## 1.1 Some Basic Definitions about Graphs (Graphs 101)

A graph $G = (V, E)$ is a structure used to represent entities (elements in the set $V$) and their relations (elements in the set $E$). We denote the set of nodes/vertices as $V$. And the corresponding edges, $E$. Where an edge is $(u,v) \in E$. 

Graphs can be either directed, such that $(u,v) \in E = (v,u) \in E$. Or, undirected such that $(u,v) \in E \neq (v,u) \in E$.

Graphs can be weighted or unweighted. In a weighted graph, each edge is associated with a scalar weight. Weights may represent distance between nodes or another measure of connectivity such as transaction amount in a transaction graph. 

Graphs can be homogeneous or heterogeneous. In a homogeneous graph nodes and edges are of the same type. For example in a social network we could consider nodes to be people and edges to be "is a friend of". Since we have only one user type, 'person' and one edge type, 'is a friend of', this is a homogenous graph. 

A transaction graph would be an example of a heterogeneous graph, where we would have two node types, customers and merchants and two edge types, customer-to-merchant purchases and customer-to-customer transfers, such as e-transferring a friend for buying lunch. 

Multigraphs are graphs that have multiple (directed) edges between the same pair of nodes, including self-loops. 

## 1.2 Graphs, Nodes, Edges

Dgl represents each node by a unique integer, called its Node ID, and each edge as a pair of integers, corresponding to the IDs of its end nodes. 

DGL assigns each edge a unique integer, called its **edge ID**, based on the order in which it was added to the graph. The numbering of node and edge IDs starting from 0. 

In DGL all edges are directed, and an edge (u,v) indicates that the direction goes from u to v. 

To specify multiple nodes, DGL uses a 1D integet tensor such as PyTorch's tensor, TensorFlow's Tensor or MXNet's ndarray of Node IDs. DGL refers to this specification as "node-tensors"

To specify multiple edges, DGL uses a tuple of node-tensors $(U,V)$. Where $(U[i], V[i])$ denotes an edge from $U[i]$ to $V[i]$. 

We can create a graph in DGL using the `dgl.graph()` method. This method takes a set of edges as input. DGL supports creating graphs from additional data sources. 

In [1]:
# An example of creating a with dgl 
import dgl
import torch as th

# Edges: 0->1; 0->2; 0->3; 1->3
u, v = th.tensor([0,0,0,1]), th.tensor([1,2,3,3])
g = dgl.graph((u,v))
print(g)

Graph(num_nodes=4, num_edges=4,
      ndata_schemes={}
      edata_schemes={})


In [4]:
# Node IDs
print(f"The nodes in the graph are: {g.nodes()}")

The nodes in the graph are: tensor([0, 1, 2, 3])


In [5]:
print(f"The edges in the graph are {g.edges()}")

The edges in the graph are (tensor([0, 0, 0, 1]), tensor([1, 2, 3, 3]))


In [7]:
#print the edge nodes and edge IDs form: edges, nodes
print(g.edges(form='all'))

(tensor([0, 0, 0, 1]), tensor([1, 2, 3, 3]), tensor([0, 1, 2, 3]))


In [8]:
# If the node with largest node ID (The newest node) is isolated we need to explicitly set number of nodes
g = dgl.graph((u,v ), num_nodes=8)


In [10]:
# for undirected graphs we need to create a bidirected graph

bg = dgl.to_bidirected(g)
bg.edges(form='all')

(tensor([0, 0, 0, 1, 1, 2, 3, 3]),
 tensor([1, 2, 3, 0, 3, 0, 0, 1]),
 tensor([0, 1, 2, 3, 4, 5, 6, 7]))

In [None]:
# check the number of edges in the bidirectional graph 

- DGL can use either 32 or 64 bit integers to store node and edge IDs. 
- DGL can handle graphs with $2^63-1$ nodes or edges. 
- If a graph contains less than $2^31 - 1$ edges or nodes one should use 32-bit integers for better speed and memory usage. 

In [11]:
# Converting between integer types
edges = (th.tensor([2,5,3]), th.tensor([3,5,0]))
g64 = dgl.graph(edges)
print(g64.idtype)

torch.int64


In [13]:
g32 = dgl.graph(edges, idtype=th.int32)
print(g32.idtype)

torch.int32


## 1.3 Node and Edge Features

The nodes and edges of a `DGLGraph` can have several user defined named features for storing graph-specific properties. These may be accessed using the `ndata` and `edata` interfaces respectively. 

In [19]:
g = dgl.graph(([0,0,1,5], [1,2,2,0]))
g

Graph(num_nodes=6, num_edges=4,
      ndata_schemes={}
      edata_schemes={})

In [20]:
g.ndata['x'] = th.ones(g.num_nodes(), 3)
g.edata['x'] = th.ones(g.num_edges(), dtype=th.int32)
g

Graph(num_nodes=6, num_edges=4,
      ndata_schemes={'x': Scheme(shape=(3,), dtype=torch.float32)}
      edata_schemes={'x': Scheme(shape=(), dtype=torch.int32)})

In [21]:
# Different names may have different shapes
g.ndata['y'] = th.randn(g.num_nodes(), 5)
g.ndata['x'][1]

tensor([1., 1., 1.])

In [22]:
# Get features of edges 0 and 3
g.edata['x'][th.tensor([0,3])]

tensor([1, 1], dtype=torch.int32)

Key Points of ndata and edata interface:

- Only features of numerical types (eg. float, double, int) are allowed. They can be scalars, vectors, or multidimensional tensors.
- Each node feature has a unique name and each edge feature has a unique name. The features for nodes and edges may have the same name. 
- A feature is created via tensor assignment, which assigns a feature to each node/edge in the graph. The leading dimension of that tensor must be equal to the number of nodes or edges in that graph. You cannot assign a feature to a subset of the nodes/edges in the graph. 
- Features of the same name must have the same dimensionality and data type. 
- The feature tensor is in row-major layout - each row-slice stores the feature of one node or edge. 

In [24]:
# For weighted graphs, one can store the weights as an edge feature as follows:
edges = (th.tensor([0,0,0,1]), th.tensor([1,2,3,3]))
weights = th.tensor([0.1, 0.6, 0.9, 0.7]) 
g = dgl.graph(edges)
g.edata['w'] = weights
g

Graph(num_nodes=4, num_edges=4,
      ndata_schemes={}
      edata_schemes={'w': Scheme(shape=(), dtype=torch.float32)})

## 1.4 Creating Graphs from External Sources

DGL supports converting from external data sources such as:
1. Conversion from external python libraries for graphs and sparse matrices (NetworkX and SciPy)
2. Loading graphs from disk. 

### Creating Graphs from External Libraries

In [25]:
import scipy.sparse as sp
spmat = sp.rand(100, 100, density=0.05)
dgl.from_scipy(spmat)

Graph(num_nodes=100, num_edges=500,
      ndata_schemes={}
      edata_schemes={})

In [26]:
import networkx as nx
nx_g = nx.path_graph(5)
dgl.from_networkx(nx_g)

Graph(num_nodes=5, num_edges=8,
      ndata_schemes={}
      edata_schemes={})

Note: when creating the networkx graph using `path_graph()` the created graph is by default undirected. Where as DGL creates a directed graph by default. DGL internally converts undirected edges to two directed edges. Using directed NetworkX graphs can avoid this behaviour. 

In [28]:
nxg = nx.DiGraph([(2,1), (1,2), (2,3), (0,0)])
dgl.from_networkx(nxg)

Graph(num_nodes=4, num_edges=4,
      ndata_schemes={}
      edata_schemes={})

Note: DGL internally converts SciPy matrices and NetworkX graphs to tensors to construct graphs. Hence, these construction methods are not meant for performance critical operations. 

### Loading Graphs from Disk

DLG supports: 
- Loading from CSV
- Loading from JSON/GML although this is not particularly fast as they call the functionalities built into NetworkX to do so. 
- DGL Binary Format has an API built directly into DGL, this API also handles feature data and graph-level label data. 
- S3 and HDFS are also supported through DGL. 

## 1.5 Heterogeneous Graphs

A heterogeneous graph can have nodes and/or edges of multiple types. Nodes/Edges of different types have *independent* ID space and feature storage. 

Example: 

![](https://data.dgl.ai/asset/image/user_guide_graphch_2.png)

In [29]:
# The example in code
graph_data = {
    ('user', 'plays','game'): (th.tensor([0,0,1,1]), th.tensor([0,1,1,2])),
    ('user', 'follows', 'user'): (th.tensor([0]), th.tensor([1])),
}

In [31]:
g = dgl.heterograph(graph_data)
g.ntypes

['game', 'user']

In [32]:
g.etypes

['follows', 'plays']

In [33]:
g.canonical_etypes

[('user', 'follows', 'user'), ('user', 'plays', 'game')]

We can use DGL to represent homogeneous and bipartite graphs as well as they are special cases of heterogenous graphs.

In [35]:
# Homogeneous graph
dgl.heterograph({('node_type', 'edge_type', 'node_type'): (u,v)})

Graph(num_nodes=4, num_edges=4,
      ndata_schemes={}
      edata_schemes={})

In [36]:
# Bipartite Graph
dgl.heterograph({('source_type', 'edge_type', 'destination_type'): (u,v)})

Graph(num_nodes={'destination_type': 4, 'source_type': 2},
      num_edges={('source_type', 'edge_type', 'destination_type'): 4},
      metagraph=[('source_type', 'destination_type', 'edge_type')])

- The metagraph associated with a heterogeneous graph is the schema of the graph. 
- The metagraph specifies type constraints on the sets of nodes and edges between nodes.
- A node $u$ in a metagraph corresponds to a node type in the associated heterograph. 
- An edge $(u,v)$ in a metagraph indicates that there are edges from nodes of type $u$ to nodes of type $v$ in the associated heterograph. 

In [37]:
g

Graph(num_nodes={'game': 3, 'user': 2},
      num_edges={('user', 'follows', 'user'): 1, ('user', 'plays', 'game'): 4},
      metagraph=[('user', 'user', 'follows'), ('user', 'game', 'plays')])

In [38]:
g.metagraph().edges()

OutMultiEdgeDataView([('user', 'user'), ('user', 'game')])

### Working with Multiple Types

When multiple node/edge types are introduced, users need to specify the particular node/edge type when invoking the DGLGraph API for type-specific information. 

In [39]:
g.num_nodes() # get number of all nodes in the graph

5

In [42]:
g.num_nodes('user') # get the number of users in the bipartite graph

2

In [43]:
g.nodes('user')

tensor([0, 1])

To set/get features for a specific node/edge type, DGL provides two new types of syntax `g.nodes[‘node_type’].data[‘feat_name’]` and `g.edges[‘edge_type’].data[‘feat_name’]`.

In [51]:
# Set and get features for users 
g.nodes['user'].data['user_acct'] = th.ones(2,1)
g.edges['plays'].data['times_played'] = th.ones(4,1)

In [52]:
g.edges['plays'].data['times_played'] 

tensor([[1.],
        [1.],
        [1.],
        [1.]])

In [49]:
g

Graph(num_nodes={'game': 3, 'user': 2},
      num_edges={('user', 'follows', 'user'): 1, ('user', 'plays', 'game'): 4},
      metagraph=[('user', 'user', 'follows'), ('user', 'game', 'plays')])

When the edge type uniquely determines the types of source and destination nodes, one can just use one string instead of a string triplet to specify the edge type. For example, for a heterograph with two relations `('user', 'plays', 'game')` and `('user', 'likes', 'game')`, it is safe to just use '`plays'` or `'likes'` to refer to the two relations.

### Loading Heterographs from Disk

CSVs are commonly used to store heterographs, this is done through storing nodes and edges in different csv files. 

For Example:
/# data folder
```
data/
|-- drug.csv        # drug nodes
|-- gene.csv        # gene nodes
|-- disease.csv     # disease nodes
|-- drug-interact-drug.csv  # drug-drug interaction edges
|-- drug-interact-gene.csv  # drug-gene interaction edges
|-- drug-treat-disease.csv  # drug-treat-disease edges
```

Additionally, DGL provides `dgl.save_graphs()` and `dgl.load_graphs()` respectively for saving and loading graphs in binary format. 

In [55]:
g = dgl.heterograph({
   ('drug', 'interacts', 'drug'): (th.tensor([0, 1]), th.tensor([1, 2])),
   ('drug', 'interacts', 'gene'): (th.tensor([0, 1]), th.tensor([2, 3])),
   ('drug', 'treats', 'disease'): (th.tensor([1]), th.tensor([2]))
})
g.nodes['drug'].data['hv'] = th.ones(3,1)

# Retain relations for all nodes for 'drug' and 'disease'
eg = dgl.edge_type_subgraph(g, [('drug', 'interacts', 'drug'), 
                                ('drug', 'treats', 'disease')])
eg


Graph(num_nodes={'disease': 3, 'drug': 3},
      num_edges={('drug', 'interacts', 'drug'): 2, ('drug', 'treats', 'disease'): 1},
      metagraph=[('drug', 'drug', 'interacts'), ('drug', 'disease', 'treats')])

In [56]:
# Make sure the associated features were copied as well
eg.nodes['drug'].data['hv']

tensor([[1.],
        [1.],
        [1.]])

### Converting Heterogenous Graphs to Homogeneous Graphs

Heterographs provide an interface for managing nodes/edges of different types and their associated features. This is helpful when:

1. The features for nodes/edges of different  data types or sizes. 
2. We want to apply different operations to nodes/edges of different types. 


In [58]:
g = dgl.heterograph({
    ('drug', 'interacts', 'drug'): (th.tensor([0,1]), th.tensor([1,2])),
    ('drug', 'treats', 'disease'): (th.tensor([1]), th.tensor([2]))
    
})
g.nodes['drug'].data['hv'] = th.zeros(3,1)
g.nodes['disease'].data['hv'] = th.ones(3,1)
g.edges['interacts'].data['he'] = th.zeros(2,1)
g.edges['treats'].data['he'] = th.zeros(1,2)

In [59]:
hg = dgl.to_homogeneous(g) # by default this does not merge features


In [63]:
'hv' in hg.ndata # we see hv is not in the node data

False

In [64]:
g

Graph(num_nodes={'disease': 3, 'drug': 3},
      num_edges={('drug', 'interacts', 'drug'): 2, ('drug', 'treats', 'disease'): 1},
      metagraph=[('drug', 'drug', 'interacts'), ('drug', 'disease', 'treats')])

In [65]:
# Copy the feature edges
# For feature copy, DGL expects the features to have the same size and dtype across node/edge types
hg = dgl.to_homogeneous(g, edata=['he'])

DGLError: Cannot concatenate column he with shape Scheme(shape=(2,), dtype=torch.float32) and shape Scheme(shape=(1,), dtype=torch.float32)

In [68]:
g.ndata

defaultdict(<class 'dict'>, {'hv': {'disease': tensor([[1.],
        [1.],
        [1.]]), 'drug': tensor([[0.],
        [0.],
        [0.]])}})

In [66]:
# Copy the node features 
# This will run correctly because the shape of the node features is (3,1) and (3,1)
hg = dgl.to_homogeneous(g, ndata=['hv'])
hg.ndata['hv']

tensor([[1.],
        [1.],
        [1.],
        [0.],
        [0.],
        [0.]])

The original node/edge types and type specific IDs are stored in `ndata` and `edata`.

In [69]:
# Order of the node types in the heterograph
g.ntypes

['disease', 'drug']

In [70]:
# Original node types
hg.ndata[dgl.NTYPE]

tensor([0, 0, 0, 1, 1, 1])

In [71]:
# Original type-specific node IDs
hg.ndata[dgl.NID]

tensor([0, 1, 2, 0, 1, 2])

In [72]:
# Order of the edge types in the heterograph 
g.etypes

['interacts', 'treats']

In [74]:
# Original edge types
hg.edata[dgl.ETYPE]

tensor([0, 0, 1])

In [75]:
# Original type specific edge IDs
hg.edata[dgl.EID]

tensor([0, 1, 0])

For some modelling tasks we may want to group some relations together and apply the same operation to them. To address this need we can first take an edge type subgraph of the heterograph and then convert the subgraph to a homogeneous graph. 

In [76]:
g = dgl.heterograph({
    ('drug', 'interacts', 'drug'): (th.tensor([0,1]), th.tensor([1,2])),
    ('drug', 'interacts', 'gene'): (th.tensor([0,1]), th.tensor([2,3])), 
    ('drug', 'treats', 'disease'): (th.tensor([1]), th.tensor([2]))
})

In [79]:
sub_g = dgl.edge_type_subgraph(g, 
                               [('drug', 'interacts', 'drug'), ('drug', 'interacts', 'gene')])

In [80]:
h_sub_g = dgl.to_homogeneous(sub_g)
h_sub_g

Graph(num_nodes=7, num_edges=4,
      ndata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64), '_TYPE': Scheme(shape=(), dtype=torch.int64)}
      edata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64), '_TYPE': Scheme(shape=(), dtype=torch.int64)})

## 1.6 DGLGraph on a GPU

TODO
