# 01-10 - Representing Graphs with `pathpyG`

*April 17 2024*

We have covered the basics of data science with `python`. With this out of the way, we now introduce `pathpyG`, a network analysis and visualization package that is being developed at my chair. If you are interested to participate in the development of this OpenSource project, you can sign up for one of our labs or apply for a position as student assistant in our group. For more information, please refer to our website.

`pathpyG` has a couple of advantages compared to many other network analysis packages. First, it is easy to install since it is a pure `python` package that does not require compilation. Second, `pathpy` has a user-friendly API that makes it easy to handle directed and undirected networks, networks where nodes or edges have attributes as well as temporal networks. Third, it provides interactive HTML visualizations that can be directly displayed inside `jupyter` notebooks, making it particularly suitable for educational settings. And finally, different from other packages it directly supports the analysis and visualization of time series data on networked systems, such as time-stamped edges or data on paths in networks.

Most importantly, `pathpyG` is based on `pyTorch` and `torch-geometric` and uses tensors as internal data representation. This facilitates the analysis of large data sets and makes it easy to apply graph neural networks in later chapters of the course.

To get started, we first import `pathpyG` and assign the local alias `pp`:

In [3]:
import pathpyG as pp
import numpy as np

If the `import` statement completes without error message, the installation was successful and we can now use `pathpyG` to generate, analyse, and visualise networks. 

## Creating networks

`pathpyG` provides the `Graph` class. The constructor takes a `pyG` Data object that can be used to pass an edge-index that captures the edges of a graph, as well as arbitrary node-, edge- or graph-level attributes. To simplify the creation of small example networks, you can use a static function that create a `Graph` object based on a list of edges represented as tuples of integers or strings.

Printing the `Graph` object will give a short string summary which tells whether the network is directed or undirected, as well as the number of unique nodes and links.

In [7]:
g1 = pp.Graph.from_edge_list([(0, 1), (1, 2), (2, 0)])
print(g1)

Directed graph with 3 nodes and 3 edges

Graph attributes
	num_nodes		<class 'int'>



A network is directed by default, but we can create an undirected network by calling the function `to_undirected` of the `Graph` instance. This will internally generate edges for all directions, i.e. for the example above it will additionally generate the edges that connect nodes in the opposite direction.

In [9]:
g2 = g1.to_undirected()
print(g2)

Undirected graph with 3 nodes and 6 (directed) edges

Graph attributes
	num_nodes		<class 'int'>



In the example above, we have used integer numbers to refer to different nodes. Internally, this will create a `pyG` Data object with an edge index, where nodes are always represented by integer indices. Let us have a look at this internal data structure:

In [11]:
print(g2.data.edge_index)

EdgeIndex([[0, 0, 1, 1, 2, 2],
           [1, 2, 0, 2, 0, 1]], sparse_size=(3, 3), nnz=6, sort_order=row,
          is_undirected=True)


We see that this edge index contains all edges for both directions, where the first tensor contains all source nodes (ordered by their index) while the second tensor contains all target nodes. This sorted tensor representation allows us to easy convert the edge index tensor to sparse matrix representations, that can be used e.g. to calculate matrix-based measures or Laplacian operators.

While the approach to use integers is easy to understand, it is often more convenient to use string-based node labels. Different from `pyG`, this is supported in `pathpyG`, i.e. we can create a network as follows:

In [15]:
n = pp.Graph.from_edge_list([('Tom', 'Bert'), ('Bert', 'Bill'), ('Bill', 'Tom')]).to_undirected()
print(n)

Undirected graph with 3 nodes and 6 (directed) edges

Graph attributes
	num_nodes		<class 'int'>



Checking the edge index, we find that we again have integer-based node indices:

In [16]:
n.data.edge_index

EdgeIndex([[0, 0, 1, 1, 2, 2],
           [1, 2, 0, 2, 0, 1]], sparse_size=(3, 3), nnz=6, sort_order=row,
          is_undirected=True)

However, `pathpyG` automatically generates a ID to index mapping object that is automatically applied when we e.g. enumerate through nodes. We can manually check this mapping and use it to resolve IDs to integer indices and vice-versa:

In [19]:
print(n.mapping)

Tom -> 0
Bert -> 1
Bill -> 2



In [22]:
n.mapping.to_id(0)

'Tom'

In [23]:
n.mapping.to_idx('Bert')

1

If we want to check explicitly whether a node exists before creating and edge, we can test this with the `in` operator on the set of nodes available via `Graph.nodes`:

In [25]:
print('Tom' in n.nodes)

True


To count the number of nodes and (directed) edges in a network we can use the `N` and `M` functions:

In [27]:
print('Network has {0} nodes and {1} edges'.format(n.N, n.M))

Network has 3 nodes and 6 edges


### Enumerating nodes and edges

We can iterate through nodes via the nodes iterator as follows. If the Graph object includes an index-ID mapping, this will be applied automatically:

In [28]:
for v in n.nodes:
    print(v)

Tom
Bert
Bill


Similar to `nodes`, the `edges` iterator of the network contains all edges of a network. Each edge is returned as a tuple and the id-index mapping is applied automatically if such a mapping exists (otherwise we obtain a tuple of node indices):

In [29]:
for e in n.edges:
    print(e)

('Tom', 'Bert')
('Tom', 'Bill')
('Bert', 'Tom')
('Bert', 'Bill')
('Bill', 'Tom')
('Bill', 'Bert')


We often want to check whether an edge exists between a specific pair of nodes. We can do this by using the `is_edge` function:

In [35]:
print(n.is_edge('Tom', 'Bert'))

True


We can access the degrees of nodes, i.e. the number of other nodes to which a node is connected, via the `degrees()` function of the Network. For an undirected network, the degrees() function gives the undirected degrees (i.e. irrespective of the directionality of an edge). For directed networks we can use the mode parameter to calculate the in- or out-degree of of ndoes in a directed network (i.e. to how many other nodes the edges of a node point of from how many other nodes edges point to the given node).

All of those functions return a dictionary that can be indexed via the node ids.

In [44]:
n.degrees(mode='in')['Tom']

2

In [45]:
n.degrees(mode='in')['Tom']

2

## Networks, Nodes and Edges with attributes

We often want to use networks to model relational data that contain additional information on nodes, edges, or networks. To support this, `pathpyG` stores data in terms of a pyG data frame, which allows to store arbitrary additional information at the level of nodes, edges or the graph in terms of torch.tensors.

In [76]:
n = pp.Graph.from_edge_list([('Tom', 'Bert'), ('Bert', 'Bill'), ('Bill', 'Tom')])
print(n)

Directed graph with 3 nodes and 3 edges

Graph attributes
	num_nodes		<class 'int'>



In the following example, we add an attribute to the modes of the graph. We can directly assing this to the underylying `pyG` data object. All node attributes must be prefixed with `node_`:

In [77]:
import torch
n.data.node_age = torch.tensor([[44], [28], [125]])

The assignment of these values to the nodes is based on the indices of nodes, which we can check via the mapping object. We can now use the following code to access the properties of individual nodes:

In [80]:
n['node_age', 'Bill']

tensor([125])

To retrieve the tensor containing the ages of all nodes, we can do the following:

In [82]:
n['node_age']

tensor([[ 44],
        [ 28],
        [125]])

Just like nodes, `Edge` objects can store arbitrary attributes that we can add as a tensor. The name of the attribute must be prefixed by `edge_`

In [90]:
n.data.edge_type = torch.tensor([[1], [2], [1]])

We can access those as follows:

In [91]:
print(n['edge_type'])
print(n['edge_type', 'Tom', 'Bert'])

tensor([[1],
        [2],
        [1]])
tensor([1])


## Adjacency matrices

Adjacency matrices are an important mathematical representation of graphs and networks. The topology of a graph can be represented in the entries of a matrix $A$, where an entry $A[i,j]=1$ indicates that an edge exists from the i-th to the j-th node of the network. The absence of edges is encoded by zero entries. The size of an adjacency matrix representation of a network with n nodes is generally $n^2$, which is not suitable for networks with thousands or millions of nodes. `pathpy` nevertheless supports efficient adjacency matrix calculation for *sparse* networks, i.e. networks where the majority of node pairs are not connected by an edge. Instead of a fully populated matrix, a call to `Grapoh.get_sparse_adj_matrix()` returns a *sparse matrix object*, which is an efficient adjacency-list representation capturing the indices and values of non-zero entries:

In [92]:
print(n.get_sparse_adj_matrix())

  (0, 1)	1.0
  (1, 2)	1.0
  (2, 0)	1.0


This enables us to directly apply matrix algebra operations from the sparse linear algebra module that is contained in `scipy`. If we instead want a dense matrix that includes zero entries, we can write:

In [93]:
print(n.get_sparse_adj_matrix().todense())

[[0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]]


The fact that the matrix is assymetric tells us that this is a directed network. By default, a binary matrix representation is returned where entries store the presence or absence of edges as 0 or 1 entries. If we want to use numerical attributes of edges instead, we can pass the name of a numerical attribute that should be used:

In [96]:
print(n.get_sparse_adj_matrix(edge_attr='edge_type').todense())

[[0 1 0]
 [0 0 2]
 [1 0 0]]


How does `pathpy` populate adjaecency matrices if the network contains multiple edges between the same pair of nodes? Let's try this by creating another edge between Tom and Bert, and let's further add a strength attribute:

In [101]:
n = pp.Graph.from_edge_list([('Tom', 'Bert'), ('Bert', 'Bill'), ('Bill', 'Tom'), ('Tom', 'Bert')])
n.data.edge_weight=torch.tensor([[2],[0.5],[1.2],[3.7]])
print(n)

Directed graph with 3 nodes and 4 edges

Edge attributes
	edge_weight		<class 'torch.Tensor'> -> torch.Size([4, 1])

Graph attributes
	num_nodes		<class 'int'>



If we now generate an adjacency matrix, the entries contain the *number of different edge objects* between pairs of nodes:

In [102]:
n.get_sparse_adj_matrix().todense()

matrix([[0., 2., 0.],
        [0., 0., 1.],
        [1., 0., 0.]], dtype=float32)

If we use a numerical attribute to calculate the matrix entries in such a network, the attributes of all edges between the same pair of nodes is automatically summed:

In [104]:
n.get_sparse_adj_matrix(edge_attr='edge_weight').todense()

matrix([[0. , 2.5, 0. ],
        [0. , 0. , 1.2],
        [3.7, 0. , 0. ]], dtype=float32)