# 06-01 - Random Walks in Graphs

*June 3 2024*  

In the first notebook we explore random walks in graphs. Random walk models are important as a study diffusion processes and their stationary distribution can be used to define centrality measures like, e.g., PageRank. In the first notebook, we show how we can calculate transition matrices and visitation probabilities and how we can use `pathpyG` to simulate and visualize random walks in complex networks.

In [1]:
import pathpyG as pp
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from numpy import linalg as npl
import scipy as sp
import torch

plt.style.use('default')
sns.set_style("whitegrid")

pp.config['torch']['device']='cpu'

## Transition matrices

We first write a function that computes a (row- or right-stochastic) transition matrix of a random walk process for a given network. The method should work for weighted, unweighted, directed, and undirected networks represented by a `pathpy.Network` object.

In [2]:
def transition_matrix(network, weight=True):
    # we first create an adjacency matrix that also considers weights
    if weight:
        A = network.get_sparse_adj_matrix(edge_attr='edge_weight').todense()
    else:
        A = network.get_sparse_adj_matrix().todense()
    # we compute the (weighted) node degrees for the purpose of normalization
    D = A.sum(axis=1)
    # we generate an empty sparse transition matrix, we use the sparse "list of lists" format,
    # which allows for a more efficient change of entries in the following loop
    T = sp.sparse.lil_matrix((network.N, network.N))
    # we calculate transition probabilities
    for i in range(network.N):
        T[i,:] = A[i,:]/D[i]
    return T

In [3]:
g = pp.Graph.from_edge_list([['a','b'],['a','c'],['b','c'],['b','a'],['c','a']])

In [4]:
g.data['edge_weight'] = torch.tensor([[2],[1],[1],[2],[5]],device=pp.config['torch']['device'])

In [5]:
print(g)

Directed graph with 3 nodes and 5 edges

Edge attributes
	edge_weight		<class 'torch.Tensor'> -> torch.Size([5, 1])

Graph attributes
	num_nodes		<class 'int'>



In [6]:
transition_matrix(g).todense()

matrix([[0.        , 0.66666667, 0.33333333],
        [0.66666667, 0.        , 0.33333333],
        [1.        , 0.        , 0.        ]])

We can write a function that computes the visitation probability $\pi^{(t)}$ of a random walk after $t$ steps, for a given network and initial distribution $\pi^{(0)}$. We use the built-in `numpy.linalg.matrix_power` function to calculate the power of the transition matrix.

In [7]:
def visitation_probabilities(network, initial_dist, t):
    T = transition_matrix(network).todense()
    p_t = np.dot(initial_dist, npl.matrix_power(T,t))
    return p_t

If we start in node $a$, naturally after zero steps we are in the same node.

In [14]:
visitation_probabilities(g, np.array([1, 0, 0]), 0)

matrix([[1., 0., 0.]])

After one step, due to the edge weights, we have a $\frac{2}{3}$ probability to be in node $b$ and a $\frac{1}{3}$ probability to be in node $c$.

In [15]:
visitation_probabilities(g, np.array([1, 0, 0]), 1)

matrix([[0.        , 0.66666667, 0.33333333]])

After 100 steps, the probability that we have returned to node $a$ is $0.45$, while we have a $0.3$ and $0.25$ probability to be in nodes $b$ and $c$ respectively.

In [16]:
visitation_probabilities(g, np.array([1, 0, 0]), 100)

matrix([[0.45, 0.3 , 0.25]])

We note that those visitation probabilities do not change if we consider 500 steps of a random walk.

In [17]:
visitation_probabilities(g, np.array([1, 0, 0]), 500)

matrix([[0.45, 0.3 , 0.25]])

`pathpyG` offers direct support for the simulation of random walk processes in a network, as well as for the calculation of transition matrices, stationary states and total variation distance. To create a random walk in a given network (including weight-based biases of transition probabilities), we can simply write:

In [18]:
rw = pp.processes.RandomWalk(g, weight=True)

We can print the transition matrix of the random walk as follows:

In [19]:
print(rw.transition_matrix)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 5 stored elements and shape (3, 3)>
  Coords	Values
  (0, 1)	0.6666666666666666
  (0, 2)	0.3333333333333333
  (1, 0)	0.6666666666666666
  (1, 2)	0.3333333333333333
  (2, 0)	1.0


By default, the transition matrix is returned as a sparse matrix. If we want to inspect it, we can return it as a nicely-formatted `pandas` `DataFrame`:

In [20]:
rw.transition_matrix_pd()

Unnamed: 0,a,b,c
a,0.0,0.666667,0.333333
b,0.666667,0.0,0.333333
c,1.0,0.0,0.0


We can use the function `rw.visitation_probabilities` to compute the node visitation probabilities after $t$ steps of a random walk starting in a specific `seed` node:

In [21]:
rw.visitation_probabilities(t=100, seed='a')

matrix([[0.45, 0.3 , 0.25]])

We can simulate a given number of steps of a random walk sequence, starting in a given `seed` node. The following loop outputs ten steps of a random walk starting in node `a`. After each step, the observed visitation frequencies are computed:

In [22]:
for time, updated_nodes in rw.simulation_run(steps=100, seed='a'):
    print('time = {0}, current node = {1}'.format(time, updated_nodes[0]))
    print(rw.visitation_frequencies)

time = 1, current node = b
[0.5 0.5 0. ]
time = 2, current node = a
[0.66666667 0.33333333 0.        ]
time = 3, current node = b
[0.5 0.5 0. ]
time = 4, current node = a
[0.6 0.4 0. ]
time = 5, current node = b
[0.5 0.5 0. ]
time = 6, current node = a
[0.57142857 0.42857143 0.        ]
time = 7, current node = b
[0.5 0.5 0. ]
time = 8, current node = c
[0.44444444 0.44444444 0.11111111]
time = 9, current node = a
[0.5 0.4 0.1]
time = 10, current node = b
[0.45454545 0.45454545 0.09090909]
time = 11, current node = a
[0.5        0.41666667 0.08333333]
time = 12, current node = b
[0.46153846 0.46153846 0.07692308]
time = 13, current node = a
[0.5        0.42857143 0.07142857]
time = 14, current node = b
[0.46666667 0.46666667 0.06666667]
time = 15, current node = a
[0.5    0.4375 0.0625]
time = 16, current node = c
[0.47058824 0.41176471 0.11764706]
time = 17, current node = a
[0.5        0.38888889 0.11111111]
time = 18, current node = b
[0.47368421 0.42105263 0.10526316]
time = 19, cu

We note that the observed visitation frequencies after 100 steps closely resemble the visitation probabilities that we have computed above. 

The `RandomWalk` class is implemented based on `pathpy`'s capabilities to simulate general discrete-time dynamical processes that change the state of nodes. For a random walk, we have binary node states that capture whether a node is currently visited by the random walker or not. We can use the `run_experiment` function to generate multiple simulation runs for a given number of steps, starting from different `seed` states. To simulate two random walks with ten steps, one starting from node `a` and a second one starting from node `b` we can write:

In [23]:
pp.plot(g, node_label=[v for v in g.mapping.node_ids]);

In [24]:
data = rw.run_experiment(steps=10, runs=['a', 'b'])
print(data)

100%|██████████| 2/2 [00:00<00:00, 4017.53it/s]

    run_id seed  time node  state
0        0    a     0    a   True
1        0    a     0    b  False
2        0    a     0    c  False
3        0    a     1    b   True
4        0    a     1    a  False
5        0    a     2    a   True
6        0    a     2    b  False
7        0    a     3    b   True
8        0    a     3    a  False
9        0    a     4    a   True
10       0    a     4    b  False
11       0    a     5    b   True
12       0    a     5    a  False
13       0    a     6    a   True
14       0    a     6    b  False
15       0    a     7    c   True
16       0    a     7    a  False
17       0    a     8    a   True
18       0    a     8    c  False
19       0    a     9    b   True
20       0    a     9    a  False
21       0    a    10    a   True
22       0    a    10    b  False
23       1    b     0    a  False
24       1    b     0    b   True
25       1    b     0    c  False
26       1    b     1    c   True
27       1    b     1    b  False
28       1    




In [25]:
print(g.mapping)

a -> 0
b -> 1
c -> 2



This generates a `pandas` `DataFrame` that stores all node state changes generated by the process. Each run of the simulation has a numerical identifier (here run\_id `0` for the walk starting in node `a` and run\_id `1` for the walk starting in node `b`). Each row in the data frame captures a state change of one node at a given time step `time`. In the first step, the visitation state of all nodes based on the initial distribution is recorded. For each subsequent time step, we see two state changes, one corresponding to the node from which the random walker moved away (setting the visitation state to `False`) and one to which the random walker moved (setting the visitation state to `True`). The beauty of this approach is that we can simulate many other processes (like, e.g., epidemic spreading) that generate exactly the same output, so we can analyze and visualize the output of various discrete-time simulations in a unified way. 

We can, for instance, use the `get_path` function to extract the `path` (in this case the walk) that a process has taken through the graph. For this, we have to specify the `run_id` corresponding to the walk that we want to return. This returns a `PathData` object that can be used for further analysis.

In [26]:
p = rw.get_path(data, run_id=0)
print(p.get_walk(0))

('a', 'b', 'a', 'b', 'a', 'b', 'a', 'c', 'a', 'b', 'a')


We can also return a ``PathData` object that contains all paths generated by the runs of our experiment:

In [27]:
path_data = rw.get_paths(data)
print(path_data)

PathData with 2 paths with total weight 2.0
