<a href="https://colab.research.google.com/github/AnacletoLAB/grape/blob/main/tutorials/First_and_Second_order_random_walks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# First and Second order random walks
First and second order random walks are a node-sampling mechanism that can be employed in a large number of algorithms. In this notebook we will shortly show how to use Ensmallen to sample a large number of random walks from big graphs.

To install the GraPE library run:

```bash
pip install grape
```

To install the Ensmallen module exclusively, which may be useful when the TensorFlow dependency causes problems, do run:

```bash
pip install ensmallen
```

In [1]:
! pip install -q ensmallen

## Retrieving a graph to run the sampling on
In this tutorial we will run samples on one of the graph from the ones available from the automatic graph retrieval of Ensmallen, namely the [Homo Sapiens graph from STRING](https://string-db.org/cgi/organisms). If you want to load a graph from an edge list, just follow the examples provided from the Loading a Graph in Ensmallen tutorial.

In [2]:
from ensmallen.datasets.string import HomoSapiens

Retrieving and loading the graph

In [3]:
graph = HomoSapiens()
# We also create a version of the graph without edge weights
unweighted_graph = graph.remove_edge_weights()

We compute the graph report:

In [4]:
graph

and the unweighted graph report:

In [5]:
unweighted_graph

## Random walks are heavily parallelized
All the algorithms to sample random walks provided by Ensmallen are heavily parallelized and therefore executing them on instances with a large amount amount of threads will lead to (obviously) better time performance. This notebook is being executed on a COLAB instance with only 2 cores, so the performance will not be as good as what they could be (likely even on your notebook or absurdly on your cellphone).

Yes, Ensmallen can run on Android phones.

In [6]:
from multiprocessing import cpu_count

cpu_count()

2

## Unweighted first-order random walks
Computation of first-order random walks ignoring the edge weights.

In [7]:
%%time
unweighted_graph.random_walks(
    # We want random walks with length 100
    walk_length=32,
    # We want to get random walks starting from 1000 random nodes
    quantity=1000,
    # We want 2 iterations from each node
    iterations=2
)

CPU times: user 50.3 ms, sys: 934 µs, total: 51.2 ms
Wall time: 26 ms


array([[ 1146, 13682, 13013, ..., 14112, 13835,   392],
       [13810, 13006,  4562, ...,  9738, 13417, 13068],
       [16210, 12726,  8488, ..., 12584,  5887,  5231],
       ...,
       [12804, 12797, 14554, ...,  7671,  9512,  2006],
       [ 9400,  4187, 17448, ...,  5363,  3307, 10931],
       [16196,  4533,  3517, ...,  9042, 16199,  1852]], dtype=uint32)

In [8]:
%%time
unweighted_graph.complete_walks(
    # We want random walks with length 100
    walk_length=100,
    # We want 2 iterations from each node
    iterations=2
)

CPU times: user 2.85 s, sys: 13.2 ms, total: 2.86 s
Wall time: 1.54 s


array([[    0,  1631,  8318, ...,  9300,  2799, 11793],
       [    1,  1144,  3625, ...,  2753, 10907,  9618],
       [    2, 17250, 15733, ..., 13807, 19065, 11865],
       ...,
       [19560,  5977, 13636, ..., 19393, 11458,  1352],
       [19562, 12312,  9881, ..., 14733,  5278,  4319],
       [19565,  9476,  8155, ...,  9590,  8602,  4687]], dtype=uint32)

## Weighted first-order random walks
Computation of first-order random walks, biased using the edge weights.

In [9]:
%%time
graph.random_walks(
    # We want random walks with length 100
    walk_length=32,
    # We want to get random walks starting from 1000 random nodes
    quantity=1000,
    # We want 2 iterations from each node
    iterations=2
)

CPU times: user 1.49 s, sys: 14.9 ms, total: 1.51 s
Wall time: 794 ms


array([[ 1146,  4660,  1854, ..., 19501,  9628,  8156],
       [13810, 13434, 10163, ...,  9455, 18078, 17508],
       [16210, 16269, 17795, ..., 10410, 13178, 14251],
       ...,
       [12804,  9542,  4835, ...,     2,  2229,  3049],
       [ 9400,  4586,  9115, ...,   268,  1562,  4805],
       [16196, 10247,  1345, ..., 11918, 13508, 10388]], dtype=uint32)

Similarly, to get random walks from all of the nodes in the graph (that are not singletons) it is possible to use:

In [10]:
%%time
graph.complete_walks(
    # We want random walks with length 100
    walk_length=100,
    # We want 2 iterations from each node
    iterations=2
)

CPU times: user 1min 33s, sys: 401 ms, total: 1min 34s
Wall time: 48 s


array([[    0,  4593,  1382, ...,  6509,   488,  1957],
       [    1, 13558, 10177, ..., 15955, 11001, 10995],
       [    2, 16486, 17828, ..., 14133, 14340, 15957],
       ...,
       [19560,  5977, 11670, ...,  5827, 15979, 15332],
       [19562, 10620, 12832, ..., 12157,  9968,  9984],
       [19565,   285,  6903, ...,  4169,  5657,  7724]], dtype=uint32)

## Second-order random walks
Secondly, we proceed to show the computation of second-order random walks, that is random walks that use [Node2Vec parameters](https://arxiv.org/abs/1607.00653) to bias the random walk towards a BFS or a DFS.

In [11]:
%%time
graph.random_walks(
    # We want random walks with length 100
    walk_length=32,
    # We want to get random walks starting from 1000 random nodes
    quantity=1000,
    # We want 2 iterations from each node
    iterations=2,
    return_weight=2.0,
    explore_weight=2.0,
)

CPU times: user 2.48 s, sys: 14.9 ms, total: 2.5 s
Wall time: 1.33 s


array([[ 1146,  4660,  1856, ..., 19014,  9022,  7592],
       [13810, 13434, 10634, ...,  9589, 18161, 17276],
       [16210, 16269, 17795, ..., 11193, 15925, 13808],
       ...,
       [12804,  9542,  4835, ...,    13,  2830,  2922],
       [ 9400,  4586,  9055, ...,   178,  1830,  6196],
       [16196, 10247,  1388, ..., 12116, 14751, 10853]], dtype=uint32)

In [12]:
%%time
unweighted_graph.random_walks(
    # We want random walks with length 100
    walk_length=32,
    # We want to get random walks starting from 1000 random nodes
    quantity=1000,
    # We want 2 iterations from each node
    iterations=2,
    return_weight=2.0,
    explore_weight=2.0,
)

CPU times: user 2.38 s, sys: 18.7 ms, total: 2.4 s
Wall time: 1.28 s


array([[ 1146,  4994,  1652, ..., 19337,  8975,  7314],
       [13810, 13583,  9998, ...,  8368, 17814, 17716],
       [16210, 16453, 17949, ..., 11054, 15721, 15902],
       ...,
       [12804,  9259,  5257, ...,     6,  2794,  3461],
       [ 9400,  4586,  8637, ...,   323,  1778,  5330],
       [16196, 10247,  1275, ..., 12055, 13706, 10827]], dtype=uint32)

In [13]:
%%time
graph.complete_walks(
    # We want random walks with length 100
    walk_length=32,
    # We want 2 iterations from each node
    iterations=2,
    return_weight=2.0,
    explore_weight=2.0,
)

CPU times: user 47.6 s, sys: 163 ms, total: 47.8 s
Wall time: 24.3 s


array([[    0,  4593,  1382, ..., 19436,  8768,  8217],
       [    1, 13558, 10368, ...,  7844, 17360, 18342],
       [    2, 16486, 17975, ..., 11684, 16029, 14435],
       ...,
       [19560,  5977, 11670, ...,  4931,   390,  1779],
       [19562, 10620, 12868, ...,  3213,  1859,    45],
       [19565,   285,  6903, ...,  2266,  2796,  1909]], dtype=uint32)

In [14]:
%%time
unweighted_graph.complete_walks(
    # We want random walks with length 100
    walk_length=32,
    # We want 2 iterations from each node
    iterations=2,
    return_weight=2.0,
    explore_weight=2.0,
)

CPU times: user 46.2 s, sys: 149 ms, total: 46.4 s
Wall time: 23.6 s


array([[    0,  4508,  1517, ..., 19362,  9034,  8253],
       [    1, 13063,  9565, ...,  9410, 17762, 17536],
       [    2, 16543, 18087, ..., 11381, 15877, 14984],
       ...,
       [19560,  5977, 12294, ...,  5521,   447,  1687],
       [19562, 10620, 12901, ...,  2835,  1599,   130],
       [19565,   285,  6909, ...,  2708,  2757,  1450]], dtype=uint32)

## Approximated second-order random walks
On graphs that include nodes with extremely high node degrees, for instance above 50000, the computation of their transition weights can be a bottleneck. In those use-cases approximated random walks can help make the computation considerably faster, by randomly subsampling each node's neighbours to a maximum number provided. In the considered graph, the most central nodes have a centrality of at most around 7000, so the impact won't be particularly significant. 

We have shown in the GraPE paper that significant subsampling of the nodes (maximum node degree clipped at 10) does not cause the performance of an edge prediction model trained on the SkipGram node embedding to change between exact random walks and approximated random walks. This is likely because of the massive amount of random walks that are made possible.

In [15]:
%%time
graph.random_walks(
    # We want random walks with length 100
    walk_length=32,
    # We want to get random walks starting from 1000 random nodes
    quantity=1000,
    # We want 2 iterations from each node
    iterations=2,
    return_weight=2.0,
    explore_weight=2.0,
    # We will subsample the neighbours of the nodes
    # dynamically to 100.
    max_neighbours=100
)

CPU times: user 946 ms, sys: 5.94 ms, total: 952 ms
Wall time: 502 ms


array([[ 1146,  2247,  1779, ..., 18680,  6406,  8389],
       [13810, 12798, 10451, ...,  9103, 17828, 17127],
       [16210, 14880, 16765, ...,  9671, 15815, 13398],
       ...,
       [12804,  6645,  4719, ...,    14,  2452,  2670],
       [ 9400,  4026,  7053, ...,   192,  2174,  5087],
       [16196, 10006,   748, ..., 11339, 11392, 10824]], dtype=uint32)

In [16]:
%%time
graph.complete_walks(
    # We want random walks with length 100
    walk_length=32,
    # We want 2 iterations from each node
    iterations=2,
    return_weight=2.0,
    explore_weight=2.0,
    # We will subsample the neighbours of the nodes
    # dynamically to 100.
    max_neighbours=100
)

CPU times: user 18.5 s, sys: 57.7 ms, total: 18.6 s
Wall time: 9.45 s


array([[    0,  4359,  1254, ..., 18952,  8488,  6979],
       [    1, 10242,  9150, ...,  9298, 17537, 17480],
       [    2, 16137, 17173, ..., 11075, 14614, 13193],
       ...,
       [19560,  5977, 10871, ...,  5339,   406,  1949],
       [19562, 10620, 12821, ...,  3072,  1798,   129],
       [19565,   285,  6338, ...,  2261,  2530,  1869]], dtype=uint32)

## Enabling the speedups
As explained more in details in the tutorial [add reference to tutorial], there are numerous speed-ups time-memory tradeoffs available in Ensmallen. These speedups allow you to exchange to use more RAM and get faster computation. Generally speaking, these speedups on graphs that have less than a few hundred millions edges have a minimal impact on the memory footprint while enabling seemingly free acceleration of most graph algorithms.

In [17]:
graph.enable()

### Weighted first order random walks with speedups
The first order random walks have about an order of magnitude speed increase.

In [18]:
%%time
graph.random_walks(
    # We want random walks with length 100
    walk_length=100,
    # We want to get random walks starting from 1000 random nodes
    quantity=1000,
    # We want 10 iterations from each node
    iterations=2
)

CPU times: user 396 ms, sys: 3.99 ms, total: 400 ms
Wall time: 222 ms


array([[ 1146,  4660,  1854, ...,  6509,   488,  1957],
       [13810, 13434, 10163, ..., 15955, 11001, 10995],
       [16210, 16269, 17795, ..., 15144, 14380, 16018],
       ...,
       [12804,  9542,  4835, ...,  8797,  5619,  4153],
       [ 9400,  4586,  9115, ...,  8979,  7397, 17274],
       [16196, 10247,  1345, ...,    26,  2183,  3226]], dtype=uint32)

In [19]:
%%time
graph.complete_walks(
    # We want random walks with length 100
    walk_length=100,
    # We want 10 iterations from each node
    iterations=2
)

CPU times: user 7.66 s, sys: 41.8 ms, total: 7.7 s
Wall time: 3.99 s


array([[    0,  4593,  1382, ...,  6509,   488,  1957],
       [    1, 13558, 10177, ..., 15955, 11001, 10995],
       [    2, 16486, 17828, ..., 14133, 14340, 15957],
       ...,
       [19560,  5977, 11670, ...,  5827, 15979, 15332],
       [19562, 10620, 12832, ..., 12157,  9968,  9984],
       [19565,   285,  6903, ...,  4169,  5657,  7724]], dtype=uint32)

### Second order random walks with speedups


In [20]:
%%time
graph.random_walks(
    # We want random walks with length 100
    walk_length=32,
    # We want to get random walks starting from 1000 random nodes
    quantity=1000,
    # We want 2 iterations from each node
    iterations=2,
    return_weight=2.0,
    explore_weight=2.0,
)

CPU times: user 1.13 s, sys: 2.93 ms, total: 1.13 s
Wall time: 617 ms


array([[ 1146,  4660,  1856, ..., 19014,  9022,  7592],
       [13810, 13434, 10634, ...,  9589, 18161, 17276],
       [16210, 16269, 17795, ..., 11193, 15925, 13808],
       ...,
       [12804,  9542,  4835, ...,    13,  2830,  2922],
       [ 9400,  4586,  9055, ...,   178,  1830,  6196],
       [16196, 10247,  1388, ..., 12116, 14751, 10853]], dtype=uint32)

In [21]:
%%time
graph.complete_walks(
    # We want random walks with length 100
    walk_length=32,
    # We want 2 iterations from each node
    iterations=2,
    return_weight=2.0,
    explore_weight=2.0,
)

CPU times: user 21.9 s, sys: 105 ms, total: 22 s
Wall time: 11.2 s


array([[    0,  4593,  1382, ..., 19436,  8768,  8217],
       [    1, 13558, 10368, ...,  7844, 17360, 18342],
       [    2, 16486, 17975, ..., 11684, 16029, 14435],
       ...,
       [19560,  5977, 11670, ...,  4931,   390,  1779],
       [19562, 10620, 12868, ...,  3213,  1859,    45],
       [19565,   285,  6903, ...,  2266,  2796,  1909]], dtype=uint32)

## Approximated second-order random walks with speedups

In [22]:
%%time
graph.random_walks(
    # We want random walks with length 100
    walk_length=32,
    # We want to get random walks starting from 1000 random nodes
    quantity=1000,
    # We want 2 iterations from each node
    iterations=2,
    return_weight=2.0,
    explore_weight=2.0,
    # We will subsample the neighbours of the nodes
    # dynamically to 100.
    max_neighbours=100
)

CPU times: user 320 ms, sys: 1.98 ms, total: 322 ms
Wall time: 173 ms


array([[ 1146,  2247,  1779, ..., 18680,  6406,  8389],
       [13810, 12798, 10451, ...,  9103, 17828, 17127],
       [16210, 14880, 16765, ...,  9671, 15815, 13398],
       ...,
       [12804,  6645,  4719, ...,    14,  2452,  2670],
       [ 9400,  4026,  7053, ...,   192,  2174,  5087],
       [16196, 10006,   748, ..., 11339, 11392, 10824]], dtype=uint32)

In [24]:
%%time
graph.complete_walks(
    # We want random walks with length 100
    walk_length=32,
    # We want 2 iterations from each node
    iterations=2,
    return_weight=2.0,
    explore_weight=2.0,
    # We will subsample the neighbours of the nodes
    # dynamically to 100.
    max_neighbours=100
)

CPU times: user 6.3 s, sys: 22.7 ms, total: 6.33 s
Wall time: 3.23 s


array([[    0,  4359,  1254, ..., 18952,  8488,  6979],
       [    1, 10242,  9150, ...,  9298, 17537, 17480],
       [    2, 16137, 17173, ..., 11075, 14614, 13193],
       ...,
       [19560,  5977, 10871, ...,  5339,   406,  1949],
       [19562, 10620, 12821, ...,  3072,  1798,   129],
       [19565,   285,  6338, ...,  2261,  2530,  1869]], dtype=uint32)