# First and Second order random walks
First and second order random walks are a node-sampling mechanism that can be employed in a large number of algorithms. In this notebook we will shortly show how to use Ensmallen to sample a large number of random walks from big graphs.

To install the GraPE library run:


In [2]:
pip install -qU grape

You should consider upgrading via the '/Users/lucacappelletti/opt/miniconda3/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


## Retrieving a graph to run the sampling on
In this tutorial we will run examples on the [Homo Sapiens graph from STRING](https://string-db.org/cgi/organisms). If you want to load a graph from an edge list, just follow the examples provided from the Loading a Graph in Ensmallen tutorial.

In [3]:
from grape.datasets.string import HomoSapiens

Retrieving and loading the graph

In [4]:
graph = HomoSapiens()
# Filter the graph weights at 700
graph = graph.filter_from_names(min_edge_weight==700)
# We also create a version of the graph without edge weights
unweighted_graph = graph.remove_edge_weights()

We compute the graph report:

In [5]:
graph

and the unweighted graph report:

In [6]:
unweighted_graph

## Random walks are heavily parallelized
All the algorithms to sample random walks provided by Ensmallen are heavily parallelized. Therefore, their execution on instances with a large amount amount of threads will lead to (obviously) better time performance. This notebook is being executed on a COLAB instance with only 2 core; therefore the performance will not be as good as they could be even on your notebook, or  your cellphone (Ensmallen can run on Android phones).

In [7]:
from multiprocessing import cpu_count

cpu_count()

12

## Unweighted first-order random walks
Computation of first-order random walks ignoring the edge weights. In the following examples random walks are computed (on unweighted and weighted graphs) by either invoking method *random_walks* or method *complete_walks*.

*random_walks* automatically chooses between exact and sampled random walks; use this method if you want to let *Grape* to chose the best option. 

*complete_walks* is the method used to compute exact walks.

In [8]:
%%time
unweighted_graph.random_walks(
    # We want random walks with length 100
    walk_length=32,
    # We want to get random walks starting from 1000 random nodes
    quantity=1000,
    # We want 2 iterations from each node
    iterations=2
)

CPU times: user 17 ms, sys: 1.08 ms, total: 18.1 ms
Wall time: 2.03 ms


array([[ 9426, 15402,  1973, ..., 10025, 17136, 12075],
       [18629, 13618,   817, ...,  6522, 19277, 18985],
       [ 4659,  2992, 14766, ...,  2127,  8491, 12404],
       ...,
       [ 3387,  9653,  1728, ...,  7196, 14670, 16603],
       [13708, 19212, 13708, ..., 19212, 13708, 19212],
       [14316,  9055,  4075, ...,   130, 16937,  3506]], dtype=uint32)

In [9]:
%%time
unweighted_graph.complete_walks(
    # We want random walks with length 100
    walk_length=100,
    # We want 2 iterations from each node
    iterations=2
)

CPU times: user 849 ms, sys: 3.9 ms, total: 852 ms
Wall time: 82.7 ms


array([[    0, 17164,  6401, ...,  9888,  1510, 16834],
       [    1,  8997,  8240, ..., 14994,  3191, 12276],
       [    2, 12438, 10922, ..., 16152,   721,  4907],
       ...,
       [19551,  7490,  5253, ...,  3221,  6992, 17486],
       [19557, 10177,  8618, ...,   667,  9035,  7346],
       [19558,  5960,  9034, ...,  5685,   508, 18766]], dtype=uint32)

## Weighted first-order random walks
Computation of first-order random walks, biased using the edge weights.

In [10]:
%%time
graph.random_walks(
    # We want random walks with length 100
    walk_length=100,
    # We want to get random walks starting from 1000 random nodes
    quantity=1000,
    # We want 2 iterations from each node
    iterations=2
)

CPU times: user 1.07 s, sys: 4.31 ms, total: 1.07 s
Wall time: 93.6 ms


array([[ 9426, 15402,  6110, ...,  2982,    44,   926],
       [18629,  4007,  3840, ..., 13705, 10240, 10145],
       [ 4659,  4019, 10169, ...,   299, 10828,  9044],
       ...,
       [ 3387,  9653,  2538, ...,  5158, 15892, 13498],
       [13708, 19212, 13708, ..., 19212, 13708, 19212],
       [14316, 16827, 10195, ...,  3758,  1016,  4500]], dtype=uint32)

Similarly, to get random walks from all of the nodes in the graph it is possible to use:

In [11]:
%%time
graph.complete_walks(
    # We want random walks with length 100
    walk_length=100,
    # We want 2 iterations from each node
    iterations=2
)

CPU times: user 18.1 s, sys: 28.7 ms, total: 18.1 s
Wall time: 1.54 s


array([[    0,  2678,  2106, ...,  3481,   113,  1771],
       [    1,  2967,  2227, ...,  9148, 10725, 10120],
       [    2,  2661,  9333, ...,   830,  8968,  7574],
       ...,
       [19551, 18771, 13619, ...,  6669, 16629,  1656],
       [19557, 16304, 15886, ...,  5088, 10724,  9159],
       [19558, 12842,  5867, ..., 10237,  4187,  7050]], dtype=uint32)

## Second-order random walks
In the following we show the computation of second-order random walks, that is random walks that use [Node2Vec parameters](https://arxiv.org/abs/1607.00653) to bias the random walk towards a BFS or a DFS.

In [12]:
%%time
graph.random_walks(
    # We want random walks with length 100
    walk_length=32,
    # We want to get random walks starting from 1000 random nodes
    quantity=1000,
    # We want 2 iterations from each node
    iterations=2,
    return_weight=2.0,
    explore_weight=2.0,
)

CPU times: user 351 ms, sys: 2.11 ms, total: 353 ms
Wall time: 32.1 ms


array([[ 9426, 15402,  6110, ..., 15356,  9814, 15356],
       [18629,  4007,  3840, ...,  2523,  9546, 12910],
       [ 4659,  4019,  4659, ...,  3359, 11058, 16488],
       ...,
       [ 3387,  9653,  2538, ..., 18061,  3797,  2599],
       [13708, 19212, 13708, ..., 19212, 13708, 19212],
       [14316, 16827, 10700, ...,    87,  8651,  1641]], dtype=uint32)

In [13]:
%%time
unweighted_graph.random_walks(
    # We want random walks with length 100
    walk_length=32,
    # We want to get random walks starting from 1000 random nodes
    quantity=1000,
    # We want 2 iterations from each node
    iterations=2,
    return_weight=2.0,
    explore_weight=2.0,
)

CPU times: user 345 ms, sys: 2.85 ms, total: 348 ms
Wall time: 31.7 ms


array([[ 9426, 15402,  6110, ..., 12097, 11796,   770],
       [18629,  4007,  3840, ...,   575,  4294, 18042],
       [ 4659,  3549,  8307, ...,  1364,  2697, 19107],
       ...,
       [ 3387,  9653,  2588, ..., 14895,  4428,  5682],
       [13708, 19212, 13708, ..., 19212, 13708, 19212],
       [14316, 16827, 10195, ...,   120, 11824,  1623]], dtype=uint32)

In [14]:
%%time
graph.complete_walks(
    # We want random walks with length 100
    walk_length=32,
    # We want 2 iterations from each node
    iterations=2,
    return_weight=2.0,
    explore_weight=2.0,
)

CPU times: user 5.97 s, sys: 9.36 ms, total: 5.98 s
Wall time: 509 ms


array([[    0,  2678,  1961, ..., 10003, 10745,  1052],
       [    1,  2967,  2227, ...,   406,  4394, 18911],
       [    2,  2661,  9414, ...,  3188,  5415, 16506],
       ...,
       [19551, 18771, 13619, ..., 15779,  4872, 15779],
       [19557, 16304, 15906, ..., 14340,  9833, 14340],
       [19558, 12842,  6061, ...,  6325, 12398,  6928]], dtype=uint32)

In [15]:
%%time
unweighted_graph.complete_walks(
    # We want random walks with length 100
    walk_length=32,
    # We want 2 iterations from each node
    iterations=2,
    return_weight=2.0,
    explore_weight=2.0,
)

CPU times: user 5.86 s, sys: 11.9 ms, total: 5.87 s
Wall time: 514 ms


array([[    0,  2678,  2106, ..., 11919, 11616,   907],
       [    1,  2967,  2227, ...,  4467,  3956, 18173],
       [    2,  2661,  9333, ...,  2272,  4489, 13232],
       ...,
       [19551, 18771, 13619, ...,  9103,  4462,  5424],
       [19557, 16304, 15886, ..., 15276, 10350,  8966],
       [19558, 12697,  5938, ...,   406,  9387,   406]], dtype=uint32)

## Approximated second-order random walks
When working on graphs where some nodes have an extremely high node degree, *d* (e.g. *d > 50000*), the computation of the transition weights can be a bottleneck. In those use-cases approximated random walks can help make the computation considerably faster, by randomly subsampling each node's neighbourhood to a maximum number, provided by the user. In the considered graph, the highest node degree id *d $\approx$ 7000*. 

In the GraPE paper we show experiments comparing the edge-prediction performance of a model trained on graph embeddings obtained by the Skipgram model when using either exact random walks, or random walks obtained by with significant subsampling of the nodes (maximum node degree clipped at 10). The comparative evaluation shows no decrease in performance.

In [16]:
%%time
graph.random_walks(
    # We want random walks with length 100
    walk_length=32,
    # We want to get random walks starting from 1000 random nodes
    quantity=1000,
    # We want 2 iterations from each node
    iterations=2,
    return_weight=2.0,
    explore_weight=2.0,
    # We will subsample the neighbours of the nodes
    # dynamically to 100.
    max_neighbours=100
)

CPU times: user 353 ms, sys: 2.35 ms, total: 355 ms
Wall time: 32.7 ms


array([[ 9426, 15402,  6110, ..., 15356,  9814, 15356],
       [18629,  4007,  3840, ...,  2523,  9546, 12910],
       [ 4659,  4019,  4659, ...,  3359, 11058, 16488],
       ...,
       [ 3387,  9653,  2538, ..., 18061,  3797,  2599],
       [13708, 19212, 13708, ..., 19212, 13708, 19212],
       [14316, 16827, 10700, ...,    87,  8651,  1641]], dtype=uint32)

In [17]:
%%time
graph.complete_walks(
    # We want random walks with length 100
    walk_length=32,
    # We want 2 iterations from each node
    iterations=2,
    return_weight=2.0,
    explore_weight=2.0,
    # We will subsample the neighbours of the nodes
    # dynamically to 100.
    max_neighbours=100
)

CPU times: user 5.95 s, sys: 11 ms, total: 5.97 s
Wall time: 510 ms


array([[    0,  2678,  1961, ..., 10003, 10745,  1052],
       [    1,  2967,  2227, ...,   406,  4394, 18911],
       [    2,  2661,  9414, ...,  3188,  5415, 16506],
       ...,
       [19551, 18771, 13619, ..., 15779,  4872, 15779],
       [19557, 16304, 15906, ..., 14340,  9833, 14340],
       [19558, 12842,  6061, ...,  6325, 12398,  6928]], dtype=uint32)

## Enabling the speedups
Ensmallen provides numerous speed-ups based on time-memory tradeoffs, which allow faster computation. Automatic Speed-up can be enabled by simply seeting a semaphor:

In [18]:
graph.enable()

### Weighted first order random walks with speedups
The first order random walks have about an order of magnitude speed increase.

In [19]:
%%time
graph.random_walks(
    # We want random walks with length 100
    walk_length=100,
    # We want to get random walks starting from 1000 random nodes
    quantity=1000,
    # We want 10 iterations from each node
    iterations=2
)

CPU times: user 236 ms, sys: 3.11 ms, total: 240 ms
Wall time: 22.6 ms


array([[ 9426, 15402,  6110, ...,  2982,    44,   926],
       [18629,  4007,  3840, ..., 13705, 10240, 10145],
       [ 4659,  4019, 10169, ...,   299, 10828,  9044],
       ...,
       [ 3387,  9653,  2538, ...,  5158, 15892, 13498],
       [13708, 19212, 13708, ..., 19212, 13708, 19212],
       [14316, 16827, 10195, ...,  3758,  1016,  4500]], dtype=uint32)

In [20]:
%%time
graph.complete_walks(
    # We want random walks with length 100
    walk_length=100,
    # We want 10 iterations from each node
    iterations=2
)

CPU times: user 4.09 s, sys: 24.6 ms, total: 4.12 s
Wall time: 379 ms


array([[    0,  2678,  2106, ...,  3481,   113,  1771],
       [    1,  2967,  2227, ...,  9148, 10725, 10120],
       [    2,  2661,  9333, ...,   830,  8968,  7574],
       ...,
       [19551, 18771, 13619, ...,  6669, 16629,  1656],
       [19557, 16304, 15886, ...,  5088, 10724,  9159],
       [19558, 12842,  5867, ..., 10237,  4187,  7050]], dtype=uint32)

### Second order random walks with speedups


In [21]:
%%time
graph.random_walks(
    # We want random walks with length 100
    walk_length=32,
    # We want to get random walks starting from 1000 random nodes
    quantity=1000,
    # We want 2 iterations from each node
    iterations=2,
    return_weight=2.0,
    explore_weight=2.0,
)

CPU times: user 123 ms, sys: 2.74 ms, total: 126 ms
Wall time: 13.3 ms


array([[ 9426, 15402,  6110, ..., 15356,  9814, 15356],
       [18629,  4007,  3840, ...,  2523,  9546, 12910],
       [ 4659,  4019,  4659, ...,  3359, 11058, 16488],
       ...,
       [ 3387,  9653,  2538, ..., 18061,  3797,  2599],
       [13708, 19212, 13708, ..., 19212, 13708, 19212],
       [14316, 16827, 10700, ...,    87,  8651,  1641]], dtype=uint32)

In [22]:
%%time
graph.complete_walks(
    # We want random walks with length 100
    walk_length=32,
    # We want 2 iterations from each node
    iterations=2,
    return_weight=2.0,
    explore_weight=2.0,
)

CPU times: user 1.88 s, sys: 17.3 ms, total: 1.9 s
Wall time: 177 ms


array([[    0,  2678,  1961, ..., 10003, 10745,  1052],
       [    1,  2967,  2227, ...,   406,  4394, 18911],
       [    2,  2661,  9414, ...,  3188,  5415, 16506],
       ...,
       [19551, 18771, 13619, ..., 15779,  4872, 15779],
       [19557, 16304, 15906, ..., 14340,  9833, 14340],
       [19558, 12842,  6061, ...,  6325, 12398,  6928]], dtype=uint32)

## Approximated second-order random walks with speedups

In [23]:
%%time
graph.random_walks(
    # We want random walks with length 100
    walk_length=32,
    # We want to get random walks starting from 1000 random nodes
    quantity=1000,
    # We want 2 iterations from each node
    iterations=2,
    return_weight=2.0,
    explore_weight=2.0,
    # We will subsample the neighbours of the nodes
    # dynamically to 100.
    max_neighbours=100
)

CPU times: user 139 ms, sys: 2.4 ms, total: 141 ms
Wall time: 14.8 ms


array([[ 9426, 15402,  6110, ..., 15356,  9814, 15356],
       [18629,  4007,  3840, ...,  2523,  9546, 12910],
       [ 4659,  4019,  4659, ...,  3359, 11058, 16488],
       ...,
       [ 3387,  9653,  2538, ..., 18061,  3797,  2599],
       [13708, 19212, 13708, ..., 19212, 13708, 19212],
       [14316, 16827, 10700, ...,    87,  8651,  1641]], dtype=uint32)

In [24]:
%%time
graph.complete_walks(
    # We want random walks with length 100
    walk_length=32,
    # We want 2 iterations from each node
    iterations=2,
    return_weight=2.0,
    explore_weight=2.0,
    # We will subsample the neighbours of the nodes
    # dynamically to 100.
    max_neighbours=100
)

CPU times: user 1.9 s, sys: 16.6 ms, total: 1.91 s
Wall time: 175 ms


array([[    0,  2678,  1961, ..., 10003, 10745,  1052],
       [    1,  2967,  2227, ...,   406,  4394, 18911],
       [    2,  2661,  9414, ...,  3188,  5415, 16506],
       ...,
       [19551, 18771, 13619, ..., 15779,  4872, 15779],
       [19557, 16304, 15906, ..., 14340,  9833, 14340],
       [19558, 12842,  6061, ...,  6325, 12398,  6928]], dtype=uint32)