# 07-02 Walk-based Node Similarities

*June 7 2023*  

Going beyond local similarity measures that only consider the immediate neighborhood of nodes, we explore walk-based similarity scores that consider the number of walks between nodes.

In [2]:
import pathpy as pp
import numpy as np

from matplotlib import pyplot as plt
from collections import defaultdict

import scipy as sp

import pandas as pd
import seaborn as sns 

plt.style.use('default')
sns.set_style("whitegrid")

plot_style = {
    'edge_color': 'grey'
}

We use the same toy example and the same functions as in the previous notebook:

In [3]:
n = pp.Network(directed=False)
n.add_edge('a', 'b')
n.add_edge('b', 'c')
n.add_edge('c', 'a')
n.add_edge('d', 'e')
n.add_edge('e', 'f')
n.add_edge('f', 'g')
n.add_edge('g', 'd')
n.add_edge('d', 'f')
n.add_edge('b', 'd')
n.plot(**plot_style)

In [4]:
def highlight_edges(styler, edge_df):
    styler.set_table_styles([{'selector': '.true', 'props': 'background-color: #e6ffe6;'},
    {'selector': '.false', 'props': 'background-color: #ffe6e6;'}], overwrite=False)
    styler.set_td_classes(edge_df)
    return styler

def get_node_similarities(n, score, **kwargs):
    sim_scores = defaultdict(defaultdict)
    edges = defaultdict(defaultdict)
    
    nodes = [v.uid for v in n.nodes]

    for v in n.nodes:
        for w in n.nodes:
            sim_scores[v][w] = score(n, v.uid, w.uid, **kwargs)
            if (v,w) in n.edges:
                edges[v][w] = 'true '
            else:
                edges[v][w] = 'false '

    df = pd.DataFrame.from_dict(sim_scores, orient='index')
    edge_df = pd.DataFrame.from_dict(edges, orient='index')

    edge_df.columns = nodes
    edge_df.index = nodes
    df.columns = nodes
    df.index = nodes
    return df, edge_df

### Number of walks of length k

In the previous notebook, we motivated similarity scores using the inverse path length, a non-local score that uses the inverse of the topological distance between nodes. In our toy example, we found that the inverse path length from node $e$ to the nodes $g$ and $b$ is the same, even though nodes $e$ and $g$ are located in the same *community*, while nodes $e$ and $b$ are located in different communities (e.g. using the notion of communities introduced in L02). 

Intuitively, we may be interested in a similarity score that assigns higher similarities to nodes in the same communities as opposed to nodes in different communities. The issue with the inverse path length is that we only consider a single, i.e. the shortest, path between two nodes. However, for pairs of nodes that are located in the same densely connected community we expect *more* different paths (or walks) of a given length to exist, compared to pairs of nodes located in different clusters. 

A simple way to account for this is to compute the total number of walks of a given length $k$. Referring to L02, we can compute this based on the $k$-th power of the (binary) adjacency matrix. The following function calculates the number of walks of exactly length two between any given pair of nodes. 

In [5]:
def num_walks(n, v, w, **kwargs):
    A = pp.algorithms.adjacency_matrix(n)
    Ak = A**kwargs['k']
    return Ak[n.nodes.index[v], n.nodes.index[w]]

Using this as a "similarity" score, we find that node "e" is now more similar to node "g" compared to node $b$, even though it is connected by a path of length two two both "g" and "b". Due to the dense connectivity in the community two different paths of length two exist from $e$ to $g$, while a single path of length two exists from $e$ to $b$.

In [6]:
df, edges = get_node_similarities(n, num_walks, k=2)
df.style.pipe(highlight_edges, edges)

Unnamed: 0,a,b,c,d,e,f,g
a,2.0,1.0,1.0,1.0,0.0,0.0,0.0
b,1.0,3.0,1.0,0.0,1.0,1.0,1.0
c,1.0,1.0,2.0,1.0,0.0,0.0,0.0
d,1.0,0.0,1.0,4.0,1.0,2.0,1.0
e,0.0,1.0,0.0,1.0,2.0,1.0,2.0
f,0.0,1.0,0.0,2.0,1.0,3.0,1.0
g,0.0,1.0,0.0,1.0,2.0,1.0,2.0


A problem with this approach is that we somehow must adjust the parameter $k$, which gives the length of the paths that we are interested in, to the size of the communities. If, for instance, we calculate the number of paths of length three, we find that two paths exist from node $f$ to $b$, because those paths can take different routes within the community of node $b$, while taking the same edge connecting this community to node $b$.

In [7]:
df, edges = get_node_similarities(n, num_walks, k=3)
df.style.pipe(highlight_edges, edges)

Unnamed: 0,a,b,c,d,e,f,g
a,2.0,4.0,3.0,1.0,1.0,1.0,1.0
b,4.0,2.0,4.0,6.0,1.0,2.0,1.0
c,3.0,4.0,2.0,1.0,1.0,1.0,1.0
d,1.0,6.0,1.0,4.0,6.0,6.0,6.0
e,1.0,1.0,1.0,6.0,2.0,5.0,2.0
f,1.0,2.0,1.0,6.0,5.0,4.0,5.0
g,1.0,1.0,1.0,6.0,2.0,5.0,2.0


In [8]:
df, edges = get_node_similarities(n, num_walks, k=4)
df.style.pipe(highlight_edges, edges)

Unnamed: 0,a,b,c,d,e,f,g
a,7.0,6.0,6.0,7.0,2.0,3.0,2.0
b,6.0,14.0,6.0,6.0,8.0,8.0,8.0
c,6.0,6.0,7.0,7.0,2.0,3.0,2.0
d,7.0,6.0,7.0,24.0,10.0,16.0,10.0
e,2.0,8.0,2.0,10.0,11.0,10.0,11.0
f,3.0,8.0,3.0,16.0,10.0,16.0,10.0
g,2.0,8.0,2.0,10.0,11.0,10.0,11.0


Above, we have only counted walks of a specific length. We can instead take the sum of all matrix powers to calculate the number of walks with a length up to a given value $k$.

In [11]:
def num_walks(n, v, w, **kwargs):
    A = pp.algorithms.adjacency_matrix(n)
    Ak = A
    for i in range(2, kwargs['k']):
        Ak += A**kwargs['k']
    return Ak[n.nodes.index[v], n.nodes.index[w]]

In [12]:
df, edges = get_node_similarities(n, num_walks, k=3)
df.style.pipe(highlight_edges, edges)

Unnamed: 0,a,b,c,d,e,f,g
a,2.0,5.0,4.0,1.0,1.0,1.0,1.0
b,5.0,2.0,5.0,7.0,1.0,2.0,1.0
c,4.0,5.0,2.0,1.0,1.0,1.0,1.0
d,1.0,7.0,1.0,4.0,7.0,7.0,7.0
e,1.0,1.0,1.0,7.0,2.0,6.0,2.0
f,1.0,2.0,1.0,7.0,6.0,4.0,6.0
g,1.0,1.0,1.0,7.0,2.0,6.0,2.0


In [13]:
df, edges = get_node_similarities(n, num_walks, k=4)
df.style.pipe(highlight_edges, edges)

Unnamed: 0,a,b,c,d,e,f,g
a,14.0,13.0,13.0,14.0,4.0,6.0,4.0
b,13.0,28.0,13.0,13.0,16.0,16.0,16.0
c,13.0,13.0,14.0,14.0,4.0,6.0,4.0
d,14.0,13.0,14.0,48.0,21.0,33.0,21.0
e,4.0,16.0,4.0,21.0,22.0,21.0,22.0
f,6.0,16.0,6.0,33.0,21.0,32.0,21.0
g,4.0,16.0,4.0,21.0,22.0,21.0,22.0


In [14]:
df, edges = get_node_similarities(n, num_walks, k=30)
df.style.pipe(highlight_edges, edges)

Unnamed: 0,a,b,c,d,e,f,g
a,20072797668420.0,35586842038049.0,20072797668393.0,58538707968692.0,38817356557784.0,49105887922084.0,38817356557784.0
b,35586842038049.0,63097461267456.0,35586842038049.0,103788735107009.0,68827239332992.0,87067533162176.0,68827239332992.0
c,20072797668393.0,35586842038049.0,20072797668420.0,58538707968692.0,38817356557784.0,49105887922084.0,38817356557784.0
d,58538707968692.0,103788735107009.0,58538707968692.0,170742057158232.0,113221555153617.0,143231437928825.0,113221555153617.0
e,38817356557784.0,68827239332992.0,38817356557784.0,113221555153617.0,75082497214816.0,94981261324432.98,75082497214816.0
f,49105887922084.0,87067533162176.0,49105887922084.0,143231437928825.0,94981261324432.98,120155111654424.0,94981261324432.98
g,38817356557784.0,68827239332992.0,38817356557784.0,113221555153617.0,75082497214816.0,94981261324432.98,75082497214816.0


## Katz index

The issue with this approach is two-fold: We first need to specify the parameter $k$ up to which we want to consider walks, which is likely to depend on the diameter of the network. Moreover, we assign equal importance to paths of any length and there are (most likely) many more paths with larger length $k$ compared to smaller length $k$.

We can address this with the so-called Katz index, which is defined as the following infinite sum:

$$ K(v,w) = \sum_{l=1}^{\infty} \beta^l |\text{walks}_{v,w}^{l}| = \beta A_{vw} + \beta^2(A^2)_{vw} + \beta^3 (A^3)_{vw}$$

where $\text{walks}_{vw}^l$ is the set of all walks of exactly length $l$ that start in $v$ and end in $w$. $\beta$ is a damping factor, where smaller values of $\lambda$ assign more weight to shorter walks.

If $\beta$ is chosen smaller than the reciprocal largest eigenvalue of $A$, this converges to the following equation (in matrix form):

$$ \mathbf{S} = \left(\mathbf{I}- \beta \mathbf{A} \right)^{-1} - \mathbf{I} $$

This implies that we can calculate a matrix $\mathbf{S}$ that captures all pairwise node similarities by a simple inversion of matrix $\mathbf{I}-\beta \mathbf{A}$ (and subtracting the identity matrix). Using `scipy` and `pathpy` we can compute this as follows:

In [15]:
def katz_index(network: pp.Network, v, w, **kwargs) -> float:
    A = pp.algorithms.adjacency_matrix(network)
    I = sp.sparse.identity(network.number_of_nodes())
    S = sp.sparse.linalg.inv(I - kwargs['beta'] * A) - I
    return S[network.nodes.index[v], network.nodes.index[w]]

The resulting similarity score depends on a parameter $\beta$ that controls the dampening factor of longer walks. For small $\beta$ the similarity is mostly based on very short walks, i.e. node pairs that are not directly connected tend to have small similarity.

In [16]:
katz_index(n, 'e', 'g', beta=0.02)

  warn('spsolve is more efficient when sparse b '


0.0008178288343514751

By increasing $\beta$ we can increase the influence of longer walks on the similarity scores:

In [17]:
katz_index(n, 'e', 'g', beta=0.2)

0.12958435207823965

In [18]:
df, edges = get_node_similarities(n, katz_index, beta=0.02)
df.style.pipe(highlight_edges, edges)

Unnamed: 0,a,b,c,d,e,f,g
a,0.000817,0.020433,0.020425,0.000409,8e-06,9e-06,8e-06
b,0.020433,0.001218,0.020433,0.020049,0.000409,0.000417,0.000409
c,0.020425,0.020433,0.000817,0.000409,8e-06,9e-06,8e-06
d,0.000409,0.020049,0.000409,0.001636,0.02045,0.020851,0.02045
e,8e-06,0.000409,8e-06,0.02045,0.000818,0.020442,0.000818
f,9e-06,0.000417,9e-06,0.020851,0.020442,0.001235,0.020442
g,8e-06,0.000409,8e-06,0.02045,0.000818,0.020442,0.000818


In [19]:
df, edges = get_node_similarities(n, katz_index, beta=0.2)
df.style.pipe(highlight_edges, edges)

  warn('spsolve is more efficient when sparse b '


Unnamed: 0,a,b,c,d,e,f,g
a,0.115016,0.293399,0.281683,0.070293,0.018337,0.021394,0.018337
b,0.293399,0.173594,0.293399,0.281174,0.07335,0.085575,0.07335
c,0.281683,0.293399,0.115016,0.070293,0.018337,0.021394,0.018337
d,0.070293,0.281174,0.070293,0.265281,0.330073,0.385086,0.330073
e,0.018337,0.07335,0.018337,0.330073,0.129584,0.317848,0.129584
f,0.021394,0.085575,0.021394,0.385086,0.317848,0.204156,0.317848
g,0.018337,0.07335,0.018337,0.330073,0.129584,0.317848,0.129584


## Leicht-Holme-Newman Index

One issue of the Katz index is that it does not correct for the number of walks that we can expect based on the degrees of nodes. To address this, we can define an index that divides the number of walks between each pair of nodes by the number of expected walks. 

This idea yields the Leicht-Holme-Newman index, which can be rewritten to the following equation: 

$$ S = D^{-1}\left(I-\frac{\alpha A}{\lambda_1} \right)^{-1}D^{-1} $$

where $\alpha$ is a free parameter in $\left[0, 1\right]$ and $\lambda_1$ is the largest eigenvalue of the adjacency matrix. Like for the Katz index, values of $\alpha$ assign more weight to short paths. We can implement it as follows:

In [20]:
def LHN_index(network: pp.Network, v, w, **kwargs) -> float:
    A = pp.algorithms.adjacency_matrix(network)
    ev = sp.sparse.linalg.eigs(A, which="LM", k=2, return_eigenvectors=False)
    m = network.number_of_edges()
    eigenvalues_sorted = np.sort(np.absolute(ev))
    lambda_1 = eigenvalues_sorted[0]
    D = sp.sparse.diags(pp.statistics.degree_sequence(network))
    I = sp.sparse.identity(network.number_of_nodes())
    S = 2*m*lambda_1*sp.sparse.linalg.inv(D) * sp.sparse.linalg.inv(I - kwargs['alpha']*A/lambda_1)*sp.sparse.linalg.inv(D)
    return S[network.nodes.index[v], network.nodes.index[w]]


In [21]:
LHN_index(n, 'e', 'g', alpha=0.02)

  warn('spsolve requires A be CSC or CSR matrix format',
  warn('spsolve is more efficient when sparse b '


0.0018967811367335168

In [22]:
LHN_index(n, 'e', 'g', alpha=0.2)

0.2215694282551531

In [23]:
df, edges = get_node_similarities(n, LHN_index, alpha=0.02)
df.style.pipe(highlight_edges, edges)

Unnamed: 0,a,b,c,d,e,f,g
a,8.636903,0.060652,0.090968,0.000474,1e-05,7e-06,1e-05
b,0.060652,3.839041,0.060652,0.03002,0.000632,0.000426,0.000632
c,0.090968,0.060652,8.636903,0.000474,1e-05,7e-06,1e-05
d,0.000474,0.03002,0.000474,2.1597,0.045499,0.030645,0.045499
e,1e-05,0.000632,1e-05,0.045499,8.636903,0.060659,0.001897
f,7e-06,0.000426,7e-06,0.030645,0.060659,3.839049,0.060659
g,1e-05,0.000632,1e-05,0.045499,0.001897,0.060659,8.636903


In [24]:
df, edges = get_node_similarities(n, LHN_index, alpha=0.2)
df.style.pipe(highlight_edges, edges)

  warn('spsolve requires A be CSC or CSR matrix format',
  warn('spsolve is more efficient when sparse b '


Unnamed: 0,a,b,c,d,e,f,g
a,8.851053,0.694499,1.031096,0.05643,0.013278,0.009687,0.013278
b,0.694499,3.979227,0.694499,0.323325,0.076076,0.055504,0.076076
c,1.031096,0.694499,8.851053,0.05643,0.013278,0.009687,0.013278
d,0.05643,0.323325,0.05643,2.270163,0.534152,0.389713,0.534152
e,0.013278,0.076076,0.013278,0.534152,8.856576,0.705022,0.221569
f,0.009687,0.055504,0.009687,0.389713,0.705022,3.989915,0.705022
g,0.013278,0.076076,0.013278,0.534152,0.221569,0.705022,8.856576
