# 07-01 - Neighborhood-based Similarity Scores

*June 12 2024*  

A simple approach to predict links in networks is to calculate scores that capture pairwise similarities between nodes. In the first practice session, we implement different approaches that use the neighborhood of a node to calculate similarity scores.

In [None]:
import numpy as np

from matplotlib import pyplot as plt
from collections import defaultdict

import scipy as sp

import pandas as pd
import seaborn as sns

import pathpyG as pp

plt.style.use('default')
sns.set_style("whitegrid")

plot_style = {
    'edge_color': 'grey'
}

We first create our well-known toy example.

In [2]:
n = pp.Graph.from_edge_list([('a', 'b'), ('b', 'c'), ('c', 'a'), ('d', 'e'), ('e', 'f'), ('f', 'g'), ('g', 'd'), ('d', 'f'), ('b', 'd')]).to_undirected()
pp.plot(n, node_label=[v for v in n.mapping.node_ids]);

This network exhibits two "clusters", where one could argue that in the second cluster one link from node "e" to "g" is "missing", i.e. there is one pair of nodes between which a link is intuitively most likely.

We start with some boilerplate code that helps us to visualize pair-wise node similarities and which we will use to address link prediction. 

We first implement a function `get_node_similarities` that accepts three arguments. The first argument is the network for which we wish to calculate pairwise node similarities. The second argument is a function `score` that will be called for each pair of nodes. This function must accept four arguments: the network, the uid of the first node, the uid of the second node and a dictionary of arbitrary keyword arguments that can be used to parameterize the calculation of the score.

The function `get_node_similarities` calculates all similarity scores and generates a `df` pandas data frame where the nodes are the rows and the columns. We further generate a second `edge_df` data frame with the same shape, where Boolean entries store whether an edge exists or not between all pairs of nodes. 

In [9]:
def get_node_similarities(n, score, **kwargs):
    sim_scores = defaultdict(defaultdict)
    edges = defaultdict(defaultdict)
    nodes = [v for v in n.nodes]
    for v in n.nodes:
        for w in n.nodes:
            sim_scores[v][w] = score(n, v, w, **kwargs)
            if (v,w) in n.edges:
                edges[v][w] = 'true '
            else:
                edges[v][w] = 'false '

    df = pd.DataFrame.from_dict(sim_scores, orient='index')
    edge_df = pd.DataFrame.from_dict(edges, orient='index')

    edge_df.columns = nodes
    edge_df.index = nodes
    df.columns = nodes
    df.index = nodes
    return df, edge_df

So simplify the visualization of pairwise similarities, the function `highlight_edges` uses the `styler` object of a `pandas` data frame to color cells based on the absence or presence of edges, where pairs of nodes connected by an edge are colored in green and pairs of nodes not connected by an edge are colored in red. In the resulting table, we can check whether high scores are assigned to green cells (where an edge is present) and low scores are assigned to red cells (where no edge exists). In addition, we would like some of the red cells that correspond to node pairs that *should* be connected to contain high scores as well.

In [10]:
def highlight_edges(styler, edge_df):
    styler.set_table_styles([{'selector': '.true', 'props': 'background-color: #e6ffe6;'},
    {'selector': '.false', 'props': 'background-color: #ffe6e6;'}], overwrite=False)
    styler.set_td_classes(edge_df)
    return styler

### Inverse Path length

To test these two functions, we implement a first similarity score that we can pass to the `get_node_similarities` function. 

We start with a simple approach based on shortest paths as introduced in L07. Here, we consider pairs of nodes more similar if a short path exists between them. For nodes $v$ and $w$ we define it as inverse of the length of the shortest path from $v$ to $w$. Note that the resulting similarity score is symmetric only for undirected networks, as in a directed network the length of the shortest path can be different from $v$ to $w$ compared to $w$ to $v$. This asymmetry is actually an advantage that we could use to predict *directed* links.

We must consider the special case if the nodes $v$ and $w$ are identical, i.e. $v=w$. Then the shortest path length is zero and we define the inverse as infinite. 

If two nodes are in two different connected components, i.e. no path exists that connects them, the path length is defined as infinite, i.e. the inverse is zero. We thus consider two nodes in different connected components as maximally dissimilar.

We use `pathpy`'s implementation of Dikstra's algorithm to calculate the path length between two nodes.

In [11]:
def inverse_path_length(network, v, w):
    dist, _ = pp.algorithms.shortest_paths.shortest_paths_dijkstra(network)
    d = dist[network.mapping.to_idx(v), network.mapping.to_idx(w)]
    if d == 0:
        return np.inf
    else:
        return 1/d

We test this for two node pairs:

In [12]:
inverse_path_length(n, 'a', 'a')

inf

In [13]:
inverse_path_length(n, 'e', 'g')

0.5

We can now pass this score to the `get_node_similarities` function, which returns two `pandas` data frames: The first data frame contains all pair-wise node similarity scores. The second data frame contains the edges, which can be used to style the table using the `DataFrame.style.pipe` function and the `highlight_edges` function defined above.

In [14]:
df, edges = get_node_similarities(n, inverse_path_length)
df.style.pipe(highlight_edges, edges)

Unnamed: 0,a,b,c,d,e,f,g
a,inf,1.0,1.0,0.5,0.333333,0.333333,0.333333
b,1.0,inf,1.0,1.0,0.5,0.5,0.5
c,1.0,1.0,inf,0.5,0.333333,0.333333,0.333333
d,0.5,1.0,0.5,inf,1.0,1.0,1.0
e,0.333333,0.5,0.333333,1.0,inf,1.0,0.5
f,0.333333,0.5,0.333333,1.0,1.0,inf,1.0
g,0.333333,0.5,0.333333,1.0,0.5,1.0,inf


For link prediction, we are interested in the red cells of the table, i.e. pairs of nodes not already connected by a link. In the table above, we find that many pairs of nodes that are not already connected by an edge have the same (highest) similarity score of 0.5. Since the smallest possible path length for nodes not connected by a link is two, the value of $0.5$ is actually the maximum possible similarity for disconnected node pairs. All nodes connected by an edge have a inverse path length similarity of one, which is true for all green cells in the table.

### Common Neighbors

To evaluate whether two nodes are part of the same local cluster of densely connected nodes, we can compare the neighborhoods of the two nodes. A simple approach is to count the number of *Common Neighbors*, i.e. 

$$ C(v,w) := | N(v) \cap N(w) | $$

Two nodes are maximally dissimilar if they do not share a single neighbor. They are more similar if they have more neighbors in common. We can compute the common neighbors similarity as follows:

In [17]:
def common_neighbors(n, v, w):
    N_v = set([x for x in n.successors(v)])
    N_w = set([x for x in n.successors(w)])
    return len(N_v.intersection(N_w))

We test this in our network:

In [18]:
df, edges = get_node_similarities(n, common_neighbors)
df.style.pipe(highlight_edges, edges)

Unnamed: 0,a,b,c,d,e,f,g
a,2,1,1,1,0,0,0
b,1,3,1,0,1,1,1
c,1,1,2,1,0,0,0
d,1,0,1,4,1,2,1
e,0,1,0,1,2,1,2
f,0,1,0,2,1,3,1
g,0,1,0,1,2,1,2


By definition, the number of common neighbors of a node with itself corresponds to the degree of that node, so the diagonal in the matrix above actually contains the degree sequence of the network.

For red node pairs not already connected by a link, we find that the pair $e$ and $g$ is considered most similar. Both have a degree of two and they have two (i.e. all) neighbors in common. 

### Szymkiewicz–Simpson coefficient

Since the common neighbors score depends on the degrees of the involved nodes, it is difficult to compare node pairs with different degrees.

On slide 6 of the lecture, we highlighted that the node pairs $(a,c)$, and $(b,f)$ are actually considered equally similar, since both pairs share one neighbor. However, we should consider nodes $a$ and $c$ more similar than nodes $b$ and $f$, since $a$ and $c$ share all neighbors (except for the one that is due to the direct link $(a,c)$). Moreover, the common neighbors score is not normalized, as it can take any positive (or zero) value.

There are different approaches to normalize the size of the common neighbor set based on the degrees of nodes. The so-called Overlap or Szymkiewicz–Simpson coefficient is defined as follows:

$$ O(v,w) := \frac{|N(v) \cap N(w)|}{\min(d_v, d_w)}$$

This coefficient is one if the neighbors of one node are a subset of the neighbors of the other node, i.e. if the overlap is maximal given the degrees of the two involved nodes. In `pathpy` we can implement it as follows:

In [19]:
def overlap_coefficient(n, v, w):
    N_v = set([x for x in n.successors(v)])
    N_w = set([x for x in n.successors(w)])
    return len(N_v.intersection(N_w))/min(len(N_v), len(N_w))

In [20]:
df, edges = get_node_similarities(n, overlap_coefficient)
df.style.pipe(highlight_edges, edges)

Unnamed: 0,a,b,c,d,e,f,g
a,1.0,0.5,0.5,0.5,0.0,0.0,0.0
b,0.5,1.0,0.5,0.0,0.5,0.333333,0.5
c,0.5,0.5,1.0,0.5,0.0,0.0,0.0
d,0.5,0.0,0.5,1.0,0.5,0.666667,0.5
e,0.0,0.5,0.0,0.5,1.0,0.5,1.0
f,0.0,0.333333,0.0,0.666667,0.5,1.0,0.5
g,0.0,0.5,0.0,0.5,1.0,0.5,1.0


We obtain a normalized similarity score in the range of $[0,1]$. We find that among the pairs of unconnected nodes, the pair $e$ and $g$ has the maximum possible similarity score of one.

### Jaccard similarity coefficient

A different approach to normalize the size of the intersection between the set is the so-called **Jaccard similarity coefficient**, which was proposed by Paul Jaccard. Here, we normalize the size of the intersection by the size of the union of the two sets. 

$$ J(v,w) := \frac{|N(v) \cap N(w)|}{|N(v) \cup N(w)|} = \frac{|N(v) \cap N(w)|}{d_v + d_w - |N(v) \cap N(w)|}$$

We again obtain a normalized coefficient in $[0,1]$. If both nodes have zero degree we define $J(v,w):=1$. Apart from nodes with zero degree, the maximum similarity of one is only assumed for pairs of nodes $v$ and $w$ where the two neighbor sets of $v$ and $w$ are identical (in which case the size of the union corresponds to the size of the intersection). Note that this is different for the overlap coefficient, where a node pair can also assume a maximum overlap coefficient of one if the neighbor set of one node is larger than that of the other one. 

In `pathpy` we can implement the Jaccard similarity coefficient as follows. For our toy example, we find that the nodes $e$ and $g$ have the exact same set of neighbors, which yields a Jaccard similarity of one.

In [21]:
def jaccard_similarity(n, v, w):
    N_v = set([x for x in n.successors(v)])
    N_w = set([x for x in n.successors(w)])
    intersection = N_v.intersection(N_w)
    if len(N_v) == 0 and len(N_w) == 0:
        return 1
    else:
        return len(intersection)/(len(N_v) + len(N_w) - len(intersection))

In [22]:
df, edges = get_node_similarities(n, jaccard_similarity)
df.style.pipe(highlight_edges, edges)

Unnamed: 0,a,b,c,d,e,f,g
a,1.0,0.25,0.333333,0.2,0.0,0.0,0.0
b,0.25,1.0,0.25,0.0,0.25,0.2,0.25
c,0.333333,0.25,1.0,0.2,0.0,0.0,0.0
d,0.2,0.0,0.2,1.0,0.2,0.4,0.2
e,0.0,0.25,0.0,0.2,1.0,0.25,1.0
f,0.0,0.2,0.0,0.4,0.25,1.0,0.25
g,0.0,0.25,0.0,0.2,1.0,0.25,1.0


### Adamic-Adar Index

Intuitively, a common neighbor $u$ between two nodes $v$ and $w$ is less informative for the similarity of $v$ and $w$ if node $u$ has a large number of connections to other nodes as well, i.e. if it also contributes to the similarity score of many other node pairs. So far, we only considered similarity scores in which we account for the degrees of nodes $v$ and $w$, while we did not consider the degree of node $u$ We can define a coefficient such that common neighbors with large degree contribute less to the similarity scores of node pairs compared to common neighbors with a small degree.

Building on this idea, the Adamic-Adar index was proposed for the purpose of link prediction by Lada Adamic and Eytan Adar in 2003. It is defined as the sum of the inverse logarithmic degrees of all common neighbors for a given pair of nodes, i.e. 

$$ A(v,w) := \sum_{u \in N(v) \cap N(w)} \frac{1}{\log|N(u)|} $$

Like the common neighbors count, this score assumes values in the range $[0, \infty)$. Pairs of nodes that have many common neighbors with small degrees will assume higher similarity scores compared to node pairs with fewer common neighbors or common neighbors that have a large degree.

In `pathpy` we can implement this as follows:

In [29]:
def adamic_adar_index(n, v, w):
    A = 0
    N_v = set([x for x in n.successors(v)])
    N_w = set([x for x in n.successors(w)])
    for u in N_v.intersection(N_w):
        A += 1 / np.log(n.out_degrees[u])
    return A
    

In [30]:
df, edges = get_node_similarities(n, adamic_adar_index)
df.style.pipe(highlight_edges, edges)

Unnamed: 0,a,b,c,d,e,f,g
a,2.352934,1.442695,0.910239,0.910239,0.0,0.0,0.0
b,1.442695,3.606738,1.442695,0.0,0.721348,0.721348,0.721348
c,0.910239,1.442695,2.352934,0.910239,0.0,0.0,0.0
d,0.910239,0.0,0.910239,4.705869,0.910239,2.88539,0.910239
e,0.0,0.721348,0.0,0.910239,1.631587,0.721348,1.631587
f,0.0,0.721348,0.0,2.88539,0.721348,3.606738,0.721348
g,0.0,0.721348,0.0,0.910239,1.631587,0.721348,1.631587


In our example network, again node $e$ and node $g$ assume the largest Adamic-Adar index among all pairs of nodes not connected by a link. For this node pair, the score is given as:

$$ A(e, g) = \frac{1}{\log{3}} + \frac{1}{\log{4}} $$

where $\log$ is the natural logarithm.

Note that this index is properly defined if we calculate it for nodes $ v \neq w$. In this case the degree of a node $u$ that has a link to both $v$ and $w$ is necessarily larger than one and the logarithm in the denominator is larger than zero. The Adamic-Adar index may not properly defined if we compute the similarity of a node with degree one to itself:

In [32]:
network_one_edge = pp.Graph.from_edge_list([('a', 'b')]).to_undirected()

adamic_adar_index(network_one_edge, 'a', 'a')

  A += 1 / np.log(n.out_degrees[u])


inf

### Cosine similarity

We close our introduction of neighborhood-based similarity scores with an interesting score that is based on a geometric interpretation of the adjacency matrix. We can treat the rows or columns of an adjacency matrix as vectors, i.e. we have one vector associated with each node. Let us denote the vector associated with node $v$ as $A_v$, e.g. the associated column vector of the adjacency vector. 

Intuitively, we can consider two vectors as more similar if they point the the same direction. This can be calculated based on the dot product, which is zero if two vectors are orthogonal, assumes positive values if they point in the same direction and negative values if they point in different directions. The dot product between two vectors depends both on the angle between the vectors and the Euclidean length (i.e. magnitude) of the two vectors. If we divide it by the product of the Euclidean lengths of the two vectors, we obtain the cosine of the angle between the vectors. 

This yields the so-called cosine similarity between nodes, which is defined as

$$ \text{cos\_sim}(v,w) := \cos(\alpha) = \frac{A_v \cdot A_w}{\|A_v\| \|A_w\|} $$

where $A_v$ and $A_w$ are the row vectors in the adjacency matrix that correspond to $v$ and $w$ respectively, and $\alpha$ is the angle between vectors $A_v$ and $A_w$.

Note that in an adjacency matrix we can only have positive entries, i.e. from a geometric point of view all vectors point into the positive direction (for all $n$ dimensions). Hence, two vectors cannot point in opposite directions, which implies that the cosine similarity for adjacency matrix rows/columns can only assume values from 0 to 1, where one indicates that v and w have exactly the same neighbors and zero indicates that v and w have no neighbors in common.

Note that the cosine similarity is not defined if one of the nodes has no neighbors, in which case the magnitude of the corresponding vector would be zero. In this case, we define it as zero.

In [40]:
def cosine_similarity(network, v, w):    
    if network.degrees()[v] == 0 or network.degrees()[w] == 0:
        return 0
    else:
        A = network.get_sparse_adj_matrix().todense()
        v_v = A[network.mapping.to_idx(v)].A1
        v_w = A[network.mapping.to_idx(w)].A1
        return np.dot(v_v, v_w)/(np.linalg.norm(v_v)*np.linalg.norm(v_w))

In [41]:
df, edges = get_node_similarities(n, cosine_similarity)
df.style.pipe(highlight_edges, edges)

Unnamed: 0,a,b,c,d,e,f,g
a,1.0,0.408248,0.5,0.353553,0.0,0.0,0.0
b,0.408248,1.0,0.408248,0.0,0.408248,0.333333,0.408248
c,0.5,0.408248,1.0,0.353553,0.0,0.0,0.0
d,0.353553,0.0,0.353553,1.0,0.353553,0.57735,0.353553
e,0.0,0.408248,0.0,0.353553,1.0,0.408248,1.0
f,0.0,0.333333,0.0,0.57735,0.408248,1.0,0.408248
g,0.0,0.408248,0.0,0.353553,1.0,0.408248,1.0


By definition, the cosine similarity of nodes to themselves is one, so we have ones on the diagonal of our table. We again find that cosine similarity is maximal for the pair `e` and `g`. This is due to the fact that those two nodes have exactly the same set of neighbors.

As you will see in this week's exercise session, the cosine similarity of two nodes $v$ and $w$ can also be interpreted as a degree-normalized set overlap coefficient. It actually corresponds to the number of common neighbors between nodes $v$ and $w$ divided by the geometric mean of $d_v$ and $d_w$. We can test this for the cosine similarity of nodes $b$ and $e$ in the example above. These two nodes have one neighbor in common:

In [42]:
common_neighbors(n, 'b', 'e')

1

The geometric mean of the node degrees is

In [43]:
np.sqrt(n.degrees()['a']*n.degrees()['b'])

2.449489742783178

and thus

In [44]:
common_neighbors(n, 'b', 'e')/np.sqrt(n.degrees()['a']*n.degrees()['b'])

0.4082482904638631