# Node Centrality Measures

**September 08 2020**  
*Vincenzo Perri*

One of the first steps in the exploration of a network is to look at the importance of its nodes. Several measures exist as there several ways to define importance in a network. 

In [1]:
from collections import defaultdict, Counter

import pathpy as pp
import numpy as np

Let's start building two toy networks, that will help in developing an understanding of the different measures.

In [2]:
n_undirected = pp.Network(directed=False)
n_undirected.add_edge('a', 'b')
n_undirected.add_edge('b', 'c')
n_undirected.add_edge('c', 'a')
n_undirected.add_edge('d', 'e')
n_undirected.add_edge('e', 'f')
n_undirected.add_edge('f', 'g')
n_undirected.add_edge('g', 'd')
n_undirected.add_edge('d', 'f')
n_undirected.add_edge('b', 'd')
n_undirected.plot()

In [3]:
n_directed = pp.Network(directed=True)
n_directed.add_edge('a', 'b')
n_directed.add_edge('b', 'c')
n_directed.add_edge('c', 'a')
n_directed.add_edge('d', 'e')
n_directed.add_edge('e', 'f')
n_directed.add_edge('f', 'g')
n_directed.add_edge('g', 'd')
n_directed.add_edge('d', 'f')
n_directed.add_edge('b', 'd')
n_directed.plot()

## Degree Centrality

A simple, local notion of node importance in networks can be defined based on the degrees of nodes. In `pathpy` the (in- or out-) degrees of nodes can be easily computed as follows:

In [4]:
n_undirected.degrees()

{'a': 2, 'b': 3, 'c': 2, 'd': 4, 'e': 2, 'f': 3, 'g': 2}

In [5]:
n_directed.indegrees()

{'a': 1, 'b': 1, 'c': 1, 'd': 2, 'e': 1, 'f': 2, 'g': 1}

In [6]:
n_directed.outdegrees()

{'a': 1, 'b': 2, 'c': 1, 'd': 2, 'e': 1, 'f': 1, 'g': 1}

In order to provide a unified API to all centrality measures, `pathpy` additionally includes a `degree_centrality` function in the module `pp.algorithms.centralities`. Using the `mode` parameter, we can switch between degre, in-, or out-degree.

In [7]:
pp.algorithms.centralities.degree_centrality(n_undirected)

{'a': 2, 'b': 3, 'c': 2, 'd': 4, 'e': 2, 'f': 3, 'g': 2}

In [8]:
pp.algorithms.centralities.degree_centrality(n_directed, mode='indegree')

{'a': 1, 'b': 1, 'c': 1, 'd': 2, 'e': 1, 'f': 2, 'g': 1}

In [9]:
pp.algorithms.centralities.degree_centrality(n_directed, mode='outdegree')

{'a': 1, 'b': 2, 'c': 1, 'd': 2, 'e': 1, 'f': 1, 'g': 1}

A common task in network analysis is the ranking of nodes by centrality. Since dictionaries in `python` are not ordered, this requires a different data structure. To simplify this frequent task, `pathpy` comes with a `rank_centralities` function that takes an unordered dictionary with centrality values as parameter, and returns a list of tuples with node uids and centrality values that are arranged in descending order:

In [10]:
ranking = pp.algorithms.centralities.rank_centralities(pp.algorithms.centralities.degree_centrality(n_undirected))
print(ranking)

print('The most important node is', ranking[0][0])

[('d', 4), ('f', 3), ('b', 3), ('g', 2), ('e', 2), ('c', 2), ('a', 2)]
The most important node is d


## Betweenness Centrality

If we want a centrality measure that considers the topology of links (and not only the number of links incident to nodes) we can measure the betweenness centrality.

In [11]:
pp.algorithms.centralities.betweenness_centrality(n_undirected)

defaultdict(float,
            {'b': 16.0,
             'd': 19.0,
             'f': 1.0,
             'a': 0.0,
             'e': 0.0,
             'g': 0.0,
             'c': 0.0})

In [12]:
pp.algorithms.centralities.rank_centralities(n_undirected.betweenness_centrality())

[('d', 19.0),
 ('b', 16.0),
 ('f', 1.0),
 ('c', 0.0),
 ('g', 0.0),
 ('e', 0.0),
 ('a', 0.0)]

To better understand why we obtain those betweenness centralities, we can look at all shortest paths between all pairs of nodes. We can now confirm that the node `b` is on all shortest path for 16 different pairs of nodes, thus yielding a betweenness centrality of 16. Node `f` on the other hand has a betweenness centrality of one because it is on half of the shortest paths from `e` to `g` and on half of the paths from `g` to `e`. The nodes `a`, `c`, `e`, and `g` do not occur on any shortest path, so they have a betweenness centrality of zero.

In [13]:
all_paths = pp.algorithms.shortest_paths.all_shortest_paths(n_undirected, return_distance_matrix=False)
all_paths

defaultdict(<function pathpy.algorithms.shortest_paths.all_shortest_paths.<locals>.<lambda>()>,
            {'a': defaultdict(set,
                         {'b': {('a', 'b')},
                          'c': {('a', 'c')},
                          'a': {('a',)},
                          'e': {('a', 'b', 'd', 'e')},
                          'f': {('a', 'b', 'd', 'f')},
                          'g': {('a', 'b', 'd', 'g')},
                          'd': {('a', 'b', 'd')}}),
             'b': defaultdict(set,
                         {'a': {('b', 'a')},
                          'c': {('b', 'c')},
                          'd': {('b', 'd')},
                          'e': {('b', 'd', 'e')},
                          'f': {('b', 'd', 'f')},
                          'g': {('b', 'd', 'g')},
                          'b': {('b',)}}),
             'c': defaultdict(set,
                         {'b': {('c', 'b')},
                          'a': {('c', 'a')},
                          'd': {(

Since it is based on shortest paths, the betweenness centrality can be naturally computed in directed networks. In our example, we obtain a more complex picture of node importance:


In [14]:
pp.algorithms.centralities.rank_centralities(pp.algorithms.centralities.betweenness_centrality(n_directed))

[('d', 12.0),
 ('b', 9.0),
 ('f', 6.0),
 ('a', 5.0),
 ('g', 3.0),
 ('c', 1.0),
 ('e', 0.0)]

## Closeness Centrality

The betweenness centrality measure emphasizes the importance of nodes for the shortest paths between other nodes. We can instead study the length of shortest paths from a node to all other nodes, where we define nodes to be more central if they are connected by short paths to all other nodes. This is the idea behind the closeness centrality. Using `pathpy` we can implement it as follows:

In [15]:
pp.algorithms.centralities.rank_centralities(pp.algorithms.centralities.closeness_centrality(n_undirected))

[('d', 0.125),
 ('b', 0.1111111111111111),
 ('f', 0.09090909090909091),
 ('g', 0.08333333333333333),
 ('e', 0.08333333333333333),
 ('a', 0.07692307692307693),
 ('c', 0.07692307692307693)]

In [16]:
pp.algorithms.centralities.rank_centralities(pp.algorithms.centralities.closeness_centrality(n_directed))

[('a', 0.3333333333333333),
 ('c', 0.3333333333333333),
 ('b', 0.3333333333333333),
 ('d', 0.08333333333333333),
 ('f', 0.07692307692307693),
 ('e', 0.06666666666666667),
 ('g', 0.058823529411764705)]

# Centralities in Empirical Networks

We conclude this unit by an exploration of node centralities in empirical networks. We first use `pathpy`'s SQLite integration to read the table `gentoo` from the database file as **directed** network. We further read the table `highschool` in the database as undirected network. We then apply the functions in the `pathpy.algorithms.centralities` module to rank nodes according to the following centrality measures:

1) in- and out-degree (for directed network), degree (for undirected network)  
2) closeness centrality   
3) betweenness centrality

In [17]:
n_gentoo = pp.io.sql.read_network('networks.db', sql='SELECT source, target FROM gentoo', directed=True)
n_highschool = pp.io.sql.read_network('networks.db', sql='SELECT source, target FROM highschool', directed=False, multiedges=True)
n_physicians = pp.io.sql.read_network('networks.db', sql='SELECT source, target FROM physicians', directed=False, multiedges=True)

In [18]:
highschool_bw = n_highschool.betweenness_centrality()
highschool_cl = n_highschool.closeness_centrality()
highschool_deg = pp.algorithms.centralities.degree_centrality(n_highschool)

In [19]:
print('\nTop five most important nodes:\n=============================')
print('betweenness centrality: ', pp.algorithms.centralities.rank_centralities(highschool_bw)[:5])
print('closeness centrality: ', pp.algorithms.centralities.rank_centralities(highschool_cl)[:5])
print('degree centrality: ', pp.algorithms.centralities.rank_centralities(highschool_deg)[:5])


Top five most important nodes:
betweenness centrality:  [('687', 4974.336731146571), ('496', 4836.858333333073), ('691', 4523.788334945413), ('954', 4168.054258449269), ('582', 3629.6290294000446)]
closeness centrality:  [('691', 0.00228310502283105), ('502', 0.002188183807439825), ('954', 0.0021645021645021645), ('1332', 0.0021413276231263384), ('642', 0.0021321961620469083)]
degree centrality:  [('407', 17), ('1295', 17), ('945', 17), ('845', 16), ('15', 16)]


# Visualising centralities

We next study how we can visually represent node centralities in terms of node sizes. Your task is to visualise the node centralities in the `highschool` network. Use the `node_size` visualisation parameter of the `pathpy.visualisation.plot` method to scale the nodes according to their closeness, degree, and betweenness centralities respectively. 

In [20]:
# normalize centralities in the network
def scale_centrality(centr, min_size=5, added_size=30):
    ma = max(centr.values())
    mi = min(centr.values())

    scaled = {}
    for v in centr:
        scaled[v] = min_size + added_size*((centr[v]-mi)/(ma-mi))
    return scaled

In [22]:
# n_highschool.plot(node_size=scale_centrality(highschool_cl))

# Clustering coefficient(s)

Other the importance of nodes, other aspects can be studied from the topology of a network. 
The *clustering coefficient* of a node quantifies the tendency of neighbors of a node to be neighbors themselves i.e. to create cliques.

We start with a function that computes the local clustering coefficient of a node, i.e. the fraction of neighbor pairs of a node where the neighbors are directly connected to each other:

In [23]:
n_undirected

<pathpy.core.network.Network object at 0x0000022D885C7670>

In [24]:
print(pp.statistics.local_clustering_coefficient(n_undirected, 'a'))
print(pp.statistics.local_clustering_coefficient(n_undirected, 'b'))
print(pp.statistics.local_clustering_coefficient(n_undirected, 'f'))

1.0
0.3333333333333333
0.6666666666666666


In the directed network, all three nodes have zero clustering coefficient:

In [25]:
n_directed

<pathpy.core.network.Network object at 0x0000022D885E0220>

In [26]:
print(pp.statistics.local_clustering_coefficient(n_directed, 'a'))
print(pp.statistics.local_clustering_coefficient(n_directed, 'b'))
print(pp.statistics.local_clustering_coefficient(n_directed, 'f'))

0.0
0.0
0.0


In this network, the only node with non-zero clustering coefficient is node `d`:

In [27]:
pp.statistics.local_clustering_coefficient(n_directed, 'd')

0.5

For a directed network that only consists of three edges $(a,b), (a,c)$ and $(b,c)$ the local clustering coefficient of $a$ is $0.5$ because only half of the possible (ordered) node pairs are connected:

In [28]:
n = pp.Network()
n.add_edge('a','b')
n.add_edge('a','c')
n.add_edge('b','c')
n.plot()
print(pp.statistics.local_clustering_coefficient(n, 'a'))

0.5


A global clustering coefficient can be computed as average of all local clustering coefficients. Note that there are also other definitions that actually count the total number of closed triads in the whole network and then calculate a single clustering coefficient for the whole network rather than averaging local clustering coefficients. 

In [26]:
pp.statistics.avg_clustering_coefficient(n_undirected)

0.7619047619047619

If we want to find the set of closed triads around a node, we can use the function `closed_triads`. It returns all direct connections between neighbors of a given node. In our example, `f` has a degree of three and there is a direct edge between two of the three neighbor pairs:

In [27]:
for edge in pp.statistics.closed_triads(n_undirected, 'f'):
    print(edge)

Uid:		0x28553826e20
Type:		Edge
Source node:	Node d
Target node:	Node e
Uid:		0x2855382f400
Type:		Edge
Source node:	Node g
Target node:	Node d


## Clustering coefficient in empirical networks

Concluding this unit, we calculate the clustering coefficient to empirical networks:

In [28]:
n_gentoo = pp.io.sql.read_network('networks.db', sql='SELECT source, target FROM gentoo', directed=True)
n_highschool = pp.io.sql.read_network('networks.db', sql='SELECT source, target FROM highschool', directed=False, multiedges=True)
n_physicians = pp.io.sql.read_network('networks.db', sql='SELECT source, target FROM physicians', directed=False, multiedges=True)

The first network is a directed communication and collaboration network, which has a small clustering coefficient.

In [29]:
print(n_gentoo.avg_clustering_coefficient())

0.002892847366817798


The other two networks are undirected social networks, for which we expect a large clustering coefficient. The reason for this is that two friends of a person a likely to be friend with each other as well:

In [30]:
print(n_highschool.avg_clustering_coefficient())
print(n_physicians.avg_clustering_coefficient())

0.3269545310354134
0.29494028552167456
