# 04-01 - Stochastic Block Model

*May 17 2023*  

In the first notebook, we implement the stochastic block model, a probabilistic generative model that generalizes the Erdös-Renyi model and that can be used to generate random networks with given community structure.

In [1]:
import pathpy as pp
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from tqdm import tqdm_notebook as tqdm

plt.style.use('default')
sns.set_style("whitegrid")
    
from numpy import linalg as npl
import scipy as sp

import scipy.special
from decimal import *
import string
from collections import Counter

from collections import defaultdict

We first implement a `python` function which - for a given stochastic block matrix $\mathbf{B}$ and a fixed block assignment vector $z$ that assigns nodes to communities - generates a random undirected network according to the stochastic block model. We can see this as a generalization of the $G(n,p)$ model, where we recover the $G(n,p)$ model if we consider a single block $B=1$ with a single link probability $M_{11}=p$.

In [2]:
def stochastic_block_model(M, z, node_names=None):
    
    # the number of nodes is implicitly given by the length of block assignment vector z 
    n = len(z)
    B = len(set(z))

    # we can use pre-defined node names, if not given, we use contiguous numbers
    if node_names == None:
        node_names = [str(i) for i in range(n)]    
    
    # generate nodes and store block assignment vector entry (i.e. community label) as node attribute
    network = pp.Network(directed = False)
    for name, z_i in zip(node_names, z):
        network.add_node(name, z=z_i)

    # randomly generate links with probabilities given by entries of the stochastic block matrix M
    for u in range(n):
        for v in range(u): # note that since the upper bound is exclusive, we do not add self-loops!
            if np.random.random() <= M[z[u], z[v]]:
                network.add_edge(node_names[u], node_names[v])

    # return network
    return network

We test our function in an example with four nodes with two communities consisting two nodes each.

In [3]:
M = np.matrix('1 0; 0 1')
z = np.array([0, 0, 1, 1])
n = stochastic_block_model(M, z, list('abcd'))
print(pp.algorithms.adjacency_matrix(n).todense())
n.plot(edge_color='gray')

[[0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]]


We can add links that interconnected those communities with probability $0.5$. We further color nodes according to the communities.

In [4]:
M = np.matrix('1 0.5; 0.5 1')
z = np.array([0, 0, 1, 1])
n = stochastic_block_model(M, z, list('abcd'))

# color nodes 
colors = {0: 'red', 1: 'blue'}
for v in n.nodes:
    v['color'] = colors[z[n.nodes.index[v.uid]]]
n.plot(edge_color='gray')

With the following model parameters, the stochastic block model corresponds to a $G(n,p)$ model with $p=0.5$. 

In [5]:
M = np.matrix('0.5')
z = np.array([0, 0, 0, 0])
n = stochastic_block_model(M, z, list('abcd'))

# color nodes 
colors = {0: 'red', 1: 'blue'}
for v in n.nodes:
    v['color'] = colors[z[n.nodes.index[v.uid]]]
n.plot(edge_color='gray')

We can also create networks where nodes in *different* communities have a preference to be connected.

In [6]:
M = np.matrix('0.2 1; 1 0.2')
z = np.array([0, 0, 1, 1])
n = stochastic_block_model(M, z, list('abcd'))

colors = {0: 'red', 1: 'blue'}
node_color = {}
for v in n.nodes:
    node_color[v.uid] = colors[z[n.nodes.index[v.uid]]]
n.plot(edge_color='gray', node_color=node_color)

For those of you who attended our course "Statistical Network Analysis" last semester: the networks generated with this parameters exhibit a negative community assortativity coefficient (see L03 in script "Statistical Network Analysis" on WueCampus):

In [7]:
pp.statistics.Q_assortativity_coefficient(n, cluster_mapping={'a': 0, 'b': 0, 'c': 1, 'd': 1})

-0.3333333333333333

We test the model with the two stochastic block matrices defined in the lecture. We plot the resulting networks and color the nodes according to the block assignment vector.

In [9]:
M = np.matrix('0.33 0.55; 0.55 0.33')
z = [0, 0, 0, 1, 1, 1]

n1 = stochastic_block_model(M, z, node_names=list("abcdef"))

colors = {0: 'red', 1: 'blue'}
for v in n1.nodes:
    v['color'] = colors[z[n1.nodes.index[v.uid]]]
n1.plot(edge_color='gray')

In [10]:
M = np.matrix('1 0.11; 0.11 1')
z = [0, 0, 0, 1, 1, 1]

n2 = stochastic_block_model(M, z, node_names=list("abcdef"))

colors = {0: 'red', 1: 'blue'}
for v in n2.nodes:
    v['color'] = colors[z[n2.nodes.index[v.uid]]]
n2.plot(edge_color='gray')

Being a *generative* model, we can use the stochastic block model to generate new instances of a data set. Let us consider the famous [`karate club` network](https://networks.skewed.de/net/karate), an example for a social network where we have **ground truth communities**. This ground truth is based on two factions that emerged in the club. We can directly read this network from the `netzschleuder` repository. The information on ground truth communities is contained in the node attribute `groups`.

In [11]:
n = pp.io.graphtool.read_netzschleuder_network('karate', '77')
print(n)

Uid:			0x7f15101355d0
Type:			Network
Directed:		False
Multi-Edges:		False
Number of nodes:	34
Number of edges:	77

Network attributes
------------------
name:	karate (77)
description:	Network of friendships among members of a university karate club. Includes metadata for faction membership after a social partition. Note: there are two versions of this network, one with 77 edges and one with 78, due to an ambiguous typo in the original study. (The most commonly used is the one with 78 edges.)[^icon]
[^icon]: Description obtained from the [ICON](https://icon.colorado.edu) project.
citation:	[['W. W. Zachary, "An information flow model for conflict and fission in small groups." Journal of Anthropological Research 33, 452-473 (1977).', 'https://doi.org/10.1086/jar.33.4.3629752']]
url:	https://aaronclauset.github.io/datacode.htm
tags:	['Social', 'Offline', 'Unweighted']
title:	Zachary Karate Club
bibtex:	['@article{zachary_information_1977,\n  doi          = {10.1086/jar.33.4.3629752},\n  

We plot the network while coloring nodes based on ground truth communities:

In [12]:
for v in n.nodes:
    v['color'] = colors[v['groups']-1]
n.plot(edge_color='gray')

The `nodes.index` dictionary allows us to map nodes to a numerical node index that ranges from $0$ to $n-1$. To map an index to a node (rather than a node to its index), we need a reverse dictionary, which we can generate as follows:

In [13]:
reverse_index = {v:k for k, v in n.nodes.index.items()}
print(reverse_index)

{0: '2', 1: '28', 2: '33', 3: '20', 4: '32', 5: '23', 6: '3', 7: '7', 8: '26', 9: '12', 10: '29', 11: '13', 12: '4', 13: '6', 14: '0', 15: '10', 16: '5', 17: '16', 18: '8', 19: '30', 20: '9', 21: '1', 22: '19', 23: '25', 24: '24', 25: '27', 26: '11', 27: '17', 28: '21', 29: '31', 30: '14', 31: '15', 32: '18', 33: '22'}


We can now create a block assignment vector, where entries $z[i]$ contain the block label of node with index $i$. To generate new randomized versions of the network with the same community structure, we construct the block assignment vector based on ground truth communities:

In [14]:
z = [n.nodes[reverse_index[i]]['groups']-1 for i in range(n.number_of_nodes())]
print(z)

[0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1]


We finally need to specify the entries of the block assignment matrix based on the observed network. We need to match the link probabilities within and across communities to the number of links observed in the network. In the next part of the lecture, we will formally estimate those entries based on a likelihood maximization, along the lines discussed in Lecture 03. Here, we adopt the same approach as in the previous lecture, where we estimated the link probability $p$ in the $G(n,p)$ model based on the number of observed links $m$ in a network with $n$ nodes as 

$$\hat{p} = \frac{\langle k \rangle}{n} = \frac{\frac{2m}{n}}{n}$$

In the exercise, you have shown that, for an undirected network without self-loops, this estimate corresponds to:

$$\hat{p} = \frac{m}{\binom{n}{2}} $$

i.e. the number of observed links divided by the number of possible links. For a given network, the block assignment vector determines the number of nodes in each block as well as the number of links within each block and between each pair of blocks. For the number of links within a block $k$ with $C_k$ nodes we have $\binom{C_k}{2}$ possible links. For the number of links between two different blocks $k$ and $l$ with $C_k$ and $C_l$ nodes respectively, we have $C_k \cdot C_l$ possible links. This allows us to calculate an estimate for the block matrix entries as follows:

In [15]:
def estimate_block_matrix(network, z):
    
    # B is the number of blocks
    B = len(set(z))

    # C[k] counts the number of nodes in block k
    C = Counter(z)

    # E[k,l] counts how many links exist between nodes in block k and block l
    E = np.zeros((B,B))
    for e in n.edges:
        if e.v.uid != e.w.uid:     
            E[z[network.nodes.index[e.v.uid]],z[network.nodes.index[e.w.uid]]] += 1
            # increment count for both directions
            if z[network.nodes.index[e.v.uid]] != z[network.nodes.index[e.w.uid]]:
                E[z[network.nodes.index[e.w.uid]],z[network.nodes.index[e.v.uid]]] += 1

    # N[k,l] counts how many links can possibly exist between nodes in block k and l
    N = np.zeros((B,B))

    # calculate number of possible links N[k,l] for all pairs of blocks k, l (see comments above)
    for k in range(B):
        for l in range(B):            
            if k == l:
                N[k,l] = scipy.special.binom(C[k], 2)
            else:
                N[k,l]= C[k] * C[l]

    # estimate block matrix entries by dividing number of observed edges between k, l by number of edges possible between k, l
    M = np.zeros((B,B))
    for k in range(B):
        for l in range(B):
            M[k,l] = E[k,l] / N[k,l]
    return M

We can now estimate the link probabilities in the block matrix such that the expected number of links within and across communities correspond to the observed numbers| in the empirical network. 

In [16]:
M = estimate_block_matrix(n, z)
print(M)

[[0.25735294 0.03806228]
 [0.03806228 0.22794118]]


We can now use the fitted block matrix and the block assignment vector constructed from the ground truth communities to generate randomized versions of the empirical network, where the block structure of the original network is preserved:

In [17]:
n_synthetic = stochastic_block_model(M, z, [reverse_index[i] for i in range(n.number_of_nodes())])
print(n_synthetic)
colors = sns.color_palette("Set1", 10)
for v in n_synthetic.nodes:
    v['color'] = colors[z[n_synthetic.nodes.index[v.uid]]]
n_synthetic.plot(edge_color='gray')

Uid:			0x7f15101372e0
Type:			Network
Directed:		False
Multi-Edges:		False
Number of nodes:	34
Number of edges:	72
