# 04-01 - Stochastic Block Model

*May 15 2024*  

In the first notebook, we implement the stochastic block model, a probabilistic generative model that generalizes the Erdös-Renyi model and that can be used to generate random networks with given community structure.

In [1]:
import pathpyG as pp
import numpy as np
import torch

import seaborn as sns
import matplotlib.pyplot as plt

from tqdm import tqdm_notebook as tqdm

plt.style.use('default')
sns.set_style("whitegrid")
    
from numpy import linalg as npl
import scipy as sp

import scipy.special
from decimal import *
import string
from collections import Counter

from collections import defaultdict

We first implement a `python` function which - for a given stochastic block matrix $\mathbf{B}$ and a fixed block assignment vector $z$ that assigns nodes to communities - generates a random undirected network according to the stochastic block model. We can see this as a generalization of the $G(n,p)$ model, where we recover the $G(n,p)$ model if we consider a single block $B=1$ with a single link probability $M_{11}=p$.

In [2]:
def stochastic_block_model(M: np.matrix, z: np.array, mapping: pp.IndexMap =None) -> pp.Graph:
    
    # the number of nodes is implicitly given by the length of block assignment vector z 
    n = len(z)
    B = len(set(z))

    # we can use pre-defined node names, if not given, we use contiguous numbers
    if mapping is None:
        mapping = pp.IndexMap([str(i) for i in range(n)])

    edges = []

    # randomly generate links with probabilities given by entries of the stochastic block matrix M
    for u in range(n):
        for v in range(u): # note that since the upper bound is exclusive, we do not add self-loops!
            if np.random.random() <= M[z[u], z[v]]:
                edges.append((mapping.to_id(u), mapping.to_id(v)))
                edges.append((mapping.to_id(v), mapping.to_id(u)))

    G = pp.Graph.from_edge_list(edges, mapping=mapping, num_nodes=n)
    G.data.node_label = torch.tensor(z)
    return G

We test our function in an example with four nodes with two communities consisting two nodes each.

In [3]:
M = np.matrix('1 0; 0 1')
z = np.array([0, 0, 1, 1])
n = stochastic_block_model(M, z, pp.IndexMap(list('abcd')))
print(n.get_sparse_adj_matrix().todense())
pp.plot(n, node_label=[v for v in n.mapping.node_ids])
print(n.data.node_label)

[[0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]]


tensor([0, 0, 1, 1])


We can add links that interconnected those communities with probability $0.5$. We further color nodes according to the communities.

In [4]:
M = np.matrix('1 0.5; 0.5 1')
z = np.array([0, 0, 1, 1])
n = stochastic_block_model(M, z, pp.IndexMap(list('abcd')))

# color nodes 
colors = {0: 'red', 1: 'blue'}
pp.plot(n, node_color = [colors[z[n.mapping.to_idx(v)]] for v in n.nodes], node_label = [v for v in n.mapping.node_ids]);

With the following model parameters, the stochastic block model corresponds to a $G(n,p)$ model with $p=0.5$. 

In [5]:
M = np.matrix('0.5')
z = np.array([0, 0, 0, 0])
n = stochastic_block_model(M, z, pp.IndexMap(list('abcd')))

# color nodes
colors = {0: 'red', 1: 'blue'}
pp.plot(n, node_color = [colors[z[n.mapping.to_idx(v)]] for v in n.nodes], node_label = [v for v in n.mapping.node_ids]);

We can also create networks where nodes in *different* communities have a preference to be connected.

In [6]:
M = np.matrix('0.1 1; 1 0.1')
z = np.array([0, 0, 1, 1])
n = stochastic_block_model(M, z, pp.IndexMap(list('abcd')))

# color nodes 
colors = {0: 'red', 1: 'blue'}
pp.plot(n, node_color = [colors[z[n.mapping.to_idx(v)]] for v in n.nodes], node_label = [v for v in n.mapping.node_ids]);

We test the model with the two stochastic block matrices defined in the lecture. We plot the resulting networks and color the nodes according to the block assignment vector.

In [7]:
M = np.matrix('0.33 0.55; 0.55 0.33')
z = [0, 0, 0, 1, 1, 1]

n1 = stochastic_block_model(M, z, pp.IndexMap(list("abcdef")))

# color nodes 
colors = {0: 'red', 1: 'blue'}
pp.plot(n1, node_color = [colors[z[n1.mapping.to_idx(v)]] for v in n1.nodes], node_label = [v for v in n1.mapping.node_ids]);

In [8]:
M = np.matrix('1 0.11; 0.11 1')
z = [0, 0, 0, 1, 1, 1]

n2 = stochastic_block_model(M, z, pp.IndexMap(list("abcdef")))

# color nodes 
colors = {0: 'red', 1: 'blue'}
pp.plot(n2, node_color = [colors[z[n2.mapping.to_idx(v)]] for v in n2.nodes], node_label = [v for v in n2.mapping.node_ids]);

Being a *generative* model, we can use the stochastic block model to generate new instances of a data set. Let us consider the famous [`karate club` network](https://networks.skewed.de/net/karate), an example for a social network where we have **ground truth communities**. This ground truth is based on two factions that emerged in the club. We can directly read this network from the `netzschleuder` repository. The information on ground truth communities is contained in the node attribute `groups`.

In [9]:
n = pp.io.read_netzschleuder_network('karate', '77')
print(n)

Undirected graph with 34 nodes and 154 (directed) edges

Node attributes
	node_groups		<class 'list'>
	node__pos		<class 'list'>
	node_name		<class 'list'>

Graph attributes
	num_nodes		<class 'int'>
	citation		<class 'str'>
	url		<class 'str'>
	tags		<class 'list'>
	description		<class 'str'>
	name		<class 'str'>



We plot the network while coloring nodes based on ground truth communities:

In [10]:
# color nodes 
colors = {1: 'red', 2: 'blue'}
pp.plot(n, node_color = [colors[n['node_groups', v][0]] for v in n.nodes]);

We can now create a block assignment vector, where entries $z[i]$ contain the block label of node with index $i$. To generate new randomized versions of the network with the same community structure, we construct the block assignment vector based on ground truth communities:

In [11]:
z = [n['node_groups', v][0]-1 for v in n.nodes]
print(z)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


We finally need to specify the entries of the block assignment matrix based on the observed network. We need to match the link probabilities within and across communities to the number of links observed in the network. In the next part of the lecture, we will formally estimate those entries based on a likelihood maximization, along the lines discussed in Lecture 03. Here, we adopt the same approach as in the previous lecture, where we estimated the link probability $p$ in the $G(n,p)$ model based on the number of observed links $m$ in a network with $n$ nodes as 

$$\hat{p} = \frac{\langle k \rangle}{n} = \frac{\frac{2m}{n}}{n}$$

In the exercise, you have shown that, for an undirected network without self-loops, this estimate corresponds to:

$$\hat{p} = \frac{m}{\binom{n}{2}} $$

i.e. the number of observed links divided by the number of possible links. For a given network, the block assignment vector determines the number of nodes in each block as well as the number of links within each block and between each pair of blocks. For the number of links within a block $k$ with $C_k$ nodes we have $\binom{C_k}{2}$ possible links. For the number of links between two different blocks $k$ and $l$ with $C_k$ and $C_l$ nodes respectively, we have $C_k \cdot C_l$ possible links. This allows us to calculate an estimate for the block matrix entries as follows:

In [12]:
def estimate_block_matrix(graph, z):
    
    # B is the number of blocks
    B = len(set(z))

    # C[k] counts the number of nodes in block k
    C = Counter(z)

    # E[k,l] counts how many links exist between nodes in block k and block l
    E = np.zeros((B,B))
    for v, w in graph.edges:
        if v <= w:
            E[z[graph.mapping.to_idx(v)],z[graph.mapping.to_idx(w)]] += 1
            # increment count for both directions
            if z[graph.mapping.to_idx(v)] != z[graph.mapping.to_idx(w)]:
                E[z[graph.mapping.to_idx(w)],z[graph.mapping.to_idx(v)]] += 1

    # N[k,l] counts how many links can possibly exist between nodes in block k and l
    N = np.zeros((B,B))

    # calculate number of possible links N[k,l] for all pairs of blocks k, l (see comments above)
    for k in range(B):
        for l in range(B):            
            if k == l:
                N[k,l] = scipy.special.binom(C[k], 2)
            else:
                N[k,l]= C[k] * C[l]

    # estimate block matrix entries by dividing number of observed edges between k, l by number of edges possible between k, l
    M = np.zeros((B,B))
    for k in range(B):
        for l in range(B):
            M[k,l] = E[k,l] / N[k,l]
    return M

We can now estimate the link probabilities in the block matrix such that the expected number of links within and across communities correspond to the observed numbers| in the empirical network. 

In [13]:
M = estimate_block_matrix(n, z)
print(M)

[[0.25735294 0.03806228]
 [0.03806228 0.22794118]]


We can now use the fitted block matrix and the block assignment vector constructed from the ground truth communities to generate randomized versions of the empirical network, where the block structure of the original network is preserved:

In [14]:
n_synthetic = stochastic_block_model(M, z, n.mapping)
print(n_synthetic)
colors = {1: 'red', 2: 'blue'}
pp.plot(n_synthetic, node_color = [colors[n['node_groups', v][0]] for v in n_synthetic.nodes]);

Undirected graph with 34 nodes and 146 (directed) edges

Node attributes
	node_label		<class 'torch.Tensor'> -> torch.Size([34])

Graph attributes
	num_nodes		<class 'int'>

