# 06-03 - Compressing random walks

*June 3 2024*  

We now show how the minimization of the code length of random walks can help us to detect community structures, while avoiding both under- and over-fitting of communities.

In [1]:
import pathpyG as pp
import scipy as sp
import numpy as np
import seaborn as sns
import torch
import queue
from collections import Counter

from collections import defaultdict

For the illustrative example in the lecture, we have simplified the discussion of flow compression insofar as we considered a specific delimiter character between the community and the node labels. Referring to Lecture 05, here we avoid this problem by actually encoding the community sequences based on a Huffman tree, which yields a prefix-free code.

For this, let us first reuse the function from last week, in which we implemented a Huffman tree:

In [2]:
def huffman_tree(sequence):

    counts = Counter(sequence).most_common()
    seq_length = len(sequence)

    # symbols with lowest frequency have highest priority
    q = queue.PriorityQueue()
    
    labels = {}
    node_type = {}
    for (symbol, count) in counts:
        # create leaf nodes and add to queue
        labels[symbol]='{0} / {1:.3f}'.format(symbol, count/seq_length)
        q.put((count, symbol))
        node_type[symbol] = 'leaf'        
    # Create huffman tree
    i = 0
    
    edges = []
    edge_symbols = []
    
    while q.qsize()>1:

        # retrieve two symbols with minimal frequency
        left = q.get()
        right = q.get()

        total_frequency = left[0] + right[0]

        # create internal node v with total frequency as label
        v = 'n_' + str(i)
        label = '{:.2f}'.format(total_frequency/seq_length)
        labels[v] = label
        node_type[v] = 'internal'
        edges.append((v,left[1]))
        edge_symbols.append('0')
        edges.append((v,right[1]))
        edge_symbols.append('1')

        q.put((left[0] + right[0], v))
        i += 1

    # the remaining entry corresponds to the root node
    root = q.get()
    huffman_tree=pp.Graph.from_edge_list(edges)
    huffman_tree.data.node_labels = [labels[v] for v in huffman_tree.nodes]
    huffman_tree.data.node_type = [node_type[v] for v in huffman_tree.nodes]
    huffman_tree.data.edge_symbols = edge_symbols
    return huffman_tree, root[1]

def generate_mapping(tree, node, prefix='', lookup_table={}):
    """Recursively turns Huffman tree into a lookup dictionary"""
    if tree.out_degrees[node] == 0:
        lookup_table[node] = prefix
    else:
        for x in tree.successors(node):
            generate_mapping(tree, x, prefix=prefix + tree['edge_symbols'][list(tree.edges).index((node, x))], lookup_table=lookup_table)

    return lookup_table

We further generate the example network from the lecture, which has four ground truth communities:

In [3]:
edges = []
for i in range(4):
    for v in range(i*4, i*4+4):
        for w in range(i*4, i*4+4):
            if v != w and (str(v),str(w)) not in edges:
                edges.append((str(v), str(w)))

edges.append(('2', '4'))
edges.append(('6', '8'))
edges.append(('10', '12'))
edges.append(('14', '0'))

g = pp.Graph.from_edge_list(edges)
pp.plot(g, node_label = [g.mapping.to_id(i) for i in range(g.N)]);

The example sequence of 32 nodes traversed by the random walk is:

In [4]:
walk = "0,1,2,3,1,0,3,2,4,5,6,7,4,5,7,6,8,9,10,11,9,8,11,10,12,13,14,15,13,12,14,15".split(",")
print(walk)

['0', '1', '2', '3', '1', '0', '3', '2', '4', '5', '6', '7', '4', '5', '7', '6', '8', '9', '10', '11', '9', '8', '11', '10', '12', '13', '14', '15', '13', '12', '14', '15']


Let us try to compress this sequence. For the compression using a Huffman code we obtain the following coding tree:

In [5]:
tree, root = huffman_tree(walk)
print(tree)
tree.data.node_colors = ['orange' if x == 'leaf' else 'grey' for x in tree.data.node_type]

pp.plot(tree, node_label = tree.data.node_labels, edge_color= ['green' if x == '1' else 'red' for x in tree.data.edge_symbols], node_color = tree.data.node_colors)

Directed graph with 31 nodes and 30 edges

Node attributes
	node_type		<class 'list'>
	node_labels		<class 'list'>

Edge attributes
	edge_symbols		<class 'list'>

Graph attributes
	num_nodes		<class 'int'>



<pathpyG.visualisations.network_plots.StaticNetworkPlot at 0x7f6bcc960520>

We see that in this sequence, each of the nodes appears with the same probability of $0.0625 = 1/16$, i.e. for this simple example we cannot use differences in the *node visitations* to compress the sequence.

In [6]:
lookup_table = generate_mapping(tree, root)
compressed = ''.join([lookup_table[v] for v in walk])
print(compressed)
print('Original:   {0} bits'.format(len(walk)*4))
print('Compressed: {0} bits'.format(len(compressed)))

01010100111111100100010111101111110111001011101011011100101010111000100101100111100110000111011000010000001100100000000100110010
Original:   128 bits
Compressed: 128 bits


However, we can use knowledge about community structures in the network to generate a hierarchical coding scheme that consists of two components: The first component encodes the currently visited community, while the second component stores a node identifier that is relative to the current community encoded in the first component.

Let us start with a mapping of nodes two two communities. Here we consider a prefix-free coding scheme for community switches, where we consider three different symbols: `0` indicates that we did not switch to another community, `1` indicates that we switched to community 1 and `2` indicates that we switched to community 2. 

By using a Huffman coding scheme, we obtain a prefix-free code for the community label. A naive encoding of the three symbols would require two bits, i.e. we would need 2 * 32 = 64 bits to encode the sequence of visited communities. 

Using Huffman coding and a community mapping that is such that the majority of random walk transitions do not switch to other communities, we can compress the sequence of community labels to 34 bits.

In [7]:
communities = "1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0".split(",")
tree, root = huffman_tree(communities)
lookup_table = generate_mapping(tree, root)
compressed = ''.join([lookup_table[v] for v in communities])
print(compressed)
print('Original:   {0} bits'.format(len(communities)*2))
print('Compressed: {0} bits'.format(len(compressed)))

tree.data.node_colors = ['orange' if x == 'leaf' else 'grey' for x in tree.data.node_type]
pp.plot(tree,node_label = tree.data.node_labels, edge_color= ['green' if x == '1' else 'red' for x in tree.data.edge_symbols], node_color = tree.data.node_colors)

0011111111111111101111111111111111
Original:   64 bits
Compressed: 34 bits


<pathpyG.visualisations.network_plots.StaticNetworkPlot at 0x7f6bccacb0a0>

For the coding of nodes, we now need only three bits per step, i.e. we obtain a total of 96 bits.

In [8]:
walk = "0,1,2,3,1,0,3,2,4,5,6,7,4,5,7,6,0,1,2,3,1,0,3,2,4,5,6,7,5,4,6,7".split(",")
print('Original:   {0} bits'.format(len(walk)*3))

Original:   96 bits


With this hierarchical coding scheme, we would need a total of 96 + 34 = 130 bits for a prefix-free code. Note that this is longer than our naive encoding of the node sequence, where we did not consider any community labels. This is due to the fact that we needed to encode three symbols (entry of any of the two communities and another symbol encoding that we stay in the current community).

Let's now consider a coding where we use four communities that correspond to the ground truth communities, with labels `0` (no switch) and `1` to `4` (for a switch to the four communities). Using Huffman coding, we now need a total of 40 bits for a prefix-free encoding of the community switches:

In [9]:
communities = "1,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0".split(",")
tree, root = huffman_tree(communities)
lookup_table = generate_mapping(tree, root)
compressed = ''.join([lookup_table[v] for v in communities])
print(compressed)
print('Original:   {0} bits'.format(len(communities)*3))
print('Compressed: {0} bits'.format(len(compressed)))

tree.data.node_colors = ['orange' if x == 'leaf' else 'grey' for x in tree.data.node_type]
pp.plot(tree,node_label = tree.data.node_labels, edge_color= ['green' if x == '1' else 'red' for x in tree.data.edge_symbols], node_color = tree.data.node_colors)

0001111111001111111101011111110111111111
Original:   96 bits
Compressed: 40 bits


<pathpyG.visualisations.network_plots.StaticNetworkPlot at 0x7f6bcc960100>

The encoding of nodes now only requires 2*32 = 64 bits, so we get a total of 64 + 40 = 104 bits for the prefix-free encoding:

In [10]:
walk = "0,1,2,3,1,0,3,2,0,1,2,3,0,1,3,2,0,1,2,3,1,0,3,2,0,1,2,3,1,0,2,3".split(",")
print('Original:   {0} bits'.format(len(walk)*2))

Original:   64 bits


We finally consider the case, where use a hierarchical coding scheme with eight communities. Here we switch clusters 16 times, so we have:

In [11]:
communities = "1,0,2,0,1,0,2,0,3,0,4,0,3,0,4,0,5,0,6,0,5,0,6,0,7,0,8,0,7,0,8,0".split(",")
tree, root = huffman_tree(communities)
lookup_table = generate_mapping(tree, root)
compressed = ''.join([lookup_table[v] for v in communities])
print(compressed)
print('Original:   {0} bits'.format(len(communities)*4))
print('Compressed: {0} bits'.format(len(compressed)))

tree.data.node_colors = ['orange' if x == 'leaf' else 'grey' for x in tree.data.node_type]
pp.plot(tree,node_label = tree.data.node_labels, edge_color= ['green' if x == '1' else 'red' for x in tree.data.edge_symbols], node_color = tree.data.node_colors)

10000100101000010010101001011010100101101100011010110001101011100111101110011110
Original:   128 bits
Compressed: 80 bits


<pathpyG.visualisations.network_plots.StaticNetworkPlot at 0x7f6bccacbb50>

Since there are only two nodes in each of the eight communities, this allows for a binary encoding, i.e. we need 32 bits to encode the node sequence. Adding the 80 bits for the compressed community sequence, we obtain 80+32 = 112 bits, which is more than 104 bits obtained for the coding scheme using four communities.

In [16]:
walk = "0,1,0,1,1,0,0,1,0,1,1,0,0,1,1,0,0,1,0,1,1,0,1,0,0,1,0,1,1,0,0,1".split(",")
print('Original:   {0} bits'.format(len(walk)))

Original:   32 bits


Using this simple example, we find that the optimal mapping of nodes to communities allows us to best compress the walk sequence, since it allows us to utilize the redundancy in community labels by means of an optimal prefix-free Huffman coding. We note that in a real example, we could also use Huffman coding to consider different frequencies of node visitations, i.e. we can assign shorter symbols to nodes that appear more frequently in a random walk sequence.