# Dataset generation

There are several approaches to generate data to start work with multilayer networks:

1. Generate synthetic dataset
2. Generate subset of existing large dataset for faster iterations
3. Download and prepare for the work real large dataset

## 0. Imports and paths

In [1]:
#| default_exp utils

In [2]:
#| hide
from nbdev.showdoc import *

In [3]:
#| export
import networkx as nx
import csv
from pathlib import Path
import random
import string

In [4]:
#| hide
import nbdev; nbdev.nbdev_export()

In [5]:
path_data = Path("../data")
path_data.mkdir(exist_ok=True)

## 1. Generate synthetic dataset
Create a small multilayer graph for testing purposes.

In [6]:
#| export 
def obfuscate_nodes(n_nodes:int, str_len:int=10) -> dict:
    """
    Maps nodes from range [0, n_nodes) to unique random strings of length str_len.
    All upper case letters and digits are used, starts with a letter.
    """
    def encode(str_len):
        return ''.join(random.choices(string.ascii_uppercase, k=1)) +\
               ''.join(random.choices(string.ascii_uppercase + string.digits, k=(str_len-1)))
    
    d = {}
    for i in range(n_nodes):
        x = encode(str_len)
        # gaurantee uniqueness
        while x in d.values():
            x = encode(str_len)
        
        d[i] = x
    
    return d

In [7]:
d = obfuscate_nodes(n_nodes=10)
d

{0: 'EH3XDJO27A',
 1: 'VCR3LQTI4J',
 2: 'H2TLNQRT25',
 3: 'E1YYG4ZYOW',
 4: 'VS6IXQ1XV2',
 5: 'N0WNCK2YCT',
 6: 'WDCEJ2MUZ8',
 7: 'UEC61ZVTEW',
 8: 'PNQJIJWMSF',
 9: 'PG03UORXEN'}

The following is a thin wrapper around the random graph generators from [networkx](https://networkx.org/documentation/stable/reference/generators.html#module-networkx.generators.random_graphs) package.

In [8]:
#| export
def create_random_graph(generator:callable, obfuscate:bool=True, **kwargs):
    """Thin wrapper around networkx's random graph generator

    Args:
        generator (callable): networkx's random graph generator
        **kwargs: passed to generator
    Example:
        >>> G = create_random_graph(nx.erdos_renyi_graph, n=10, p=0.6, directed=False)
    """
    G = generator(**kwargs)
    if obfuscate:
        G = nx.relabel_nodes(G, obfuscate_nodes(G.number_of_nodes()))
    return G

In [9]:
G = create_random_graph(nx.erdos_renyi_graph, n=10, p=0.6, directed=False)

In [10]:
G.nodes()

NodeView(('LD93XCQ5OS', 'ZV29NBKL64', 'CBYHJPXKEF', 'GFEJQT1UT3', 'ARJEENQ08A', 'JBT7NVJXW5', 'PE1P3YDXL5', 'BYMXLVBB2E', 'EXR76ODCYA', 'Q54BXC0AKM'))

Create random graph and write it to the disk:

In [11]:
#| export
def create_and_save_random_graph(generator:callable,
                                 label:str,
                                 path_to:str|Path,
                                 obfuscate:bool=True,
                                 **kwargs):
    """
    Creates a random graph and saves it to path_to.

    Args:
        generator (callable): networkx's random graph generator
        obfuscate (bool, optional): if true, obfuscates node names. Defaults to True.
        path_to (str|Path, optional): path to save the graph. Defaults to None.
    
    Example:
        >>> create_random_graph_and_save(nx.erdos_renyi_graph, n=10, p=0.6, directed=False, path_to=path_data / 'synthetic' / '1.csv')
    """

    G = create_random_graph(generator=generator, obfuscate=obfuscate, **kwargs)
    edges = G.edges()
    with open(path_to, 'w', newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=' ')
        for edge in edges:
            writer.writerow([edge[0], edge[1], label])

Create three layers. For experiments with CmmD algorithm we need at least several hundreds of nodes to see the difference in the clustering with a variety of $\gamma$ values:

In [12]:
path_dir_to = path_data / 'synthetic'
path_dir_to.mkdir(exist_ok=True)

In [13]:
n, p, label = 300, 0.6, "first"
create_and_save_random_graph(nx.erdos_renyi_graph, label, path_dir_to / '1.csv', n=n, p=p)

n, p, label = 500, 0.4, "second"
create_and_save_random_graph(nx.erdos_renyi_graph, label, path_dir_to / '2.csv', n=n, p=p)

n, p, label = 400, 0.7, "third"
create_and_save_random_graph(nx.erdos_renyi_graph, label, path_dir_to / '3.csv', n=n, p=p)

The method above generates obfuscated graph with node labels different from layer to layer. However, for multilayer community analysis we need that at least some nodes share labels among layers.