# Tutorial: Generating model-based networks using the `CRep` algorithm

Welcome to this tutorial on using the _Probabilistic Generative Models for Network Analysis_ (`pgm`) package. In this tutorial, we'll walk through the process of generating a network that follows a particular probabilistic model assumption. We'll use the `CRep` algorithm to generate a  network with planted reciprocity and community structure.


## Generating a network using the `CRep` algorithm

The first step in our network generating process consists of setting the configuration file. This file contains the parameters that the model will use to generate the network. As explained in  
  the reference [1], the `CRep` algorithm has several parameters that can be set, including the  number of nodes, the number of communities, the reciprocity coefficient, and the community   strengths. Instead of setting these parameters manually, we can use the configuration file to illustrate the model's basic needs.

In [1]:
# We import the `open_binary` function from the `importlib.resources` module. This function is used to open a binary file included in a package.
from importlib.resources import open_binary

# We import the `yaml` module to convert the data from a YAML formatted string into a Python dictionary.
import yaml

# Define the path to the configuration file for the 'CRep' algorithm.
config_path = 'setting_syn_data_CRep.yaml'

# Open the configuration file for the 'CRep' algorithm
with open_binary('pgm.data.model', config_path) as fp:
    # Load the contents of the configuration file into a dictionary
    synthetic_configuration = yaml.load(fp, Loader=yaml.Loader)

In [2]:
synthetic_configuration

{'N': 600,
 'K': 3,
 'eta': 0.5,
 'k': 20,
 'ExpM': None,
 'over': 0.0,
 'corr': 0.0,
 'seed': 0,
 'alpha': 0.1,
 'ag': 0.1,
 'beta': 0.1,
 'Normalization': 0,
 'structure': 'assortative',
 'end_file': '',
 'out_folder': '../data/input/',
 'output_parameters': True,
 'output_adj': True,
 'outfile_adj': None}

To get a more detailed explanation of the parameters, we refer the reader to the publication [1].
 However, as we can see in the dictionary, the reciprocity coefficient is set to 0.5. This means 
 that the network will have a moderate level of reciprocity.  We will increase this value to 0.8 
 and generate a network with a higher level of reciprocity. We will also modify some details 
 regarding the output of the algorithm as follows.

In [3]:
# Increase the reciprocity coefficient
synthetic_configuration['eta'] = 0.8

# The flag 'output_parameters' determines whether the parameters of the model should be saved to a file.
synthetic_configuration['output_parameters'] = False
# The flag 'output_adj' determines whether the adjacency matrices should be saved to a file.
synthetic_configuration['output_adj'] = True
# The flag 'out_folder' determines the output folder for the adjacency matrices.
synthetic_configuration['out_folder'] = 'tutorial_outputs/CRep_synthetic/'
# The flag 'outfile_adj' determines the name of the file for the adjacency matrices.
synthetic_configuration['outfile_adj'] = 'syn_dataframe.dat'

Once the parameters are set, we can generate the network using the `GM_reciprocity` class.

In [4]:
# We load the `GM_reciprocity` class from the `pgm.input.generate_network` module.
from pgm.input.generate_network import GM_reciprocity

In [5]:
# We define the class `gen` as an instance of the `GM_reciprocity` class using the configuration parameters.
gen = GM_reciprocity(**synthetic_configuration)

We can check that indeed the parameters were set correctly by attributes of the `gen` object.

In [6]:
gen.__dict__

{'N': 600,
 'K': 3,
 'k': 20,
 'seed': 0,
 'alpha': 0.1,
 'ag': 0.1,
 'beta': 0.1,
 'end_file': '',
 'out_folder': 'tutorial_outputs/CRep_synthetic/',
 'output_parameters': False,
 'output_adj': True,
 'outfile_adj': 'syn_dataframe.dat',
 'eta': 0.8,
 'ExpM': 6000,
 'over': 0.0,
 'corr': 0.0,
 'Normalization': 0,
 'structure': 'assortative'}

As we can see, the parameters were set correctly. Now, we can generate the network using the `reciprocity_planted_network` method. This method creates a dataframe with the network's edges and saves it into the `syn_dataframe.dat` in the output folder. 

In [7]:
# We generate the network using the `reciprocity_planted_network`method. 
gen.reciprocity_planted_network();

We can now load the dataframe and inspect how the network looks like.

In [8]:
import pandas as pd
# Load the dataframe
df = pd.read_csv(synthetic_configuration['out_folder'] + synthetic_configuration['outfile_adj'], sep=' ', header=None)
# Print the first 5 rows of the dataframe 
df.head()

Unnamed: 0,0,1,2
0,source,target,w
1,0,18,1
2,0,50,1
3,0,95,1
4,0,196,1


As we can see, the dataframe contains the edges of the network. The first column represents the 
source node, and the second column represents the target node. The last column represents the weight of the edge.

This way we have generated a network using the `CRep` algorithm. In the next section, we will use the `pgm` package to analyze the network and extract the community structure and reciprocity coefficient.

## Analyzing the network using the `pgm` package

First, we start by importing the data using the `pgm` package. This means, we will load the data 
from the `syn_dataframe.dat` file and generate the adjacency matrices needed to run the `CRep` algorithm.

In [9]:
from pgm.input.loader import import_data
from pathlib import Path

# Define the names of the columns in the input file that 
# represent the source and target nodes of each edge
ego = 'source'
alter = 'target'

# Set the 'force_dense' flag to False
force_dense = False

# Call the 'import_data' function to load the data from the input file
A, B, B_T, data_T_vals = import_data(Path(synthetic_configuration['out_folder']) / synthetic_configuration['outfile_adj'],
                                     ego=ego,
                                     alter=alter,
                                     force_dense=force_dense,
                                     header=0)

Notice that the `import_data` prints some information about the data, such as the number of nodes
 and edges, together with the actual reciprocity coefficient that describes the network. Notice 
 that, although we set the reciprocity coefficient to 0.8, the actual reciprocity coefficient is 
 0.638. This is because the `CRep` algorithm generates a network with a reciprocity coefficient 
 that also follows other principles, such as the community structure. We will store this value 
for later comparison with the inferred reciprocity coefficient.

In [10]:
actual_reciprocity = 0.638

Once the data is loaded, we can pass it to the `CRep` algorithm to obtain estimates of the latent
variables describing the network. To do so, we need to set the configuration file for the `CRep` algorithm.

In [11]:
# Set the algorithm to 'CRep'
algorithm = 'CRep'

# Define the path to the configuration file for the 'CRep' algorithm
config_path = 'setting_' + algorithm + '.yaml'

We load the configuration file using the data files in the `pgm` package instead of a relative path to it:

In [12]:
# Open the configuration file for the 'CRep' algorithm
with open_binary('pgm.data.model', config_path) as fp:
    conf = yaml.load(fp, Loader=yaml.Loader)

In [13]:
# Print the configuration file
print(yaml.dump(conf))

K: 3
assortative: true
constrained: true
end_file: _CRep
eta0: null
files: config/data/input/theta_gt111.npz
fix_eta: false
initialization: 0
out_folder: outputs/
out_inference: true
rseed: 0
undirected: false


The previous file shows the parameters actually needed to _run_ the model. These parameters set the algorithms basic needs to work.  

Now, let's change the path to the output folder, so we can save the results into the same folder 
where the input data is located. 

In [14]:
# Set the output folder for the 'CRep' algorithm
conf['out_folder'] = synthetic_configuration['out_folder']

# Set the end file for the 'CRep' algorithm
conf['end_file'] = '_' + algorithm

## Running the Model
Finally, we are ready to run the _CRep_ model! The way this works is in a two-step process: first, we called the `CRep` class, which initializes the model. Then, we call the `fit` method, which runs the algorithm. 

In [15]:
# Import the 'CRep' class from the 'pgm.model.crep' module
from pgm.model.crep import CRep
import numpy as np

# Import the 'time' module
import time

# Get the list of nodes from the first graph in the list 'A'
nodes = A[0].nodes()

# Create an instance of the 'CRep' class
model = CRep()

# Print all the attributes of the 'CRep' instance
# The '__dict__' attribute of an object is a dictionary containing 
# the object's attributes.
print(model.__dict__)

{'inf': 10000000000.0, 'err_max': 1e-12, 'err': 0.1, 'num_realizations': 5, 'convergence_tol': 0.0001, 'decision': 10, 'max_iter': 1000, 'flag_conv': 'log'}


Model created! Now, we can run the model using the `fit` method. As mentioned before, this method
 takes as input the data, and the configuration parameters. 
 
Before running the model, we set the logging level to `INFO` to print relevant information about the model's progress.

In [18]:
# Import the logging module
import logging

# Get the root logger and set its level to INFO.
logging.getLogger().setLevel(logging.INFO)

In [17]:
# Print a message indicating the start of the 'CRep' algorithm
print(f'\n### Run {algorithm} ###')

# Get the current time
time_start = time.time()

# Run the 'CRep' model
inferred_parameters = model.fit(data=B,
              data_T=B_T,
              data_T_vals=data_T_vals,
              nodes=nodes,
              **conf)

# Print the time elapsed since the start of the 'CRep' algorithm
print(f'\nTime elapsed: {np.round(time.time() - time_start, 2)} seconds.')

INFO:root:eta is initialized randomly.
INFO:root:u, v and w are initialized randomly.
INFO:root:Updating realization 0 ...



### Run CRep ###


INFO:root:Nreal = 0 - Pseudo Log-likelihood = -15341.547528742449 - iterations = 100 - time = 0.41 seconds
INFO:root:Nreal = 0 - Pseudo Log-likelihood = -15333.775232560947 - iterations = 200 - time = 0.91 seconds
INFO:root:Nreal = 0 - Pseudo Log-likelihood = -15331.614264064836 - iterations = 300 - time = 1.64 seconds
INFO:root:Nreal = 0 - Pseudo Log-likelihood = -15330.479566359936 - iterations = 400 - time = 2.31 seconds
INFO:root:Nreal = 0 - Pseudo Log-likelihood = -15330.064278151229 - iterations = 500 - time = 2.87 seconds
INFO:root:Nreal = 0 - Pseudo Log-likelihood = -15329.23669960981 - iterations = 600 - time = 3.41 seconds
INFO:root:Nreal = 0 - Pseudo Log-likelihood = -15328.947482113104 - iterations = 700 - time = 3.88 seconds
INFO:root:Nreal = 0 - Pseudo Log-likelihood = -15328.945289296786 - iterations = 800 - time = 4.33 seconds
INFO:root:Nreal = 0 - Pseudo Log-likelihood = -15328.94476636945 - iterations = 900 - time = 4.76 seconds
INFO:root:Nreal = 0 - Pseudo Log-likeli


Time elapsed: 16.82 seconds.


Done! The model has been run and the results are stored into the variable `inferred_parameters`. We 
can unpack the latent variables from it to take a closer look at.

## Analyzing the results
Next, we will examine the outcomes produced by the model. To do this, it is necessary to load the contents from the file `CRep/theta_CRep.npz.`

In [17]:
len(inferred_parameters)

5

In [18]:
u_inf, v_inf, w_inf, eta_inf, _ = inferred_parameters

Since we know the ground truth configuration of the network, we can compare the inferred 
parameters with the true ones. We will do this by comparing the inferred reciprocity coefficient with the true one.


In [19]:
# Print the true reciprocity coefficient
print(f'Desired reciprocity coefficient: {synthetic_configuration["eta"]}')
# Print the actual reciprocity coefficient
print(f'Actual reciprocity coefficient: {actual_reciprocity}')
# Print the inferred reciprocity coefficient
print(f'Inferred reciprocity coefficient: {eta_inf}')


Desired reciprocity coefficient: 0.8
Actual reciprocity coefficient: 0.638
Inferred reciprocity coefficient: 0.632408070678144


As we can see, the inferred reciprocity coefficient is very close to the true one. This 
means that the model was able to capture the reciprocity of the network. 

## Summary

In this tutorial, we have shown how to generate a network that follows the probabilistic rules 
guiding the `CRep` algorithm. We have also shown how to load the network and infer the latent 
variables using the `pgm` package. Given the probabilistic nature of the data, we also use the 
`CRep` as our main tool to infer the latent variables. We have shown that the inferred reciprocity
coefficient is very close to the true one. This means that the model was able to successfully reconstruct the network's reciprocity.

## References
[1] Safdari H., Contisciani M. & De Bacco C. (2021). Generative model for reciprocity and 
community detection in networks, _Phys. Rev. Research_ 3, 023209.