# *WoMG*: tutorial

This is a tutorial for *WoMG*. The WoMG software generates synthetic datasets of documents cascades on network. It starts with any (un)directed, (un)weighted graph and a collection of documents and it outputs the propagation DAGs of the docs through the network. Diffusion process is guided by the nodes underlying preferences.

### TOC:
* [Demo](#Demo)
* [Help](#Help)
* [Usage](#Usage)
* [Analysis](#Analysis)
* [Statistics](#Statistics)

## Demo

The following code will run WoMG with the default parameters:

In [None]:
!python3 ../src/main.py

Checking the outputs

In [None]:
!cd ../Output/ ; ls ; #cat Diffusion_formatted_output_sim0.txt



## Help

Let's check the parameters by the help page:

In [5]:
!python3 ../src/main.py --help

usage: main.py [-h] [-v] [--topics [K]] [--docs [D]] [--steps [T]]
               [--homophily [H]] [--actives [A]] [--virality [V]]
               [--graph [GRAPH]] [--weighted] [--unweighted] [--directed]
               [--undirected] [--docs-folder [DOCS]] [--output OUTPUT]
               [--format FORMAT] [--seed [SEED]] [--dimensions d]
               [--walk-length w] [--num-walks nw] [--window-size ws]
               [--iter ITER] [--workers WORKERS] [--p P] [--q Q]

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --topics [K]          Number of topics in the topic model. Default 15. K<d
  --docs [D]            Number of docs to be generated. Default 100
  --steps [T]           Number of time steps for diffusion
  --homophily [H]       0<=H<=1 :degree of homophily decoded from the given
                        network. 1-H is degree of influence between nodes;
           

First set of quantitative parameters are:

1. number of topics to be considered in the topic distributions of documents and nodes interests; it has to be less than number of dimensions of the nodes' space provided by node2vec
2. number of documents TO BE GENERATED by lda, giving this parameter lda will be directly set to generative mode
3. steps of the diffusion simulation
4. H degree of homophily. Node2vec is used as baseline for generating interests vectors of the nodes starting from the given graph. Parameters *p* and *q* can achieve different decoded degree of homophily and structural equivalence (see paper). The best mix of them can be achieved only by a deep analysis of the network and a grid searh on the parameters. In order to pursuit generality in the input graph we use three degree of mixing: structural equivalence predominant, deepWalk (p=1, q=1), homophily predominant (which are not the best for representing the graph!).  1-H is the degree of social influence between nodes; which is the percentage of the avg interests vecs norms to be assigned to the influence vectors.
5. percentage of active nodes with respect to the total number of nodes in the intial configuration (before diffusion) for each doc.
6. virality of the doc; if virality is high, exponent of the power law is high and threshold for activation is low.


Next parameters concern input graph, input documents and the node2vec original parameters.

## Usage

The following code will produce a synthetic propagation dataset on the [Digg network dataset](https://www.isi.edu/~lerman/downloads/digg2009.html). This dataset consists in: graph dataset and diffusion dataset. We used the first as input of WoMG for generating diffusions and analyse results. 

We set:

1. the number of steps equal to 100
2. the maximum percentage of active nodes per doc equal to 0.065
3. number of generated docs equal to 3553
4. virality exponent of the docs distribution equal to 0.009

In [8]:
#!python3 ../src/main.py --graph ../data/graph/digg/digg_edgelist.txt --directed --steps 100 --actives 0.065 --docs 3553  --virality 0.009

##### Output

The analysis of the actions using digg network as input can be done using simulation_index=_tutorial.

The real dataset analysis provides the following results:

    items actions [max, min, avg]:   6265 105 505

    users actions [max, min, avg]:   3415 20 115

## Analysis

In [15]:
import ast
import pathlib
import numpy as np

In [18]:
simulation_index = "_tutorial"
output_path = pathlib.Path.cwd() / "Output"

In [19]:
file_info = output_path / str("Network_info_sim"+str(simulation_index)+".txt")
file_prop = output_path / str("Diffusion_formatted_output_sim"+str(simulation_index)+".txt")

###### import functions

In [20]:
def extract(file_in):
    '''
    Returns file from the given input path
    '''
    if str(pathlib.Path(file_in).suffix) == '.txt':
        with open(file_in, 'r') as f:
            s = f.readlines()
    if str(pathlib.Path(file_in).suffix) == '.pickle':
        with open(file_in, 'rb') as f:
            s = pickle.load(f)
    return s

In [21]:
def to_dict(inp, typ=False):
    '''
    if typ:
        Returns info input (inp) in dict format
    if not typ:
        Returns cascades in a dict format:
        (outer)first key: time, (inner)second key: item, 
        value: new active nodes
    '''
    
    if typ:
        info_dict = ast.literal_eval(str(inp).replace('[','').replace(']','').replace('"',''))
        out_dict = info_dict
        
    else:   
        prop_dict = {}
        index = 0
        for i in range(2, len(prop), 2):
            inp[i] = inp[i].replace('\n', '')
            inp[i] = inp[i].replace('set()','None')
            prop_dict[index] = ast.literal_eval(prop[i])
            index += 1

        out_dict = prop_dict
           
    return out_dict

In [22]:
info_str = extract(file_info)
info = to_dict(info_str, typ=True)
prop = extract(file_prop)
cascades = to_dict(prop, typ=False)

##### analysis functions

In [23]:
# items actions
def items_actions(cascades, plot=False):
    '''
    Returns the vector of all items' actions
    each entry is the the number of activations(actions)
    for the item identified by entry-index
    '''
    numb_docs = max(cascades[0].keys())
    items_action_vec = [0 for i in range(numb_docs+1)]
    for step in cascades.keys():
        for item in cascades[step].keys():
            if cascades[step][item] != None:
                items_action_vec[item] += len(cascades[step][item])
            #if cascades[step][item] == None:
                #print(item)
    #print(items_action_vec)
    if plot:
        plt.hist(items_action_vec)
        plt.show()
    
    print('items actions [max, min, avg]: ', int(max(items_action_vec)), int(min(items_action_vec)), int(np.mean(items_action_vec)))
    return items_action_vec

In [24]:
items_data = items_actions(cascades)

items actions [max, min, avg]:  7559 0 505


In [25]:
# users actions
def users_actions(cascades):
    '''
    Returns the vector of all users' actions
    each entry is the the number of activations(actions)
    for the user identified by entry-index
    '''
    numb_nodes = int(info['numb_nodes'])
 
        
    # defining dict
    users_actions_dict = {}
    for step in cascades.keys():
        for item in cascades[step].keys():
            if cascades[step][item] != None:
                for node in cascades[step][item]:
                    users_actions_dict[node] = 0
                    
    # counting            
    for step in cascades.keys():
        for item in cascades[step].keys():
            if cascades[step][item] != None:
                for node in cascades[step][item]:
                    users_actions_dict[node] += 1
                    
    users_actions_vec = [0 for i in range(numb_nodes)]
    for key, index in zip(sorted(users_actions_dict.keys()), range(numb_nodes)):
        users_actions_vec[index] = users_actions_dict[key]
    
    print('users actions [max, min, avg]: ',int(max(users_actions_vec)),
          int(min(users_actions_vec)), int(np.mean(users_actions_vec)))
    return users_actions_vec

In [26]:
users_data = users_actions(cascades)

users actions [max, min, avg]:  157 73 113


## Statistics