# Extracting paths from temporal networks

[Run notebook in Google Colab](https://colab.research.google.com/github/pathpy/pathpy/blob/master/doc/tutorial/path_extraction.ipynb)  
[Download notebook](https://github.com/pathpy/pathpy/raw/master/doc/tutorial/path_extraction.ipynb)

This short tutorial demonstrates (and explains) how to calculate time-respecting path frequencies in a temporal network.

In [None]:
pip install git+git://github.com/pathpy/pathpy.git

In [1]:
import pathpy as pp
import io
import numpy as np

We first generate a maximally simple temporal network with three (instantaneous) time-stamped edges. Each time-stamped edge will automatically be assigned a duration that depends on the user configuration, namely the value of `duration_value` in the section `[temporal]` of `config.cfg` in package folder `pathpy`. If you want to change the default value of 1 you can do so by:

```
pp.config['temporal']['duration_value'] = 2
```

In [13]:
pc = pp.PathCollection(multipaths=False)

p1 = pp.Path('a', 'x', 'c', uid='a-x-c')
p2 = pp.Path('b', 'x', 'd', uid='b-x-d')

pc.add(p1)
pc.add(p2)
pc.add(p2)
print(pc.counter)

PathPyCounter({'b-x-d': 2, 'a-x-c': 1})


In [14]:
pc.nodes

{'a': Empty a, 'x': Empty x, 'c': Empty c, 'b': Empty b, 'd': Empty d}

In [15]:
p3 = pp.Path('b', 'x', 'd', uid='b-x-d-2')
pc.add(p3)
print(pc.counter)

PathPyCounter({'b-x-d': 3, 'a-x-c': 1})


In [21]:
n = pp.Network(multiedges=False)
n.add_edge('a', 'b')
n.add_edge('a', 'b')
print(n.edges)
print(n.edges.counter)

{Edge ('a', 'b')}
PathPyCounter({'0x209fb759a90': 2})


In [22]:
n = pp.Network(multiedges=False)
n.add_edge('a', 'b', uid='x')
n.add_edge('a', 'b', uid='y')
print(n.edges)
print(n.edges.counter)

{Edge x}
PathPyCounter({'x': 2})


In [26]:
tn = pp.TemporalNetwork()
tn.add_edge('a', 'b', timestamp=1)
tn.add_edge('b', 'c', timestamp=2)
tn.add_edge('b', 'd', timestamp=5)
print(tn)
tn.plot()

Uid:			0x1ca22890fd0
Type:			TemporalNetwork
Directed:		True
Multi-Edges:		False
Number of unique nodes:	4
Number of unique edges:	3
Number of temp nodes:	4
Number of temp edges:	3
Observation periode:	1 - 6.0
Observation time:	5.0


As a first step, we can turn this temporal network into a time-unfolded directed acyclic graph. For this, we have to specify the maximum time difference delta between any two time-stamped edges that shall constitute a time-respecting or causal path. In addition to occuring within the maximum time difference, time-stamped edges also have to occur in the correct temporal ordering.

In the resulting time-unfolded directed acyclic graph, each time-unfolded node `v_t` represents a node `v` at a given time stamp `t`. Each edge (`v_t`, `w_{t'}`) between such time-unfolded nodes represents a possible causal influence (i.e. a time-respecting of causal path) by which node `v` at time `t` can influence node `w` at `t'>t`.

By definition, each time-stamped edge (`v`, `w`, t) is a causal path of length one by which node `v` at time `t` can influence node `w` at the next timestamp `t+1` (i.e. we assume that it takes one unit of time for influence to traverse an edge). For a maximum time difference of one between two edges, the only causal path of length two connects node `a` (at time 1) via node `b` (at time 2) to node `c` at time 3. We can see this in the resulting time-unfolded directed acyclic graph:

In [6]:
dag = pp.DirectedAcyclicGraph.from_temporal_network(tn, delta=1)
dag.plot()

If we increase the maximum time difference to `delta=2` three additional time-respecting paths of length one emerges (one from `b` at time 5 to `d` at time 7,  one from `a` at time 1 to `b` at time 3, and one from node `b` at time 2 to node `c` at time 4). This further implies one additional time-respecting path of length two, which is represented in the DAG below:

In [7]:
dag = pp.DirectedAcyclicGraph.from_temporal_network(tn, delta=2)
dag.plot()

If we set the delta to a maximum value of `infinity`, only the time-ordering of time-stamped edges is considered, i.e. any time gap between edges is allowed. In the example above, this implies that the state of node `a` at time `t=1` can influence any other node at any later time. In the directed acyclic graph this is represented as:

In [8]:
dag = pp.DirectedAcyclicGraph.from_temporal_network(tn, delta=np.inf)
dag.plot()

Thanks to its acyclicity, a directed acyclic graph can be used to calculate a finite set of all paths from any root node (the potential start node/time of a causal path in a temporal network) to any leaf node (the potential end node/time of a causal path) in the DAG. We can use the `roots` and `leafs` properties of the `DirectedAcyclicGraph` class to return those:

In [9]:
print([v.uid for v in dag.roots])
print([v.uid for v in dag.leafs])

['a_1']
['c_6', 'b_3', 'c_4', 'b_4', 'c_5', 'd_6', 'b_6', 'c_3']


We can calculate all possible paths from a given root node to any leaf node in the DAG as follows. This returns a `PathCollection` object which contains all `Path` instances as well as a `Counter` that contains the counts of all paths.

In [10]:
paths = dag.routes_from('a_1')
print(paths)
print(paths.counter)

{Path a_1-b_2-c_6, Path a_1-b_2-c_5, Path a_1-b_2-c_4, Path a_1-b_2-c_3, Path a_1-b_3, Path a_1-b_4, Path a_1-b_5-d_6, Path a_1-b_6}
PathPyCounter({'a_1-b_3': 1, 'a_1-b_4': 1, 'a_1-b_6': 1, 'a_1-b_5-d_6': 1, 'a_1-b_2-c_5': 1, 'a_1-b_2-c_4': 1, 'a_1-b_2-c_6': 1, 'a_1-b_2-c_3': 1})


To calculate the statistics of all paths from all roots, we can call the following function. In the example above, this is identical to the paths from the root `a_1` since there is a single root node only in this DAG.

In [11]:
paths = pp.algorithms.path_extraction.all_paths_from_dag(dag)
print(paths)
print(paths.counter)

{Path a_1-b_2-c_3, Path a_1-b_3, Path a_1-b_2-c_6, Path a_1-b_4, Path a_1-b_2-c_5, Path a_1-b_2-c_4, Path a_1-b_5-d_6, Path a_1-b_6}
PathPyCounter({'a_1-b_3': 1, 'a_1-b_4': 1, 'a_1-b_6': 1, 'a_1-b_5-d_6': 1, 'a_1-b_2-c_5': 1, 'a_1-b_2-c_4': 1, 'a_1-b_2-c_6': 1, 'a_1-b_2-c_3': 1})


The problem with this is that we are actually not interested in paths in the time-unfolded directed acyclic graph, but in causal, i.e. time-respecting paths in the original network. In our example, different nodes in the directed acyclic graph actually correspond to the same node in the network at different times. For instance, nodes `b_4` and `b_6` represent the same node `b` at time `4` and `6`. 

For the calculation of paths in the original network, we must incorporate this information. pathpy supports this via a custom node mapping, that we can pass to the path calculation. Moreover, each node in the directed acyclic graph generated by the `from_temporal_network` function has a node attribute `original` that contains the ID of the original node in the temporal network. To map DAG nodes to such nodes we can thus write:

In [12]:
paths = pp.algorithms.path_extraction.all_paths_from_dag(dag, node_mapping={ v.uid: v['original'].uid for v in dag.nodes })
print(paths)
print(paths.counter)

{Path a-b-c, Path a-b-d, Path a-b}
PathPyCounter({'a-b': 1, 'a-b-d': 1, 'a-b-c': 1})


We have now mapped four different paths `a->b->c` to a single causal path, because the different paths in the original DAG all represent the same path from a single root node to the same (mapped) leaf node. The only difference is that transitions between nodes happen at different times.

It might appear that this is all we need to calculate statistics of causal paths in a temporal network. However, the situation is more complicated if we additionally consider which of the shorter paths are already contained in the longer paths. Let us reconsider the DAG generated above:

In [13]:
dag.plot()

In the original temporal network, there are actually only three temporal edges `(a,b), (b,c)` and `(c,d)` occurring in sequence. This leads to two different longest causal paths of length two, which contain the three shorter causal paths of length one (i.e. the edges).

The idea to focus on longest causal paths only is the basis for the calculation of path statistics using the following function:

In [14]:
paths = pp.algorithms.path_extraction.all_paths_from_temporal_network(tn, delta=np.inf)
print(paths)
print(paths.counter)

{Path a-b-d, Path a-b-c}
PathPyCounter({'a-b-d': 1, 'a-b-c': 1})


In the example above, we only have a single root node, which is why the path statistics for the whole DAG is identical to the statistics returned for the paths originating in the single root note. 

In our temporal network with delta=2 we actually have two different roots, hence we have to consider causal paths starting in different nodes:

In [15]:
dag = pp.DirectedAcyclicGraph.from_temporal_network(tn, delta=2)
dag.plot()

In [16]:
paths = dag.routes_from('a_1')
print(paths)
print(paths.counter)

{Path a_1-b_2-c_3, Path a_1-b_2-c_4, Path a_1-b_3}
PathPyCounter({'a_1-b_3': 1, 'a_1-b_2-c_3': 1, 'a_1-b_2-c_4': 1})


In [17]:
paths = dag.routes_from('b_5')
print(paths)
print(paths.counter)

{Path b_5-d_6, Path b_5-d_7}
PathPyCounter({'b_5-d_7': 1, 'b_5-d_6': 1})


In [18]:
paths = pp.algorithms.path_extraction.all_paths_from_dag(dag)
print(paths)
print(paths.counter)

{Path a_1-b_3, Path a_1-b_2-c_4, Path b_5-d_7, Path a_1-b_2-c_3, Path b_5-d_6}
PathPyCounter({'a_1-b_3': 1, 'a_1-b_2-c_3': 1, 'a_1-b_2-c_4': 1, 'b_5-d_7': 1, 'b_5-d_6': 1})


In [19]:
paths = pp.algorithms.path_extraction.all_paths_from_temporal_network(tn, delta=2)
print(paths)
print(paths.counter)

{Path b-d, Path a-b-c}
PathPyCounter({'b-d': 1, 'a-b-c': 1})


## Calculating paths in undirected temporal networks

To understand how paths are calculated in temporal networks with undirected edges, we must consider that - while an undirected edge is an unordered tuple - in the time dimension we necessarily have an order in which nodes are visited, i.e. there is a direction in which an edge is traversed in time. This becomes clear in the following toy example. We first consider a temporal network with three directed temporal edges.

In [21]:
tn = pp.TemporalNetwork(directed=True)
tn.add_edge('a', 'b', timestamp=1)
tn.add_edge('b', 'c', timestamp=2)
tn.add_edge('a', 'c', timestamp=3)
tn.add_edge('b', 'a', timestamp=3)
tn.plot()

It is easy to see that this gives rise to the following three causal paths:

In [23]:
paths = pp.algorithms.path_extraction.all_paths_from_temporal_network(tn, delta=1)
print(paths)
print(paths.counter)

{Path b-a, Path a-c, Path a-b-c}
Counter({'b-a': 1, 'a-b-c': 1, 'a-c': 1})


If we consider the same temporal network with undirected edges, the situation is more complex. Now the undirected edges can be traversed in both directions, which generates additional causal paths.

In [24]:
tn = pp.TemporalNetwork(directed=False)
tn.add_edge('a', 'b', timestamp=1)
tn.add_edge('b', 'c', timestamp=2)
tn.add_edge('a', 'c', timestamp=3)
tn.add_edge('b', 'a', timestamp=3)
tn.plot()

In [25]:
paths = pp.algorithms.path_extraction.all_paths_from_temporal_network(tn, delta=1)
print(paths)
print(paths.counter)

{Path b-a, Path a-c, Path c-b-a, Path a-b-c-a, Path a-b}
Counter({'a-b-c-a': 1, 'c-b-a': 1, 'b-a': 1, 'a-c': 1, 'a-b': 1})


The reason for those additional paths becomes clear if we consider the time-unfolded DAG representation of the temporal network. The direction in which undirected edges are traversed are captured in the directionality of edges in the DAG, i.e. even an **undirected** temporal network turns into a **directed** acyclic graph if we unfold along time. This is due to the fundamental directedness of the time dimension.

In [26]:
dag = pp.DirectedAcyclicGraph.from_temporal_network(tn, delta=1)
dag.plot()

## Exporting path statistics to state files

In [2]:
tn = pp.TemporalNetwork()
tn.add_edge('a', 'b', timestamp=1)
tn.add_edge('b', 'c', timestamp=2)
tn.add_edge('b', 'd', timestamp=5)
tn.add_edge('a', 'b', timestamp=12)
tn.add_edge('b', 'c', timestamp=13)
paths = pp.algorithms.path_extraction.all_paths_from_temporal_network(tn, delta=2)
print(paths)
print(paths.counter)

{Path b-d, Path a-b-c}
Counter({'a-b-c': 2, 'b-d': 1})


In [3]:
pp.io.infomap.to_state_file(paths, 'test.state')

In [4]:
with io.open('test.state', 'r') as f:
    print(f.read())

*Vertices 4
1 "a"
3 "c"
2 "b"
4 "d"
*States
1 2 "{}_b"
2 4 "{}_d"
3 1 "{}_a"
4 2 "{a}_b"
5 3 "{}_c"
*Links
1 2 1
3 4 2
4 5 2


In [20]:
n = pp.io.konect.read_konect_name('sociopatterns-hypertext')
print(n)

Uid:			0x1ca1dc36940
Type:			TemporalNetwork
Directed:		False
Multi-Edges:		False
Number of unique nodes:	226
Number of unique edges:	2196
Number of temp nodes:	226
Number of temp edges:	20818
Observation periode:	1246255220 - 1246467561.0
Observation time:	212341.0

Network attributes
------------------
category:	HumanContact
code:	HY
name:	Hypertext 2009
description:	Visitorâ€“visitor face-to-face contacts
extr:	sociopatterns
url:	http://www.sociopatterns.org/
long-description:	This is the network of face-to-face contacts of the attendees of the ACM Hypertext 2009 conference. The ACM Conference on Hypertext and Hypermedia 2009 (HT 2009, http://www.ht2009.org/) was held in Turin, Italy over three days from June 29 to July 1, 2009. In the network, a node represents a conference visitor, and an edge represents a face-to-face contact that was active for at least 20 seconds. Multiple edges denote multiple contacts. Each edge is annotated with the time at which the contact took place.
enti

In [21]:
validation, train = pp.algorithms.evaluation.train_test_split(n, train_size = 0.8, split = "time")

# PaCO: Path Counting in Temporal Networks

We will look at two temporal netorks:

In [None]:
tn1 = pp.TemporalNetwork(directed=True)
tn1.add_edge("a", "b", timestamp=1)  # 0
tn1.add_edge("a", "b", timestamp=2)  # 1
tn1.add_edge("b", "a", timestamp=3)  # 2
tn1.add_edge("b", "c", timestamp=3)  # 3
tn1.add_edge("d", "c", timestamp=3)  # 4
tn1.add_edge("d", "c", timestamp=4)  # 5
tn1.add_edge("c", "d", timestamp=5)  # 6
tn1.add_edge("c", "b", timestamp=6)  # 7
tn1.add_edge("b", "c", timestamp=7)  # 8
tn1.plot()

In [None]:
tn2 = pp.TemporalNetwork(directed=True)
tn2.add_edge("a", "b", timestamp=1)  # 0
tn2.add_edge("a", "c", timestamp=2)  # 1
tn2.add_edge("b", "c", timestamp=2)  # 2
tn2.add_edge("c", "d", timestamp=3)  # 3
tn2.add_edge("b", "d", timestamp=4)  # 4
tn2.add_edge("d", "c", timestamp=4)  # 5
tn2.add_edge("d", "c", timestamp=5)  # 6
tn2.add_edge("d", "a", timestamp=5)  # 7
tn2.add_edge("c", "b", timestamp=6)  # 8
tn2.plot()

For PaCo, we define paths in temporal networks to consist of temporal links which 
* can be continued topologically: the destination of a previous link is the same as the source of the next link, e.g. $(a,b,t_1)$ and $(b, e, t_2)$.
* can be continued temporally: the timestamps of successive links $t_i$, $t_{i+1}$ satisfy
$$t_i < t_{i+1}$$
$$t_{i+1} - t_i \leq \delta $$

For the first temporal network, and choices $\delta = 2$ and $\delta = 3$ the paths are as follows:

In [None]:
tn1_delta2= {
    1: {('a', 'b'): 2,
        ('b', 'a'): 1,
        ('b', 'c'): 2,
        ('c', 'b'): 1,
        ('c', 'd'): 1,
        ('d', 'c'): 2},
    2: {('a', 'b', 'a'): 2,
        ('a', 'b', 'c'): 2,
        ('b', 'c', 'd'): 1,
        ('c', 'b', 'c'): 1,
        ('d', 'c', 'b'): 1,
        ('d', 'c', 'd'): 2},
    3: {('a', 'b', 'c', 'd'): 2,
        ('d', 'c', 'b', 'c'): 1}
}

In [None]:
tn1_delta3 =  {
    1: {('a', 'b'): 2,
        ('b', 'a'): 1,
        ('b', 'c'): 2,
        ('c', 'b'): 1,
        ('c', 'd'): 1,
        ('d', 'c'): 2},
    2: {('a', 'b', 'a'): 2,
        ('a', 'b', 'c'): 2,
        ('b', 'c', 'b'): 1,
        ('b', 'c', 'd'): 1,
        ('c', 'b', 'c'): 1,
        ('d', 'c', 'b'): 2,
        ('d', 'c', 'd'): 2},
    3: {('a', 'b', 'c', 'b'): 2,
        ('a', 'b', 'c', 'd'): 2,
        ('b', 'c', 'b', 'c'): 1,
        ('d', 'c', 'b', 'c'): 2},
    4: {('a', 'b', 'c', 'b', 'c'): 2}
}

For the second temporal network and choices $\delta = 1$ and $\delta = 2$  the paths are following:

In [None]:
tn2_delta1 = {
    1: {('a', 'b'): 1,
        ('a', 'c'): 1,
        ('b', 'c'): 1,
        ('b', 'd'): 1,
        ('c', 'b'): 1,
        ('c', 'd'): 1,
        ('d', 'a'): 1,
        ('d', 'c'): 2},
    2: {('a', 'b', 'c'): 1,
        ('a', 'c', 'd'): 1,
        ('b', 'c', 'd'): 1,
        ('b', 'd', 'a'): 1,
        ('b', 'd', 'c'): 1,
        ('c', 'd', 'c'): 1,
        ('d', 'c', 'b'): 1},
    3: {('a', 'b', 'c', 'd'): 1,
        ('a', 'c', 'd', 'c'): 1,
        ('b', 'c', 'd', 'c'): 1,
        ('b', 'd', 'c', 'b'): 1},
    4: {('a', 'b', 'c', 'd', 'c'): 1}
}

In [None]:
tn2_delta2 = {
    1: {('a', 'b'): 1,
        ('a', 'c'): 1,
        ('b', 'c'): 1,
        ('b', 'd'): 1,
        ('c', 'b'): 1,
        ('c', 'd'): 1,
        ('d', 'a'): 1,
        ('d', 'c'): 2},
    2: {('a', 'b', 'c'): 1,
        ('a', 'c', 'd'): 1,
        ('b', 'c', 'd'): 1,
        ('c', 'd', 'c'): 2,
        ('b', 'd', 'c'): 1,
        ('c', 'd', 'a'): 1,
        ('b', 'd', 'a'): 1,
        ('d', 'c', 'b'): 2},
    3: {('a', 'b', 'c', 'd'): 1,
        ('a', 'c', 'd', 'c'): 2,
        ('b', 'c', 'd', 'c'): 2,
        ('b', 'd', 'c', 'b'): 1,
        ('a', 'c', 'd', 'a'): 1,
        ('b', 'c', 'd', 'a'): 1,
        ('c', 'd', 'c', 'b'): 2},
    4: {('a', 'b', 'c', 'd', 'a'): 1,
        ('a', 'c', 'd', 'c', 'b'): 2,
        ('b', 'c', 'd', 'c', 'b'): 2,
        ('a', 'b', 'c', 'd', 'c'): 2},
    5: {('a', 'b', 'c', 'd', 'c', 'b'): 2}
}

PaCo finds these paths, and outputs them in a `PathCollection`.

In [None]:
def test_PaCo():
    """
    Test the PaCo algorithm
    """
    for tn, delta, solution in [ (tn1, 3, tn1_delta3), (tn1, 3, tn1_delta3), (tn2, 1, tn2_delta1), (tn2, 2, tn2_delta2)]: 
        PaCo_paths = pp.algorithms.path_extraction.PaCo(tn, delta, skip_first=0, up_to_k=10)
        for l in solution:
            for path in solution[l]:
                assert PaCo_paths[path]["frequency"] == solution[l][path], f"Mismatch in counts for path {path}, correct counter is {solution[l][path]}, PaCo computed {PaCo_paths[path]['frequency']}."

                PaCo_paths.remove(path)
        assert len(PaCo_paths) == 0, f"PaCo found some non-existing paths."
    return True

In [None]:
test_PaCo()

Let's now try this in the actual sociopatterns data

In [None]:
paths = pp.algorithms.path_extraction.PaCo(train, 300, skip_first=0, up_to_k=10)

In [None]:
pp.io.infomap.to_state_file(cnt, 'test.state', max_memory=1)

In [None]:
with io.open('test.state', 'r') as f:
    print(f.read())

In [None]:
pp.io.infomap.to_state_file(cnt, 'test.state', max_memory=2)

In [None]:
with io.open('test.state', 'r') as f:
    print(f.read())