# Cross-validation experiments with networks

[Run notebook in Google Colab](https://colab.research.google.com/github/pathpy/pathpy/blob/master/doc/tutorial/cross_validation.ipynb)  
[Download notebook](https://github.com/pathpy/pathpy/raw/master/doc/tutorial/cross_validation.ipynb)

`pathpy` provides basic support for evaluations based on cross-validation experiments. In particular, the `train_test_split` method can be used to create train and test splits. The semantics of the method as well as the arguments is similar to the [corresponding function in `sklearn`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

To demonstrate the use, we generate a random graph:

In [None]:
pip install git+git://github.com/pathpy/pathpy.git

In [1]:
import pathpy as pp

n = pp.generators.ER_np(100, 0.04)
print(n)
n.plot()

Uid:			0x22e7a4a1320
Type:			Network
Directed:		False
Multi-Edges:		False
Number of nodes:	100
Number of edges:	209


To generate a test and train network instance, where the test network contains a random fraction of 25 % of the nodes, we can write:

In [2]:
train, test = pp.algorithms.evaluation.train_test_split(n, test_size = 0.25)
print(train)
print(test)

Uid:			0x22e7a4a1320_train
Type:			Network
Directed:		False
Multi-Edges:		False
Number of nodes:	75
Number of edges:	123
Uid:			0x22e7a4a1320_test
Type:			Network
Directed:		False
Multi-Edges:		False
Number of nodes:	25
Number of edges:	9


The method generates two new Network instances that refer to the same node and edge objects as the original network, i.e. the new objects do not consume a lot of memory. The original network instance is not changed. The uids of the newly generated networks will be set to the original uid with a suffix of `_test` and `_train` respectively.

By default, the split will be made based on the nodes, and the train and test networks will include all incident edges for the corresponding node sets. This implies that some edges can be lost if the split is made along the endpoints. To preserve the number of edges, we can set the split method to `edge`. This will sample a random fraction of edges, and all nodes are added to both networks, i.e. the node sets between the two networks are identical. The sum of the edges of the training and test network equals the number of edges in the original network.

In [3]:
train, test = pp.algorithms.evaluation.train_test_split(n, test_size = 0.25, split='edge')
print(train)
print(test)

Uid:			0x22e7a4a1320_train
Type:			Network
Directed:		False
Multi-Edges:		False
Number of nodes:	100
Number of edges:	157
Uid:			0x22e7a4a1320_test
Type:			Network
Directed:		False
Multi-Edges:		False
Number of nodes:	100
Number of edges:	52


We can alternatively set the size of the training set:

In [5]:
train, test = pp.algorithms.evaluation.train_test_split(n, train_size = 0.25, split='edge')
print(train)
print(test)

Uid:			0x22e7a4a1320_train
Type:			Network
Directed:		False
Multi-Edges:		False
Number of nodes:	100
Number of edges:	53
Uid:			0x22e7a4a1320_test
Type:			Network
Directed:		False
Multi-Edges:		False
Number of nodes:	100
Number of edges:	156


Apart from static networks, we can also create cross-validation sets for temporal networks. For this, we first load a temporal network from the KONECT database:

In [6]:
tn = pp.io.konect.read_konect_name('sociopatterns-hypertext')
print(tn)
tn.plot()

Uid:			0x22e5aeb9748
Type:			TemporalNetwork
Directed:		False
Multi-Edges:		False
Number of unique nodes:	113
Number of unique edges:	2196
Number of temp nodes:	113
Number of temp edges:	2196
Observation periode:	1246255220 - 1246467081.0

Network attributes
------------------
category:	HumanContact
code:	HY
name:	Hypertext 2009
description:	Visitorâ€“visitor face-to-face contacts
extr:	sociopatterns
url:	http://www.sociopatterns.org/
long-description:	This is the network of face-to-face contacts of the attendees of the ACM Hypertext 2009 conference. The ACM Conference on Hypertext and Hypermedia 2009 (HT 2009, http://www.ht2009.org/) was held in Turin, Italy over three days from June 29 to July 1, 2009. In the network, a node represents a conference visitor, and an edge represents a face-to-face contact that was active for at least 20 seconds. Multiple edges denote multiple contacts. Each edge is annotated with the time at which the contact took place.
entity-names:	visitor
relationsh

TypeError: 'int' object is not callable

We can call the same function on a temporal network instance. By default, the split will be made based on the observed interactions, i.e. in the following example the first 75 % of all time-stamped interactions will be included in the training network, while the last 25 % will be included in the test network. 

In [7]:
train, test = pp.algorithms.evaluation.train_test_split(tn, test_size=0.25)
print(train)
print(test)

AttributeError: 'TemporalNetwork' object has no attribute 'tedges'

In [8]:
train.plot()

In [9]:
test.plot()

We can also split based on the observed time, i.e. here we include all interactions ocurring within in the first 75 % of the observed time period in the training network, while the remaining interactions are included in the test network.

In [10]:
train, test = pp.algorithms.evaluation.train_test_split(tn, test_size=0.25, split='time')
print(train)
print(test)

TypeError: 'int' object is not callable

## Randomizing networks

To detect patterns in networks, it is often helpful to randomize their topology while preserving the nodes as well as certain aggregate statistics of the network. We can try this with the Karate club network:

In [11]:
n = pp.io.graphtool.read_netzschleuder_network('karate', '77')
n.plot()

We can use the `randomize` functions in the module `generators`, which are coupled to the corresponding random graph models. To generate a randomized version of a network where the number of nodes $n$ and the number of edges $m$ is preserved, we can use the `randomize` function associated with the Erdös-Renyi $G(n,m)$ model. This will generate a network with the same number of nodes and edges, but with a different, randomized topology:

In [12]:
r1 = pp.generators.ER_nm_randomize(n)
r1.plot()

If we use the `randomize` function of the Erdös-Renyi `G(n,p)` model, we obtain a randomized network with the same number of nodes and the same number of *expected* edges":

In [13]:
r2 = pp.generators.ER_np_randomize(n)
r2.plot()

## Shuffling temporal networks

To study temporal networks, it is often helpful to randomize the timestamps of edges while preserving the frequency and topology of edges, as well as the inter-event time distribution. Consider the following example:

In [14]:
tn = pp.TemporalNetwork(directed=False, uid='temporal_network_42')
tn.add_edge('a', 'b', timestamp=1, color='red')
tn.add_edge('b', 'c', timestamp=3, color='green')
tn.add_edge('c', 'a', timestamp=6, color='blue')
tn.add_edge('a', 'b', timestamp=12, color='orange')
tn.plot()

KeyError: '0x22e5baa25c0'

We can use the `randomize_temporal_network` function to randomly permute the timestamps and reassign them to edges. This method will preserve time-varying attributes of edges. In the example above, the shuffled network contains two occurences of edge $(a,b)$, one is red the other is orange. However, those edges will occur at randomly chosen timestamps.

In [15]:
shuffled_tempnet = pp.algorithms.evaluation.shuffle_temporal_network(tn)

AttributeError: 'TemporalNetwork' object has no attribute 'tedges'

In [16]:
print(shuffled_tempnet)

NameError: name 'shuffled_tempnet' is not defined

In [17]:
shuffled_tempnet.plot()

NameError: name 'shuffled_tempnet' is not defined