# Reading networks from netzschleuder

[Run notebook in Google Colab](https://colab.research.google.com/github/pathpy/pathpy/blob/master/doc/tutorial/netzschleuder.ipynb)

The [netzschleuder](https://networks.skewed.de) repository is an online repository of more thn 100,000 networks maintained by [Tiago Peixoto](https://skewed.de/tiago). With `pathpy` you can directly read any network from the netzschleuder repository to analyze and visualize it.

In [None]:
pip install git+git://github.com/pathpy/pathpy.git

In [1]:
import pathpy as pp

from pprint import pprint

Since the `netzschleuder` repository uses the graphtool binary format to store network data, support to retrieve networks from the repository is included in `pathpy`'s `io.graphtool` submodule.

Each `netzschleuder` data set can contain one or more networks. If there is more than one network in a data set, we have to additionally specify the name of the network that we wish to retrieve. In a first step, we can use the function `list_netzschleuder_records` to retrieve a list of all data sets. In the following, we only print the first 20 records:

In [2]:
datasets = pp.io.graphtool.list_netzschleuder_records()
pprint(datasets[:20])

['7th_graders',
 'academia_edu',
 'add_health',
 'adjnoun',
 'advogato',
 'amazon_copurchases',
 'amazon_ratings',
 'ambassador',
 'anybeat',
 'arxiv_authors',
 'arxiv_citation',
 'arxiv_collab',
 'as_skitter',
 'baidu',
 'baseball',
 'bible_nouns',
 'bibsonomy',
 'bison',
 'bitcoin',
 'bitcoin_alpha']


We can use keyword arguments to set additional query parameters (e.g. looking for data with specific tags or returning full records with all attributes). The supported query parameters can be found in the [API description](https://networks.skewed.de/api). To return all social networks in the `netzschleuder` repository, we call (here we only return the records 50 through 70):

In [3]:
datasets = pp.io.graphtool.list_netzschleuder_records(tags='Social')
pprint(datasets[50:70])

['football_tsevans',
 'foursquare',
 'foursquare_friendships',
 'foursquare_global',
 'freshmen',
 'game_thrones',
 'google_plus',
 'hens',
 'high_tech_company',
 'highschool',
 'hiv_transmission',
 'hyves',
 'inploid',
 'jazz_collab',
 'kangaroo',
 'karate',
 'kidnappings',
 'lastfm',
 'lastfm_aminer',
 'law_firm']


To retrieve detailed metadata on a specific data set, we can use the following function:

In [4]:
datasets = pp.io.graphtool.read_netzschleuder_record('karate')
pprint(datasets)

{'analyses': {'77': {'average_degree': 4.529411764705882,
                     'degree_assortativity': -0.46747895436420067,
                     'degree_std_dev': 3.751355003177064,
                     'diameter': 5,
                     'edge_properties': [],
                     'edge_reciprocity': 1.0,
                     'global_clustering': 0.2583170254403131,
                     'hashimoto_radius': 5.250999453080816,
                     'is_bipartite': False,
                     'is_directed': False,
                     'knn_proj_1': 3.6195042006501628,
                     'knn_proj_2': 1.4574919498236765,
                     'largest_component_fraction': 1.0,
                     'mixing_time': 6.981247460909331,
                     'num_edges': 77,
                     'num_vertices': 34,
                     'transition_gap': 0.8665453140179857,
                     'vertex_properties': [['name', 'int16_t'],
                                           ['groups', 'int1

Those metadata contain citation information (including a BibTeX record), the original URL from which the data was retrieved, a textual description of the data, as well as a list of networks contained in the data set. In the example above, the `karate` data set contains two networks named `77` and `78`, referring to different versions of the data. For each network, the metadata contain a number of network-level metrics.

## Reading static networks

Let us now read the network into an instance of `pathpy.Network`. For this, we can use the function `read_netzschleuder_network`. To read a specific network, we must specify both the name of the data set as well as the name of the network (in case there is more than one). The function will automatically determine the type of network to return, i.e. static or temporal, directed or undirected, single or multi-edge.

In [5]:
n = pp.io.graphtool.read_netzschleuder_network('karate', '77')
print(n)

Uid:			0x1f4744b7c88
Type:			Network
Directed:		False
Multi-Edges:		False
Number of nodes:	34
Number of edges:	77

Network attributes
------------------
name:	karate (77)
description:	Network of friendships among members of a university karate club. Includes metadata for faction membership after a social partition. Note: there are two versions of this network, one with 77 edges and one with 78, due to an ambiguous typo in the original study. (The most commonly used is the one with 78 edges.)[^icon]
[^icon]: Description obtained from the [ICON](https://icon.colorado.edu) project.
citation:	[['W. W. Zachary, "An information flow model for conflict and fission in small groups." Journal of Anthropological Research 33, 452-473 (1977).', 'https://doi.org/10.1086/jar.33.4.3629752']]
url:	http://tuvalu.santafe.edu/~aaronc/data/
tags:	['Social', 'Offline', 'Unweighted']
title:	Zachary Karate Club
bibtex:	['@article{Zachary_1977,\n\tdoi = {10.1086/jar.33.4.3629752},\n\turl = {https://doi.org/10.

In [6]:
pp.plot(n)

## Reading temporal networks

`karate` is an example for a static network, where edges do not have associated timestamps. However. the `netzschleuder` repository contains a number of temporal networks where edges are observed at specific times. To retrieve a list of temporal networks in the netzschleuder database, we can again use the function `list_netschleuder_records` setting the query parameter `tag=Temporal`. We only output records 200 through 250:

In [7]:
pp.io.graphtool.list_netzschleuder_records(tag='Temporal')[200:250]

['rhesus_monkey',
 'roadnet',
 'route_views',
 'sa_companies',
 'scotus_majority',
 'slashdot_threads',
 'slashdot_zoo',
 'soc_net_comms',
 'social_location',
 'sp_high_school',
 'sp_high_school_new',
 'sp_hospital',
 'sp_hypertext',
 'sp_infectious',
 'sp_kenyan_households',
 'sp_office',
 'sp_primary_school',
 'stackoverflow',
 'student_cooperation',
 'swingers',
 'terrorists_911',
 'topology',
 'trackers',
 'train_terrorists',
 'trec',
 'trec_web',
 'twitter',
 'twitter_15m',
 'twitter_2009',
 'twitter_events',
 'twitter_higgs',
 'twitter_sample',
 'twitter_social',
 'ugandan_village',
 'un_migrations',
 'uni_email',
 'unicodelang',
 'us_agencies',
 'us_air_traffic',
 'us_congress',
 'us_patents',
 'us_roads',
 'visualizeus',
 'webkb',
 'wiki_article_words',
 'wiki_categories',
 'wiki_link_dyn',
 'wiki_rfa',
 'wiki_science',
 'wiki_talk']

To retrieve the full information on a specific record, we again call the `read_netzschleuder_record` with the associated data set name:

In [8]:
pp.io.graphtool.read_netzschleuder_record('sp_hospital')

{'title': 'Hospital ward dynamic contacts (2010)',
 'description': 'This dataset contains the temporal network of contacts between patients, patients and health-care workers (HCWs) and among HCWs in a hospital ward in Lyon, France, from Monday, December 6, 2010 at 1:00 pm to Friday, December 10, 2010 at 2:00 pm. The study included 46 HCWs and 29 patients.[^icon]\n\nThe file contains a tab-separated list representing the active contacts during 20-second intervals of the data collection. Each line has the form “t i j Si Sj“, where i and j are the anonymous IDs of the persons in contact, Si and Sj are their statuses (NUR=paramedical staff, i.e. nurses and nurses’ aides; PAT=Patient; MED=Medical doctor; ADM=administrative staff), and the interval during which this contact was active is [ t – 20s, t ]. If multiple contacts are active in a given interval, you will see multiple lines starting with the same value of t. Time is measured in seconds.\n\n[^icon]: Description obtained from the [ICO

If there is only a single network in the data set, we can omit the network name (which then assumes the same value as the data set). In the network above, each edge has a `time` attribute. `pathpy` will thus return an instance of `TemporalNetwork`:

In [9]:
tn = pp.io.graphtool.read_netzschleuder_network('sp_hospital')
print(tn)

Uid:			0x1f47811db00
Type:			TemporalNetwork
Directed:		False
Multi-Edges:		False
Number of unique nodes:	75
Number of unique edges:	1139
Number of temp nodes:	75
Number of temp edges:	32424
Observation periode:	140 - 347641.0

Network attributes
------------------
name:	sp_hospital
description:	This dataset contains the temporal network of contacts between patients, patients and health-care workers (HCWs) and among HCWs in a hospital ward in Lyon, France, from Monday, December 6, 2010 at 1:00 pm to Friday, December 10, 2010 at 2:00 pm. The study included 46 HCWs and 29 patients.[^icon]

The file contains a tab-separated list representing the active contacts during 20-second intervals of the data collection. Each line has the form “t i j Si Sj“, where i and j are the anonymous IDs of the persons in contact, Si and Sj are their statuses (NUR=paramedical staff, i.e. nurses and nurses’ aides; PAT=Patient; MED=Medical doctor; ADM=administrative staff), and the interval during which this co

To generate dynamic visualisation of this temporal network, we can simpy call:

In [10]:
pp.plot(tn)

## Reading temporal data as static networks 

Sometimes, we have network data sets where edges include time stamps, but we may want to ignore the timestamps, treating them as multiple observations of the same edge instead. To return a static projection of such a network, we can set `ignore_temporal=True`. By default, an unweighted single-edge network will be generated, i.e. additional observations of the same edge at different time stamps are simply discarded. To highlight that we ignore part of the data, `pathpy` issues a warning:

In [11]:
n = pp.io.graphtool.read_netzschleuder_network('sp_hypertext', 'contacts', ignore_temporal=True)
print(n)



Uid:			0x1f415beb748
Type:			Network
Directed:		False
Multi-Edges:		False
Number of nodes:	113
Number of edges:	2196

Network attributes
------------------
name:	sp_hypertext (contacts)
description:	The temporal network of contacts among attendees of the ACM Hypertext 2009 conference, which spanned 2.5 days of time.[^icon]

This dataset was collected during the ACM Hypertext 2009 conference, where the SocioPatterns project deployed the Live Social Semantics application. Conference attendees volunteered to wear radio badges that monitored their face-to-face proximity. The dataset published here represents the dynamical network of face-to-face proximity of ~110 conference attendees over about 2.5 days. No personal data are released here, and no metadata collected by the Live Social Semantics application are exposed. We provide two data files, described below.

Contact List. This is a tab-separated list representing the active contacts during 20-second intervals of the data collection. Ea

We may instead want to keep all information on the edges, either by returning a multi-edge network in which multiple edges between the same nodes are allowed, or by projecting the multiple observations to a numerical `weight` attribute of edges, where an edge weigt of `n` indicates that this specific edge has been observed `n` times. We can control this behavior using the additional parameter `mutliedges`:

In [12]:
n = pp.io.graphtool.read_netzschleuder_network('sp_hypertext', 'contacts', 
                                               ignore_temporal=True, multiedges=True)
print(n)

Uid:			0x1f4751134a8
Type:			Network
Directed:		False
Multi-Edges:		True
Number of nodes:	113
Number of edges:	20818

Network attributes
------------------
name:	sp_hypertext (contacts)
description:	The temporal network of contacts among attendees of the ACM Hypertext 2009 conference, which spanned 2.5 days of time.[^icon]

This dataset was collected during the ACM Hypertext 2009 conference, where the SocioPatterns project deployed the Live Social Semantics application. Conference attendees volunteered to wear radio badges that monitored their face-to-face proximity. The dataset published here represents the dynamical network of face-to-face proximity of ~110 conference attendees over about 2.5 days. No personal data are released here, and no metadata collected by the Live Social Semantics application are exposed. We provide two data files, described below.

Contact List. This is a tab-separated list representing the active contacts during 20-second intervals of the data collection. Ea

We can easily turn this into a **weighted** network, where each edge is included only once while an additional `weight` attribute counts the occurrences of that edge:

In [13]:
weighted_net = pp.Network.to_weighted_network(n)
print(weighted_net)

Uid:			0x1f4753aa7b8
Type:			Network
Directed:		False
Multi-Edges:		False
Number of nodes:	113
Number of edges:	2196
