# Network Data & Software


- Where to find network data
    - Google; search on GitHub (universal solution)
    - http://snap.stanford.edu/data/ (mostly big network)
    - https://networkdata.ics.uci.edu/ (mostly small network)
    - http://networkrepository.com/index.php (a more modern one)

- Software for network data
    - Visualization and basic stuff: Gephi (light-weighted); Cytoscape
    - Python: networkx/igraph
    - R: igraph

- Different formats of network data
- See [Gephi’s document](https://gephi.org/users/supported-graph-formats/
) for example. Differnet software may support differnet formats.
- CSV is the most basic and most transferrable formation. You can just write networks as a matrix, edge list, or adjacency list.

# Network data using Gephi

We use the data on the network of the characters in Les Misérables for illustration.


# NetworkX in Python 

NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. https://networkx.org/documentation/stable/index.html

Tutorial Online: https://ericmjl.github.io/Network-Analysis-Made-Simple/

##  Read  raw data

In [None]:
import networkx as nx

import csv
from operator import itemgetter

In [None]:
nx.__version__

- We first use the CSV module to load the nodes and edges. 
- Two datasets: 
    - a list of nodes and 
    - a list of edge pairs (as tuples of nodes).

In [None]:
!head -10 quakers_nodelist.csv

In [None]:
!head -20 quakers_edgelist.csv

In [None]:
with open('quakers_nodelist.csv', 'r') as nodecsv:
    nodereader = csv.reader(nodecsv) 
    nodes = [n for n in nodereader][1:]
nodes

In [None]:
node_names = [n[0] for n in nodes] # Get a list of only the node names
node_names

In [None]:
with open('quakers_edgelist.csv', 'r') as edgecsv: 
    edgereader = csv.reader(edgecsv) 
    edges = [tuple(e) for e in edgereader][1:]
    
edges

In [None]:
len(node_names)

In [None]:
len(edges)

## Create a `Graph' object

In [None]:
G = nx.Graph()

- create a “graph object,” a special NetworkX data type, using NetworkX

- This will create a new Graph object, G, with nothing in it. Now you can add your lists of nodes and edges like so:

In [None]:
G.add_nodes_from(node_names)
G.add_edges_from(edges)

- This is one of several ways to add data to a network object. 
- You can check out the [NetworkX documentation](https://networkx.org/documentation/stable/tutorial.html#adding-attributes-to-graphs-nodes-and-edges) for information about adding weighted edges, or adding nodes and edges one-at-a-time.

In [None]:
list(G.nodes)

In [None]:
list(G.edges)[:10]

## Add Attributes

- For NetworkX, a Graph object is one big thing (your network) made up of two kinds of smaller things (your nodes and your edges). 
- So far we’ve already import the nodes and edges (as pairs of nodes),
- NetworkX also allows us to add attributes to both nodes and edges, providing more information about each of them.


- This list contains all of the rows from quakers_nodelist.csv, including columns for name, historical significance, gender, birth year, death year, and SDFB ID. You’ll want to loop through this list and add this information to our graph. There are a couple ways to do this, but NetworkX provides two convenient functions for adding attributes to all of a Graph’s nodes or edges at once: `nx.set_node_attributes()` and `nx.set_edge_attributes()`. To use these functions, you’ll need your attribute data to be in the form of a **Python dictionary**, in which node names are the keys and the attributes you want to add are the values. You’ll want to create a dictionary for each one of your attributes, and then add them using the functions above. The first thing you must do is create five empty dictionaries, using curly braces:

In [None]:
hist_sig = {}
gender = {}
birth = {}
death = {}
id = {}

- Now we can loop through our nodes list and add the appropriate items to each dictionary. 
- We do this by knowing in advance the position, or index, of each attribute. Because  `quaker_nodelist.csv` file is well-organized, we know that the person’s name will always be the first item in the list: index 0, since you always start counting with 0 in Python. 
- The person’s historical significance will be index 1, their gender will be index 2, and so on. Therefore we can construct our dictionaries like so:6

In [None]:
nodes[:3]

In [None]:
for node in nodes: # Loop through the list, one row at a time
    hist_sig[node[0]] = node[1]
    gender[node[0]] = node[2]
    birth[node[0]] = node[3]
    death[node[0]] = node[4]
    id[node[0]] = node[5]

In [None]:
hist_sig

- Now we have a set of dictionaries for attributes that can be added to nodes in the Graph object.
- The `set_node_attributes` function takes three variables: 
    - the Graph to which you’re adding the attribute, 
    - the dictionary of id-attribute pairs,
    - the name of the new attribute. 
- The code for adding your six attributes looks like this:

In [None]:
nx.set_node_attributes(G, hist_sig, 'historical_significance')
nx.set_node_attributes(G, gender, 'gender')
nx.set_node_attributes(G, birth, 'birth_year')
nx.set_node_attributes(G, death, 'death_year')
nx.set_node_attributes(G, id, 'sdfb_id')

- Now all of the nodes have these five attributes, 
- Below, we show an example of printing out all the birth years of the nodes by looping through them and accessing the birth_year attribute:

In [None]:
G.nodes['James Logan']

In [None]:
for n in G.nodes: # Loop through every node, in our data "n" will be the name of the person
    print(n, G.nodes[n]['birth_year']) # Access every node by its name, and then by the attribute "birth_year"

## Drawing graphs

NetworkX is not primarily a graph drawing package but basic drawing with Matplotlib as well as an interface to use the open source Graphviz software package are included. These are part of the networkx.drawing module and will be imported if possible.

In [None]:
import matplotlib.pyplot as plt

G1 = nx.petersen_graph()

plt.subplot(121)
nx.draw(G1, with_labels=True, font_weight='bold')
plt.subplot(122)
nx.draw_shell(G1, nlist=[range(5, 10), range(5)], with_labels=True, font_weight='bold')
plt.show()

In [None]:
nx.draw(G, with_labels=True)
plt.show()

In [None]:
plt.figure(figsize=(10,7))

nx.draw_networkx(G, alpha=0.3, with_labels=False, edge_color='.4')
plt.tight_layout()

## Metrics available in NetworkX

### Network Density

You can calculate network density by running nx.density(G). 

In [None]:
density = nx.density(G)
print("Network density:", density)

In this case, the density of our network is approximately 0.0248. On a scale of 0 to 1, not a very dense network, which comports with what you can see in the visualization. 

A 0 would mean that there are no connections at all, and a 1 would indicate that all possible edges are present (a perfectly connected network)

This network we created is on the lower end of that scale, but still far from 0.

### Shortest Path

- A shortest path measurement calculates the shortest possible series of nodes and edges that stand between any two nodes, something hard to see in large network visualizations. 

- This measure is essentially finding friends-of-friends—if my mother knows someone that I don’t, then mom is the shortest path between me and that person. 

- To calculate a shortest path, you’ll need to pass several input variables (information you give to a Python function): the whole graph, your source node, and your target node.

- An example: the shortest path between `Margaret Fell` and `George Whitehead`.

- Since we used names to uniquely identify our nodes in the network, you can access those nodes (as the source and target of your path), using the names directly.

In [None]:
fell_whitehead_path = nx.shortest_path(G, source="Margaret Fell", target="George Whitehead")

print("Shortest path between Fell and Whitehead:", fell_whitehead_path)

In [None]:
print("Length of that path:", len(fell_whitehead_path)-1)

### Connectivity

In [None]:
# If your Graph has more than one component, this will return False:
nx.is_connected(G)

In [None]:
# Next, use nx.connected_components to get the list of components,
components = nx.connected_components(G)
list(components)

In [None]:
# then use the max() command to find the largest one:
components = nx.connected_components(G)
largest_component = max(components, key=len)

largest_component

In [None]:
# Create a "subgraph" of just the largest component
# Then calculate the diameter of the subgraph, just like we did with density.

subgraph = G.subgraph(largest_component) # effectively the 'filter' we see in Gephi
diameter = nx.diameter(subgraph)
print("Network diameter of largest component:", diameter)

### Degree

In [None]:
G.degree('Alexander Parker')

In [None]:
degree_dict = dict(G.degree(G.nodes()))
nx.set_node_attributes(G, degree_dict, 'degree')

In [None]:
G.nodes['William Penn']

In [None]:
sorted_degree = sorted(degree_dict.items(), key=itemgetter(1), reverse=True)

You can sort betweenness (or eigenvector) centrality by changing the variable names in the sorting code above, as:

In [None]:
print("Top 20 nodes by degree:")
for d in sorted_degree[:20]:
    print(d)

## Exporting Data

NetworkX supports a very large number of file formats for data export. You could even export your graph as a Pandas dataframe if there were more advanced statistical operations you wanted to run. 

Most of the export options work in roughly the same way, so for this tutorial you’ll learn how to export your data into Gephi’s GEXF format. Once you’ve exported the file, you can upload it directly into Gephi for visualization.

In [None]:
nx.write_gexf(G, 'quaker_network.gexf')

In [None]:
G = nx.read_gexf('quaker_network.gexf')

In [None]:
G.nodes['William Penn']