## Data 620 - Week 3
### Liam Byrne

#### Assignment: Graph Visualization

***

+ **Load a graph database of your choosing from a text file or other source. If you take a large network dataset from the web (such as from https://snap.stanford.edu/data/), please feel free at this point to load just a small subset of the nodes and edges.**

The graph used in this example are email associations released from the Enron investigaiton ([available here](https://snap.stanford.edu/data/email-Enron.html)). A brief description follows:

>Enron email communication network covers all the email communication within a dataset of around half million emails. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation. Nodes of the network are email addresses and if an address i sent at least one email to address j, the graph contains an undirected edge from i to j. Note that non-Enron email addresses act as sinks and sources in the network as we only observe their communication with the Enron email addresses.

First we will need to read the repo'd file to a **NetworkX** graph.

In [1]:
import networkx as nx
import pandas as pd

repo = "https://raw.githubusercontent.com/Liam-O/Data620/master/Wk3/Email-Enron.txt"
enron = pd.read_csv(repo, sep = "\t", skiprows = 3)
enron.columns = ("FromNodeID", "ToNodeID")
# Create graph object from df
g = nx.from_pandas_dataframe(enron, source = "FromNodeID",
                             target = "ToNodeID", create_using = nx.DiGraph())

# Force to undirected which removes duplicate edges
g = g.to_undirected()
print(nx.info(g))

Name: 
Type: Graph
Number of nodes: 36692
Number of edges: 183831
Average degree:  10.0202


+ **2. Create basic analysis on the graph, including the graph’s diameter, and at least one other metric of your choosing. You may either code the functions by hand (to build your intuition and insight), or use functions in an existing package.**

The graph is not connected, i.e. there are subgraphs. In order to find the greatest distance between any two verticies, *the diameter*, we will find the largest subgraph from the data. A small diameter would lead one to believe that the network is close-knit - i.e. everyone would know that something was wrong if there was an issue.

In [2]:
nx.is_connected(g)

False

In [3]:
g_max = max(nx.connected_component_subgraphs(g), key = len)
nx.is_connected(g_max)

True

We can see that th attributes of `g_max` are different than that of `g` that were output above:

In [4]:
print(nx.info(g_max))

Name: 
Type: Graph
Number of nodes: 33696
Number of edges: 180811
Average degree:  10.7319


In [5]:
nx.diameter(g_max)

13

As another example of the graph's attributes, we will find the node with the highest degree, i.e. the node with the most edges incedent to it. The output will be the email ID associated with that node. In an investigation, if any wrongdoing was associated with the company. The individual on the majority of the company emails, would be a good place to start the investigation.

In [8]:
deg_max = nx.degree_centrality(g_max)
n_max = max(deg_max, key = deg_max.get)
n_max_count = max(deg_max)
print("The individual who had the widest communicaiton network in emails was email ID {0}. "
"The number of people who had communication with the individual were {1}.".format(
        n_max, n_max_count))

The individual who had the widest communicaiton network in emails was email ID 5038. The number of people who had communication with the individual were 36691.


We will focus on the individual within the network with the most neighbors and create a subgraph of that network. The network, an egocentric graph, will only have nodes with a 1 degree of seperation from this central node. 

In [11]:
ego_center = nx.ego_graph(g_max, n = n_max, undirected = True)
print(nx.info(ego_center))

Name: 
Type: Graph
Number of nodes: 1384
Number of edges: 1831
Average degree:   2.6460
