# Assignment Graph Visualization
### DATA 620
### Euclid zhang &emsp; David Moste &emsp; Samuel Reeves  
### 6/3/2021

Presentation Video Link: TD

Data Source: https://snap.stanford.edu/data/email-EuAll.html

Import required libraries

In [221]:
import requests
import networkx as nx
import pandas as pd

Read the data file, then split them in to lines. The first 3 lines contains data descriptions

In [222]:
response = requests.get("https://raw.githubusercontent.com/ezaccountz/Data_620/main/week%202%20part%201/Email-EuAll.txt").text
[line for line in response.split("\n")[:3]]

['# Directed graph (each unordered pair of nodes is saved once): Email-EuAll.txt ',
 '# Email network of a large European Research Institution (directed edge means at least one email was sent between October 2003 and March 2005)',
 '# Nodes: 265214 Edges: 420045']

The 4th line contains the column names of the data: FromNodeId and ToNodeId. The last line is blank. Therefore, we only need from line 5 to the second last line in the text file

In [223]:
lines = response.split("\n")[4:-1]

Read the two node IDs from each line, which are separated by a '\t' (tab) character.
Then convert the node IDs to int for smaller storage and easier calculation

In [224]:
data_full = pd.DataFrame(data = [n.split("\t") for n in lines], columns = ['from','to'])
data_full['from'] = [int(i) for i in data_full['from']]
data_full['to'] = [int(i) for i in data_full['to']]
data_full

Unnamed: 0,from,to
0,0,1
1,0,4
2,0,5
3,0,8
4,0,11
...,...,...
420040,265209,422
420041,265210,422
420042,265211,192
420043,265212,875


Since the dataset is too large, we are going to select a small subset for the analysis. The n_subset variable is the number of unique nodes

In [225]:
n_subset = 500

Filter the edges between the first n_subset nodes 

In [226]:
data_sub = data_full[data_full['from'] < n_subset]
data_sub = data_sub[data_sub['to'] < n_subset]
data_sub = data_sub.reset_index(drop=True)
data_sub

Unnamed: 0,from,to
0,0,1
1,0,4
2,0,5
3,0,8
4,0,11
...,...,...
2951,497,452
2952,497,455
2953,497,457
2954,497,496


Now, we create a directed graph using the finished dataframe

In [227]:
g = nx.from_pandas_edgelist(data_sub, 'from', 'to', create_using=nx.DiGraph())

The number of nodes is

In [228]:
nx.number_of_nodes(g)

499

The total number of nodes is 499, it seems like there is one node ID not included in the FromNodeId list

The number of directed edges is

In [229]:
nx.number_of_edges(g)

2956

Let's check if there are any isolated groups

In [230]:
[len(c) for c in sorted(nx.strongly_connected_components(g), key=len, reverse=True)][0:9]

[252, 2, 2, 2, 2, 1, 1, 1, 1]

The main connected group contains 252 nodes, which the remaining group contains only 1 or 2 nodes

We will focus on the main group lcc (largest connected component)

In [231]:
largest_cc = max(nx.strongly_connected_components(g), key=len)

#The following should be used instead for undirected graph
#g = nx.from_pandas_edgelist(data_sub, 'from', 'to')
#[len(c) for c in sorted(nx.connected_components(g), key=len, reverse=True)]
#largest_cc = max(nx.connected_components(g), key=len)

In [232]:
lcc = g.subgraph(largest_cc).copy() 

again, the number of nodes in the lcc is 

In [233]:
nx.number_of_nodes(lcc)

252

The number of edges in the lcc is

In [234]:
nx.number_of_edges(lcc)

2513

The diameter of the graph of lcc is 7, indicating a well connected social network

In [235]:
nx.diameter(lcc)

7

Now, let check the other properties of our graph

In [236]:
summarys = pd.DataFrame(dict(
    DEGREE = dict(lcc.degree),
    DEGREE_CENTRALITY = nx.degree_centrality(lcc),
    EIGENVECTOR = nx.eigenvector_centrality(lcc),
    KATZ = nx.katz_centrality_numpy(lcc),
    CLOSENESS_CENTRALITY = nx.closeness_centrality(lcc),
    BETWEENNESS_CENTRALITY = nx.betweenness_centrality(lcc),
    CLUSTCOEF = nx.clustering(lcc),
)) 
#code used is from 
#https://stackoverflow.com/questions/51974825/networkx-pandas-cant-output-the-degree-of-the-nodes-into-a-txt-file-error

In [237]:
summarys.sort_values(by = 'DEGREE', ascending = False).head(20)

Unnamed: 0,DEGREE,DEGREE_CENTRALITY,EIGENVECTOR,KATZ,CLOSENESS_CENTRALITY,BETWEENNESS_CENTRALITY,CLUSTCOEF
314,127,0.505976,0.251725,-0.031513,0.496047,0.135286,0.126526
10,100,0.398406,0.170728,0.06943,0.470919,0.140121,0.096966
192,97,0.386454,0.242733,-0.019339,0.4631,0.057408,0.203005
175,81,0.322709,0.175969,-0.061403,0.447415,0.049527,0.180952
240,76,0.302789,0.169645,-0.148683,0.458029,0.050219,0.190868
336,75,0.298805,0.21841,-0.070358,0.435009,0.021719,0.315921
446,71,0.282869,0.094778,-0.034898,0.449821,0.091287,0.117934
467,68,0.270916,0.147885,-0.10003,0.43958,0.038645,0.189355
425,66,0.262948,0.086513,0.03683,0.386749,0.040221,0.117133
457,65,0.258964,0.143047,-0.067836,0.457195,0.035086,0.179979


In [238]:
summarys.mean()

DEGREE                    19.944444
DEGREE_CENTRALITY          0.079460
EIGENVECTOR                0.040200
KATZ                       0.003336
CLOSENESS_CENTRALITY       0.341816
BETWEENNESS_CENTRALITY     0.008054
CLUSTCOEF                  0.319141
dtype: float64

The average degree of the nodes is about 20, which again shows a well connected social network

Export the graph as a gefx file that can be used by Gephi or other programs for visualization

In [239]:
nx.write_gexf(lcc, 'E:/SPS/DATA 620/assignments/emails.gexf')