# DATA620 Basic Graph Analysis
## Erik Nylander

In [1]:
# Loading needed packages
%matplotlib inline
import networkx as nx
import pandas as pd

For this weeks assignment we have selected some data from the Stanford Netwok Aanalysis Project on the scientific collaborations between authors whose papers were submitted to Astro Physics category of Arxiv. Each node on the graph is an author and each edge of the graph indicates that the authors collaborated on a given paper. We have downloaded and extracted the file into a text file. Working with the data we realized that there were issues with the connectedness of the graph when looking at the full set  of nodes adn edges. There were also 198,110 edges which we felt might be a little large to work with for this frist project. We have elected to look at the first 501 eleents of that data set. 

We took this data and generated a sample of the first 501 individuals. This is the data set that we will be loading for the assignment and can be found on my Github site. The data was origanally sourced from https://snap.stanford.edu/data/ca-AstroPh.html.

In [8]:
# Reading the data into a pandas data frame
data = pd.read_csv('https://raw.githubusercontent.com/eriknylander99/DATA620/master/data/CA-AstroPh-sample.csv', header = 0)

Now that we have loaded the data lets take a look at the data. We see that there is a FromNodeId and a ToNodeId. These indicate the edges connecting two authors who have collaborated on a given work. Given that we have a subset of the originall data we see that our set contains 439 nodes whic represent the various authors and 496 edges that represent the collaborations that have occured on various papers. We also notice from the looking at the data that the suthors are listed in order so we have the full collaboration history for 7 authors and the 438 authors that they have collaborated with.

In [9]:
# Generating a subset of the first 501 edges from the data.
data.head()

Unnamed: 0.1,Unnamed: 0,FromNodeId,ToNodeId
0,0,84424,276
1,1,84424,1662
2,2,84424,5089
3,3,84424,6058
4,4,84424,6229


In [14]:
# Number of authors that we have the full records for.
len(data.FromNodeId.unique())

7

In [15]:
# Number of authors that our 7 authors have collaborated with.
len(data.ToNodeId.unique())

438

Now that we have our data in a pandas data frame we can use the built in feature of NetworkX that allows it to read a pandas data frame edge list.

In [10]:
# Generating a network x from the edge list stored in pandas.
G = nx.from_pandas_dataframe(data, 'FromNodeId', 'ToNodeId')

Now that we have our graph object we can do some basic analysis of the graph. First we will check the number of nodes and edges in the graph. We do note that the import has resulted in our graph haveing 439 nodes and 496 edges. The discrepency between the length of the edge set that we imported and the resulting number of edges are a set of repeated edges that describe the authors connections in each direction.

In [11]:
# Number of nodes in the graph.
nx.number_of_nodes(G)

439

In [12]:
# Number of edges in the graph.
nx.number_of_edges(G)

496

We will use the built in diameter algorithm from NetworkX to find the diameter of our graph of authorship relationships. We find out that the diameter of our graph is a length of 6.

In [17]:
# Calculating the diameter of the graph using the built in NetworkX function.
nx.diameter(G)

6

Given that we are looking at the authorship of papers to the Astro Physics collaberation network we decided to see how many people does any given author tend to work with. To do this we found the degree of each node and averaged these values together. We have found that given our smaller sample. The average author collaborates with 2.26 other authors. We also decided that we would like to find out if there were any "super" collaborators that were worked with on a large number of papers. We elected to find authors that had collaborated with more than 20 other individuals. This found us 7 "super" authors and gave us author 10627 who has contributed with over 185 other authors.

In [19]:
# Calculating the average number of 
num_neighbors = nx.degree(G)
avg = 0
for key in num_neighbors:
    avg = avg + num_neighbors[key]
print avg/float(len(num_neighbors))

2.25968109339


In [20]:
# Finding the maximum degree, what is the most number of papers that an author has been on?
max(num_neighbors.values())

185

In [24]:
{k:v for (k,v) in num_neighbors.items() if v > 20}

{60471: 46,
 63225: 27,
 84424: 75,
 89308: 32,
 94138: 81,
 106274: 185,
 127393: 55}

Finally we will write out our graph as a .gefx file that we can then import into Gephi for visualization and further analysis.

In [13]:
# Writing the graph as a GEXF file.
nx.write_gexf(G, 'd:/data/data620/CA-AstroPh.gexf')