#  Constructing a network diagram from the networks exported from CrossRef records as Author-to-Author edges and aggregated Author details

To ensure that we don't loose any edges, we will create the graphs as MultiGraphs (multiple edges between two nodes).
Then we can sum the weights by converting this MultiGraph to a normal Graph, or do this in Gephi on import.




In [76]:
# library for data import and handling
import pandas as pd

# for construction and analysis of networks
import networkx as nx

---
defining a function before we import our data


### During our data processing we will want to reduce the network from a MultiGraph (many edges connecting the same nodes (many publications connecting the same authors) to a Graph with single connections with summed weights.


In [77]:
# Function to create a Graph from weighted MultiGraph
# using source (s), target (t) and weight data from edges

def merge_edge_weights(Graph_in):
    Graph_out = nx.Graph()
    for s,t,data in Graph_in.edges(data=True):
        w = data['weight'] if 'weight' in data else 1.0
        if Graph_out.has_edge(s,t):
            Graph_out[s][t]['weight'] += w
        else:
            Graph_out.add_edge(s, t, weight=w)
    return Graph_out


### First we will bring in the list of edges that we aquired from separating the author lists into 1-to-1 relationships

In [78]:
all_edges = pd.read_csv('./D4out_edge_coded_for_graphing.csv')

all_edges.head()

Unnamed: 0,source,target,source_auth,target_auth,timestamp,weight,DOI,CR_citations,DOI_low,research_group,group_type
0,1684,8996,A. L. Fenwick,J. A. Goos,2014-08-30 14:03:56,0.111111,10.1186/s12881-014-0095-4,5,10.1186/s12881-014-0095-4,['genomic medicine'],['Theme']
1,1684,8693,A. L. Fenwick,J. Rankin,2014-08-30 14:03:56,0.111111,10.1186/s12881-014-0095-4,5,10.1186/s12881-014-0095-4,['genomic medicine'],['Theme']
2,1684,7142,A. L. Fenwick,H. Lord,2014-08-30 14:03:56,0.111111,10.1186/s12881-014-0095-4,5,10.1186/s12881-014-0095-4,['genomic medicine'],['Theme']
3,1684,18477,A. L. Fenwick,T. Lester,2014-08-30 14:03:56,0.111111,10.1186/s12881-014-0095-4,5,10.1186/s12881-014-0095-4,['genomic medicine'],['Theme']
4,1684,1637,A. L. Fenwick,A. J. M. Hoogeboom,2014-08-30 14:03:56,0.111111,10.1186/s12881-014-0095-4,5,10.1186/s12881-014-0095-4,['genomic medicine'],['Theme']


### Now we bring in our additional information about particular authors (Nodes)

In [79]:
author_nodes = pd.read_csv('./D3out_Author_nodes_processed.csv', index_col=0)
author_nodes.shape #.columns

(20225, 10)

In [80]:
author_nodes.head()

Unnamed: 0,author,DOI,DOI_count,CR_citations,primary_affiliation,research_group,group_type,primary_group,primary_type,Ox_author
0,A. Abdel-Gadir,['10.1016/j.jcmg.2015.11.008'],1,80,{nan},['cardiovascular'],['Theme'],cardiovascular,Theme,0
1,A. Abe,['10.1080/15548627.2015.1100356'],1,3007,{nan},['immunity and inflammation'],['Theme'],immunity and inflammation,Theme,0
2,A. Abizaid,['10.1016/j.ijcard.2013.03.064'],1,1,{nan},['cardiovascular'],['Theme'],cardiovascular,Theme,0
3,A. Abubakar,['10.1371/journal.pone.0113360'],1,34,{nan},['translational physiology'],['Theme'],translational physiology,Theme,0
4,A. Abulí,"['10.1136/gutjnl-2011-300537', '10.1186/1471-2...",2,106,{nan},"['genomic medicine', 'genomic medicine']","['Theme', 'Theme']",genomic medicine,Theme,0


In [81]:
author_nodes.dtypes

author                 object
DOI                    object
DOI_count               int64
CR_citations            int64
primary_affiliation    object
research_group         object
group_type             object
primary_group          object
primary_type           object
Ox_author               int64
dtype: object

In [82]:
G =nx.from_pandas_edgelist(all_edges,  edge_attr=['weight', 'timestamp'], create_using=nx.MultiGraph())

In [83]:
# Add some more data to the nodes in the network graph

nx.set_node_attributes(G, pd.Series(author_nodes.author, index=author_nodes.index).to_dict(), 'author')
nx.set_node_attributes(G, pd.Series(author_nodes.DOI_count, index=author_nodes.index).to_dict(), 'DOI_count')
nx.set_node_attributes(G, pd.Series(author_nodes.CR_citations, index=author_nodes.index).to_dict(), 'CR_counts')

nx.set_node_attributes(G, pd.Series(author_nodes.research_group, index=author_nodes.index).to_dict(), 'research_group')
nx.set_node_attributes(G, pd.Series(author_nodes.group_type, index=author_nodes.index).to_dict(), 'group_type')
nx.set_node_attributes(G, pd.Series(author_nodes.primary_group, index=author_nodes.index).to_dict(), 'primary_group')
nx.set_node_attributes(G, pd.Series(author_nodes.primary_type, index=author_nodes.index).to_dict(), 'primary_group_type')
nx.set_node_attributes(G, pd.Series(author_nodes.Ox_author, index=author_nodes.index).to_dict(), 'Ox_mentions')

In [84]:
nx.info(G)

'Name: \nType: MultiGraph\nNumber of nodes: 20225\nNumber of edges: 5395986\nAverage degree: 533.5956'

In [85]:
# Uncomment to save (approx 1.2 GB file)

# nx.write_gexf(G, "./author_networks/All_OxBRC2_multi.gexf")

In [86]:
# now we can merge multiple edges beween the same nodes

AllG_merge = merge_edge_weights(G)

In [87]:
# same number of nodes, fewer edges
nx.info(AllG_merge)

'Name: \nType: Graph\nNumber of nodes: 20225\nNumber of edges: 4296252\nAverage degree: 424.8457'

--- 

## Dividing OxBRC2 publications into 3 equal sections

---

Based on the previous findings that:

    - the publication rate over the course of OxBRC2 is linear overall 
    - splitting the publications into 3 sections allows comparison of publications for the start, middle, and end
    - START  = before and including '2013-11-22 23:59:59+00:00'
    - MIDDLE = after '2013-11-22 23:59:59+00:00' and up to and including '2015-07-09 23:59:59+00:00'
    - END = after '2015-07-15 23:59:59+00:00'
    

---

In [88]:
StartG = nx.MultiGraph( [ (s,t,edge_attr) for s,t,
                         edge_attr in G.edges(data=True) if edge_attr['timestamp']<='2013-11-22 23:59:59+00:00'])

In [89]:
nx.info(StartG)

'Name: \nType: MultiGraph\nNumber of nodes: 6684\nNumber of edges: 368354\nAverage degree: 110.2196'

In [90]:
# Add some more data to the nodes in the network graph

nx.set_node_attributes(StartG, pd.Series(author_nodes.author, index=author_nodes.index).to_dict(), 'author')
nx.set_node_attributes(StartG, pd.Series(author_nodes.DOI_count, index=author_nodes.index).to_dict(), 'DOI_count')
nx.set_node_attributes(StartG, pd.Series(author_nodes.CR_citations, index=author_nodes.index).to_dict(), 'CR_counts')

nx.set_node_attributes(StartG, pd.Series(author_nodes.research_group, index=author_nodes.index).to_dict(), 'research_group')
nx.set_node_attributes(StartG, pd.Series(author_nodes.group_type, index=author_nodes.index).to_dict(), 'group_type')
nx.set_node_attributes(StartG, pd.Series(author_nodes.primary_group, index=author_nodes.index).to_dict(), 'primary_group')
nx.set_node_attributes(StartG, pd.Series(author_nodes.primary_type, index=author_nodes.index).to_dict(), 'primary_group_type')
nx.set_node_attributes(StartG, pd.Series(author_nodes.Ox_author, index=author_nodes.index).to_dict(), 'Ox_mentions')

In [91]:
StartG.size(weight='weight')

5727.499999999998

In [92]:
nx.density(StartG)

0.016492537627516356

In [93]:
nx.is_connected(StartG)


False

In [94]:
# Uncomment to save  (approx 90MB file)

#nx.write_gexf(StartG, "./author_networks/Start_OxBRC2_multi.gexf")

In [95]:
StartG_merge = merge_edge_weights(StartG)

In [96]:
nx.info(StartG_merge)

'Name: \nType: Graph\nNumber of nodes: 6684\nNumber of edges: 288614\nAverage degree:  86.3597'

In [97]:
nx.density(StartG_merge)

0.012922290119906409

In [98]:
MidG = nx.MultiGraph( [ (s,t,edge_attr) for s,t,
                       edge_attr in G.edges(data=True) if edge_attr['timestamp']>'2013-11-22 23:59:59+00:00' and edge_attr['timestamp']<='2015-07-09 23:59:59+00:00'])


In [99]:
nx.info(MidG)

'Name: \nType: MultiGraph\nNumber of nodes: 8697\nNumber of edges: 1021205\nAverage degree: 234.8407'

In [100]:
# Add some more data to the nodes in the network graph

nx.set_node_attributes(MidG, pd.Series(author_nodes.author, index=author_nodes.index).to_dict(), 'author')
nx.set_node_attributes(MidG, pd.Series(author_nodes.DOI_count, index=author_nodes.index).to_dict(), 'DOI_count')
nx.set_node_attributes(MidG, pd.Series(author_nodes.CR_citations, index=author_nodes.index).to_dict(), 'CR_counts')

nx.set_node_attributes(MidG, pd.Series(author_nodes.research_group, index=author_nodes.index).to_dict(), 'research_group')
nx.set_node_attributes(MidG, pd.Series(author_nodes.group_type, index=author_nodes.index).to_dict(), 'group_type')
nx.set_node_attributes(MidG, pd.Series(author_nodes.primary_group, index=author_nodes.index).to_dict(), 'primary_group')
nx.set_node_attributes(MidG, pd.Series(author_nodes.primary_type, index=author_nodes.index).to_dict(), 'primary_group_type')
nx.set_node_attributes(MidG, pd.Series(author_nodes.Ox_author, index=author_nodes.index).to_dict(), 'Ox_mentions')

In [101]:
MidG.size(weight='weight')

7208.500000000002

In [102]:
nx.density(MidG)

0.02700560598939731

In [103]:
# Uncomment to save  (approx 240MB file)

#nx.write_gexf(MidG, "./author_networks/Mid_OxBRC2_multi.gexf")

In [104]:
MidG_merge = merge_edge_weights(MidG)

In [105]:
nx.info(MidG_merge)

'Name: \nType: Graph\nNumber of nodes: 8697\nNumber of edges: 786972\nAverage degree: 180.9755'

In [106]:
nx.density(MidG_merge)

0.02081135105751341

In [107]:
EndG = nx.MultiGraph( [ (s,t,edge_attr) for s,t,
                       edge_attr in G.edges(data=True) if edge_attr['timestamp']>'2015-07-09 23:59:59+00:00'])


In [108]:
nx.info(EndG)

'Name: \nType: MultiGraph\nNumber of nodes: 10913\nNumber of edges: 4006427\nAverage degree: 734.2485'

In [109]:
# Add some more data to the nodes in the network graph

nx.set_node_attributes(EndG, pd.Series(author_nodes.author, index=author_nodes.index).to_dict(), 'author')
nx.set_node_attributes(EndG, pd.Series(author_nodes.DOI_count, index=author_nodes.index).to_dict(), 'DOI_count')
nx.set_node_attributes(EndG, pd.Series(author_nodes.CR_citations, index=author_nodes.index).to_dict(), 'CR_counts')

nx.set_node_attributes(EndG, pd.Series(author_nodes.research_group, index=author_nodes.index).to_dict(), 'research_group')
nx.set_node_attributes(EndG, pd.Series(author_nodes.group_type, index=author_nodes.index).to_dict(), 'group_type')
nx.set_node_attributes(EndG, pd.Series(author_nodes.primary_group, index=author_nodes.index).to_dict(), 'primary_group')
nx.set_node_attributes(EndG, pd.Series(author_nodes.primary_type, index=author_nodes.index).to_dict(), 'primary_group_type')
nx.set_node_attributes(EndG, pd.Series(author_nodes.Ox_author, index=author_nodes.index).to_dict(), 'Ox_mentions')

In [110]:
EndG.nodes()[11]

{'author': 'A.  Afshin',
 'DOI_count': 2,
 'CR_counts': 1503,
 'research_group': "['prevention and population care', 'biomedical informatics and technology', 'prevention and population care']",
 'group_type': "['Theme', 'Theme', 'Theme']",
 'primary_group': 'prevention and population care',
 'primary_group_type': 'Theme',
 'Ox_mentions': 0}

In [111]:
EndG.size(weight='weight')

8303.499999999995

In [112]:
nx.density(EndG)

0.06728816999177445

In [113]:
# Uncomment to save (approx 0.9 GB file)

#nx.write_gexf(EndG, "./author_networks/End_OxBRC2_multi.gexf")

In [114]:
EndG_merge = merge_edge_weights(EndG)

In [115]:
nx.density(EndG_merge)

0.05935710738598239

In [116]:
df_stages_table = pd.DataFrame(data=[[AllG_merge.number_of_nodes(),
                                      AllG_merge.number_of_edges(),
                                      nx.density(AllG_merge)],
                                     [StartG_merge.number_of_nodes(),
                                      StartG_merge.number_of_edges(),
                                      nx.density(StartG_merge)],
                                     [MidG_merge.number_of_nodes(),
                                      MidG_merge.number_of_edges(),
                                      nx.density(MidG_merge) ],
                                     [EndG_merge.number_of_nodes(),
                                      EndG_merge.number_of_edges(),
                                      nx.density(EndG_merge) ]],
                               index=['All','Start','Middle', 'End'],
                               columns=['Nodes (n)', 'Edges (n)','Network Density'])

In [124]:
df_stages_table

Unnamed: 0,Nodes (n),Edges (n),Network Density
All,20225,4296252,0.021007
Start,6684,288614,0.012922
Middle,8697,786972,0.020811
End,10913,3534201,0.059357


---
## Oxford nodes

A rough approximation of the number of authors with associations to Oxford (geographical local network), from searching for terms in author's institution information.  This again highlights large variations in publisher policies and entries and highlights the need for better persistant and unique identifiers for researcher and organisations if we want toinvestigate further with any accuracy.

---

In [118]:
AllG_Ox = G.subgraph([x for x,y in G.nodes(data=True) if y['Ox_mentions']>=1])
nx.info(AllG_Ox)

'Name: \nType: MultiGraph\nNumber of nodes: 1606\nNumber of edges: 24561\nAverage degree:  30.5866'

In [119]:
StartG_Ox = StartG.subgraph([x for x,y in StartG.nodes(data=True) if y['Ox_mentions']>=1]).copy()
nx.info(StartG_Ox)

'Name: \nType: MultiGraph\nNumber of nodes: 868\nNumber of edges: 6848\nAverage degree:  15.7788'

In [120]:
MidG_Ox = MidG.subgraph([x for x,y in MidG.nodes(data=True) if y['Ox_mentions']>=1])
nx.info(MidG_Ox)

'Name: \nType: MultiGraph\nNumber of nodes: 1006\nNumber of edges: 7878\nAverage degree:  15.6620'

In [121]:
EndG_Ox = EndG.subgraph([x for x,y in EndG.nodes(data=True) if y['Ox_mentions']>=1]).copy()
nx.info(EndG_Ox)

'Name: \nType: MultiGraph\nNumber of nodes: 1068\nNumber of edges: 9835\nAverage degree:  18.4176'

--- 
A quick point to highlight, you can merge edges in Gephi as well as networkx and this may give easier exploration and visualization for some.  To further analyse in networkx you may want to bring node aatributes into the 'merged' networks.
(currently not needed here)

In [122]:
StartG.nodes[1684]

{'author': 'A. L.  Fenwick',
 'DOI_count': 8,
 'CR_counts': 424,
 'research_group': "['genomic medicine', 'genomic medicine', 'genomic medicine', 'genomic medicine', 'genomic medicine', 'genomic medicine', 'genomic medicine', 'cancer', 'molecular diagnostics', 'genomic medicine']",
 'group_type': "['Theme', 'Theme', 'Theme', 'Theme', 'Theme', 'Theme', 'Theme', 'Theme', 'Working Group', 'Theme']",
 'primary_group': 'genomic medicine',
 'primary_group_type': 'Theme',
 'Ox_mentions': 2}

In [123]:
StartG_merge.nodes[1684]

{}